Python for Data Analysis - CH.3 Built-in Data Structures, Functions, and Files
3.1 Data Structures and Sequences
[[Tuple]]
^bb464a
tup = 4, 5, 6
nested_tup = (4, 5, 6), (7, 8)
# You can convert any sequence or iterator to a tuple by invoking tuple:
tuple([4, 0, 2])
tup = tuple('string')
# access the element
tup[0]
While the objects stored in a tuple may be mutable themselves, once the tuple is created it’s not possible to modify which object is stored in each slot:
tup = tuple(['foo', [1, 2], True])
tup[2] = False
# This won't work.
If an object inside a tuple is mutable, such as a list, you can modify it in-place:
tup[1].append(3)
# concatenate
(4, None, 'foo') + (6, 0) + ('bar',)
# Multiplying a tuple by an integer, as with lists, has the effect of concatenating together that many copies of the tuple:
('foo', 'bar') * 4
Note that the objects themselves are not copied, only the references to them.
[[Unpacking tuple]]s
If you try to assign to a tuple-like expression of variables, Python will attempt to unpack the value on the righthand side of the equals sign:
tup=(4,5,6)
a,b,c=tup
b
# Even sequences with nested tuples can be unpacked:
tup=4,5,(6,7)
a,b,(c,d)=tup
# Using this functionality you can easily swap variable names
a,b=1,2
b,a=a,b
# A common use of variable unpacking is iterating over sequences of tuples or lists:
seq=[(1,2,3),(4,5,6),(7,8,9)]
for a,b,c in seq:
print('a={0}, b={1}, c={2}'.format(a, b, c))
# Another common use is returning multiple values from a function.
The Python language recently acquired some more advanced tuple unpacking to help with situations where you may want to “pluck” a few elements from the beginning of a tuple. This uses the special syntax *rest
, which is also used in function signatures to capture an arbitrarily long list of positional arguments:
values=1,2,3,4,5
a, b, *rest = values
a, b
rest
# This rest bit is sometimes something you want to discard; there is nothing special about the rest name. As a matter of convention, many Python programmers will use the underscore (_) for unwanted variables:
a, b, *_ = values
Tuple methods
Since the size and contents of a tuple cannot be modified, it is very light on instance methods.
A particularly useful one (also available on lists) is count
, which counts the number of occurrences of a value
a=(1,2,2,2,3,4,2)
a.count(2)
[[List]]
lists are variable-length and their contents can be modified in-place. You can define them using square brackets []
or using the list type
function
a_list = [2, 3, 7, None]
tup = ('foo', 'bar', 'baz')
b_list = list(tup)
b_list[1] = 'peekaboo'
Lists and tuples are semantically similar (though tuples cannot be modified) and can be used interchangeably in many functions.
The list function is frequently used in data processing as a way to materialize an [[iterator]] or [[generator]] expression:
gen = range(10)
list(gen)
Adding and removing elements
b_list.append('dwarf')
b_list.insert(1, 'red')
b_list.pop(2)
b_list.remove('foo') # locates the first such value and removes it from the last
insert is computationally expensive compared with append, because references to subsequent elements have to be shifted internally to make room for the new element. If you need to insert elements at both the beginning and end of a sequence, you may wish to explore
collections.deque
, a double-ended queue, for this purpose.
Concatenating and combining lists
Similar to tuples, adding two lists together with + concatenates them.
If you have a list already defined, you can append multiple elements to it using the extend
method
x = [4, None, 'foo']
x.extend([7, 8, (2, 3)])
Note that list concatenation by addition is a comparatively expensive operation since a new list must be created and the objects copied over. Using extend to append elements to an existing list, especially if you are building up a large list, is usually preferable
Sorting
You can sort a list in-place (without creating a new object) by calling its sort
function
a=[7,2,5,1,3]
a.sort()
b = ['saw', 'small', 'He', 'foxes', 'six']
b.sort(key=len)
sort
has a few options that will occasionally come in handy. One is the ability to pass a secondary sort key—that is, a function that produces a value to use to sort the objects.
Binary search and maintaining a sorted list
The built-in bisect
module implements [[binary search]] and insertion into a sorted list. bisect.bisect
finds the location where an element should be inserted to keep it sorted, while bisect.insort
actually inserts the element into that location
import bisect
c=[1,2,2,2,3,4,7]
bisect.bisect(c, 2)
bisect.bisect(c, 5)
bisect.insort(c, 6)
The bisect
module functions do not check whether the list is sorted, as doing so would be computationally expensive. Thus, using them with an unsorted list will succeed without error but may lead to incorrect results.
Slicing
seq=[7,2,3,7,5,6,0,1]
seq[1:5]
# Slices can also be assigned to with a sequence:
seq[3:4] = [6, 3]
# A step can also be used after a second colon to, say, take every other element
seq[::2]
# A clever use of this is to pass -1, which has the useful effect of reversing a list or tuple:
seq[::-1]
While the element at the start index is included, the stop index is not included, so that the number of elements in the result is stop - start
Either the start or stop can be omitted, in which case they default to the start of the sequence and the end of the sequence, respectively
Negative indices slice the sequence relative to the end.
Built-in Sequence Functions
[[enumerate()]]
It’s common when iterating over a sequence to want to keep track of the index of the current item.
# a DIY approach
i=0
for value in collection:
# do something with value
i+=1
# Python has a built-in function, enumerate, which returns a sequence of (i, value) tuples:
for i, value in enumerate(collection):
# do something with value
When you are indexing data, a helpful pattern that uses enumerate is computing a dict mapping the values of a sequence (which are assumed to be unique) to their locations in the sequence:
some_list = ['foo', 'bar', 'baz']
mapping = {}
for i, v in enumerate(some_list):
mapping[v] = i
mapping
[[sorted()]]
The sorted function returns a new sorted list from the elements of any sequence:
sorted('horse race')
The sorted function accepts the same arguments as the sort method on lists.
[[zip()]]
zip “pairs” up the elements of a number of lists, tuples, or other sequences to create a list of tuples:
seq1 = ['foo', 'bar', 'baz']
seq2 = ['one', 'two', 'three']
zipped = zip(seq1, seq2) # return a 'zip' object
list(zipped)
# zip can take an arbitrary number of sequences, and the number of elements it produces is determined by the shortest sequence:
seq3 = [False, True]
list(zip(seq1, seq2, seq3))
# A very common use of zip is simultaneously iterating over multiple sequences, possibly also combined with enumerate
for i, (a, b) in enumerate(zip(seq1, seq2)):
print('{0}: {1}, {2}'.format(i, a, b))
Given a “zipped” sequence, zip can be applied in a clever way to “unzip” the sequence. Another way to think about this is converting a list of rows into a list of columns. The syntax, which looks a bit magical, is:
pitchers = [('Nolan', 'Ryan'), ('Roger', 'Clemens'),
('Schilling', 'Curt')]
first_names, last_names = zip(*pitchers)
first_names
last_names
[[reversed()]]
reversed iterates over the elements of a sequence in reverse order
Keep in mind that reversed is a [[generator]], so it does not create the reversed sequence until materialized (e.g., with list or a for loop).
[[dict]]
dict is likely the most important built-in Python data structure. A more common name for it is [[hash map]] or [[associative array.]] It is a flexibly sized collection of key-value pairs, where key and value are Python objects.
empty_dict = {}
d1 = {'a' : 'some value', 'b' : [1, 2, 3, 4]}
d1[7] = 'an integer'
d1['b']
You can check if a dict contains a key using the same syntax used for checking whether a list or tuple contains a value
You can delete values either using the del
keyword or the pop
method (which simultaneously returns the value and deletes the key):
del d1[5]
d1
ret = d1.pop('dummy')
ret
d1
The keys and values method give you [[iterator]]s of the dict’s keys and values, respectively. While the key-value pairs are not in any particular order, these functions out‐ put the keys and values in the same order:
list(d1.keys())
list(d1.values())
You can merge one dict into another using the update method:
d1.update({'b' : 'foo', 'c' : 12})
The update method changes dicts in-place, so any existing keys in the data passed to update will have their old values discarded.
Creating dicts from sequences
Since a dict is essentially a collection of 2-tuples, the dict function accepts a list of 2-tuples:
mapping = dict(zip(range(5), reversed(range(5))))
[[dict comprehension]]
Default values
method:
get
setdefault
[[dict.setdefault()]]
# value = some_dict.get(key, default_value)
words = ['apple', 'bat', 'bar', 'atom', 'book']
by_letter = {}
for word in words:
letter = word[0]
if letter not in by_letter:
by_letter[letter] = [word]
else:
by_letter[letter].append(word)
by_letter
for word in words:
letter = word[0]
by_letter.setdefault(letter, []).append(word)
The built-in collections
module has a useful class, defaultdict
, which makes this even easier. To create one, you pass a type or function for generating the default value for each slot in the dict:
from collections import defaultdict
by_letter = defaultdict(list)
for word in words:
by_letter[word[0]].append(word)
Valid dict key types
While the values of a dict can be any Python object, the keys generally have to be immutable objects like scalar types (int, float, string) or tuples (all the objects in the tuple need to be immutable, too). The technical term here is [[hashability]].
[[set]]
A set is an unordered collection of unique elements.
You can think of them like dicts, but keys only, no values. A set can be created in two ways: via the set
function or via a set literal with curly braces
set([2, 2, 2, 1, 3, 3])
{2,2,2,1,3,3}
Sets support mathematical set operations like union, intersection, difference, and symmetric difference
a={1,2,3,4,5}
b={3,4,5,6,7,8}
a.union(b)
a|b
a.intersection(b)
a&b
![[Pasted image 20220704163525.png]]
All of the logical set operations have in-place counterparts, which enable you to replace the contents of the set on the left side of the operation with the result.
Like dicts, set elements generally must be immutable. To have list-like elements, you must convert it to a tuple:
my_data = [1, 2, 3, 4]
my_set = {tuple(my_data)}
my_set
You can also check if a set is a subset of (is contained in) or a superset of (contains all elements of) another set
a_set = {1, 2, 3, 4, 5}
{1, 2, 3}.issubset(a_set)
a_set.issuperset({1, 2, 3})
Sets are equal if and only if their contents are equal:
{1,2,3}=={3,2,1}
List, Set, and Dict Comprehensions
[expr for val in collection if condition]
Set and dict comprehensions are a natural extension, producing sets and dicts in an idiomatically similar way instead of lists.
dict_comp = {key-expr : value-expr for value in collection if condition}
set_comp = {expr for value in collection if condition}
[[Nested list comprehension]]s
At first, nested list comprehensions are a bit hard to wrap your head around. The for parts of the list comprehension are arranged according to the order of nesting, and any filter condition is put at the end as before. Keep in mind that the order of the for expressions would be the same if you wrote a nested for loop instead of a list comprehension.
3.2 Functions
As a rule of thumb, if you anticipate needing to repeat the same or very similar code more than once, it may be worth writing a reusable function.
Functions are declared with the def
keyword and returned from with the return
keyword.
There is no issue with having multiple return statements. If Python reaches the end of a function without encountering a return statement, None
is returned automatically.
Each function can have [[positional argument]]s and [[keyword argument]]s. Keyword arguments are most commonly used to specify default values or optional arguments.
The main restriction on function arguments is that the keyword arguments must follow the positional arguments (if any).
Namespaces, Scope, and Local Functions
Functions can access variables in two different [[scope]]s: global and local. An alternative and more descriptive name describing a variable scope in Python is a [[namespace]].
Any variables that are assigned within a function by default are assigned to the local namespace. The local namespace is created when the function is called and immediately populated by the function’s arguments. After the function is finished, the local namespace is destroyed (with some exceptions that are outside the purview of this chapter).
def func():
a = []
for i in range(5):
a.append(i)
# This is different with the following:
a = []
def func():
for i in range(5):
a.append(i)
I generally discourage use of the global keyword. Typically global variables are used to store some kind of state in a system. If you find yourself using a lot of them, it may indicate a need for object-oriented programming (using classes).
Returning Multiple Values
def f():
a = 5
b = 6
c = 7
return a, b, c
a, b, c = f()
# What’s happening here is that the function is actually just returning one object, namely a tuple, which is then being unpacked into the result variables.
# we could have done this instead:
return_value = f()
# A potentially attractive alternative to returning multiple values like before might be to return a dict instead:
def f():
a = 5
b = 6
c = 7
return {'a':a,'b':b,'c':c}
Functions Are Objects
You can use functions as arguments to other functions like the built-in map function, which applies a function to a sequence of some kind
Anonymous ([[Lambda]]) Functions
Python has support for so-called anonymous or lambda functions, which are a way of writing functions consisting of a single statement, the result of which is the return value. They are defined with the lambda
keyword, which has no meaning other than “we are declaring an anonymous function”:
def short_function(x):
returnx*2
equiv_anon = lambda x: x * 2
They are especially convenient in data analysis because, as you’ll see, there are many cases where data transformation functions will take functions as arguments.
Currying: Partial Argument Application
[[Currying]] is computer science jargon (named after the mathematician [[Haskell Curry]]) that means deriving new functions from existing ones by [[partial argument application]].
# For example, suppose we had a trivial function that adds two numbers together:
def add_numbers(x, y):
return x+y
# Using this function, we could derive a new function of one variable, add_five, that adds 5 to its argument:
add_five = lambda y: add_numbers(5, y)
The second argument to add_numbers is said to be curried.
There’s nothing very fancy here, as all we’ve really done is define a new function that calls an existing function. The built-in functools
module can simplify this process using the partial function:
from functools import partial
add_five = partial(add_numbers, 5)
[[generator]]s
Having a consistent way to iterate over sequences, like objects in a list or lines in a file, is an important Python feature. This is accomplished by means of the [[iterator protocol]], a generic way to make objects iterable.
An [[iterator]] is any object that will yield objects to the Python interpreter when used in a context like a for
loop.
References
- [[Python for Data Analysis]]