Python for Data Analysis, 3E -Open Edition
Introduction to Data Science in Python#
Essential Python Libraries#
NumPy (Numerical Python)
pandas
matplotlib
IPython and Jupyter
SciPy
scikit-learn
statsmodels
Miniconda, a minimal installation of the conda package manager, along with conda-forge, a community-maintained software distribution based on conda.
Intro commands#
Running the Jypyter Notebook:
jupyter notebook
.Running the IPython Shell:
ipython
.Activate Conda virtual environment:
conda activate <venv name>
.
Fundamentals of Data Manipulation with Python#
Note that list concatenation by addition is a comparatively expensive operation since a new list must be created and the objects copied over. Using extend to append elements to an existing list, especially if you are building up a large list, is usually preferable. Thus:
[27]:
list_of_lists = [[i, i + 1, i + 2] for i in range(5)]
everything = []
for chunk in list_of_lists:
everything.extend(chunk)
everything
[27]:
[0, 1, 2, 1, 2, 3, 2, 3, 4, 3, 4, 5, 4, 5, 6]
You can select sections of most sequence types by using slice notation, which in its basic form consists of start:stop passed to the indexing operator []. Slices can also be assigned with a sequence:
[28]:
seq = [7, 2, 3, 7, 5, 6, 0, 1]
seq[3:5] = [6, 3]
seq
[28]:
[7, 2, 3, 6, 3, 6, 0, 1]
A step
can also be used after a second colon to, say, take every other element:
[29]:
seq[::2]
[29]:
[7, 3, 3, 0]
A clever use of this is to pass -1, which has the useful effect of reversing a list or tuple:
[30]:
seq[::-1]
[30]:
[1, 0, 6, 3, 6, 3, 2, 7]
You can merge one dictionary into another using the update method:
[31]:
d1 = {"a": 1, "b": 2}
d1.update({"b": "foo", "c": 12})
d1
[31]:
{'a': 1, 'b': 'foo', 'c': 12}
It’s common to occasionally end up with two sequences that you want to pair up element-wise in a dictionary. As a first cut, you might write code like this:
mapping = {}
for key, value in zip(key_list, value_list):
mapping[key] = value
[32]:
tuples = zip(range(5), reversed(range(5)))
tuples
[32]:
<zip at 0x105303480>
[33]:
mapping = dict(tuples)
mapping
[33]:
{0: 4, 1: 3, 2: 2, 3: 1, 4: 0}
zip:
zip “pairs” up the elements of a number of lists, tuples, or other sequences to create a list of tuples. It can take an arbitrary number of sequences, and the number of elements it produces is determined by the shortest sequence. A common use of zip is simultaneously iterating over multiple sequences, possibly also combined with enumerate:
[34]:
# shortest sequence
seq1 = ["foo", "bar", "baz"]
seq2 = ["one", "two", "three"]
seq3 = [False, True]
list(zip(seq1, seq2, seq3))
[34]:
[('foo', 'one', False), ('bar', 'two', True)]
[35]:
# simultaneously iterating over multiple sequences
for index, (a, b) in enumerate(zip(seq1, seq2)):
print(f"{index}: {a}, {b}")
0: foo, one
1: bar, two
2: baz, three
The setdefault
dictionary method can be used to simplify stuff:
[36]:
words = ["apple", "bat", "bar", "atom", "book"]
by_letter = {}
for word in words:
letter = word[0]
by_letter.setdefault(letter, []).append(word)
by_letter
[36]:
{'a': ['apple', 'atom'], 'b': ['bat', 'bar', 'book']}
The built-in collections module has a useful class, defaultdict
, which makes this even easier. To create one, you pass a type or function for generating the default value for each slot in the dictionary:
[37]:
from collections import defaultdict
by_letter = defaultdict(list)
for word in words:
by_letter[word[0]].append(word)
by_letter
[37]:
defaultdict(list, {'a': ['apple', 'atom'], 'b': ['bat', 'bar', 'book']})
While the values of a dictionary can be any Python object, the keys generally have to be immutable objects like scalar types (int, float, string) or tuples (all the objects in the tuple need to be immutable, too). The technical term here is hashability. You can check whether an object is hashable (can be used as a key in a dictionary) with the hash function:
[38]:
hash("string")
[38]:
-3842245639611595742
[39]:
hash([1,2])
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
Cell In[39], line 1
----> 1 hash([1,2])
TypeError: unhashable type: 'list'
List, Set, and Dictionary Comprehensions#
List comprehensions are a convenient and widely used Python language feature. They allow you to concisely form a new list by filtering the elements of a collection, transforming the elements passing the filter into one concise expression. They take the basic form:
[expr for value in collection if condition]
[40]:
# example filtering out a list of strings with length 2 or less
strings = ["a", "as", "bat", "car", "dove", "python"]
[x.upper() for x in strings if len(x) > 2]
[40]:
['BAT', 'CAR', 'DOVE', 'PYTHON']
[41]:
# example of set comprehension
unique_lengths = {len(x) for x in strings}
unique_lengths
[41]:
{1, 2, 3, 4, 6}
[42]:
# same expression using map function
set(map(len, strings))
[42]:
{1, 2, 3, 4, 6}
[43]:
# creating a lookup map for these strings for ther locations in the list
loc_mapping = {value: index for index, value in enumerate(strings)}
loc_mapping
[43]:
{'a': 0, 'as': 1, 'bat': 2, 'car': 3, 'dove': 4, 'python': 5}
[44]:
# get a single list containing all names with two or more a’s in them
all_data = [["John", "Emily", "Michael", "Mary", "Steven"],
["Maria", "Juan", "Javier", "Natalia", "Pilar"]]
names_of_interest = [name for names in all_data for name in names if name.count("a") >= 2]
names_of_interest
[44]:
['Maria', 'Natalia']
[45]:
# flattening a list of tuples
some_tuples = [(1, 2, 3), (4, 5, 6), (7, 8, 9)]
flattened = [x for tup in some_tuples for x in tup]
flattened
[45]:
[1, 2, 3, 4, 5, 6, 7, 8, 9]
Functions#
Each function can have positional arguments and keyword arguments. Keyword arguments are most commonly used to specify default values or optional arguments. All positional arguments must be specified when calling a function. The main restriction on function arguments is that the keyword arguments must follow the positional arguments (if any). You cn specify keyword arguments in any order. This frees you from having to remember the order in which the function arguments were specified. You need to remember only what their names are.
Functions are objects:
[46]:
# cleaning strings useful approach
import re
def remove_punctuation(value):
return re.sub("[!#?]", "", value)
clean_ops = [str.strip, remove_punctuation, str.title]
def clean_strings(strings, ops):
result = []
for value in strings:
for func in ops:
value = func(value)
result.append(value)
return result
states = [" Alabama ", "Georgia!", "Georgia", "georgia", "FlOrIda",
"south carolina##", "West virginia?"]
clean_strings(states, clean_ops)
[46]:
['Alabama',
'Georgia',
'Georgia',
'Georgia',
'Florida',
'South Carolina',
'West Virginia']
You can use functions as arguments to other functions like the built-in map
function, which applies a function to a sequence of some kind:
[47]:
for x in map(remove_punctuation, states):
print(x)
Alabama
Georgia
Georgia
georgia
FlOrIda
south carolina
West virginia
Anonymous (Lambda) Functions#
Are a way of writing functions consisting of a single statement, the result of which is the return value. They are defined with the lambda
keyword, which has no meaning other than «we are declaring an anonymous function». Thery are specially convenient in data analysis because there are many cases where data transformation functions will take functions as arguments, and instead of writing a full-out function declaration, one just assign the lambda function. It also make a clearer statement.
[48]:
# lambda example
def apply_to_list(some_list, f):
return [f(x) for x in some_list]
ints = [4, 0, 1, 5, 6]
apply_to_list(ints, lambda x: x * 2)
[48]:
[8, 0, 2, 10, 12]
[49]:
# also could be pass a lambda funtion to the list's `sort` method to sort a
# collection of strings by the number of distinct letters in each string.
strings = ["foo", "card", "bar", "aaa", "abad"]
set_of_strings = {tuple(set(string)): len(set(string)) for string in strings}
print(set_of_strings)
strings.sort(key=lambda x: len(set(x)))
print(f"Collection sorted: {strings}")
{('o', 'f'): 2, ('r', 'c', 'd', 'a'): 4, ('r', 'b', 'a'): 3, ('a',): 1, ('d', 'b', 'a'): 3}
Collection sorted: ['aaa', 'foo', 'bar', 'abad', 'card']
Generators#
The iterator protocol is a generic way to make objects iterable. An iterator is any object that will yield objects to the Python interpreter when used in a context like a for
loop. A generator
is a convenient way to construct a new iterable object. Generators can return a sequence of multiple values by pausing and resuming execution each time the generator is used. To create a generator, use the yield
keyword instead of return
in a function:
[50]:
# defining a function using the yield keyword
def squares(n=10):
print(f"Generating squares from 1 to {n ** 2}:", end=" ")
for i in range(1, n + 1):
yield i ** 2
# when calling the generator, no code is immediately executed
gen = squares()
print(gen)
# until the elements are requested from the generator
for x in gen:
print(x, end=" ")
<generator object squares at 0x105325620>
Generating squares from 1 to 100: 1 4 9 16 25 36 49 64 81 100
Nota
Generators help your program use less memory by the fact that produce output one element at a time versus an entire list all at once.
Generator expressions#
[51]:
# like list,dictionary and set comprehensions, to create one use parenthesis instead of brackets
gen = (x ** 2 for x in range(100))
print(gen)
<generator object <genexpr> at 0x1053ec5f0>
[52]:
# can be used as function arguments in some cases
sum(x ** 2 for x in range(100))
[52]:
328350
Some useful itertools functions:#
Function |
Description |
---|---|
chain(*iterables) |
Generates a sequence by chaining iterators together. Once elements from the first iterator are exhausted, elements from the next iterator are returned, and so on. |
combinations(iterable, k) |
Generates a sequence of all possible k-tuples of elements in the iterable, ignoring order and without replacement (see also the companion function combinations_with_replacement). |
permutations(iterable, k) |
Generates a sequence of all possible k-tuples of elements in the iterable, respecting order. |
groupby(iterable[, keyfunc]) |
Generates (key, sub-iterator) for each unique key. |
product(*iterables, repeat=1) |
Generates the Cartesian product of the input iterables as tuples, similar to a nested for loop. |
Errors and Exception handling#
[53]:
# basic structure
def attempt_float(x):
try: # attempt next statement
return float(x)
except (TypeError, ValueError): #identifying type of error, if left blank (no tuple o value) it catches all types of error
return x # statement if error catched
attempt_float("something")
[53]:
'something'
# forcing some code to be executed regardless of whether or not the code in the try block succeeds
f = open(path, mode="w")
try:
write_to_file(f)
except:
print("Failed")
else: # executed if try statement succeeds
print("Succeded")
finally:
f.close() # the f object always get closed
Files and the Operating System#
To open a file for reading or writing, use the built-in open
function. As a best practice pass the encoding="utf-8"
keyword argument because the default Unicode encoding for reading files varies from platform to platform. By default, the file is opened in read-only mode "r"
. Then, the file, can be used as a list and be iterated over (the file comes out of the file with the end-of-line markers intact). When using open
for creating file objects, ensure the file is closed
(f.close()
), this releases its resources back to the operating system. One way to make it easier to clean up open files is to use the with
statement:
with open(path, encoding="utf-8") as f:
lines = [x.rstrip() for x in f]
This will automatically close the file when exiting the with
block. Next table shows Python file modes.
Mode |
Description |
---|---|
r |
Read-only mode |
w |
Write-only mode; creates a new file (erasing the data for any file with the same name) |
x |
Write-only mode; creates a new file but fails if the file path already exists |
a |
Append to existing file (creates the file if it does not already exist) |
r+ |
Read and write |
b |
Add to mode for binary files (i.e., «rb» or «wb») |
t |
Text mode for files (automatically decoding bytes to Unicode); this is the default if not specified |
Most commonly used methods#
Method/Attribute |
Description |
---|---|
read([size]) |
Return data from file as bytes or string depending on the file mode, with optional |
readable() |
Return |
readlines([size]) |
Return list of lines in the file, with optional |
write(string) |
Write passed string to file. |
writable() |
Return |
writelines(strings) |
Write passed sequence of strings to the file. |
close() |
Close the file object. |
flush() |
Flush the internal I/O buffer to disk. |
seek(pos) |
Move to indicated file position (integer). Beware using |
seekable() |
Return |
tell() |
Return current file position as integer. |
closed |
|
encoding |
The encoding used to interpret bytes in the file as Unicode (use UTF-8). |
The default behavior for Python files (whether readable or writable) is text mode, which means that you intend to work with Python strings (i.e., Unicode). This contrasts with binary mode, which you can obtain by appending b to the file mode. UTF-8 is a variable-length Unicode encoding, so when I request some number of characters from the file, Python reads enough bytes (which could be as few as 10 or as many as 40 bytes) from the file to decode that many characters. If I open the file in «rb»
mode instead, read
requests that exact number of bytes.
NumPy Basics: Arrays and Vectorized Computation#
One of the reasons NumPy (Numerical Python) is so important for numerical computations in Python is because it is designed for efficiency on large arrays of data. NumPy-based algorithms are generally 10 to 100 times faster (or more) than their pure Python counterparts and use significantly less memory.
Nota
In this chapter and throughout the book, I use the standard NumPy convention of always using import numpy as np
. It would be possible to put from numpy import *
in your code to avoid having to write np., but I advise against making a habit of this. The numpy namespace is large and contains a number of functions whose names conflict with built-in Python functions (like min
and max
). Following standard conventions like these is almost always a good idea.
Some important NumPy array creation functions#
[ ]:
Function |
Description |
---|---|
|
Convert input data (list, tuple, array, or other sequence type) to an |
|
Convert input to |
|
Like the built-in |
|
Produce an array of all 1s with the given shape and data type; |
|
Like |
|
Create new arrays by allocating new memory, but do not populate with any values like |
|
Produce an array of the given shape and data type with all values set to the indicated «fill value»; |
|
Create a square N × N identity matrix (1s on the diagonal and 0s elsewhere). |
[11]:
# convert strings representing numbers to numeric form
import numpy as np
numeric_strings = np.array(["1.25", "-9.6", "42"], dtype=np.bytes_)
print(numeric_strings)
# calling astype always creates a new array (a copy of the data), even if the new data is the same as the old data type
print(numeric_strings.astype(float))
[b'1.25' b'-9.6' b'42']
[ 1.25 -9.6 42. ]
Arrays are important because they enable you to express batch operations on data without writing any for
loops. Numpy users call this vectorization. Any arithmetic operations between equal-size arrays apply the operation element-wise. Also, arithmetic operations with scalar propagate the scalar argument to each element in the array. Comparisons between arrays of the same size yield Boolean arrays.
If you want a copy of a slice of an ndarray instead of a view, it will be needed to explicitly copy the array (e.g. arr[5:8].copy()
).
In a two-dimensional array, the elements at each index are no longer scalars but one-dimensional arrays. Thus, individual elements can be accessed recursively. But that is a bit too much work, so you can pass a comma-separated listo of indices to select individual elements:
[34]:
arr2d = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])
print(arr2d, end="\n\n")
print("a) [0][2]: ", arr2d[0][2])
print("b) [0, 2]: ", arr2d[0, 2])
[[1 2 3]
[4 5 6]
[7 8 9]]
a) [0][2]: 3
b) [0, 2]: 3
In multidimensional arrays, if you omit later indices, the returned object will be a lower dimensional ndarray consisting of all the data along the higher dimensions.
Like one-dimensional objects such as Python lists, ndarrays can be sliced with the familiar syntaxt (i.e. arr[1:6]). Consider two-dimensional array, a slice, will select a range of elements along the axis. It can be helpful to read the expression arr2d[:2]
as «select the first two rows of arr2d
.»
[35]:
print(arr2d[:2])
[[1 2 3]
[4 5 6]]
[41]:
# it can be pass multiple slices: first two rows, but 2nd and 3rd columns.
print(arr2d[:2, 1:])
[[2 3]
[5 6]]