Python for Data Analysis, 3E -Open Edition

Introduction to Data Science in Python#

Essential Python Libraries#

  • NumPy (Numerical Python)

  • pandas

  • matplotlib

  • IPython and Jupyter

  • SciPy

  • scikit-learn

  • statsmodels

  • Miniconda, a minimal installation of the conda package manager, along with conda-forge, a community-maintained software distribution based on conda.

Reference for installation

Intro commands#

  • Running the Jypyter Notebook: jupyter notebook.

  • Running the IPython Shell: ipython.

  • Activate Conda virtual environment: conda activate <venv name>.

Fundamentals of Data Manipulation with Python#

Note that list concatenation by addition is a comparatively expensive operation since a new list must be created and the objects copied over. Using extend to append elements to an existing list, especially if you are building up a large list, is usually preferable. Thus:

[27]:
list_of_lists = [[i, i + 1, i + 2] for i in range(5)]
everything = []
for chunk in list_of_lists:
    everything.extend(chunk)

everything
[27]:
[0, 1, 2, 1, 2, 3, 2, 3, 4, 3, 4, 5, 4, 5, 6]

You can select sections of most sequence types by using slice notation, which in its basic form consists of start:stop passed to the indexing operator []. Slices can also be assigned with a sequence:

[28]:
seq = [7, 2, 3, 7, 5, 6, 0, 1]
seq[3:5] = [6, 3]

seq
[28]:
[7, 2, 3, 6, 3, 6, 0, 1]

A step can also be used after a second colon to, say, take every other element:

[29]:
seq[::2]
[29]:
[7, 3, 3, 0]

A clever use of this is to pass -1, which has the useful effect of reversing a list or tuple:

[30]:
seq[::-1]
[30]:
[1, 0, 6, 3, 6, 3, 2, 7]

You can merge one dictionary into another using the update method:

[31]:
d1 = {"a": 1, "b": 2}
d1.update({"b": "foo", "c": 12})
d1
[31]:
{'a': 1, 'b': 'foo', 'c': 12}

It’s common to occasionally end up with two sequences that you want to pair up element-wise in a dictionary. As a first cut, you might write code like this:

mapping = {}
for key, value in zip(key_list, value_list):
    mapping[key] = value
[32]:
tuples = zip(range(5), reversed(range(5)))
tuples
[32]:
<zip at 0x105303480>
[33]:
mapping = dict(tuples)
mapping
[33]:
{0: 4, 1: 3, 2: 2, 3: 1, 4: 0}

zip:

zip “pairs” up the elements of a number of lists, tuples, or other sequences to create a list of tuples. It can take an arbitrary number of sequences, and the number of elements it produces is determined by the shortest sequence. A common use of zip is simultaneously iterating over multiple sequences, possibly also combined with enumerate:

[34]:
# shortest sequence
seq1 = ["foo", "bar", "baz"]
seq2 = ["one", "two", "three"]
seq3 = [False, True]

list(zip(seq1, seq2, seq3))
[34]:
[('foo', 'one', False), ('bar', 'two', True)]
[35]:
# simultaneously iterating over multiple sequences
for index, (a, b) in enumerate(zip(seq1, seq2)):
    print(f"{index}: {a}, {b}")
0: foo, one
1: bar, two
2: baz, three

The setdefault dictionary method can be used to simplify stuff:

[36]:
words = ["apple", "bat", "bar", "atom", "book"]
by_letter = {}

for word in words:
    letter = word[0]
    by_letter.setdefault(letter, []).append(word)

by_letter
[36]:
{'a': ['apple', 'atom'], 'b': ['bat', 'bar', 'book']}

The built-in collections module has a useful class, defaultdict, which makes this even easier. To create one, you pass a type or function for generating the default value for each slot in the dictionary:

[37]:
from collections import defaultdict
by_letter = defaultdict(list)

for word in words:
    by_letter[word[0]].append(word)

by_letter
[37]:
defaultdict(list, {'a': ['apple', 'atom'], 'b': ['bat', 'bar', 'book']})

While the values of a dictionary can be any Python object, the keys generally have to be immutable objects like scalar types (int, float, string) or tuples (all the objects in the tuple need to be immutable, too). The technical term here is hashability. You can check whether an object is hashable (can be used as a key in a dictionary) with the hash function:

[38]:
hash("string")
[38]:
-3842245639611595742
[39]:
hash([1,2])
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Cell In[39], line 1
----> 1 hash([1,2])

TypeError: unhashable type: 'list'

List, Set, and Dictionary Comprehensions#

List comprehensions are a convenient and widely used Python language feature. They allow you to concisely form a new list by filtering the elements of a collection, transforming the elements passing the filter into one concise expression. They take the basic form:

[expr for value in collection if condition]
[40]:
# example filtering out a list of strings with length 2 or less
strings = ["a", "as", "bat", "car", "dove", "python"]
[x.upper() for x in strings if len(x) > 2]
[40]:
['BAT', 'CAR', 'DOVE', 'PYTHON']
[41]:
# example of set comprehension

unique_lengths = {len(x) for x in strings}
unique_lengths
[41]:
{1, 2, 3, 4, 6}
[42]:
# same expression using map function
set(map(len, strings))
[42]:
{1, 2, 3, 4, 6}
[43]:
# creating a lookup map for these strings for ther locations in the list

loc_mapping = {value: index for index, value in enumerate(strings)}
loc_mapping
[43]:
{'a': 0, 'as': 1, 'bat': 2, 'car': 3, 'dove': 4, 'python': 5}
[44]:
# get a single list containing all names with two or more a’s in them
all_data = [["John", "Emily", "Michael", "Mary", "Steven"],
            ["Maria", "Juan", "Javier", "Natalia", "Pilar"]]

names_of_interest = [name for names in all_data for name in names if name.count("a") >= 2]


names_of_interest

[44]:
['Maria', 'Natalia']
[45]:
# flattening a list of tuples
some_tuples = [(1, 2, 3), (4, 5, 6), (7, 8, 9)]
flattened = [x for tup in some_tuples for x in tup]
flattened
[45]:
[1, 2, 3, 4, 5, 6, 7, 8, 9]

Functions#

Each function can have positional arguments and keyword arguments. Keyword arguments are most commonly used to specify default values or optional arguments. All positional arguments must be specified when calling a function. The main restriction on function arguments is that the keyword arguments must follow the positional arguments (if any). You cn specify keyword arguments in any order. This frees you from having to remember the order in which the function arguments were specified. You need to remember only what their names are.

Functions are objects:

[46]:
# cleaning strings useful approach
import re

def remove_punctuation(value):
    return re.sub("[!#?]", "", value)

clean_ops = [str.strip, remove_punctuation, str.title]

def clean_strings(strings, ops):
    result = []
    for value in strings:
        for func in ops:
            value = func(value)
        result.append(value)
    return result

states = ["   Alabama ", "Georgia!", "Georgia", "georgia", "FlOrIda",
          "south   carolina##", "West virginia?"]

clean_strings(states, clean_ops)
[46]:
['Alabama',
 'Georgia',
 'Georgia',
 'Georgia',
 'Florida',
 'South   Carolina',
 'West Virginia']

You can use functions as arguments to other functions like the built-in map function, which applies a function to a sequence of some kind:

[47]:
for x in map(remove_punctuation, states):
    print(x)
   Alabama
Georgia
Georgia
georgia
FlOrIda
south   carolina
West virginia

Anonymous (Lambda) Functions#

Are a way of writing functions consisting of a single statement, the result of which is the return value. They are defined with the lambda keyword, which has no meaning other than «we are declaring an anonymous function». Thery are specially convenient in data analysis because there are many cases where data transformation functions will take functions as arguments, and instead of writing a full-out function declaration, one just assign the lambda function. It also make a clearer statement.

[48]:
# lambda example
def apply_to_list(some_list, f):
    return [f(x) for x in some_list]

ints = [4, 0, 1, 5, 6]

apply_to_list(ints, lambda x: x * 2)
[48]:
[8, 0, 2, 10, 12]
[49]:
# also could be pass a lambda funtion to the list's `sort` method to sort a
# collection of strings by the number of distinct letters in each string.
strings = ["foo", "card", "bar", "aaa", "abad"]

set_of_strings = {tuple(set(string)): len(set(string)) for string in strings}

print(set_of_strings)

strings.sort(key=lambda x: len(set(x)))

print(f"Collection sorted: {strings}")
{('o', 'f'): 2, ('r', 'c', 'd', 'a'): 4, ('r', 'b', 'a'): 3, ('a',): 1, ('d', 'b', 'a'): 3}
Collection sorted: ['aaa', 'foo', 'bar', 'abad', 'card']

Generators#

The iterator protocol is a generic way to make objects iterable. An iterator is any object that will yield objects to the Python interpreter when used in a context like a for loop. A generator is a convenient way to construct a new iterable object. Generators can return a sequence of multiple values by pausing and resuming execution each time the generator is used. To create a generator, use the yield keyword instead of return in a function:

[50]:
# defining a function using the yield keyword
def squares(n=10):
    print(f"Generating squares from 1 to {n ** 2}:", end=" ")
    for i in range(1, n + 1):
        yield i ** 2

# when calling the generator, no code is immediately executed
gen = squares()
print(gen)

# until the elements are requested from the generator
for x in gen:
    print(x, end=" ")

<generator object squares at 0x105325620>
Generating squares from 1 to 100: 1 4 9 16 25 36 49 64 81 100

Nota

Generators help your program use less memory by the fact that produce output one element at a time versus an entire list all at once.

Generator expressions#

[51]:
# like list,dictionary and set comprehensions, to create one use parenthesis instead of brackets
gen = (x ** 2 for x in range(100))
print(gen)
<generator object <genexpr> at 0x1053ec5f0>
[52]:
# can be used as function arguments in some cases
sum(x ** 2 for x in range(100))
[52]:
328350

Some useful itertools functions:#

Function

Description

chain(*iterables)

Generates a sequence by chaining iterators together. Once elements from the first iterator are exhausted, elements from the next iterator are returned, and so on.

combinations(iterable, k)

Generates a sequence of all possible k-tuples of elements in the iterable, ignoring order and without replacement (see also the companion function combinations_with_replacement).

permutations(iterable, k)

Generates a sequence of all possible k-tuples of elements in the iterable, respecting order.

groupby(iterable[, keyfunc])

Generates (key, sub-iterator) for each unique key.

product(*iterables, repeat=1)

Generates the Cartesian product of the input iterables as tuples, similar to a nested for loop.

more @ the official Python documentation.

Errors and Exception handling#

[53]:
# basic structure

def attempt_float(x):
    try: # attempt next statement
        return float(x)
    except (TypeError, ValueError): #identifying type of error, if left blank (no tuple o value) it catches all types of error
        return x # statement if error catched

attempt_float("something")
[53]:
'something'
# forcing some code to be executed regardless of whether or not the code in the try block succeeds

f = open(path, mode="w")

try:
    write_to_file(f)
except:
    print("Failed")
else: # executed if try statement succeeds
    print("Succeded")
finally:
    f.close() # the f object always get closed

Files and the Operating System#

To open a file for reading or writing, use the built-in open function. As a best practice pass the encoding="utf-8" keyword argument because the default Unicode encoding for reading files varies from platform to platform. By default, the file is opened in read-only mode "r". Then, the file, can be used as a list and be iterated over (the file comes out of the file with the end-of-line markers intact). When using open for creating file objects, ensure the file is closed (f.close()), this releases its resources back to the operating system. One way to make it easier to clean up open files is to use the with statement:

with open(path, encoding="utf-8") as f:
    lines = [x.rstrip() for x in f]

This will automatically close the file when exiting the with block. Next table shows Python file modes.

Mode

Description

r

Read-only mode

w

Write-only mode; creates a new file (erasing the data for any file with the same name)

x

Write-only mode; creates a new file but fails if the file path already exists

a

Append to existing file (creates the file if it does not already exist)

r+

Read and write

b

Add to mode for binary files (i.e., «rb» or «wb»)

t

Text mode for files (automatically decoding bytes to Unicode); this is the default if not specified

Most commonly used methods#

Method/Attribute

Description

read([size])

Return data from file as bytes or string depending on the file mode, with optional size argument indicating the number of bytes or string characters to read.

readable()

Return True if the file supports read operations.

readlines([size])

Return list of lines in the file, with optional size argument.

write(string)

Write passed string to file.

writable()

Return True if the file supports write operations.

writelines(strings)

Write passed sequence of strings to the file.

close()

Close the file object.

flush()

Flush the internal I/O buffer to disk.

seek(pos)

Move to indicated file position (integer). Beware using seek when opening files in any mode other than binary. If the file position falls in the middle of the bytes defining a Unicode character, then subsequent reads will result in an error.

seekable()

Return True if the file object supports seeking and thus random access (some file-like objects do not).

tell()

Return current file position as integer.

closed

True if the file is closed.

encoding

The encoding used to interpret bytes in the file as Unicode (use UTF-8).

The default behavior for Python files (whether readable or writable) is text mode, which means that you intend to work with Python strings (i.e., Unicode). This contrasts with binary mode, which you can obtain by appending b to the file mode. UTF-8 is a variable-length Unicode encoding, so when I request some number of characters from the file, Python reads enough bytes (which could be as few as 10 or as many as 40 bytes) from the file to decode that many characters. If I open the file in «rb» mode instead, read requests that exact number of bytes.

NumPy Basics: Arrays and Vectorized Computation#

One of the reasons NumPy (Numerical Python) is so important for numerical computations in Python is because it is designed for efficiency on large arrays of data. NumPy-based algorithms are generally 10 to 100 times faster (or more) than their pure Python counterparts and use significantly less memory.

Nota

In this chapter and throughout the book, I use the standard NumPy convention of always using import numpy as np. It would be possible to put from numpy import * in your code to avoid having to write np., but I advise against making a habit of this. The numpy namespace is large and contains a number of functions whose names conflict with built-in Python functions (like min and max). Following standard conventions like these is almost always a good idea.

Some important NumPy array creation functions#

[ ]:

Function

Description

array

Convert input data (list, tuple, array, or other sequence type) to an ndarray either by inferring a data type or explicitly specifying a data type; copies the input data by default.

asarray

Convert input to ndarray, but do not copy if the input is already an ndarray.

arange

Like the built-in range but returns an ndarray instead of a list.

ones, ones_like

Produce an array of all 1s with the given shape and data type; ones_like takes another array and produces a ones array of the same shape and data type.

zeros, zeros_like

Like ones and ones_like but producing arrays of 0s instead.

empty, empty_like

Create new arrays by allocating new memory, but do not populate with any values like ones and zeros.

full, full_like

Produce an array of the given shape and data type with all values set to the indicated «fill value»; full_like takes another array and produces a filled array of the same shape and data type.

eye, identity

Create a square N × N identity matrix (1s on the diagonal and 0s elsewhere).

[11]:
# convert strings representing numbers to numeric form
import numpy as np
numeric_strings = np.array(["1.25", "-9.6", "42"], dtype=np.bytes_)

print(numeric_strings)
# calling astype always creates a new array (a copy of the data), even if the new data is the same as the old data type
print(numeric_strings.astype(float))
[b'1.25' b'-9.6' b'42']
[ 1.25 -9.6  42.  ]

Arrays are important because they enable you to express batch operations on data without writing any for loops. Numpy users call this vectorization. Any arithmetic operations between equal-size arrays apply the operation element-wise. Also, arithmetic operations with scalar propagate the scalar argument to each element in the array. Comparisons between arrays of the same size yield Boolean arrays.

If you want a copy of a slice of an ndarray instead of a view, it will be needed to explicitly copy the array (e.g. arr[5:8].copy()).

In a two-dimensional array, the elements at each index are no longer scalars but one-dimensional arrays. Thus, individual elements can be accessed recursively. But that is a bit too much work, so you can pass a comma-separated listo of indices to select individual elements:

[34]:
arr2d = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])
print(arr2d, end="\n\n")
print("a) [0][2]: ", arr2d[0][2])
print("b) [0, 2]: ", arr2d[0, 2])
[[1 2 3]
 [4 5 6]
 [7 8 9]]

a) [0][2]:  3
b) [0, 2]:  3

In multidimensional arrays, if you omit later indices, the returned object will be a lower dimensional ndarray consisting of all the data along the higher dimensions.

Like one-dimensional objects such as Python lists, ndarrays can be sliced with the familiar syntaxt (i.e. arr[1:6]). Consider two-dimensional array, a slice, will select a range of elements along the axis. It can be helpful to read the expression arr2d[:2] as «select the first two rows of arr2d

[35]:
print(arr2d[:2])
[[1 2 3]
 [4 5 6]]
[41]:
# it can be pass multiple slices: first two rows, but 2nd and 3rd columns.
print(arr2d[:2, 1:])
[[2 3]
 [5 6]]