# Chapter 11 The `pandas`

Library

This chapter introduces the *Python Data Analysis* library ** pandas**—a set of modules, functions, and classes used to for easily and efficiently performing data analysis—

`panda`

’s speciality is its highly optimized performance when working with large data sets. `pandas`

is the most common library used with Python for Data Science (and mirrors the `R`

language in many ways, allowing programmers to easily move between the two).
This chapter will discuss the two main data structures used by `pandas`

(*Series*and

*DataFrames*) and how to use them to organize and work with data.

## 11.1 Setting up `pandas`

`pandas`

is a **third-party** library (not built into Python!), but is included by default with Anaconda and so can be imported directly. Additionally, Pandas is built on top of the `numpy`

scientific computing library which supports highly optimized mathematical operations. Thus many `pandas`

operations involve working with `numpy`

data structures, and the `pandas`

library requires `numpy`

(also included in Anaconda) to also be imported:

```
# import libraries
import pandas as pd # standard shortcut names
import numpy as np
```

Normal practice is to `import`

the module and reference types and methods using dot notation, rather than importing them into the global namespace. Also note that this module will focus primarily on `pandas`

, leaving `numpy`

-specific data structures and functions for the reader to explore.

## 11.2 Series

The first basic `pandas`

data structure is a **Series**. A Series represents a *one-dimensional ordered collection of values*, making them somewhat similar to a regular Python *list*. However, elements can also be given *labels* (called the **index**), which can be non-numeric values, similar to a *key* in a Python *dictionary*. This makes a Series somewhat like an ordered dictionary—one that supports additional methods and efficient data-processing behaviors.

Series can be created using the `Series()`

function (a *constructor* for instances of the class):

```
# create a Series from a list
number_series = pd.Series([1, 2, 2, 3, 5, 8])
print(number_series)
```

This code would print out:

```
0 1
1 2
2 2
3 3
4 5
5 8
dtype: int64
```

Printing a Series will display it like a *table*: the first value in each row is the **index** (label) of that element, and the second is the value of the element in the Series. Printing will also display the *type* of the elements in the Series. All elements in the Series will be treated as “same” type—if you create a Series from mixed elements (e.g., numbers and strings), the type will be the a generic `object`

. In practice, you almost always create Series from a single type.

If you create a Series from a list, each element will be given an *index* (label) that is that values’s index in the list. You can also create a Series from a *dictionary*, in which case the keys will be used as the index labels:

```
# create a Series from a dictionary
age_series = pd.Series({'sarah': 42, 'amit': 35, 'zhang': 13})
print(age_series)
```

```
amit 35
sarah 42
zhang 13
dtype: int64
```

Notice that the Series is automatically **sorted** by the keys of the dictionary! This means that the order of the elements in the Series will always be the same for a given dictionary (which cannot be said for the dictionary items themselves).

### 11.2.1 Series Operations and Methods

The main benefit of Series (as opposed to normal lists or dictionaries) is that they provide a number of operations and methods that make it easy to consider and modify the entire Series, rather than needing to work with each element individually. These functions include built-in *mapping*, *reducing*, and *filtering* style operations.

In particular, basic operators (whether math operators such as `+`

and `-`

, or relational operators such as `>`

or `==`

) function as **vectorized operations**, meaning that they are applied to the entire Series **member-wise**: the operation is applied to the first element in the Series, then the second, then the third, and so forth:

```
sample = pd.Series(range(1,6)) # Series of numbers from 1 to 5 (6 is excluded)
result = sample + 4 # add 4 to each element (produces new Series)
print(result)
# 0 5
# 1 6
# 2 7
# 3 8
# 4 9
# dtype: int64
is_greater_than_3 = sample > 3 # compare each element
print(is_greater_than_3)
# 0 False
# 1 False
# 2 False
# 3 True # note index and value are not the same
# 4 True
# dtype: bool
```

Having a Series operation apply to a *scalar* (a single value) is referred to as **broadcasting**. The idea is that the smaller “set” of elements (e.g., a single value) is *broadcast* so that it has a comparable size, thereby allowing different “sized” data structures to interact. Technically, operating on a Series with a *scalar* is actually a specific case of operating on it with another Series!

If the second operand is *another Series*, then mathematical and relational operations are still applied **member-wise**, with the elements of each operand being “matched” by their index label. This means that for most Series whose indices are list indices, operators will be applied “in order”.

```
s1 = pd.Series([2, 2, 2, 2, 2])
s2 = pd.Series([1, 2, 3, 4, 5])
# Examples of operations (list only includes values)
list(s1 + s2) # [3, 4, 5, 6, 7]
list(s1 / s2) # [2.0, 1.0, 0.66666666666666663, 0.5, 0.40000000000000002]
list(s1 < s2) # [False, False, True, True, True]
# Add a Series to itself (why not?)
list(s2 + s2) # [2, 4, 6, 8, 10]
# Perform more advanced arithmetic!
s3 = (s1 + s2) / (s1 + s1)
list(s3) # [0.75, 1.0, 1.25, 1.5, 1.75]
```

And note that these operations will be *fast*, even for very large Series, allowing for effective data manipulations.

`pandas`

Series also include a number of *methods* for inspecting and manipulating their data. Some useful examples are detailed below (not a comprehensive listing):

Function | Description |
---|---|

`index` |
an attribute; the sequence of index labels (convert to a list to use) |

`head(n)` |
returns a Series containing only the first `n` elements |

`tail(n)` |
returns a Series containing only the last `n` elements |

`any()` |
returns whether ANY of the elements are `True` (or “truthy”) |

`all()` |
returns whether ALL of the elements are `True` (or “truthy”) |

`mean()` |
returns the statistical mean of the elements in the Series |

`std()` |
returns the standard deviation of the elements in the Series |

`describe()` |
returns a Series of descriptive statistics |

`idxmax()` |
returns the index label of the element with the max value |

Series support many more methods as well: see the full documentation for a complete list.

One particularly useful method to mention is the `apply()`

method. This method is used to *apply* a particular **callback function** to each element in the series. This is a *mapping* operation, similar to what you’ve done with the `map()`

function:

```
def square(n): # a function that squares a number
return n**2
number_series = pd.Series([1,2,3,4,5]) # an initial series
square_series = number_series.apply(square)
list(square_series) # [1, 4, 9, 16, 25]
# Can also apply built-in functions
import math
sqrt_series = number_series.apply(math.sqrt)
list(sqrt_series) # [1.0, 1.4142135623730951, 1.7320508075688772, 2.0, 2.2360679774997898]
# Pass additional arguments as keyword args (or `args` for a single argument)
cubed_series = number_series.apply(math.pow, args=(3,)) # call math.pow(n, 3) on each
list(cubed_series) # [1.0, 8.0, 27.0, 64.0, 125.0]
```

### 11.2.2 Accessing Series

Just like lists and dictionaries, elements in a Series can be accessed using **bracket notation**, putting the index label inside the brackets:

```
letter_series = pd.Series(['a', 'b', 'c', 'd', 'e', 'f'])
age_series = pd.Series({'sarah': 42, 'amit': 35, 'zhang': 13})
# Get the 1th element from the letter_series
letter_series[1] # 'b'
# Get the 'sarah' element from age_series
age_series['amit'] # 35
# Get the 0th element from age_series
# (Series are ordered, so can always be accessed positionally!)
age_series[0] # 42
```

Note that the returned values are not technically basic `int`

or `float`

or `string`

types, but are rather specific `numpy`

objects that work almost identically to their normal type (but with some additional optimization).

You can also use list-style *slices* using the colon operator (e.g., elements ** 1:3**). Additionally, it is possible to specify

**(i.e., a**

*a sequence of indicies**list*or

*range*or even a

*Series*of indices) to access using bracket notation. This will produce a new Series object that contains only the elements that have those labels:

```
age_series = pd.Series({'sarah': 42, 'amit': 35, 'zhang': 13})
index_list = ['sarah', 'zhang']
print(age_series[index_list])
# sarah 42
# zhang 13
# dtype: int64
# using an anonymous variable for the index list (notice the brackets!)
print(age_series[['sarah', 'zhang']])
# sarah 42
# zhang 13
# dtype: int64
```

This also means that you can use something like a *list comprehension* (or even a Series operation!) to determine which elements to select from a Series!

```
letter_series = pd.Series(['a','b','c','d','e','f'])
even_numbers = [num for num in range(0,6) if num%2 == 0] # [0, 2, 4]
# Get letters with even numbered indices
letter_series[even_numbers]
# 0 a
# 2 c
# 4 e
# dtype: object
# In one line (check the brackets!)
letter_series[[num for num in range(0,6) if num%2 == 0]]
```

Finally, using a ** sequence of booleans** with bracket notation will produce a new Series containing the elements whose position

*corresponds*with

`True`

values. This is called **boolean indexing**.

```
shoe_sizes = pd.Series([7, 6.5, 4, 11, 8]) # a series of shoe sizes
index_filter = [True, False, False, True, True] # list of which elements to extract
# Extract every element in an index that is True
shoe_sizes[index_filter] # has values 7.0, 11.0, 8.0
```

In this example, since `index_filter`

is `True`

at index 0, 3, and 4, then `shoe_sizes[index_filter]`

returns a Series with the elements from index numbers 0, 3, and 4.

This technique is incredibly powerful because it allows you to easily perform **filtering** operations on a Series:

```
shoe_sizes = pd.Series([7, 6.5, 4, 11, 8]) # a series of shoe sizes
big_sizes = shoe_sizes > 6.5 # has values True, False, False, True, True
big_shoes = shoe_sizes[big_sizes] # has values 7, 11, 8
# In one line
big_shoes = shoe_sizes[shoe_sizes > 6.5]
```

You can think of the last statement as saying *shoe sizes where shoe size is greater than 6.5*.

Note that you can include *logical operators* (“and” and “or”) when defining series by using the operators `&`

for “and” and `|`

for “or”. Be sure to wrap each relational expression in `()`

to enforce order of operations.

While it is perfectly possible to do similar filtering with a list comprehension, the boolean indexing expression can be very simple to read and runs quickly. (This is also the normal style of doing filtering in the `R`

programming language).

## 11.3 DataFrames

The most common data structure used in `pandas`

(more common than Series) is a **DataFrame**. A DataFrame represents a **table**, where data is organized into rows and columns. You can think of a DataFrame as being like a Excel spreadsheet or a SQL table.
This book has previously represented tabular data using a *list of dictionaries*. However, this required you to be careful to make sure that all of the dictionaries shared keys, and did not offer easy ways to interact with the table in terms of its rows or columns. DataFrames give you that functionality!

A DataFrame can also be understood as a *dictionary of Series*, where each Series represents a **column** of the table. The keys of this dictionary are the *index labels* of the columns, while the the index labels of the Series serve as the labels for the row.

This is distinct from spreadsheets or SQL tables, which are often seen as a collection of *observations* (rows). Programmatically, DataFrames should primarily be considered as a collection of *features* (columns), which happen to be sequenced to correspond to observations.

A DataFrame can be created using the `DataFrame()`

function (a *constructor* for instances of the class). This function usually takes as an argument a *dictionary* whose values are Series (or values that can be converted into a Series, such as a list or a dictionary):

```
name_series = pd.Series(['Ada', 'Bob', 'Chris', 'Diya', 'Emma'])
heights = [64, 74, 69, 69, 71]
weights = [135, 156, 139, 144, 152]
df = pd.DataFrame({'name': name_series, 'height': heights, 'weight': weights})
print(df)
# name height weight
# 0 Ada 64 135
# 1 Bob 74 156
# 2 Chris 69 139
# 3 Diya 69 144
# 4 Emma 71 152
```

Although DataFrames variables are often named `df`

in `pandas`

examples, this is ** not** a good variable name. You can and should use much more descriptive names for your DataFrames (e.g.,

`person_size_table`

) when doing actual analysis.Note that you can specify the order of columns in the table using the `columns`

keyword argument, and the order of the rows using the `index`

keyword argument.

It is also possible to create a DataFrame directly from a spreadsheet—such as from ** .csv** file (containing

**c**omma

**s**separated

**v**alues) by using the

`pandas.read_csv()`

function:`my_dataframe = pd.read_csv('path/to/my/file.csv')`

See the IO Tools documentation for details and other file-reading functions.

### 11.3.1 DataFrame Operations and Methods

Much like Series, DataFrames support a **vectorized** form of mathematical and relational operators: when the other operand is a *scalar* (e.g., a single number or a string), then the operation is applied member-wise to each value in the DataFrame:

```
# data frame of test scores
test_scores = pd.DataFrame({
'math':[91, 82, 93, 100, 78, 91],
'spanish':[88, 79, 77, 99, 88, 93]
})
curved_scores = test_scores * 1.02 # curve scores up by 2%
print(curved_scores)
# math spanish
# 0 92.82 89.76
# 1 83.64 80.58
# 2 94.86 78.54
# 3 102.00 100.98
# 4 79.56 89.76
# 5 92.82 94.86
print(curved_scores > 90)
# math spanish
# 0 True False
# 1 False False
# 2 True False
# 3 True True
# 4 False False
# 5 True True
```

It is possible for both operands be DataFrames. In this case, the operation is applied **member-wise**, where values are matched if they have the same row *and* column label. Note that any value that doesn’t have a pair will instead produce the value `NaN`

(Not a Number). This is not a normal way of working with DataFrames—it is much more common to access individual rows and columns and work with those (e.g., make a new column that is the sum of two others); see below for details.

Also like Series, DataFrames objects support a large number of methods, including:

Function | Description |
---|---|

`index` |
an attribute; the sequence of row index labels (convert to a list to use) |

`columns` |
an attribute; the sequence of column index labels (convert to a list to use) |

`head(n)` |
returns a DataFrame containing only the first `n` rows |

`tail(n)` |
returns a DataFrame containing only the last `n` rows |

`assign(...)` |
returns a new DataFrame with an additional column; call as `df.assign(new_label=new_column)` |

`drop(label, row_or_col)` |
returns a new DataFrame with the given row or column removed |

`mean()` |
returns a Series of the statistical means of the values of each column |

`all()` |
returns a Series of whether ALL the elemnts in each column are `True` (or “truthy”) |

`describe()` |
returns a DataFrame whose columns are Series of descriptive statistics for each column in the original DataFrame |

You may notice that many of these methods (e.g., `head()`

, `mean()`

, `describe()`

, `any()`

) also exist for Series! In fact, most every method that Series support are supported by DataFrames as well. These methods are all applied **per column** (not per row)—that is, calling `mean()`

on a DataFrame will calculate the *mean* of **each column** in that DataFrame:

```
df = pd.DataFrame({
'name': ['Ada', 'Bob', 'Chris', 'Diya', 'Emma'],
'height': [64, 74, 69, 69, 71],
'weights': [135, 156, 139, 144, 152]
})
df.mean()
# height 69.4
# weights 145.2
# dtype: float64
```

If the Series method would return a *scalar* (a single value, as with `mean()`

or `any()`

), then the DataFrame method returns a Series whose labels are the column labels, as above. If the Series method instead would return a Series (multiple values, as with `head()`

or `describe()`

), then the DataFrame method returns a new DataFrame whose columns are each of the resulting Series:

```
df = pd.DataFrame({
'name': ['Ada', 'Bob', 'Chris', 'Diya', 'Emma'],
'height': [64, 74, 69, 69, 71],
'weights': [135, 156, 139, 144, 152]
})
df.describe()
# height weights
# count 5.000000 5.000000
# mean 69.400000 145.200000
# std 3.646917 8.757854
# min 64.000000 135.000000
# 25% 69.000000 139.000000
# 50% 69.000000 144.000000
# 75% 71.000000 152.000000
# max 74.000000 156.000000
```

Notice that the `height`

column is the result of calling `describe()`

on the DataFrame’s `height`

column Series!

As a general rule: if you’re expecting one value per column, you’ll get a Series of those values; if you’re expecting multiple values per column, you’ll get a DataFrame of those values.

This also means that you can sometimes “double-call” methods to reduce them further. For example, `df.all()`

returns a Series of whether each column contains only `True`

values; `df.all().all()`

would check if *that* Series contains all `True`

values (thus checking *all* columns have all `True`

value, i.e., the entire table is all `True`

values).

### 11.3.2 Accessing DataFrames

DataFrames make it possible to quickly access individual or a subset of values, though these methods use a variety of syntax structures. For this explanation, refer to the following sample DataFrame initially described above:

```
# all examples in this section
df = pd.DataFrame({
'name': ['Ada', 'Bob', 'Chris', 'Diya', 'Emma'],
'height': [64, 74, 69, 69, 71],
'weights': [135, 156, 139, 144, 152]
})
print(df)
# name height weight
# 0 Ada 64 135
# 1 Bob 74 156
# 2 Chris 69 139
# 3 Diya 69 144
# 4 Emma 71 152
```

Since DataFrames are most commonly viewed as a *dictionary of columns*, it is possible to access tho as such using **bracket notation** (using the index label of the column):

```
print( df['height'] ) # get height column
# 0 64
# 1 74
# 2 69
# 3 69
# 4 71
# Name: height, dtype: int64
```

However, it is often more common to refer to individual columns using **dot notation**, treating each column as an *attribute* or *property* of the DataFrame object:

```
# same results as above
print( df.height ) # get height column
```

It is also possible to select *multiple* columns by using a *list* or sequence inside the **bracket notation** (similar to selecting multiple values from a Series). This will produce a new DataFrame (a “sub-table”)

```
# count the brackets carefully!
print( df[['name', 'height']] ) # get name and height columns
# can also select multiple columns with a list of their positions
print( df[[1,2]] ) # get 1st (name) and 2nd (weight) columns
```

*Watch out though*! Specifying a **slice** (using a colon ** :**) will actually select by

*row*position, not column position! I do not know wherefore this inconsistency, other than “convenience”.

```
print( df[0:2] ) # get ROWS 0 through 2 (not inclusive)
# name height weight
# 0 Ada 64 135
# 1 Bob 74 156
```

Because DataFrames support multiple indexes, it is possible to use **boolean indexing** (as with Series), allowing you to *filter* for rows based the values in their columns:

```
# Access rows where the `height` column has a value greater than 70
print( df[df.height > 70] )
# name height weight
# 1 Bob 74 156
# 4 Emma 71 152
```

Note that `df.height`

is a Series (a column), so `df.height > 70`

produces a Series of boolean values (`True`

and `False`

). This Series is used to determine *which* rows to return from the DataFrame—each row that corresponds with a `True`

index.

Finally, DataFrames also provide two *attributes* (properties) used to “quick access” values: ** loc**, which provides an “index” (lookup table) based on index labels, and

**, which provides an “index” (lookup table) based on row and column positions. Each of these “indexes” can be thought of as a**

`iloc`

*dictionary*whose values are the individual elements in the DataFrame, and whose keys can therefore be used to access those values using

**bracket notation**. The dictionaries support multiple types of keys (using label-based

`loc`

as an example):Key Type | Description | Example |
---|---|---|

`df.loc[row_label]` |
An individual value | `df.loc['Ada']` (the row labeled `Ada` ) |

`df.loc[row_label_list]` |
A list of row labels | `df.loc[['Ada','Bob']]` (the rows labeled `Ada` and `Bob` ) |

`df.loc[row_label_slice]` |
A slice of row labels |
`df.loc['Bob':'Diya']` (the rows from `Bob` to `Diya` . Note that this is an inclusive slice!) |

`df.loc[row_label, col_label]` |
A tuple of `(row, column)` |
`df.loc['Ada', 'height']` (the value at row `Ada` , column `height` ) |

`df.loc[row_label_seq, col_label_seq]` |
A tuple of label lists or slices |
`df.loc['Bob':'Diya', ['height','weight']]` (the rows from `Bob` to `Diya` with the columns `height` and `weight` ) |

Note that the example

`df`

table doesn’t have row labels beyond`0`

to`4`

Using a

*tuple*makes it easy to access a particular value in the table, or a range of values (*selecting*rows and columns ).You can also use the boundless slice

`:`

to refer to “all elements”. So for example:`df.loc[:, 'height'] # get all rows, but just the 'height' column`

This is a basic summary of how to create and access DataFrames; for more detailed usage, additional methods, and specific “recipes”, see the official `pandas`

documentation.

## 11.4 Aggregation and Grouping

https://jakevdp.github.io/PythonDataScienceHandbook/index.html https://jakevdp.github.io/PythonDataScienceHandbook/03.07-merge-and-join.html https://jakevdp.github.io/PythonDataScienceHandbook/03.06-concat-and-append.html https://www.shanelynn.ie/summarising-aggregation-and-grouping-data-in-python-pandas/ https://pandas.pydata.org/pandas-docs/stable/user_guide/groupby.html

//this is about taking lots of different things and *reducing* them to one thing.

//aggregation functions exist (mean, std, min, max) – when called on a data frame, applies per row

//groupby explanation

– lab section example from slides

## 11.5 Joining DataFrames Together

//concat/append are things

//merge(join) explanation
- default is “inner”
- compare to “left” & “right” [`how`

] key

## Resources

- 10 minutes to pandas (pandas docs) a basic set of examples
- Tutorials (pandas docs) a list and guide to various tutorials (of mixed quality)
- Intro to Data Structure (pandas docs)
- Essential Basic Functionality (pandas docs) not really basic, but a complete set of examples
- Pandas. Data Processing (Data Analysis in Python)
- Python Data Science Handbook a full textbook with extensive details about using Pandas (and related libraries).