Chapter 9 The pandas
Library
This chapter introduces the Python Data Analysis library pandas
—a set of modules, functions, and classes used to easily and efficiently perform data analysis—panda
’s speciality is its highly optimized performance when working with large data sets. pandas
is the most common library used with Python for Data Science (and mirrors the R
language in many ways, allowing programmers to easily move between the two).
This chapter will give you an introduction to this library. It will discuss the two main data structures used by pandas
(Series and DataFrames) and how to use them to organize and work with data, as well as to perform basic grouping and aggregatin functions. Note that this should not be considered a complete reference or tutorial for the library—there’s a lot it can do, and this text gives you just a taste!
9.1 Setting up pandas
pandas
is a third-party library (not built into Python!), but is included by default with most Python setups such as Anaconda and so can be usually be imported with additional installation. Additionally, pandas
is built on top of the numpy
scientific computing library which supports highly optimized mathematical operations. Thus many pandas
operations involve working with numpy
data structures, and the pandas
library requires numpy
(which also included in Anaconda) to be imported alongside it:
# import libraries
import pandas as pd # standard shortcut names
import numpy as np
Normal practice is to import
the module and reference types and methods using dot notation, rather than importing them into the global namespace. This helps keep the global namespace from being overfilled with variables. Also note that this chapter will focus primarily on pandas
, leaving numpy
-specific data structures and functions for the reader to explore.
9.2 Series
The first basic pandas
data structure is a Series. A Series represents a one-dimensional ordered collection of values, making them similar to a regular Python list. However, elements can also be given labels (called the index), which can be non-numeric values similars to the keys in a Python dictionary. This makes a Series a bit like an “ordered dictionary”—one that supports additional methods and efficient data-processing behaviors.
Series can be created using the Series()
function (a constructor for instances of the class):
# create a Series from a list
number_series = pd.Series([1, 2, 2, 3, 5, 8])
print(number_series)
This code would print out:
0 1
1 2
2 2
3 3
4 5
5 8
dtype: int64
Printing a Series will display it like a table: the first value in each row is the index (label) of that element, and the second is the value of the element in the Series. Printing will also display the type of the elements in the Series. All elements in the Series will be treated as “same” type—if you create a Series from mixed elements (e.g., numbers and strings), the type will be the a generic object
. In practice, you almost always create Series from a single type.
If you create a Series from a list, each element will be given an index (label) that is that values’s index in the list. You can also create a Series from a dictionary, in which case the keys will be used as the index labels:
# create a Series from a dictionary
age_series = pd.Series({'sarah': 42, 'amit': 35, 'zhang': 13})
print(age_series)
amit 35
sarah 42
zhang 13
dtype: int64
Notice that the Series is automatically sorted by the keys of the dictionary! This means that the order of the elements in the Series will always be the same for a given dictionary (which cannot be said for the dictionary items themselves).
9.2.1 Series Operations
The main benefit of Series (as opposed to normal lists or dictionaries) is that they provide a number of operations and methods that make it easy to consider and modify the entire Series, rather than needing to work with each element individually. These functions include built-in mapping and filtering style operations, as well as reducing aggregations.
When using basic operators (whether math operators such as +
and -
, or relational operators such as >
or ==
) on a Series are what are called vectorized operations, meaning the operation is applied to Series elements pair-wise. This means that each element from the first Series operand is modified by the element in the same corresponding position in the second Series operand. This will produce the value at the corresponding position of the resulting Series. In other words, if you want to add two Series, then the value of the first element in the result will be the sum of the first elements in each Series, the second element in the result will be the sum of the second elements in each Series, and so on.
# Create two Series to combine
s1 = pd.Series([3, 1, 4, 1, 5])
s2 = pd.Series([1, 6, 1, 8, 0])
s3 = s1 + s2 # add together
print(s3)
# 0 4 # 3 + 1
# 1 7 # 1 + 6
# 2 5 # 4 + 1
# 3 9 # 1 + 8
# 4 5 # 5 + 0
# dtype: int64
# Create two Series to combine
s1 = pd.Series([2, 2, 2, 2, 2])
s2 = pd.Series([1, 2, 3, 4, 5])
# Examples of operations (list version only includes values)
list(s1 + s2) # [3, 4, 5, 6, 7]
list(s1 / s2) # [2.0, 1.0, 0.66666666666666663, 0.5, 0.40000000000000002]
list(s1 < s2) # [False, False, True, True, True]
# Add a Series to itself (why not?)
list(s2 + s2) # [2, 4, 6, 8, 10]
# Perform more advanced arithmetic!
s3 = (s1 + s2) / (s1 + s1)
list(s3) # [0.75, 1.0, 1.25, 1.5, 1.75]
These operations will be fast, even for very large Series, allowing for effective data manipulations.
Remember that Series operations are performed on matching indices. If one operand doesn’t have the same index, then you’ll get an undefined value:
alpha_series = pd.Series({'a': 1, 'b': 2, 'c': 3, 'm': 13})
omega_series = pd.Series({'m': 13, 'z': 26})
result = alpha_series + omega_series
print(result)
# a NaN
# b NaN
# c NaN
# m 26.0
# z NaN
# dtype: float64
In this example, only the 'm'
label was shared between the series, so only that one matched: the other values weren’t defined so produces NaN
(Not A Number) values as a result. Thus you can perform mathemtical operation on Series of different sizes; its just that some indices won’t match. But if you try to compare (using <
or ==
) Series with different indices, you will get an error.
It is also possible to use a scalar (a single value) as an operand with a Series. This is referred to as broadcasting. The idea is that the smaller “set” of elements (e.g., a single value) is broadcast — expanded — so that it has a comparable size, thereby allowing different “sized” data structures to interact. Technically, operating on a Series with a scalar is actually a specific case of operating on it with another Series!
sample = pd.Series(range(1,6)) # Series of numbers from 1 to 5 (6 is excluded)
result = sample + 4 # add 4 to each element (produces new Series)
print(result)
# 0 5
# 1 6
# 2 7
# 3 8
# 4 9
# dtype: int64
is_less_than_3 = sample < 3 # compare each element
print(is_less_than_3)
# 0 True
# 1 True
# 2 False # comparing to the value, not the index!
# 3 False
# 4 False
# dtype: bool
Series containing booleans also support logical operators (“and” and “or”), using the operators &
for “and” and |
for “or”:
days_it_rained = pd.Series({"mon": True, "tue": True, "wed": False, "thu": True, "fri": False})
days_it_snowed = pd.Series({"mon": False, "tue": True, "wed": True, "thu": True, "fri": False})
days_it_rained_and_snowed = days_it_rained & days_it_snowed # combine with "and"
print(days_it_rained_and_snowed)
# mon False
# tue True
# wed False
# thu True
# fri False
# dtype: bool
9.2.2 Series Methods
pandas
Series also include a number of methods for inspecting and manipulating their data. Some useful examples shown below (this is not a comprehensive listing):
sample_numbers = pd.Series([8, 6, 7, 5, 3, 0, 9]) # an example to work with
# The `head(n)` method returns a Series containing only the first `n` elements
# Note that this is a *new Series*
# The `tail(n)` method works similarly, but returns the last `n` elements.
first_3 = sample_numbers.head(3)
print(first_3)
# 0 8
# 1 6
# 2 7
# dtype: int64
# You can use simple aggregation methods, such as:
maximum_value = sample_numbers.max() # gets the maximum value of the series
maximum_value_index = sample_numbers.idxmax() # get the label (index) where the maximum value is
mean_of_values = sample_numbers.mean() # gets the average (statistical mean) of the values
std_dev_of_values = sample.numbers.std() # gets the standard deviation of the values
# There are also aggregation methods for Series of booleans
# The `any()` method returns whether ANY of the elements are `True`
# The `all()` method returns whether ALL of the elements are `True`
larger_than_3_series = sample_numbers > 3 # make a new series of Booleans
any_bigger = larger_than_3_series.any()
all_bigger = larger_than_3_series.all()
print(any_bigger) # True
print(all_bigger) # False
# get a Series of descriptive statistics
# you would need to access the individual values to use them
description_series = sample_numbers.describe()
print(description_series)
# count 7.000000
# mean 5.428571
# std 3.101459
# min 0.000000
# 25% 4.000000
# 50% 6.000000
# 75% 7.500000
# max 9.000000
# dtype: float64
Notice that these are methods (called with dot notation). And generally they’re used to do some kind of processing on the whole Series!
Series support many more methods as well: see the full documentation for a complete list.
9.2.3 Series Indexing
Like dictionaries, each element in a Series has an index (a key or label). If that Series was produced from a list the indices will just be numeric positions; if it was produced from a dictionary then the indices will be the keys of that dictionary (but they are still called indices in pandas
!) You can access the sequence of indices by using the index
attribute (it’s not a method!); you will need to convert that sequence into a list to make use of it.
letter_series = pd.Series(['a', 'b', 'c', 'd', 'e', 'f'])
age_series = pd.Series({'sarah': 42, 'amit': 35, 'zhang': 13})
letter_indices = list(letter_series.index)
print(letter_indices) # [0, 1, 2, 3, 4, 5]
age_indices = list(age_series.index)
print(age_indices) # ['sarah', 'amit', 'zhang']
Just like lists and dictionaries, elements in a Series can be accessed using bracket notation, putting the index label inside the brackets:
letter_series = pd.Series(['a', 'b', 'c', 'd', 'e', 'f'])
age_series = pd.Series({'sarah': 42, 'amit': 35, 'zhang': 13})
# Get the 1th element from the letter_series
letter_series[1] # 'b'
# Get the 'amit' element from age_series
age_series['amit'] # 35
# Get the 0th element from age_series
# (Series are ordered, so can always be accessed positionally!)
age_series[0] # 42
(The returned values from using bracket notation are not technically basic int
or float
or string
types, but are rather specific numpy
objects that work almost identically to their normal type, but with some additional optimization. This shouldn’t impact anything you do).
You can also use list-style slices using the colon operator (e.g., elements 1:3
). Additionally, it is possible to specify a sequence of indicies (i.e., a list or range or even a Series of indices) to access using bracket notation. This will produce a new Series object that contains only the elements that have those labels:
age_series = pd.Series({'sarah': 42, 'amit': 35, 'zhang': 13})
index_list = ['sarah', 'zhang']
print(age_series[index_list])
# sarah 42
# zhang 13
# dtype: int64
# using an anonymous variable for the index list (notice the brackets!)
print(age_series[['sarah', 'zhang']])
# sarah 42
# zhang 13
# dtype: int64
This also means that you can use something like a list comprehension (or even a Series operation!) to determine which elements to select from a Series!
letter_series = pd.Series(['a','b','c','d','e','f'])
even_numbers = [num for num in range(0,6) if num%2 == 0] # [0, 2, 4]
# Get letters with even numbered indices
letter_series[even_numbers]
# 0 a
# 2 c
# 4 e
# dtype: object
# The same thing, but in one line (check the brackets!)
letter_series[[num for num in range(0,6) if num%2 == 0]]
Note in the last line that there are two sets of brackets: one set being used to access the values of letter_series
, and one used to specify a list (using a comprehension).
Finally, using a sequence of booleans with bracket notation will produce a new Series containing the elements whose position corresponds with the position of a True
value. So if the first element of the boolean sqeuence is a True
, then the first element of the Series will be accessed; if the second element of the boolean sequence is a True
, then the second element of the Series will be accessed, and so on. This process is called boolean indexing.
shoe_sizes = pd.Series([7, 6.5, 4, 11, 8]) # a series of shoe sizes
index_filter = [True, False, False, True, True] # list of which elements to extract
# Extract every element in an index that is True
shoe_sizes[index_filter] # has values 7.0, 11.0, 8.0
In this example, since index_filter
is True
at index 0, 3, and 4, then shoe_sizes[index_filter]
returns a Series with the elements from index numbers 0, 3, and 4.
This technique is incredibly powerful because it allows you to easily perform filtering operations on a Series:
shoe_sizes = pd.Series([7, 6.5, 4, 11, 8]) # a Series of shoe sizes
big_sizes = shoe_sizes > 6.5 # a Series of booleans: True, False, False, True, True
big_shoes = shoe_sizes[big_sizes] # a Series with values 7, 11, 8
# In one line
big_shoes = shoe_sizes[shoe_sizes > 6.5]
You can think of the last statement as saying shoe sizes where shoe size is greater than 6.5.
Remember that you can use logical operators &
(“and”) and |
(“or”) to combine multiple Series of booleans, allowing you to create complex “filtering” statements out of multiple series:
# access elements from "daily temperatures" where it was raining
# or the wind was high
# assume all three values are Series.
daily_temperature[days_rained | days_windspeed == "high"]
While it is perfectly possible to do similar filtering with a list comprehension, the boolean indexing syntax can be very simple to read and runs quickly. (This is also the normal style of doing filtering in the R
programming language).
9.3 DataFrames
The most common data structure used in pandas
(more common than Series in fact!) is a DataFrame. A DataFrame represents a table, where data is organized into rows and columns. You can think of a DataFrame as being like a Excel spreadsheet or a SQL table.
This book has previously represented tabular data using a list of dictionaries. However, this required you to be careful to make sure that all of the dictionaries shared keys, and did not offer easy ways to interact with the table in terms of its rows or columns. DataFrames give you that functionality!
A DataFrame can be understood as a dictionary of Series, where each Series represents a column of the table. The keys of this dictionary are the index labels of the columns, while the the index labels of the Series themselves serve as the labels for the row.
This structure is distinct from spreadsheets or SQL tables or even lists-of-dictionaries, which are often seen as a collection of observations (rows). Programmatically, DataFrames should primarily be considered as a collection of features (columns), which happen to be sequenced to correspond to observations.
A DataFrame can be created using the DataFrame()
function (a constructor for instances of the class). This function usually takes as an argument a dictionary whose values are Series (or values that can be converted into a Series, such as a list or a dictionary):
name_series = pd.Series(['Ada', 'Bob', 'Chris', 'Diya', 'Emma'])
heights = [64, 74, 69, 69, 71] # lists can be made into Series
weights = [135, 156, 139, 144, 152]
people_df = pd.DataFrame({'name': name_series, 'height': heights, 'weight': weights})
print(people_df)
# name height weight
# 0 Ada 64 135
# 1 Bob 74 156
# 2 Chris 69 139
# 3 Diya 69 144
# 4 Emma 71 152
Notice that the rows are labeled with indices 0, 1, 2, 3, 4
(the indices used by the individual Series—which were made from lists), and the columns are labeled with indices name, height, weight
(the keys used in the dictionary that create the DataFrame).
If you end a Jupyter cell with an expression that is a DataFrame, Jupyter will format it as a graphical table—but it’s still a DataFrame!
Although DataFrames variables are often named just df
in pandas
examples, this is not a good variable name! You can and should use much more descriptive names for your DataFrames (e.g., person_size_table
or person_size_df
) when doing actual analysis.
It is possible to specify the order of columns in the table using the columns
keyword argument, and the order of the rows using the index
keyword argument. However, the ordering of rows and columns is often not necessary for data analysis.
It is also possible to create a DataFrame directly from a spreadsheet—such as from a .csv
file (containing comma sseparated values)—by using the pandas.read_csv()
function:
my_dataframe = pd.read_csv('path/to/my/file.csv')
See the IO Tools documentation for details and other file-reading functions.
9.3.1 DataFrame Operations
Much like Series, DataFrames support a vectorized form of mathematical and relational operators: when the other operand is a scalar (e.g., a single number or a string), then the operation is applied member-wise to each value in the DataFrame:
# data frame of test scores
test_scores = pd.DataFrame({
'math':[91, 82, 93, 100, 78, 91],
'spanish':[88, 79, 77, 99, 88, 93]
})
# A mathematical operator applies to each element in the data frame
curved_scores = test_scores * 1.02 # curve scores up by 2%
print(curved_scores)
# math spanish
# 0 92.82 89.76
# 1 83.64 80.58
# 2 94.86 78.54
# 3 102.00 100.98
# 4 79.56 89.76
# 5 92.82 94.86
# A relational operator applies to each element in the data frame
print(curved_scores > 90)
# math spanish
# 0 True False
# 1 False False
# 2 True False
# 3 True True
# 4 False False
# 5 True True
It is possible for to apply such operators when both operands are DataFrames. In this case, the operation is applied member-wise, where values are matched if they have the same row and column label. Note that any value that doesn’t have a pair will instead produce the value NaN
(Not a Number). This is not a normal way of working with DataFrames—it is much more common to access individual rows and columns and work with those (e.g., make a new column that is the sum of two others); see below for details.
9.3.2 DataFrame Methods
Also like Series, DataFrames objects support a large number of methods that can be used to modify or interrogate the data structure.
people_df = pd.DataFrame({
'name': ['Ada', 'Bob', 'Chris', 'Diya', 'Emma'],
'height': [64, 74, 69, 69, 71],
'weight': [135, 156, 139, 144, 152]
})
print(people_df) # for reference
# name height weight
# 0 Ada 64 135
# 1 Bob 74 156
# 2 Chris 69 139
# 3 Diya 69 144
# 4 Emma 71 152
# The `head(n)` method returns a DataFrame containing only the first `n` ROWS
# Note that this is a *new DataFrame*
# The `tail(n)` method works similarly, but returns the last `n` ROWS.
first_2_rows = people_df.head(2)
print(first_2_rows)
# name height weight
# 0 Ada 64 135
# 1 Bob 74 156
# The `assign()` method returns a new DataFrame with additional columns
# Each new column is named by the argument name and given the argument value
# This does not modify the original DataFrame!
people_with_hats_df = people_df.assign(wearing_hat=[False, True, False, True, True])
print(people_with_hats_df)
# name height weight wearing_hat
# 0 Ada 64 135 False
# 1 Bob 74 156 True
# 2 Chris 69 139 False
# 3 Diya 69 144 True
# 4 Emma 71 152 True
extra_columns_df = people_df.assign(col1=1, col2=2, col3=3)
print(extra_columns_df)
# name height weight col1 col2 col3
# 0 Ada 64 135 1 2 3
# 1 Bob 74 156 1 2 3
# 2 Chris 69 139 1 2 3
# 3 Diya 69 144 1 2 3
# 4 Emma 71 152 1 2 3
# The `drop()` function returns a new DataFrame with rows or columns _removed
# The `axis` argument indicates whether to drop a row (drop by `index`)
# or a column (drop by `column`)
names_only_df = people_df.drop(['height', 'weight'], axis="columns")
print(names_only_df) # A DataFrame with a single column!
# name
# 0 Ada
# 1 Bob
# 2 Chris
# 3 Diya
# 4 Emma
select_rows_df = people_df.drop([1, 2], axis="index")
print(select_rows_df)
# name height weight
# 0 Ada 64 135
# 3 Diya 69 144
# 4 Emma 71 152
While the drop()
function can be used to remove rows or columns, it’s more common to use boolean indexing to access the desired rows & columns; see below for details. When working with data structures in programming, it’s better to think about “what you want to keep” rather than “what you want to remove”.
DataFrames also support most of the same methods that Series do, such as mean()
, any()
, describe()
, etc. With a DataFrame, these methods are applied per column (not per row). So calling mean()
on a DataFrame will calculate the mean of each column in that DataFrame, producing a new Series of those values:
# continuing from previous example
people_measures_only_df = people_df.drop("name", axis="columns")
# Get the statistical mean (average) of each column of the DataFrame
people_means_series = people_measures_only_df.mean()
print(people_means_series)
# height 69.4
# weight 145.2
# dtype: float64
To be clear: the result of calling e.g., .mean()
is a Series, not a DataFrame (remember that Series can have indices that are not numeric, like a dictionary does!)
Series methods like .describe()
which returned a Series are also support: in this case, they return a Series for each column in the DataFrame… compiled into a new DataFrame!
people_description_df = people_measures_only_df.describe()
print(people_description_df)
# height weight
# count 5.000000 5.000000
# mean 69.400000 145.200000
# std 3.646917 8.757854
# min 64.000000 135.000000
# 25% 69.000000 139.000000
# 50% 69.000000 144.000000
# 75% 71.000000 152.000000
# max 74.000000 156.000000
The height
column of the people_description_df
the result of calling describe()
on the DataFrame’s height
column Series.
If a Series method would return a scalar (a single value, as with mean()
or any()
), then then calling that method on a DataFrame returns a Series whose labels are the column labels. If the Series method instead would return a Series (multiple values, as with head()
or describe()
), then calling that method on a DataFrame returns a new DataFrame whose columns are each of the resulting Series. So as a general rule: if you’re expecting one value per column, you’ll get a Series of those values; if you’re expecting multiple values per column, you’ll get a DataFrame of those values.
This also means that you can sometimes “double-call” methods to reduce them further. For example, df.all()
returns a Series of whether each column contains only True
values; df.all().all()
would check if that Series contains all True
values (thus checking all columns have all True
value, i.e., the entire table is all True
values).
9.3.3 Accessing DataFrames
It is possible to access values of a DataFrame by index, just as you can with Series. But DataFrames are two-dimensional data structures, so they have both an index (which represents the row) and columns (which represent the columns). Thus you can talk about the index or column of a DataFrame to refer to a specific element. Similar to the Series’ index
attribute, you can access a list of indices or a list of columns by using the index
and column
attributes respectively:
people_df = pd.DataFrame({
'name': ['Ada', 'Bob', 'Chris', 'Diya', 'Emma'],
'height': [64, 74, 69, 69, 71],
'weight': [135, 156, 139, 144, 152]
})
people_indices = list(people_df.index)
print(people_indices) # [0, 1, 2, 3, 4]
people_columns = list(people_df.columns)
print(people_columns) # ['name', 'height', 'weight']
It is also possible to access DataFrame values using bracket notation. As DataFrames are commonly viewed as a dictionary of columns, the value used inside the brackets is the label of the column:
# get the height column using bracket notation
height_column = people_df['height']
print(height_column)
# 0 64
# 1 74
# 2 69
# 3 69
# 4 71
# Name: height, dtype: int64
If you try to put a row label or index inside of the brackets, you’ll get a KeyError
that the key isn’t defined—similar to if you tried to access a key that wasn’t in a dictionary.
While bracket notatoin works, it is often more common to refer to individual columns using dot notation, treating each column as an attribute or property of the DataFrame object:
# get the height column using dot notation
height_column = people_df.height
print(height_column)
# 0 64
# 1 74
# 2 69
# 3 69
# 4 71
# Name: height, dtype: int64
This alternate syntax does exactly the same thing—it’s just written in a different way! The dot notation can be slightly easier to type and to read (less punctuation is involved), making it often preferred. Being able to use dot notation this way is a feature of DataFrames, and doesn’t apply to normal dictionaries.
As with Series, it is also possible to select multiple columns by using a list or sequence inside the brackets. This will produce a new DataFrame (a “sub-table”)
list_of_columns = ['name', 'height']
subtable = people_df[list_of_columns]
print(subtable)
# name height
# 0 Ada 64
# 1 Bob 74
# 2 Chris 69
# 3 Diya 69
# 4 Emma 71
# same as above, but in one line:
subtable = people_df[['name', 'height']]
print(subtable)
You can’t access columns by their position, so cannot specify a list of numbers—it would need to be a list of column names (usually strings).
However: specifying a slice (using a colon :
) will select by row position, not column position! I do not know wherefore this inconsistency, other than “convenience”.
# get ROWS 1 through 4 (not inclusive)
subtable = people_df[1:4]
print(subtable)
# name height weight
# 1 Bob 74 156
# 2 Chris 69 139
# 3 Diya 69 144
It is also possible to use bracket notation when assiging values to a DataFrame. This is a useful way to add additional columns to the DataFrame if you don’t want to use the assign()
method—in particular, this will modify the DataFrame “in place” (without making a copy)… though potentially lead to errors with changing referenced values.
people_df['wearing_hat'] = [False, True, False, True, True]
print(people_df)
# name height weight wearing_hat
# 0 Ada 64 135 False
# 1 Bob 74 156 True
# 2 Chris 69 139 False
# 3 Diya 69 144 True
# 4 Emma 71 152 True
Note that if you try to use an assignment operator to modify a subset (called a “slice”) of a DataFrame—such as one you selected with bracket notation—you will get a SettingWithCopyWarning
.
# Change the element in the `wearing_hat` column and row 3 to be False
# Diya is no longer wearing a hat
people_df['wearing_hat'][3] = False # <SettingWithCopyWarning>
This is a warning, not an error; your operation will succeed but it will look like something went front. This warning is shown because such operations have a chance of creating unpredictable results, so pandas
gives you a warning not to do that. See the user guide for a more detailed discussion. If you get this warning, you should adjust your indexing so you’re not getting a subset… or just use a different method of assigning new values to a variable.
Because DataFrames support multiple indexes, it is possible to use boolean indexing (as with Series), allowing you to filter for rows based the values in their columns. Note carefully that boolean indexing selects by row (not by column)—a list of column names selects columns, but a list of booleans selects rows!
# Access the `height` column (a Series) and compare to 70
are_tall_series = people_df.height > 70
print(are_tall_series)
# 0 False
# 1 True
# 2 False
# 3 False
# 4 True
# Name: height, dtype: bool
tall_people_df = people_df[are_tall_series]
print(tall_people_df)
# name height weight
# 1 Bob 74 156
# 4 Emma 71 152
# As a single line:
tall_people_df = people_df[people_df.height > 70]
Note that people_df.height
is a Series (a column), so people_df.height > 70
produces a Series of boolean values (True
and False
). This Series is used to determine which rows to return from the DataFrame—each row that corresponds with a True
index.
Finally, DataFrames also provide two attributes (properties) used to “quick access” values: loc
, which provides an “index” (lookup table) based on index labels, and iloc
, which provides an “index” (lookup table) based on row and column positions. Each of these “indexes” can be thought of as a dictionary whose values are the individual elements in the DataFrame, and whose keys can therefore be used to access those values using bracket notation. The dictionaries support multiple types of keys (using label-based loc
as an example):
Key Type | Description | Example |
---|---|---|
df.loc[row_label] |
An individual row | df.loc['Ada'] (the row labeled Ada ) |
df.loc[row_label_list] |
A list of row labels | df.loc[['Ada','Bob']] (the rows labeled Ada and Bob ) |
df.loc[row_label_slice] |
A slice of row labels | df.loc['Bob':'Diya'] (the rows from Bob to Diya . Note that this is an inclusive slice!) |
df.loc[row_label, col_label] |
A tuple of (row, column) |
df.loc['Ada', 'height'] (the value at row Ada , column height ) |
df.loc[row_label_seq, col_label_seq] |
A tuple of label lists or slices | df.loc['Bob':'Diya', ['height','weight']] (the rows from Bob to Diya with the columns height and weight ) |
df.iloc[row_index, col_index] |
A tuple of (row_number, column_number) |
df.iloc[0, 2] (the value at 0th row ad 2th column) |
df.iloc[row_index_seq, col_index_seq] |
A tuple of label indices or slices | df.iloc[1:3, [1,2]] (the 1th to 3th rows (exclusive) with the 1th and 2th columns |
The loc
attribute thus lets you use bracket notation, but specifying the rows and columns at the same time, separated by a comma (because you’re passing in a tuple)! The iloc
attribute works similarly, but instead of specifying the labels for the rows and columns, you specify the positional index.
You can also use the boundless slice :
to refer to “all elements”. So for example:
people_df.loc[:, 'height'] # get all rows, but just the 'height' column
9.4 Grouping
The pandas
library is designed and most useful when working with large data sets—which might have hundreds or thousands of rows. And when working with such data, you often want to perform aggregations and get statistics on only part of the data set, rather than the whole. For example, rather than calculate the average temperature across an entire year’s worth of recordings, maybe you want to get the average of each month, or the average on each day of the week. What you’d like to do is to break the dataset into different groups (subsets of the data) and then apply existing methods like max()
or mean()
to each of those groups. While it’s possible to do this by accessing the subsets you want and then using e.g., a loop to process each one, pandas
provides additional functionality to create and work with subgroups of a DataFrame.
As a worked example, consider the below DataFrame of student exam scores. You might be interested in comparing scores within or between section, or within or between student year.
# A DataFrame of student scores on exams
scores_df = pd.DataFrame({
"name": ['Ada', 'Bob', 'Chris', 'Diya', 'Emma', 'Fred', 'George'],
"section": ['A', 'A', 'A', 'A', 'B', 'B', 'B'],
"year": [4, 3, 1, 3, 2, 1, 4],
"midterm": [92, 90, 99, 95, 83, 96, 87],
"final": [88, 86, 80, 95, 85, 94, 92],
})
print(scores_df)
# name section year midterm final
# 0 Ada A 4 92 88
# 1 Bob A 3 90 86
# 2 Chris A 1 99 80
# 3 Diya A 3 95 95
# 4 Emma B 2 83 85
# 5 Fred B 1 96 94
# 6 George B 4 87 92
The groupby()
method is called on a DataFrame and “separates” the rows into groups. At it’s most basic, the groupby()
method takes an argument the column name to “group” the rows by—all rows that have the same value in that column will be placed in the same group. The groupby()
method returns a new GroupBy value, which is a special data structure that tracks which rows have been put into different groups. You can use the .groups
attribute to get a dictionary of groups and which row indices are in each, or use the get_group(column)
method to get a DataFrame of just that group:
# Separate into groups by "section" column
by_section_groups = scores_df.groupby('section')
# Get a dictionary of which rows are in which group
# The key is the group label; the value is a list of rows indices in the group
group_dict = by_section_groups.groups
print(group_dict) # {'A': [0, 1, 2, 3], 'B': [4, 5, 6]}
# Get the DataFrame subset representing a single group
section_a_df = by_section_groups.get_group('A')
print(section_a_df)
# name section year midterm final
# 0 Ada A 4 92 88
# 1 Bob A 3 90 86
# 2 Chris A 1 99 80
# 3 Diya A 3 95 95
Although we talk about the groupby()
method as “separating” or “breaking” a DataFrame into groups, it doesn’t actually modify the DataFrame at all! Similarly, the GroupBy value produces is a single value; it just has metadata indicating which rows below to which groups.
It is possible to specify more complex grouping criteria as argument(s) to the groupby()
method. For example, you can specify a list of columns as an argument, in which case the rows will be grouped first by one column, and then by the next:
# Group by "section" and "year"
section_and_year_groups = scores_df.groupby(['section', 'year'])
print(section_and_year_groups.groups)
# {
# ('A', 1): [2],
# ('A', 3): [1, 3],
# ('A', 4): [0],
# ('B', 1): [5],
# ('B', 2): [4],
# ('B', 4): [6]
# }
In the above example, there is a group for each “section and year” combination. (In this small example dataset, many groups have just a single row).
9.4.1 Aggregation
Groups rows of a DataFrame isn’t particularly useful on its own; you can achieve the same ends just by using regular access techniques (bracket notation, etc). The real purpose of grouping is to be able to apply aggregation methods per group&madsh;you can quickly calculate statistics for each group at once in order to compare them!
The pandas
library calls this the split-apply-combine process, which has 3 steps:
- Split the data into groups
- Apply some operation to each individual group
- Combine the results back into a single data structure
You split data into groups by using the groupby()
method. You apply aggregation functions (such as mean()
, max()
, etc) to each group. And then because you’re working on a GroupBy value, those results are automatically combined back into a DataFrame. In a way, an aggregation combines steps 2 and 3.
You can apply an aggregation operation to each of the groups by calling that method on the GroupBy value. The method will automatically be applied each group (as if they were separate DataFrames). The method will then return a new DataFrame, whose rows are the results per group.
# A DataFrame with only the section and numeric columns
scores_only_df = scores_df[['section', 'midterm', 'final']]
# Split into groups
by_section_groups = scores_only_df.groupby('section')
# Apply the `mean()` function and Combine into new DataFrame
mean_by_section_df = by_section_groups.mean()
print(mean_by_section_df)
# midterm final
# section
# A 94.000000 87.250000
# B 88.666667 90.333333
# Apply the `max()` function and Combine into new DataFrame
max_by_section_df = by_section_groups.max()
print(max_by_section_df)
# midterm final
# section
# A 99 95
# B 96 94
The above code calculates the statistics fir of each column (exactly as happens when called on a DataFrame); but it calculates it for each group separately. The results for each group is a different row in the resulting DataFrame.
There are a number of built-in aggregation methods such as max()
and mean()
; some common ones are demonstrated in the previous sections.
A useful “generic” method for performing such aggregations is the agg()
method. This method supports a variety of arguments allowing you to customize what aggregations are performed on the groups. For example, you can pass in a string with the name of the aggregation method to call, or a list of strings of methods to call:
# Apply multiple aggregation functions at once
range_stats_df = by_section_groups.agg(['min', 'mean', 'max'])
print(range_stats_df)
# midterm final
# min mean max min mean max
# section
# A 90 94.000000 99 80 87.250000 95
# B 83 88.666667 96 85 90.333333 94
This is a normal DataFrame, it’s just that the columns are tuples (e.g., ('midterm', 'min')
) instead of strings. The pandas
library will print it out in a readable way.
You can further customize what aggregations you perform by instead passing the agg()
multiple names arguments. The name of the argument will be the desired column label in the resulting DataFrame, and the value of the argument should be a tuple of the column to aggregate and which aggregate function to apply. These are called named aggregations:
# Apply specific aggregations to specific columns, producing named results
# (each argument is on its own line just for readability)
custom_stats_df = by_section_groups.agg(
avg_mid=('midterm', 'mean'),
avg_final=('final', 'mean'),
max_final=('final', 'max')
)
print(custom_stats_df)
# avg_mid avg_final max_final
# section
# A 94.000000 87.250000 95
# B 88.666667 90.333333 94
This syntax is particularly useful for creating “clean” DataFrames that can be used for further processing.
There are many other options for using the agg()
method as well. See the pandas
user guide for more details and examples.
When doing an aggregation when grouping by multiple columns, you can end up with a DataFrame whose indices are tuples:
final_by_year_section_df = scores_df.groupby(['section', 'year']).agg(
max_final=('final', 'max')
)
print(final_by_year_section_df.index)
# MultiIndex([('A', 1),
# ('A', 3),
# ('A', 4),
# ('B', 1),
# ('B', 2),
# ('B', 4)],
# names=['section', 'year'])
A MultiIndex is an advanced concept from pandas
; while there are lots of ways of working with such indices, at a basic level you can think of them has just being a situation where each index is a tuple
.
This can work fine if you’re considering data grouped by multiple columns. However, sometimes you will want to “go back” and only have data indexed by a single value. You can achieve this by using the reset_index() method,. This method can be used to pull a column out of the index and make it into a regular column of the data. It takes as an argument which column you want to remove from the index:
print(final_by_year_section_df)
# max_final
# section year
# A 1 80
# 3 95
# 4 88
# B 1 94
# 2 85
# 4 92
final_by_section_df = final_by_year_section_df.reset_index('year')
# year max_final
# section
# A 1 80
# A 3 95
# A 4 88
# B 1 94
# B 2 85
# B 4 92
print(final_by_section_df.index)
# Index(['A', 'A', 'A', 'B', 'B', 'B'], dtype='object', name='section')
The differences between these DataFrames is subtle. In the second one, year
has been moved to a column (where the section
is the index label—note that each element has a section listed on its row). This would allow you to do further analysis on the data (which is now organized only by section, with year as a feature rather than an index).
Overall, grouping can allow you to quickly compare different subsets of your data. In doing so, you’re redefining your unit of analysis. Grouping lets you frame your analysis question in terms of comparing groups of observations, rather than individual observations. This form of abstraction makes it easier to ask and answer complex questions about your data.
Resources
This chapter provides only an introduction to the pandas
library; there is a lot more to this library, with many more functions, structures, approaches, etc. To learn more about using pandas
, check out the official user guide, tutorials, and “recipes”.
- 10 minutes to
pandas
(pandas
docs) a basic set of examples - Tutorials (
pandas
docs) a list and guide to various tutorials (of mixed quality) - User guide (
pandas
docs) - Cookbook (
pandas
docs) a set of short examples and patterns for writing idiomaticpandas
code - Python Data Science Handbook a full textbook with extensive details about using Pandas (and related libraries).
- Pandas. Data Processing (Data Analysis in Python)