1. Slicing & Dicing Data with pandas#
Learning Objectives
Explain what pandas indexes are and how they’re used
Get, modify, and set indexes on series and data frames
Explain and use pandas’ index alignment
Describe and use the 4 major modes of indexing in pandas
Explain the difference between
.loc
and.at
Identify problems where using
.set_index
to make a column an index is helpfulExplain and use pandas’ query mini-language
This chapter is a deep dive into how indexing—the process of getting or setting elements of a data structure—works in the pandas package. Indexing is sometimes also called subsetting or (element) extraction, and is a fundamental operation when using a programming language to solve problems in data science.
1.1. Prerequisites#
This chapter assumes you already have basic familiarity with Python and pandas.
In particular, you should be comfortable indexing Python lists and dictionaries
with the square bracket operator [ ]
, and should be able to explain what a
Series
and DataFrame
are and how the they differ. DataLab’s Python Basics
Reader and its accompanying workshop provide a suitable
introduction to these topics.
To follow along, you’ll need the following software versions (or newer) installed on your computer:
One way to install these is to install the Anaconda Python distribution.
Chapter 2 provides more details about Anaconda and the conda
package manager.
CLICK HERE to get the data set used in the examples for this chapter. After clicking the link, you’ll need to right-click on the page and select “Save Page As…”. Make sure to save the page somewhere you can easily access from your Python session.
1.2. What’s an Index?#
In pandas, an Index
is a Python object that stores a sequence of labels or
names. The labels usually correspond to the elements of a data structure along
a given dimension or axis. Indexes are a key feature that differentiates
pandas from other software for programming with tabular data (such as R).
According to the pandas documentation, indexes serve three important roles:
As metadata to provide additional context about a data set
As a way to explicitly and automatically align data, avoiding bugs due to misalignment (see Section 1.2.3)
As a convenience for getting and setting subsets of the data (see Section 1.4)
1.2.1. Indexes on Series#
As an example of where pandas uses indexes, Series
are 1-dimensional and
always have exactly one index. You can see the index just to the left of the
elements when you print out a series:
import pandas as pd
x = pd.Series([4, 5, 3, 1])
x
0 4
1 5
2 3
3 1
dtype: int64
You can get or set the index on a series with the .index
attribute:
x.index
RangeIndex(start=0, stop=4, step=1)
The labels in an index can be numbers, strings, dates, or other values. Pandas
provides subclasses of Index
for specific purposes, including:
RangeIndex
, where the labels are a monotonic sequence of integers with a fixed beginning, end, and step sizeCategoricalIndex
, where the labels are categories (and not necessarily unique)IntervalIndex
, where the labels are non-overlapping intervals of the real number lineDateTimeIndex
, where the labels are dates and timesPeriodIndex
, where each label is a date and time span
You may also encounter Int64Index
, UInt64Index
, and Float64Index
, which
represent an arbitrary list of labels of the respective type. For example:
x[[1, 3]].index
Int64Index([1, 3], dtype='int64')
Beginning in pandas 2.0, these three subclasses (but not subclasses in the
bulleted list above) will be replaced with a generic Index
class. See this
warning in the pandas documentation for details.
You can access the labels in an index with standard Python indexing operations:
idx = x.index
# Get the 2nd label
idx[1]
1
Indexes are immutable, which means the labels in an index can’t be changed. For instance:
idx[0] = 5
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
Cell In[6], line 1
----> 1 idx[0] = 5
File ~/garden/micromamba/envs/workshop/lib/python3.11/site-packages/pandas/core/indexes/base.py:5302, in Index.__setitem__(self, key, value)
5300 @final
5301 def __setitem__(self, key, value):
-> 5302 raise TypeError("Index does not support mutable operations")
TypeError: Index does not support mutable operations
Although labels in an index can’t be changed, every index also has a name,
which can be changed. Think of the name as a name for the dimension or axis to
which the index corresponds. You can get or set the name of an index with the
.name
attribute (on the index itself):
idx.name
The default value of .name
is None
, which means the index doesn’t have a
name. When pandas prints a series or data frame, it will also print the names
of any attached indexes. For example:
x.index.name = "labels"
x
labels
0 4
1 5
2 3
3 1
dtype: int64
Pandas automatically creates indexes whenever you create (or load) a series or
data frame. Pandas also provides functions to construct indexes manually. These
are usually named after the type of index. For instance, you can create an
ordinary Index
from a list or NumPy array:
new_idx = pd.Index(["a", "b", "c", "d"])
new_idx
Index(['a', 'b', 'c', 'd'], dtype='object')
You can use assignment to replace the index on a series. For example:
x.index = new_idx
x
a 4
b 5
c 3
d 1
dtype: int64
Note that the new index must have the same length as the index it replaces (and the series itself).
Checkpoint Exercise
Replace the index on x
with one like the original (in variable idx
), but
where the first label is 5 instead of 0. Try to do this without typing out a
list of all 4 labels.
Hint: the Python list
function can convert many types of objects into lists,
including pandas indexes.
1.2.2. Indexes on Data Frames#
Pandas DataFrame
s are 2-dimensional and always have exactly two indexes. The
indexes correspond to the rows and the columns, respectively. To see this, use
pandas to load the data from the sports.csv
file:
dtype = {"classification_other": str}
sports = pd.read_csv("data/sports.csv", dtype = dtype)
sports.head()
year | unitid | institution_name | city_txt | state_cd | zip_text | classification_code | classification_name | classification_other | ef_male_count | ... | partic_coed_women | sum_partic_men | sum_partic_women | rev_men | rev_women | total_rev_menwomen | exp_men | exp_women | total_exp_menwomen | sports | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2015 | 100654 | Alabama A & M University | Normal | AL | 35762.0 | 2 | NCAA Division I-FCS | NaN | 1923 | ... | NaN | 31 | 0 | 345592.0 | NaN | 345592.0 | 397818.0 | NaN | 397818.0 | Baseball |
1 | 2015 | 100654 | Alabama A & M University | Normal | AL | 35762.0 | 2 | NCAA Division I-FCS | NaN | 1923 | ... | NaN | 19 | 16 | 1211095.0 | 748833.0 | 1959928.0 | 817868.0 | 742460.0 | 1560328.0 | Basketball |
2 | 2015 | 100654 | Alabama A & M University | Normal | AL | 35762.0 | 2 | NCAA Division I-FCS | NaN | 1923 | ... | NaN | 61 | 46 | 183333.0 | 315574.0 | 498907.0 | 246949.0 | 251184.0 | 498133.0 | All Track Combined |
3 | 2015 | 100654 | Alabama A & M University | Normal | AL | 35762.0 | 2 | NCAA Division I-FCS | NaN | 1923 | ... | NaN | 99 | 0 | 2808949.0 | NaN | 2808949.0 | 3059353.0 | NaN | 3059353.0 | Football |
4 | 2015 | 100654 | Alabama A & M University | Normal | AL | 35762.0 | 2 | NCAA Division I-FCS | NaN | 1923 | ... | NaN | 9 | 0 | 78270.0 | NaN | 78270.0 | 83913.0 | NaN | 83913.0 | Golf |
5 rows × 28 columns
This data set contains information about funding and staffing of athletics programs at U.S. universities from 2015 to 2019 (inclusive). The source for the data set is the U.S. Department of Education’s Equity in Athletics Data Analysis project, and the data was cleaned by the Tidy Tuesday community. This article describes some of the conclusions that can be drawn from the data set.
On data frames, you can use the .index
attribute to get and set the row
index. For example:
sports.index
RangeIndex(start=0, stop=132327, step=1)
The default index created by the read_csv
function counts from 0 to the
number of rows.
The other index on data frames corresponds to the columns. You can use the
.columns
attribute to get and set the column index. For instance:
sports.columns
Index(['year', 'unitid', 'institution_name', 'city_txt', 'state_cd',
'zip_text', 'classification_code', 'classification_name',
'classification_other', 'ef_male_count', 'ef_female_count',
'ef_total_count', 'sector_cd', 'sector_name', 'sportscode',
'partic_men', 'partic_women', 'partic_coed_men', 'partic_coed_women',
'sum_partic_men', 'sum_partic_women', 'rev_men', 'rev_women',
'total_rev_menwomen', 'exp_men', 'exp_women', 'total_exp_menwomen',
'sports'],
dtype='object')
The indexes stored in the .index
and .columns
attributes of a data frame
have the same classes and basic properties as indexes on series.
1.2.3. Alignment#
For arithmetic operations that involve more than one series or data frame, pandas automatically aligns the elements based on the indexes. As an example, consider these two series:
u = pd.Series([1, 2, 3], index = ["a", "b", "c"])
v = pd.Series([1, 2, 3], index = ["c", "b", "a"])
The elements of the series are identical, but the indexes differ. If you’re an experienced NumPy (or R) user, the result of adding these two series together might surprise you:
u + v
a 4
b 4
c 4
dtype: int64
This is the result because pandas aligns the elements, adding element a
from
u
to element a
from v
, element b
to element b
, and so on.
Sometimes you might want to do arithmetic on series or data frames without any
alignment. The canonical solution in this case is to use the .to_numpy
method
to convert one or both of the objects into a NumPy array:
u + v.to_numpy()
a 2
b 4
c 6
dtype: int64
Pandas tries to align elements and carry out arithmetic operations even for objects with mismatched shapes and labels. Pandas fills elements for which the result is unknown with missing values. For example:
w = pd.Series([-10, 0], index = ["z", "a"])
u - w
a 1.0
b NaN
c NaN
z NaN
dtype: float64
In this example, only element a
has two operands; for each other element, the
result is a missing value because only one operand is known.
Checkpoint Exercise
What happens if you use an arithmetic operation on two series or data frames where one has repeated/duplicated labels in its index?
Hint: construct an example and see what happens!
Pandas does not align series and data frames in comparison operations, and raises an error if they’re not already aligned:
u > v
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
Cell In[18], line 1
----> 1 u > v
File ~/garden/micromamba/envs/workshop/lib/python3.11/site-packages/pandas/core/ops/common.py:72, in _unpack_zerodim_and_defer.<locals>.new_method(self, other)
68 return NotImplemented
70 other = item_from_zerodim(other)
---> 72 return method(self, other)
File ~/garden/micromamba/envs/workshop/lib/python3.11/site-packages/pandas/core/arraylike.py:58, in OpsMixin.__gt__(self, other)
56 @unpack_zerodim_and_defer("__gt__")
57 def __gt__(self, other):
---> 58 return self._cmp_method(other, operator.gt)
File ~/garden/micromamba/envs/workshop/lib/python3.11/site-packages/pandas/core/series.py:6237, in Series._cmp_method(self, other, op)
6234 res_name = ops.get_op_result_name(self, other)
6236 if isinstance(other, Series) and not self._indexed_same(other):
-> 6237 raise ValueError("Can only compare identically-labeled Series objects")
6239 lvalues = self._values
6240 rvalues = extract_array(other, extract_numpy=True, extract_range=True)
ValueError: Can only compare identically-labeled Series objects
See this GitHub issue for a developer discussion of whether this is a helpful feature for preventing bugs or a nuisance.
Finally, the align
method manually aligns two series or data frames. The
method returns two objects of the same type, but with elements sorted so that
the indexes match. For example:
u.align(v)
(a 1
b 2
c 3
dtype: int64,
a 3
b 2
c 1
dtype: int64)
For objects with mismatched shapes, the align
method inserts missing values
to make their shapes and indexes match:
u.align(w)
(a 1.0
b 2.0
c 3.0
z NaN
dtype: float64,
a 0.0
b NaN
c NaN
z -10.0
dtype: float64)
The align
method provides a way to align to objects so that they can be
compared:
ua, va = u.align(v)
ua > va
a False
b False
c True
dtype: bool
Moreover, concatenating the two results from the align
method is the same as
performing a join. The align
method even provides a parameter join
to
select the type of join ("outer"
, "left"
, "right"
, or "inner"
). For
relational series and data frames, you can break a join into smaller steps by
using a combination of the align
function , column indexing, and the pandas
concat
function.
1.3. Indexing Operators#
Pandas provides several different operators for indexing series and data
frames, as well as several different ways to use each. The primary indexing
operator is the square bracket [ ]
.
For a Series
, the square bracket operator usually selects elements by
label. For instance, integer arguments are treated as labels, not positions, if
the labels in the index are integers:
x = pd.Series([1, 2, 3], index = [1, 0, 3])
x[0]
2
You can use a list to select multiple elements at once or repeat elements:
x[[0, 3, 0]]
0 2
3 3
0 2
dtype: int64
You can also use the square bracket operator to select elements by label when the labels are strings:
y = pd.Series([10, 20, 30], index = ["a", "b", "c"])
y["a"]
10
In this case, passing an integer argument selects an element by position:
y[0]
10
Using the square bracket operator this way in non-interactive code is risky, because it can cause bugs if some of the series or data frames your code operates on unexpectedly have integer labels.
The dot .
serves as a secondary indexing operator for series and data frames
with string indexes. For instance, to get the element b
:
y.b
20
Since the dot .
is also used to access attributes and methods, it cannot be
used to access elements with a label that’s the same as the name of an
attribute or method. When in doubt, it’s safer to use [ ]
to access elements
by label.
1.3.1. Slices#
A slice selects a range of elements. The syntax is start:stop:step
, with
the second colon :
and arguments optional, the same as for Python’s built-in
slicing. The default start value is the beginning of the series, the default
stop value is one past the end, and the default step value is 1.
Slices can be by position or by label. With [ ]
, when the start or stop value
is an integer, pandas assumes you want to slice by position:
x[1:3]
0 2
3 3
dtype: int64
In a slice by position, the element at the stop position is not included.
Slicing with only the step can be useful in a variety of situations, such as getting every second element:
x[::2]
1 1
3 3
dtype: int64
Negative values in a slice are counted backward from the end of the object. For example, this code gets the last two elements of the series:
x[-2:]
0 2
3 3
dtype: int64
Checkpoint Exercise
What happens if you use a negative step value in a slice?
When the start or stop value is a string, pandas slices by label:
y["b":]
b 20
c 30
dtype: int64
Be aware that when you slice by label, the element at the stop position is always included:
y["b":"c"]
b 20
c 30
dtype: int64
This is a case where the pandas developers decided convenience outweighs consistency. See this section of the pandas documentation for more details.
Checkpoint Exercise
What happens if you include an integer step value in a slice by label?
1.3.2. By Position with .iloc
#
When you want to index a series or data frame by position, use [ ]
on the
.iloc
attribute. Then any integer arguments are assumed to be positions,
regardless of the type of index on the object.
For example:
x.iloc[0]
1
x.iloc[[1, 0, 2]]
0 2
1 1
3 3
dtype: int64
x.iloc[0:]
1 1
0 2
3 3
dtype: int64
You can use this attribute regardless of the index type:
y.iloc[0:]
a 10
b 20
c 30
dtype: int64
As a result, the .iloc
attribute is a more reliable way to index by position
than using [ ]
alone.
Pandas also provides an .iat
attribute to get or set scalar values in a series
or data frame by position:
x.iat[0]
1
Compared to .iloc
, the .iat
attribute is typically much faster (in terms of
CPU time), so it’s important to use .iat
in performance-critical sections of
code. In addition, .iat
raises an error if your code attempts to get or set
more than one value, which can potentially alert you to bugs that would go
undetected with .iloc
.
1.3.3. By Label with .loc
#
When you want to index a series or data frame by label, use [ ]
on the
.loc
attribute. Then any integer arguments are assumed to be labels.
For example:
x.loc[0]
2
x.loc[0]
2
x.loc[[0, 1]]
0 2
1 1
dtype: int64
y.loc["b"]
20
y.loc[["b", "a", "b"]]
b 20
a 10
b 20
dtype: int64
Pandas also provides an .at
attribute to get or set scalar values in a series
or data frame by label:
x.at[0]
2
The advantages of .at
compared to .loc
are the same as .iat
compared to
.iloc
: better run-time speed and stricter requirements on the argument.
When setting elements by label with [ ]
, .loc
, or .at
, if you use a
label that’s not already in the series index, pandas will append the label and
value. For instance:
x.loc[10] = -1
x
1 1
0 2
3 3
10 -1
dtype: int64
This doesn’t work when setting elements by position.
1.3.4. By a Condition#
The square bracket operator [ ]
, the .iloc
attribute, and the .loc
attribute all support indexing a series by a condition. The condition must be
represented by a Boolean array with the same length as the series.
You can use comparison operators to get a suitable Boolean array. For example:
is_positive = x > 0
is_positive
1 True
0 True
3 True
10 False
dtype: bool
x[is_positive]
1 1
0 2
3 3
dtype: int64
x.loc[is_positive]
1 1
0 2
3 3
dtype: int64
If the condition has an index, the square bracket operator and .loc
operator
will try to align the condition with the series being indexed:
z = x.iloc[[1, 0, 3]]
z
0 2
1 1
10 -1
dtype: int64
z[is_positive]
0 2
1 1
dtype: int64
The .iloc
operator raises an error if the Boolean array has an index:
x.iloc[is_positive]
---------------------------------------------------------------------------
NotImplementedError Traceback (most recent call last)
Cell In[49], line 1
----> 1 x.iloc[is_positive]
File ~/garden/micromamba/envs/workshop/lib/python3.11/site-packages/pandas/core/indexing.py:1073, in _LocationIndexer.__getitem__(self, key)
1070 axis = self.axis or 0
1072 maybe_callable = com.apply_if_callable(key, self.obj)
-> 1073 return self._getitem_axis(maybe_callable, axis=axis)
File ~/garden/micromamba/envs/workshop/lib/python3.11/site-packages/pandas/core/indexing.py:1611, in _iLocIndexer._getitem_axis(self, key, axis)
1608 key = np.asarray(key)
1610 if com.is_bool_indexer(key):
-> 1611 self._validate_key(key, axis)
1612 return self._getbool_axis(key, axis=axis)
1614 # a list of integers
File ~/garden/micromamba/envs/workshop/lib/python3.11/site-packages/pandas/core/indexing.py:1452, in _iLocIndexer._validate_key(self, key, axis)
1450 if hasattr(key, "index") and isinstance(key.index, Index):
1451 if key.index.inferred_type == "integer":
-> 1452 raise NotImplementedError(
1453 "iLocation based boolean "
1454 "indexing on an integer type "
1455 "is not available"
1456 )
1457 raise ValueError(
1458 "iLocation based boolean indexing cannot use "
1459 "an indexable as a mask"
1460 )
1461 return
NotImplementedError: iLocation based boolean indexing on an integer type is not available
As described in Section 1.2.3, the canonical way to avoid index alignment
is to use the .to_numpy
method to convert to a NumPy array:
x.iloc[is_positive.to_numpy()]
1 1
0 2
3 3
dtype: int64
The logic operators for inverting and combining conditions in pandas differ from Python’s usual logic operators:
|
is logical or&
is logical and~
is logical not
For example:
x[is_positive & (x % 2 == 1)]
1 1
3 3
dtype: int64
In compound conditions, each sub-condition must be enclosed in parentheses
()
, because in Python’s order of operations, the |
, &
, and ~
operators are evaluated before comparison operators such as ==
, >
, and <
.
1.3.5. By a Callable#
The square bracket operator [ ]
, the .iloc
attribute, and the .loc
attribute also all support indexing a series by a function (or callable
object defined with def
or lambda
. The function must accept one argument:
the series itself. The function must return a valid argument for the indexing
operator (positions, labels, or a Boolean array).
The primary use case for this form of indexing is chaining together multiple operations without defining intermediate variables. For example, this code adds each element at an even-numbered position to the next element, and then gets only the results greater than 2:
(x.iloc[::2] + x.iloc[1::2].to_numpy())[lambda a: a > 2]
1 3
dtype: int64
As you can probably see, chaining operations makes code relatively difficult to understand. Whenever possible, it’s better to use simple indexing strategies rather than indexing by a callable.
Checkpoint Exercise
Use indexing by a callable to set the elements of x
with even values (not
positions) to twice their current value.
What’s a simpler way to do this using other indexing modes?
1.3.6. Data Frames#
The indexing operators [ ]
, .iloc
, and .loc
can all be used with data
frames in addition to series. Since data frames are 2-dimensional, you can pass
the indexing operators a separate argument for each dimension. The rows come
first.
For instance, to get (1st row, 2nd and 3rd columns) of the athletics data frame the code is:
sports.iloc[0, [1, 2]]
unitid 100654
institution_name Alabama A & M University
Name: 0, dtype: object
When the result has a lower dimension than the object being indexed, pandas automatically converts to an appropriate data type. In this case, the result is converted to a series with the names of the two columns as its index.
As another example, the code to get (rows labeled 10 and 11, column labeled
"year"
) is:
sports.loc[[10, 11], "year"]
10 2015
11 2015
Name: year, dtype: int64
With .iloc
and .loc
, you can select all of the elements along a given
dimension by passing a colon :
as the argument for that dimension. For
instance, this code gets all rows in the columns labeled "year"
and
"institution_name"
:
sports.loc[:, ["year", "institution_name"]]
year | institution_name | |
---|---|---|
0 | 2015 | Alabama A & M University |
1 | 2015 | Alabama A & M University |
2 | 2015 | Alabama A & M University |
3 | 2015 | Alabama A & M University |
4 | 2015 | Alabama A & M University |
... | ... | ... |
132322 | 2019 | Simon Fraser University |
132323 | 2019 | Simon Fraser University |
132324 | 2019 | Simon Fraser University |
132325 | 2019 | Simon Fraser University |
132326 | 2019 | Simon Fraser University |
132327 rows × 2 columns
You can mix indexing by condition with indexing by position or label. The most common way to do this is to use a condition to select rows and labels to select columns. For example:
sports.loc[sports.year == 2018, ["year", "institution_name", "sports"]]
year | institution_name | sports | |
---|---|---|---|
52387 | 2018 | Alabama A & M University | Baseball |
52388 | 2018 | Alabama A & M University | Basketball |
52389 | 2018 | Alabama A & M University | Football |
52390 | 2018 | Alabama A & M University | Golf |
52391 | 2018 | Alabama A & M University | Softball |
... | ... | ... | ... |
70154 | 2018 | Simon Fraser University | Soccer |
70155 | 2018 | Simon Fraser University | Softball |
70156 | 2018 | Simon Fraser University | Swimming |
70157 | 2018 | Simon Fraser University | Volleyball |
70158 | 2018 | Simon Fraser University | Wrestling |
17772 rows × 3 columns
Checkpoint Exercise
Use indexing to get only the rows where year
is 2016
and institution_name
is "University of California-Davis"
. Limit the columns to year
,
institution_name
, sports
, and total_rev_menwomen
(the total revenue from
the program in USD).
Indexing a data frame with only one argument generally accesses rows. For example:
sports.iloc[0]
year 2015
unitid 100654
institution_name Alabama A & M University
city_txt Normal
state_cd AL
zip_text 35762.0
classification_code 2
classification_name NCAA Division I-FCS
classification_other NaN
ef_male_count 1923
ef_female_count 2300
ef_total_count 4223
sector_cd 1
sector_name Public, 4-year or above
sportscode 1
partic_men 31.0
partic_women NaN
partic_coed_men NaN
partic_coed_women NaN
sum_partic_men 31
sum_partic_women 0
rev_men 345592.0
rev_women NaN
total_rev_menwomen 345592.0
exp_men 397818.0
exp_women NaN
total_exp_menwomen 397818.0
sports Baseball
Name: 0, dtype: object
As an exception, using the square bracket operator [ ]
to index a data frame
with a string argument (or list of strings) accesses columns:
sports[["year", "institution_name"]]
year | institution_name | |
---|---|---|
0 | 2015 | Alabama A & M University |
1 | 2015 | Alabama A & M University |
2 | 2015 | Alabama A & M University |
3 | 2015 | Alabama A & M University |
4 | 2015 | Alabama A & M University |
... | ... | ... |
132322 | 2019 | Simon Fraser University |
132323 | 2019 | Simon Fraser University |
132324 | 2019 | Simon Fraser University |
132325 | 2019 | Simon Fraser University |
132326 | 2019 | Simon Fraser University |
132327 rows × 2 columns
1.4. Columns as Indexes#
An important feature of pandas is that you can use arbitrary columns in a data
frame as indexes. Generally the columns should contain identifiers for the rows
or other categorical data. For example, in the sports
data frame, the column
institution_name
is an identifier.
You can use the .set_index
method to set a column as an index:
sports = sports.set_index("institution_name")
sports.head()
year | unitid | city_txt | state_cd | zip_text | classification_code | classification_name | classification_other | ef_male_count | ef_female_count | ... | partic_coed_women | sum_partic_men | sum_partic_women | rev_men | rev_women | total_rev_menwomen | exp_men | exp_women | total_exp_menwomen | sports | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
institution_name | |||||||||||||||||||||
Alabama A & M University | 2015 | 100654 | Normal | AL | 35762.0 | 2 | NCAA Division I-FCS | NaN | 1923 | 2300 | ... | NaN | 31 | 0 | 345592.0 | NaN | 345592.0 | 397818.0 | NaN | 397818.0 | Baseball |
Alabama A & M University | 2015 | 100654 | Normal | AL | 35762.0 | 2 | NCAA Division I-FCS | NaN | 1923 | 2300 | ... | NaN | 19 | 16 | 1211095.0 | 748833.0 | 1959928.0 | 817868.0 | 742460.0 | 1560328.0 | Basketball |
Alabama A & M University | 2015 | 100654 | Normal | AL | 35762.0 | 2 | NCAA Division I-FCS | NaN | 1923 | 2300 | ... | NaN | 61 | 46 | 183333.0 | 315574.0 | 498907.0 | 246949.0 | 251184.0 | 498133.0 | All Track Combined |
Alabama A & M University | 2015 | 100654 | Normal | AL | 35762.0 | 2 | NCAA Division I-FCS | NaN | 1923 | 2300 | ... | NaN | 99 | 0 | 2808949.0 | NaN | 2808949.0 | 3059353.0 | NaN | 3059353.0 | Football |
Alabama A & M University | 2015 | 100654 | Normal | AL | 35762.0 | 2 | NCAA Division I-FCS | NaN | 1923 | 2300 | ... | NaN | 9 | 0 | 78270.0 | NaN | 78270.0 | 83913.0 | NaN | 83913.0 | Golf |
5 rows × 27 columns
After setting a categorical column as an index, you can use indexing by label as a convenient way to get rows in specific categories:
sports.loc["University of California-Davis"]
year | unitid | city_txt | state_cd | zip_text | classification_code | classification_name | classification_other | ef_male_count | ef_female_count | ... | partic_coed_women | sum_partic_men | sum_partic_women | rev_men | rev_women | total_rev_menwomen | exp_men | exp_women | total_exp_menwomen | sports | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
institution_name | |||||||||||||||||||||
University of California-Davis | 2015 | 110644 | Davis | CA | 956168678.0 | 2 | NCAA Division I-FCS | NaN | 11215 | 16223 | ... | NaN | 36 | 0 | 884627.0 | NaN | 884627.0 | 884627.0 | NaN | 884627.0 | Baseball |
University of California-Davis | 2015 | 110644 | Davis | CA | 956168678.0 | 2 | NCAA Division I-FCS | NaN | 11215 | 16223 | ... | NaN | 15 | 19 | 1862783.0 | 1340541.0 | 3203324.0 | 1862783.0 | 1340541.0 | 3203324.0 | Basketball |
University of California-Davis | 2015 | 110644 | Davis | CA | 956168678.0 | 2 | NCAA Division I-FCS | NaN | 11215 | 16223 | ... | NaN | 62 | 94 | 571582.0 | 590416.0 | 1161998.0 | 571582.0 | 590416.0 | 1161998.0 | All Track Combined |
University of California-Davis | 2015 | 110644 | Davis | CA | 956168678.0 | 2 | NCAA Division I-FCS | NaN | 11215 | 16223 | ... | NaN | 0 | 25 | NaN | 659996.0 | 659996.0 | NaN | 659996.0 | 659996.0 | Field Hockey |
University of California-Davis | 2015 | 110644 | Davis | CA | 956168678.0 | 2 | NCAA Division I-FCS | NaN | 11215 | 16223 | ... | NaN | 105 | 0 | 4478087.0 | NaN | 4478087.0 | 4478087.0 | NaN | 4478087.0 | Football |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
University of California-Davis | 2019 | 110644 | Davis | CA | 95616.0 | 2 | NCAA Division I-FCS | NaN | 11780 | 18371 | ... | NaN | 0 | 27 | NaN | 945919.0 | 945919.0 | NaN | 945919.0 | 945919.0 | Equestrian |
University of California-Davis | 2019 | 110644 | Davis | CA | 95616.0 | 2 | NCAA Division I-FCS | NaN | 11780 | 18371 | ... | NaN | 0 | 0 | NaN | NaN | NaN | NaN | NaN | NaN | Rodeo |
University of California-Davis | 2019 | 110644 | Davis | CA | 95616.0 | 2 | NCAA Division I-FCS | NaN | 11780 | 18371 | ... | NaN | 0 | 0 | NaN | NaN | NaN | NaN | NaN | NaN | Sailing |
University of California-Davis | 2019 | 110644 | Davis | CA | 95616.0 | 2 | NCAA Division I-FCS | NaN | 11780 | 18371 | ... | NaN | 0 | 0 | NaN | NaN | NaN | NaN | NaN | NaN | Table Tennis |
University of California-Davis | 2019 | 110644 | Davis | CA | 95616.0 | 2 | NCAA Division I-FCS | NaN | 11780 | 18371 | ... | NaN | 0 | 0 | NaN | NaN | NaN | NaN | NaN | NaN | Weight Lifting |
96 rows × 27 columns
You can use the .reset_index
method to restore the institution_name
column
to the data frame and reset the index to a range:
sports = sports.reset_index()
sports.head()
institution_name | year | unitid | city_txt | state_cd | zip_text | classification_code | classification_name | classification_other | ef_male_count | ... | partic_coed_women | sum_partic_men | sum_partic_women | rev_men | rev_women | total_rev_menwomen | exp_men | exp_women | total_exp_menwomen | sports | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | Alabama A & M University | 2015 | 100654 | Normal | AL | 35762.0 | 2 | NCAA Division I-FCS | NaN | 1923 | ... | NaN | 31 | 0 | 345592.0 | NaN | 345592.0 | 397818.0 | NaN | 397818.0 | Baseball |
1 | Alabama A & M University | 2015 | 100654 | Normal | AL | 35762.0 | 2 | NCAA Division I-FCS | NaN | 1923 | ... | NaN | 19 | 16 | 1211095.0 | 748833.0 | 1959928.0 | 817868.0 | 742460.0 | 1560328.0 | Basketball |
2 | Alabama A & M University | 2015 | 100654 | Normal | AL | 35762.0 | 2 | NCAA Division I-FCS | NaN | 1923 | ... | NaN | 61 | 46 | 183333.0 | 315574.0 | 498907.0 | 246949.0 | 251184.0 | 498133.0 | All Track Combined |
3 | Alabama A & M University | 2015 | 100654 | Normal | AL | 35762.0 | 2 | NCAA Division I-FCS | NaN | 1923 | ... | NaN | 99 | 0 | 2808949.0 | NaN | 2808949.0 | 3059353.0 | NaN | 3059353.0 | Football |
4 | Alabama A & M University | 2015 | 100654 | Normal | AL | 35762.0 | 2 | NCAA Division I-FCS | NaN | 1923 | ... | NaN | 9 | 0 | 78270.0 | NaN | 78270.0 | 83913.0 | NaN | 83913.0 | Golf |
5 rows × 28 columns
Setting indexes appropriately makes it much more convenient to filter the rows of a data frame.
1.4.1. MultiIndexes#
One of pandas’ most powerful features is the ability to set multiple columns as
the index at the same time. This creates a MultiIndex
, a hierarchical index
with multiple levels. For example, this code creates a MultiIndex from year
,
institution_name
and sports
:
sports = sports.set_index(["year", "institution_name", "sports"])
The MultiIndex makes it easy to get all of the information for a specific year, institution, and sport. When working with a MultiIndex, the corresponding argument to the indexing operator should be a tuple with one element for each level of the MultiIndex:
sports.loc[(2015, "Stanford University", "Football")]
/tmp/ipykernel_12885/1887213640.py:1: PerformanceWarning: indexing past lexsort depth may impact performance.
sports.loc[(2015, "Stanford University", "Football")]
unitid | city_txt | state_cd | zip_text | classification_code | classification_name | classification_other | ef_male_count | ef_female_count | ef_total_count | ... | partic_coed_men | partic_coed_women | sum_partic_men | sum_partic_women | rev_men | rev_women | total_rev_menwomen | exp_men | exp_women | total_exp_menwomen | |||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
year | institution_name | sports | |||||||||||||||||||||
2015 | Stanford University | Football | 243744 | Stanford | CA | 94305.0 | 1 | NCAA Division I-FBS | NaN | 3663 | 3331 | 6994 | ... | NaN | NaN | 100 | 0 | 43744639.0 | NaN | 43744639.0 | 23724407.0 | NaN | 23724407.0 |
1 rows × 25 columns
You can use the .xs
(read: “cross section”) method to select rows from only
one particular level of a MultiIndex. For instance, to get all rows for Harvard
university:
sports.xs("Harvard University", level = 1)
unitid | city_txt | state_cd | zip_text | classification_code | classification_name | classification_other | ef_male_count | ef_female_count | ef_total_count | ... | partic_coed_men | partic_coed_women | sum_partic_men | sum_partic_women | rev_men | rev_women | total_rev_menwomen | exp_men | exp_women | total_exp_menwomen | ||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
year | sports | |||||||||||||||||||||
2015 | Baseball | 166027 | Cambridge | MA | 2138.0 | 2 | NCAA Division I-FCS | NaN | 3637 | 3256 | 6893 | ... | NaN | NaN | 35 | 0 | 543855.0 | NaN | 543855.0 | 543855.0 | NaN | 543855.0 |
Basketball | 166027 | Cambridge | MA | 2138.0 | 2 | NCAA Division I-FCS | NaN | 3637 | 3256 | 6893 | ... | NaN | NaN | 16 | 19 | 1296186.0 | 743673.0 | 2039859.0 | 1296186.0 | 743673.0 | 2039859.0 | |
All Track Combined | 166027 | Cambridge | MA | 2138.0 | 2 | NCAA Division I-FCS | NaN | 3637 | 3256 | 6893 | ... | NaN | NaN | 138 | 95 | 721675.0 | 721128.0 | 1442803.0 | 721675.0 | 721128.0 | 1442803.0 | |
Fencing | 166027 | Cambridge | MA | 2138.0 | 2 | NCAA Division I-FCS | NaN | 3637 | 3256 | 6893 | ... | NaN | NaN | 13 | 12 | 151656.0 | 164450.0 | 316106.0 | 151656.0 | 164450.0 | 316106.0 | |
Field Hockey | 166027 | Cambridge | MA | 2138.0 | 2 | NCAA Division I-FCS | NaN | 3637 | 3256 | 6893 | ... | NaN | NaN | 0 | 24 | NaN | 363711.0 | 363711.0 | NaN | 363711.0 | 363711.0 | |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
2019 | Equestrian | 166027 | Cambridge | MA | 2138.0 | 2 | NCAA Division I-FCS | NaN | 3531 | 3454 | 6985 | ... | NaN | NaN | 0 | 0 | NaN | NaN | NaN | NaN | NaN | NaN |
Rodeo | 166027 | Cambridge | MA | 2138.0 | 2 | NCAA Division I-FCS | NaN | 3531 | 3454 | 6985 | ... | NaN | NaN | 0 | 0 | NaN | NaN | NaN | NaN | NaN | NaN | |
Sailing | 166027 | Cambridge | MA | 2138.0 | 2 | NCAA Division I-FCS | NaN | 3531 | 3454 | 6985 | ... | NaN | NaN | 14 | 23 | 108409.0 | 178101.0 | 286510.0 | 108409.0 | 178101.0 | 286510.0 | |
Table Tennis | 166027 | Cambridge | MA | 2138.0 | 2 | NCAA Division I-FCS | NaN | 3531 | 3454 | 6985 | ... | NaN | NaN | 0 | 0 | NaN | NaN | NaN | NaN | NaN | NaN | |
Weight Lifting | 166027 | Cambridge | MA | 2138.0 | 2 | NCAA Division I-FCS | NaN | 3531 | 3454 | 6985 | ... | NaN | NaN | 0 | 0 | NaN | NaN | NaN | NaN | NaN | NaN |
122 rows × 25 columns
The level argument is 1
because institution_name
is the 2nd level of the
MultiIndex, and indexing starts from 0. You can also use the name of the level
instead:
sports.xs("Princeton University", level = "institution_name")
unitid | city_txt | state_cd | zip_text | classification_code | classification_name | classification_other | ef_male_count | ef_female_count | ef_total_count | ... | partic_coed_men | partic_coed_women | sum_partic_men | sum_partic_women | rev_men | rev_women | total_rev_menwomen | exp_men | exp_women | total_exp_menwomen | ||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
year | sports | |||||||||||||||||||||
2015 | Baseball | 186131 | Princeton | NJ | 85440070.0 | 2 | NCAA Division I-FCS | NaN | 2717 | 2543 | 5260 | ... | NaN | NaN | 31 | 0 | 454767.0 | NaN | 454767.0 | 454767.0 | NaN | 454767.0 |
Basketball | 186131 | Princeton | NJ | 85440070.0 | 2 | NCAA Division I-FCS | NaN | 2717 | 2543 | 5260 | ... | NaN | NaN | 16 | 17 | 1117914.0 | 1170976.0 | 2288890.0 | 1117914.0 | 1170976.0 | 2288890.0 | |
All Track Combined | 186131 | Princeton | NJ | 85440070.0 | 2 | NCAA Division I-FCS | NaN | 2717 | 2543 | 5260 | ... | NaN | NaN | 159 | 175 | 776687.0 | 665927.0 | 1442614.0 | 776687.0 | 665927.0 | 1442614.0 | |
Fencing | 186131 | Princeton | NJ | 85440070.0 | 2 | NCAA Division I-FCS | NaN | 2717 | 2543 | 5260 | ... | NaN | NaN | 16 | 13 | 238594.0 | 219334.0 | 457928.0 | 238594.0 | 219334.0 | 457928.0 | |
Field Hockey | 186131 | Princeton | NJ | 85440070.0 | 2 | NCAA Division I-FCS | NaN | 2717 | 2543 | 5260 | ... | NaN | NaN | 0 | 23 | NaN | 500755.0 | 500755.0 | NaN | 500755.0 | 500755.0 | |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
2019 | Swimming and Diving | 186131 | Princeton | NJ | 8544.0 | 2 | NCAA Division I-FCS | NaN | 2670 | 2638 | 5308 | ... | NaN | NaN | 38 | 37 | 685836.0 | 710247.0 | 1396083.0 | 685836.0 | 710247.0 | 1396083.0 |
Tennis | 186131 | Princeton | NJ | 8544.0 | 2 | NCAA Division I-FCS | NaN | 2670 | 2638 | 5308 | ... | NaN | NaN | 11 | 11 | 550389.0 | 472485.0 | 1022874.0 | 550389.0 | 472485.0 | 1022874.0 | |
Volleyball | 186131 | Princeton | NJ | 8544.0 | 2 | NCAA Division I-FCS | NaN | 2670 | 2638 | 5308 | ... | NaN | NaN | 17 | 16 | 324540.0 | 423386.0 | 747926.0 | 324540.0 | 423386.0 | 747926.0 | |
Water Polo | 186131 | Princeton | NJ | 8544.0 | 2 | NCAA Division I-FCS | NaN | 2670 | 2638 | 5308 | ... | NaN | NaN | 22 | 17 | 408972.0 | 335028.0 | 744000.0 | 408972.0 | 335028.0 | 744000.0 | |
Wrestling | 186131 | Princeton | NJ | 8544.0 | 2 | NCAA Division I-FCS | NaN | 2670 | 2638 | 5308 | ... | NaN | NaN | 31 | 0 | 840184.0 | NaN | 840184.0 | 840184.0 | NaN | 840184.0 |
91 rows × 25 columns
The pandas documentation on MultiIndexes describes a wide variety of ways MultiIndexes can be used to slice and dice data frames.
Checkpoint Exercise
Use a MultiIndex to get a series that lists the sports for which there were university athletics programs in Aberdeen, WA in 2018.
1.5. The .query
Method#
The .query
method is another powerful way to extract information from a data
frame. The idea of .query
is that you write a query string using a query
mini-language to select a subset of the rows. This has several advantages over
other ways of indexing:
For large data sets, the
.query
method is slightly faster than other ways of indexingYou don’t have to repeat the name of the data frame every time you refer to a column in a condition
You can reuse a single query across multiple data frames, provided they all have the queried columns
Conditions are easier to read:
You can use
not
,or
andand
to invert and combine conditions instead of~
,|
, and&
You can use
in
andnot in
in conditionsYou can write
a < b < c
rather than(a < b) & (b < c)
It’s not necessary to enclose sub-conditions in parentheses, since the query mini-language defines its own order of operations
Here’s an example of a query:
sports = sports.reset_index()
qs = "city_txt == 'Davis' and state_cd == 'CA' and sports == 'Basketball'"
ucd_bball = sports.query(qs)
ucd_bball[["institution_name", "sports", "year", "total_rev_menwomen"]]
institution_name | sports | year | total_rev_menwomen | |
---|---|---|---|---|
913 | University of California-Davis | Basketball | 2015 | 3203324.0 |
18263 | University of California-Davis | Basketball | 2016 | 3817436.0 |
35698 | University of California-Davis | Basketball | 2017 | 4021500.0 |
53319 | University of California-Davis | Basketball | 2018 | 4240907.0 |
73947 | University of California-Davis | Basketball | 2019 | 4141401.0 |
Checkpoint Exercise
Use a query string to get a data frame that lists the sports for which there were university athletics programs in Aberdeen, WA in 2018.
The pandas documentation provides further details about how to write query strings.