1. Slicing & Dicing Data with pandas#

Learning Objectives

  • Explain what pandas indexes are and how they’re used

  • Get, modify, and set indexes on series and data frames

  • Explain and use pandas’ index alignment

  • Describe and use the 4 major modes of indexing in pandas

  • Explain the difference between .loc and .at

  • Identify problems where using .set_index to make a column an index is helpful

  • Explain and use pandas’ query mini-language

This chapter is a deep dive into how indexing—the process of getting or setting elements of a data structure—works in the pandas package. Indexing is sometimes also called subsetting or (element) extraction, and is a fundamental operation when using a programming language to solve problems in data science.

1.1. Prerequisites#

This chapter assumes you already have basic familiarity with Python and pandas. In particular, you should be comfortable indexing Python lists and dictionaries with the square bracket operator [ ], and should be able to explain what a Series and DataFrame are and how the they differ. DataLab’s Python Basics Reader and its accompanying workshop provide a suitable introduction to these topics.

To follow along, you’ll need the following software versions (or newer) installed on your computer:

One way to install these is to install the Anaconda Python distribution. Chapter 2 provides more details about Anaconda and the conda package manager.

CLICK HERE to get the data set used in the examples for this chapter. After clicking the link, you’ll need to right-click on the page and select “Save Page As…”. Make sure to save the page somewhere you can easily access from your Python session.

1.2. What’s an Index?#

In pandas, an Index is a Python object that stores a sequence of labels or names. The labels usually correspond to the elements of a data structure along a given dimension or axis. Indexes are a key feature that differentiates pandas from other software for programming with tabular data (such as R).

According to the pandas documentation, indexes serve three important roles:

  1. As metadata to provide additional context about a data set

  2. As a way to explicitly and automatically align data, avoiding bugs due to misalignment (see Section 1.2.3)

  3. As a convenience for getting and setting subsets of the data (see Section 1.4)

1.2.1. Indexes on Series#

As an example of where pandas uses indexes, Series are 1-dimensional and always have exactly one index. You can see the index just to the left of the elements when you print out a series:

import pandas as pd

x = pd.Series([4, 5, 3, 1])
x
0    4
1    5
2    3
3    1
dtype: int64

You can get or set the index on a series with the .index attribute:

x.index
RangeIndex(start=0, stop=4, step=1)

The labels in an index can be numbers, strings, dates, or other values. Pandas provides subclasses of Index for specific purposes, including:

  • RangeIndex, where the labels are a monotonic sequence of integers with a fixed beginning, end, and step size

  • CategoricalIndex, where the labels are categories (and not necessarily unique)

  • IntervalIndex, where the labels are non-overlapping intervals of the real number line

  • DateTimeIndex, where the labels are dates and times

  • PeriodIndex, where each label is a date and time span

You may also encounter Int64Index, UInt64Index, and Float64Index, which represent an arbitrary list of labels of the respective type. For example:

x[[1, 3]].index
Int64Index([1, 3], dtype='int64')

Beginning in pandas 2.0, these three subclasses (but not subclasses in the bulleted list above) will be replaced with a generic Index class. See this warning in the pandas documentation for details.

You can access the labels in an index with standard Python indexing operations:

idx = x.index
# Get the 2nd label
idx[1]
1

Indexes are immutable, which means the labels in an index can’t be changed. For instance:

idx[0] = 5
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Cell In[6], line 1
----> 1 idx[0] = 5

File ~/garden/micromamba/envs/workshop/lib/python3.11/site-packages/pandas/core/indexes/base.py:5302, in Index.__setitem__(self, key, value)
   5300 @final
   5301 def __setitem__(self, key, value):
-> 5302     raise TypeError("Index does not support mutable operations")

TypeError: Index does not support mutable operations

Although labels in an index can’t be changed, every index also has a name, which can be changed. Think of the name as a name for the dimension or axis to which the index corresponds. You can get or set the name of an index with the .name attribute (on the index itself):

idx.name

The default value of .name is None, which means the index doesn’t have a name. When pandas prints a series or data frame, it will also print the names of any attached indexes. For example:

x.index.name = "labels"
x
labels
0    4
1    5
2    3
3    1
dtype: int64

Pandas automatically creates indexes whenever you create (or load) a series or data frame. Pandas also provides functions to construct indexes manually. These are usually named after the type of index. For instance, you can create an ordinary Index from a list or NumPy array:

new_idx = pd.Index(["a", "b", "c", "d"])
new_idx
Index(['a', 'b', 'c', 'd'], dtype='object')

You can use assignment to replace the index on a series. For example:

x.index = new_idx
x
a    4
b    5
c    3
d    1
dtype: int64

Note that the new index must have the same length as the index it replaces (and the series itself).

Checkpoint Exercise

Replace the index on x with one like the original (in variable idx), but where the first label is 5 instead of 0. Try to do this without typing out a list of all 4 labels.

Hint: the Python list function can convert many types of objects into lists, including pandas indexes.

1.2.2. Indexes on Data Frames#

Pandas DataFrames are 2-dimensional and always have exactly two indexes. The indexes correspond to the rows and the columns, respectively. To see this, use pandas to load the data from the sports.csv file:

dtype = {"classification_other": str}
sports = pd.read_csv("data/sports.csv", dtype = dtype)
sports.head()
year unitid institution_name city_txt state_cd zip_text classification_code classification_name classification_other ef_male_count ... partic_coed_women sum_partic_men sum_partic_women rev_men rev_women total_rev_menwomen exp_men exp_women total_exp_menwomen sports
0 2015 100654 Alabama A & M University Normal AL 35762.0 2 NCAA Division I-FCS NaN 1923 ... NaN 31 0 345592.0 NaN 345592.0 397818.0 NaN 397818.0 Baseball
1 2015 100654 Alabama A & M University Normal AL 35762.0 2 NCAA Division I-FCS NaN 1923 ... NaN 19 16 1211095.0 748833.0 1959928.0 817868.0 742460.0 1560328.0 Basketball
2 2015 100654 Alabama A & M University Normal AL 35762.0 2 NCAA Division I-FCS NaN 1923 ... NaN 61 46 183333.0 315574.0 498907.0 246949.0 251184.0 498133.0 All Track Combined
3 2015 100654 Alabama A & M University Normal AL 35762.0 2 NCAA Division I-FCS NaN 1923 ... NaN 99 0 2808949.0 NaN 2808949.0 3059353.0 NaN 3059353.0 Football
4 2015 100654 Alabama A & M University Normal AL 35762.0 2 NCAA Division I-FCS NaN 1923 ... NaN 9 0 78270.0 NaN 78270.0 83913.0 NaN 83913.0 Golf

5 rows × 28 columns

This data set contains information about funding and staffing of athletics programs at U.S. universities from 2015 to 2019 (inclusive). The source for the data set is the U.S. Department of Education’s Equity in Athletics Data Analysis project, and the data was cleaned by the Tidy Tuesday community. This article describes some of the conclusions that can be drawn from the data set.

On data frames, you can use the .index attribute to get and set the row index. For example:

sports.index
RangeIndex(start=0, stop=132327, step=1)

The default index created by the read_csv function counts from 0 to the number of rows.

The other index on data frames corresponds to the columns. You can use the .columns attribute to get and set the column index. For instance:

sports.columns
Index(['year', 'unitid', 'institution_name', 'city_txt', 'state_cd',
       'zip_text', 'classification_code', 'classification_name',
       'classification_other', 'ef_male_count', 'ef_female_count',
       'ef_total_count', 'sector_cd', 'sector_name', 'sportscode',
       'partic_men', 'partic_women', 'partic_coed_men', 'partic_coed_women',
       'sum_partic_men', 'sum_partic_women', 'rev_men', 'rev_women',
       'total_rev_menwomen', 'exp_men', 'exp_women', 'total_exp_menwomen',
       'sports'],
      dtype='object')

The indexes stored in the .index and .columns attributes of a data frame have the same classes and basic properties as indexes on series.

1.2.3. Alignment#

For arithmetic operations that involve more than one series or data frame, pandas automatically aligns the elements based on the indexes. As an example, consider these two series:

u = pd.Series([1, 2, 3], index = ["a", "b", "c"])
v = pd.Series([1, 2, 3], index = ["c", "b", "a"])

The elements of the series are identical, but the indexes differ. If you’re an experienced NumPy (or R) user, the result of adding these two series together might surprise you:

u + v
a    4
b    4
c    4
dtype: int64

This is the result because pandas aligns the elements, adding element a from u to element a from v, element b to element b, and so on.

Sometimes you might want to do arithmetic on series or data frames without any alignment. The canonical solution in this case is to use the .to_numpy method to convert one or both of the objects into a NumPy array:

u + v.to_numpy()
a    2
b    4
c    6
dtype: int64

Pandas tries to align elements and carry out arithmetic operations even for objects with mismatched shapes and labels. Pandas fills elements for which the result is unknown with missing values. For example:

w = pd.Series([-10, 0], index = ["z", "a"])

u - w
a    1.0
b    NaN
c    NaN
z    NaN
dtype: float64

In this example, only element a has two operands; for each other element, the result is a missing value because only one operand is known.

Checkpoint Exercise

What happens if you use an arithmetic operation on two series or data frames where one has repeated/duplicated labels in its index?

Hint: construct an example and see what happens!

Pandas does not align series and data frames in comparison operations, and raises an error if they’re not already aligned:

u > v
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Cell In[18], line 1
----> 1 u > v

File ~/garden/micromamba/envs/workshop/lib/python3.11/site-packages/pandas/core/ops/common.py:72, in _unpack_zerodim_and_defer.<locals>.new_method(self, other)
     68             return NotImplemented
     70 other = item_from_zerodim(other)
---> 72 return method(self, other)

File ~/garden/micromamba/envs/workshop/lib/python3.11/site-packages/pandas/core/arraylike.py:58, in OpsMixin.__gt__(self, other)
     56 @unpack_zerodim_and_defer("__gt__")
     57 def __gt__(self, other):
---> 58     return self._cmp_method(other, operator.gt)

File ~/garden/micromamba/envs/workshop/lib/python3.11/site-packages/pandas/core/series.py:6237, in Series._cmp_method(self, other, op)
   6234 res_name = ops.get_op_result_name(self, other)
   6236 if isinstance(other, Series) and not self._indexed_same(other):
-> 6237     raise ValueError("Can only compare identically-labeled Series objects")
   6239 lvalues = self._values
   6240 rvalues = extract_array(other, extract_numpy=True, extract_range=True)

ValueError: Can only compare identically-labeled Series objects

See this GitHub issue for a developer discussion of whether this is a helpful feature for preventing bugs or a nuisance.

Finally, the align method manually aligns two series or data frames. The method returns two objects of the same type, but with elements sorted so that the indexes match. For example:

u.align(v)
(a    1
 b    2
 c    3
 dtype: int64,
 a    3
 b    2
 c    1
 dtype: int64)

For objects with mismatched shapes, the align method inserts missing values to make their shapes and indexes match:

u.align(w)
(a    1.0
 b    2.0
 c    3.0
 z    NaN
 dtype: float64,
 a     0.0
 b     NaN
 c     NaN
 z   -10.0
 dtype: float64)

The align method provides a way to align to objects so that they can be compared:

ua, va = u.align(v)
ua > va
a    False
b    False
c     True
dtype: bool

Moreover, concatenating the two results from the align method is the same as performing a join. The align method even provides a parameter join to select the type of join ("outer", "left", "right", or "inner"). For relational series and data frames, you can break a join into smaller steps by using a combination of the align function , column indexing, and the pandas concat function.

1.3. Indexing Operators#

Pandas provides several different operators for indexing series and data frames, as well as several different ways to use each. The primary indexing operator is the square bracket [ ].

For a Series, the square bracket operator usually selects elements by label. For instance, integer arguments are treated as labels, not positions, if the labels in the index are integers:

x = pd.Series([1, 2, 3], index = [1, 0, 3])
x[0]
2

You can use a list to select multiple elements at once or repeat elements:

x[[0, 3, 0]]
0    2
3    3
0    2
dtype: int64

You can also use the square bracket operator to select elements by label when the labels are strings:

y = pd.Series([10, 20, 30], index = ["a", "b", "c"])
y["a"]
10

In this case, passing an integer argument selects an element by position:

y[0]
10

Using the square bracket operator this way in non-interactive code is risky, because it can cause bugs if some of the series or data frames your code operates on unexpectedly have integer labels.

The dot . serves as a secondary indexing operator for series and data frames with string indexes. For instance, to get the element b:

y.b
20

Since the dot . is also used to access attributes and methods, it cannot be used to access elements with a label that’s the same as the name of an attribute or method. When in doubt, it’s safer to use [ ] to access elements by label.

1.3.1. Slices#

A slice selects a range of elements. The syntax is start:stop:step, with the second colon : and arguments optional, the same as for Python’s built-in slicing. The default start value is the beginning of the series, the default stop value is one past the end, and the default step value is 1.

Slices can be by position or by label. With [ ], when the start or stop value is an integer, pandas assumes you want to slice by position:

x[1:3]
0    2
3    3
dtype: int64

In a slice by position, the element at the stop position is not included.

Slicing with only the step can be useful in a variety of situations, such as getting every second element:

x[::2]
1    1
3    3
dtype: int64

Negative values in a slice are counted backward from the end of the object. For example, this code gets the last two elements of the series:

x[-2:]
0    2
3    3
dtype: int64

Checkpoint Exercise

What happens if you use a negative step value in a slice?

When the start or stop value is a string, pandas slices by label:

y["b":]
b    20
c    30
dtype: int64

Be aware that when you slice by label, the element at the stop position is always included:

y["b":"c"]
b    20
c    30
dtype: int64

This is a case where the pandas developers decided convenience outweighs consistency. See this section of the pandas documentation for more details.

Checkpoint Exercise

What happens if you include an integer step value in a slice by label?

1.3.2. By Position with .iloc#

When you want to index a series or data frame by position, use [ ] on the .iloc attribute. Then any integer arguments are assumed to be positions, regardless of the type of index on the object.

For example:

x.iloc[0]
1
x.iloc[[1, 0, 2]]
0    2
1    1
3    3
dtype: int64
x.iloc[0:]
1    1
0    2
3    3
dtype: int64

You can use this attribute regardless of the index type:

y.iloc[0:]
a    10
b    20
c    30
dtype: int64

As a result, the .iloc attribute is a more reliable way to index by position than using [ ] alone.

Pandas also provides an .iat attribute to get or set scalar values in a series or data frame by position:

x.iat[0]
1

Compared to .iloc, the .iat attribute is typically much faster (in terms of CPU time), so it’s important to use .iat in performance-critical sections of code. In addition, .iat raises an error if your code attempts to get or set more than one value, which can potentially alert you to bugs that would go undetected with .iloc.

1.3.3. By Label with .loc#

When you want to index a series or data frame by label, use [ ] on the .loc attribute. Then any integer arguments are assumed to be labels.

For example:

x.loc[0]
2
x.loc[0]
2
x.loc[[0, 1]]
0    2
1    1
dtype: int64
y.loc["b"]
20
y.loc[["b", "a", "b"]]
b    20
a    10
b    20
dtype: int64

Pandas also provides an .at attribute to get or set scalar values in a series or data frame by label:

x.at[0]
2

The advantages of .at compared to .loc are the same as .iat compared to .iloc: better run-time speed and stricter requirements on the argument.

When setting elements by label with [ ], .loc, or .at, if you use a label that’s not already in the series index, pandas will append the label and value. For instance:

x.loc[10] = -1
x
1     1
0     2
3     3
10   -1
dtype: int64

This doesn’t work when setting elements by position.

1.3.4. By a Condition#

The square bracket operator [ ], the .iloc attribute, and the .loc attribute all support indexing a series by a condition. The condition must be represented by a Boolean array with the same length as the series.

You can use comparison operators to get a suitable Boolean array. For example:

is_positive = x > 0
is_positive
1      True
0      True
3      True
10    False
dtype: bool
x[is_positive]
1    1
0    2
3    3
dtype: int64
x.loc[is_positive]
1    1
0    2
3    3
dtype: int64

If the condition has an index, the square bracket operator and .loc operator will try to align the condition with the series being indexed:

z = x.iloc[[1, 0, 3]]
z
0     2
1     1
10   -1
dtype: int64
z[is_positive]
0    2
1    1
dtype: int64

The .iloc operator raises an error if the Boolean array has an index:

x.iloc[is_positive]
---------------------------------------------------------------------------
NotImplementedError                       Traceback (most recent call last)
Cell In[49], line 1
----> 1 x.iloc[is_positive]

File ~/garden/micromamba/envs/workshop/lib/python3.11/site-packages/pandas/core/indexing.py:1073, in _LocationIndexer.__getitem__(self, key)
   1070 axis = self.axis or 0
   1072 maybe_callable = com.apply_if_callable(key, self.obj)
-> 1073 return self._getitem_axis(maybe_callable, axis=axis)

File ~/garden/micromamba/envs/workshop/lib/python3.11/site-packages/pandas/core/indexing.py:1611, in _iLocIndexer._getitem_axis(self, key, axis)
   1608     key = np.asarray(key)
   1610 if com.is_bool_indexer(key):
-> 1611     self._validate_key(key, axis)
   1612     return self._getbool_axis(key, axis=axis)
   1614 # a list of integers

File ~/garden/micromamba/envs/workshop/lib/python3.11/site-packages/pandas/core/indexing.py:1452, in _iLocIndexer._validate_key(self, key, axis)
   1450 if hasattr(key, "index") and isinstance(key.index, Index):
   1451     if key.index.inferred_type == "integer":
-> 1452         raise NotImplementedError(
   1453             "iLocation based boolean "
   1454             "indexing on an integer type "
   1455             "is not available"
   1456         )
   1457     raise ValueError(
   1458         "iLocation based boolean indexing cannot use "
   1459         "an indexable as a mask"
   1460     )
   1461 return

NotImplementedError: iLocation based boolean indexing on an integer type is not available

As described in Section 1.2.3, the canonical way to avoid index alignment is to use the .to_numpy method to convert to a NumPy array:

x.iloc[is_positive.to_numpy()]
1    1
0    2
3    3
dtype: int64

The logic operators for inverting and combining conditions in pandas differ from Python’s usual logic operators:

  • | is logical or

  • & is logical and

  • ~ is logical not

For example:

x[is_positive & (x % 2 == 1)]
1    1
3    3
dtype: int64

In compound conditions, each sub-condition must be enclosed in parentheses (), because in Python’s order of operations, the |, &, and ~ operators are evaluated before comparison operators such as ==, >, and <.

1.3.5. By a Callable#

The square bracket operator [ ], the .iloc attribute, and the .loc attribute also all support indexing a series by a function (or callable object defined with def or lambda. The function must accept one argument: the series itself. The function must return a valid argument for the indexing operator (positions, labels, or a Boolean array).

The primary use case for this form of indexing is chaining together multiple operations without defining intermediate variables. For example, this code adds each element at an even-numbered position to the next element, and then gets only the results greater than 2:

(x.iloc[::2] + x.iloc[1::2].to_numpy())[lambda a: a > 2]
1    3
dtype: int64

As you can probably see, chaining operations makes code relatively difficult to understand. Whenever possible, it’s better to use simple indexing strategies rather than indexing by a callable.

Checkpoint Exercise

Use indexing by a callable to set the elements of x with even values (not positions) to twice their current value.

What’s a simpler way to do this using other indexing modes?

1.3.6. Data Frames#

The indexing operators [ ], .iloc, and .loc can all be used with data frames in addition to series. Since data frames are 2-dimensional, you can pass the indexing operators a separate argument for each dimension. The rows come first.

For instance, to get (1st row, 2nd and 3rd columns) of the athletics data frame the code is:

sports.iloc[0, [1, 2]]
unitid                                100654
institution_name    Alabama A & M University
Name: 0, dtype: object

When the result has a lower dimension than the object being indexed, pandas automatically converts to an appropriate data type. In this case, the result is converted to a series with the names of the two columns as its index.

As another example, the code to get (rows labeled 10 and 11, column labeled "year") is:

sports.loc[[10, 11], "year"]
10    2015
11    2015
Name: year, dtype: int64

With .iloc and .loc, you can select all of the elements along a given dimension by passing a colon : as the argument for that dimension. For instance, this code gets all rows in the columns labeled "year" and "institution_name":

sports.loc[:, ["year", "institution_name"]]
year institution_name
0 2015 Alabama A & M University
1 2015 Alabama A & M University
2 2015 Alabama A & M University
3 2015 Alabama A & M University
4 2015 Alabama A & M University
... ... ...
132322 2019 Simon Fraser University
132323 2019 Simon Fraser University
132324 2019 Simon Fraser University
132325 2019 Simon Fraser University
132326 2019 Simon Fraser University

132327 rows × 2 columns

You can mix indexing by condition with indexing by position or label. The most common way to do this is to use a condition to select rows and labels to select columns. For example:

sports.loc[sports.year == 2018, ["year", "institution_name", "sports"]]
year institution_name sports
52387 2018 Alabama A & M University Baseball
52388 2018 Alabama A & M University Basketball
52389 2018 Alabama A & M University Football
52390 2018 Alabama A & M University Golf
52391 2018 Alabama A & M University Softball
... ... ... ...
70154 2018 Simon Fraser University Soccer
70155 2018 Simon Fraser University Softball
70156 2018 Simon Fraser University Swimming
70157 2018 Simon Fraser University Volleyball
70158 2018 Simon Fraser University Wrestling

17772 rows × 3 columns

Checkpoint Exercise

Use indexing to get only the rows where year is 2016 and institution_name is "University of California-Davis". Limit the columns to year, institution_name, sports, and total_rev_menwomen (the total revenue from the program in USD).

Indexing a data frame with only one argument generally accesses rows. For example:

sports.iloc[0]
year                                        2015
unitid                                    100654
institution_name        Alabama A & M University
city_txt                                  Normal
state_cd                                      AL
zip_text                                 35762.0
classification_code                            2
classification_name          NCAA Division I-FCS
classification_other                         NaN
ef_male_count                               1923
ef_female_count                             2300
ef_total_count                              4223
sector_cd                                      1
sector_name              Public, 4-year or above
sportscode                                     1
partic_men                                  31.0
partic_women                                 NaN
partic_coed_men                              NaN
partic_coed_women                            NaN
sum_partic_men                                31
sum_partic_women                               0
rev_men                                 345592.0
rev_women                                    NaN
total_rev_menwomen                      345592.0
exp_men                                 397818.0
exp_women                                    NaN
total_exp_menwomen                      397818.0
sports                                  Baseball
Name: 0, dtype: object

As an exception, using the square bracket operator [ ] to index a data frame with a string argument (or list of strings) accesses columns:

sports[["year", "institution_name"]]
year institution_name
0 2015 Alabama A & M University
1 2015 Alabama A & M University
2 2015 Alabama A & M University
3 2015 Alabama A & M University
4 2015 Alabama A & M University
... ... ...
132322 2019 Simon Fraser University
132323 2019 Simon Fraser University
132324 2019 Simon Fraser University
132325 2019 Simon Fraser University
132326 2019 Simon Fraser University

132327 rows × 2 columns

1.4. Columns as Indexes#

An important feature of pandas is that you can use arbitrary columns in a data frame as indexes. Generally the columns should contain identifiers for the rows or other categorical data. For example, in the sports data frame, the column institution_name is an identifier.

You can use the .set_index method to set a column as an index:

sports = sports.set_index("institution_name")
sports.head()
year unitid city_txt state_cd zip_text classification_code classification_name classification_other ef_male_count ef_female_count ... partic_coed_women sum_partic_men sum_partic_women rev_men rev_women total_rev_menwomen exp_men exp_women total_exp_menwomen sports
institution_name
Alabama A & M University 2015 100654 Normal AL 35762.0 2 NCAA Division I-FCS NaN 1923 2300 ... NaN 31 0 345592.0 NaN 345592.0 397818.0 NaN 397818.0 Baseball
Alabama A & M University 2015 100654 Normal AL 35762.0 2 NCAA Division I-FCS NaN 1923 2300 ... NaN 19 16 1211095.0 748833.0 1959928.0 817868.0 742460.0 1560328.0 Basketball
Alabama A & M University 2015 100654 Normal AL 35762.0 2 NCAA Division I-FCS NaN 1923 2300 ... NaN 61 46 183333.0 315574.0 498907.0 246949.0 251184.0 498133.0 All Track Combined
Alabama A & M University 2015 100654 Normal AL 35762.0 2 NCAA Division I-FCS NaN 1923 2300 ... NaN 99 0 2808949.0 NaN 2808949.0 3059353.0 NaN 3059353.0 Football
Alabama A & M University 2015 100654 Normal AL 35762.0 2 NCAA Division I-FCS NaN 1923 2300 ... NaN 9 0 78270.0 NaN 78270.0 83913.0 NaN 83913.0 Golf

5 rows × 27 columns

After setting a categorical column as an index, you can use indexing by label as a convenient way to get rows in specific categories:

sports.loc["University of California-Davis"]
year unitid city_txt state_cd zip_text classification_code classification_name classification_other ef_male_count ef_female_count ... partic_coed_women sum_partic_men sum_partic_women rev_men rev_women total_rev_menwomen exp_men exp_women total_exp_menwomen sports
institution_name
University of California-Davis 2015 110644 Davis CA 956168678.0 2 NCAA Division I-FCS NaN 11215 16223 ... NaN 36 0 884627.0 NaN 884627.0 884627.0 NaN 884627.0 Baseball
University of California-Davis 2015 110644 Davis CA 956168678.0 2 NCAA Division I-FCS NaN 11215 16223 ... NaN 15 19 1862783.0 1340541.0 3203324.0 1862783.0 1340541.0 3203324.0 Basketball
University of California-Davis 2015 110644 Davis CA 956168678.0 2 NCAA Division I-FCS NaN 11215 16223 ... NaN 62 94 571582.0 590416.0 1161998.0 571582.0 590416.0 1161998.0 All Track Combined
University of California-Davis 2015 110644 Davis CA 956168678.0 2 NCAA Division I-FCS NaN 11215 16223 ... NaN 0 25 NaN 659996.0 659996.0 NaN 659996.0 659996.0 Field Hockey
University of California-Davis 2015 110644 Davis CA 956168678.0 2 NCAA Division I-FCS NaN 11215 16223 ... NaN 105 0 4478087.0 NaN 4478087.0 4478087.0 NaN 4478087.0 Football
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
University of California-Davis 2019 110644 Davis CA 95616.0 2 NCAA Division I-FCS NaN 11780 18371 ... NaN 0 27 NaN 945919.0 945919.0 NaN 945919.0 945919.0 Equestrian
University of California-Davis 2019 110644 Davis CA 95616.0 2 NCAA Division I-FCS NaN 11780 18371 ... NaN 0 0 NaN NaN NaN NaN NaN NaN Rodeo
University of California-Davis 2019 110644 Davis CA 95616.0 2 NCAA Division I-FCS NaN 11780 18371 ... NaN 0 0 NaN NaN NaN NaN NaN NaN Sailing
University of California-Davis 2019 110644 Davis CA 95616.0 2 NCAA Division I-FCS NaN 11780 18371 ... NaN 0 0 NaN NaN NaN NaN NaN NaN Table Tennis
University of California-Davis 2019 110644 Davis CA 95616.0 2 NCAA Division I-FCS NaN 11780 18371 ... NaN 0 0 NaN NaN NaN NaN NaN NaN Weight Lifting

96 rows × 27 columns

You can use the .reset_index method to restore the institution_name column to the data frame and reset the index to a range:

sports = sports.reset_index()
sports.head()
institution_name year unitid city_txt state_cd zip_text classification_code classification_name classification_other ef_male_count ... partic_coed_women sum_partic_men sum_partic_women rev_men rev_women total_rev_menwomen exp_men exp_women total_exp_menwomen sports
0 Alabama A & M University 2015 100654 Normal AL 35762.0 2 NCAA Division I-FCS NaN 1923 ... NaN 31 0 345592.0 NaN 345592.0 397818.0 NaN 397818.0 Baseball
1 Alabama A & M University 2015 100654 Normal AL 35762.0 2 NCAA Division I-FCS NaN 1923 ... NaN 19 16 1211095.0 748833.0 1959928.0 817868.0 742460.0 1560328.0 Basketball
2 Alabama A & M University 2015 100654 Normal AL 35762.0 2 NCAA Division I-FCS NaN 1923 ... NaN 61 46 183333.0 315574.0 498907.0 246949.0 251184.0 498133.0 All Track Combined
3 Alabama A & M University 2015 100654 Normal AL 35762.0 2 NCAA Division I-FCS NaN 1923 ... NaN 99 0 2808949.0 NaN 2808949.0 3059353.0 NaN 3059353.0 Football
4 Alabama A & M University 2015 100654 Normal AL 35762.0 2 NCAA Division I-FCS NaN 1923 ... NaN 9 0 78270.0 NaN 78270.0 83913.0 NaN 83913.0 Golf

5 rows × 28 columns

Setting indexes appropriately makes it much more convenient to filter the rows of a data frame.

1.4.1. MultiIndexes#

One of pandas’ most powerful features is the ability to set multiple columns as the index at the same time. This creates a MultiIndex, a hierarchical index with multiple levels. For example, this code creates a MultiIndex from year, institution_name and sports:

sports = sports.set_index(["year", "institution_name", "sports"])

The MultiIndex makes it easy to get all of the information for a specific year, institution, and sport. When working with a MultiIndex, the corresponding argument to the indexing operator should be a tuple with one element for each level of the MultiIndex:

sports.loc[(2015, "Stanford University", "Football")]
/tmp/ipykernel_12885/1887213640.py:1: PerformanceWarning: indexing past lexsort depth may impact performance.
  sports.loc[(2015, "Stanford University", "Football")]
unitid city_txt state_cd zip_text classification_code classification_name classification_other ef_male_count ef_female_count ef_total_count ... partic_coed_men partic_coed_women sum_partic_men sum_partic_women rev_men rev_women total_rev_menwomen exp_men exp_women total_exp_menwomen
year institution_name sports
2015 Stanford University Football 243744 Stanford CA 94305.0 1 NCAA Division I-FBS NaN 3663 3331 6994 ... NaN NaN 100 0 43744639.0 NaN 43744639.0 23724407.0 NaN 23724407.0

1 rows × 25 columns

You can use the .xs (read: “cross section”) method to select rows from only one particular level of a MultiIndex. For instance, to get all rows for Harvard university:

sports.xs("Harvard University", level = 1)
unitid city_txt state_cd zip_text classification_code classification_name classification_other ef_male_count ef_female_count ef_total_count ... partic_coed_men partic_coed_women sum_partic_men sum_partic_women rev_men rev_women total_rev_menwomen exp_men exp_women total_exp_menwomen
year sports
2015 Baseball 166027 Cambridge MA 2138.0 2 NCAA Division I-FCS NaN 3637 3256 6893 ... NaN NaN 35 0 543855.0 NaN 543855.0 543855.0 NaN 543855.0
Basketball 166027 Cambridge MA 2138.0 2 NCAA Division I-FCS NaN 3637 3256 6893 ... NaN NaN 16 19 1296186.0 743673.0 2039859.0 1296186.0 743673.0 2039859.0
All Track Combined 166027 Cambridge MA 2138.0 2 NCAA Division I-FCS NaN 3637 3256 6893 ... NaN NaN 138 95 721675.0 721128.0 1442803.0 721675.0 721128.0 1442803.0
Fencing 166027 Cambridge MA 2138.0 2 NCAA Division I-FCS NaN 3637 3256 6893 ... NaN NaN 13 12 151656.0 164450.0 316106.0 151656.0 164450.0 316106.0
Field Hockey 166027 Cambridge MA 2138.0 2 NCAA Division I-FCS NaN 3637 3256 6893 ... NaN NaN 0 24 NaN 363711.0 363711.0 NaN 363711.0 363711.0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
2019 Equestrian 166027 Cambridge MA 2138.0 2 NCAA Division I-FCS NaN 3531 3454 6985 ... NaN NaN 0 0 NaN NaN NaN NaN NaN NaN
Rodeo 166027 Cambridge MA 2138.0 2 NCAA Division I-FCS NaN 3531 3454 6985 ... NaN NaN 0 0 NaN NaN NaN NaN NaN NaN
Sailing 166027 Cambridge MA 2138.0 2 NCAA Division I-FCS NaN 3531 3454 6985 ... NaN NaN 14 23 108409.0 178101.0 286510.0 108409.0 178101.0 286510.0
Table Tennis 166027 Cambridge MA 2138.0 2 NCAA Division I-FCS NaN 3531 3454 6985 ... NaN NaN 0 0 NaN NaN NaN NaN NaN NaN
Weight Lifting 166027 Cambridge MA 2138.0 2 NCAA Division I-FCS NaN 3531 3454 6985 ... NaN NaN 0 0 NaN NaN NaN NaN NaN NaN

122 rows × 25 columns

The level argument is 1 because institution_name is the 2nd level of the MultiIndex, and indexing starts from 0. You can also use the name of the level instead:

sports.xs("Princeton University", level = "institution_name")
unitid city_txt state_cd zip_text classification_code classification_name classification_other ef_male_count ef_female_count ef_total_count ... partic_coed_men partic_coed_women sum_partic_men sum_partic_women rev_men rev_women total_rev_menwomen exp_men exp_women total_exp_menwomen
year sports
2015 Baseball 186131 Princeton NJ 85440070.0 2 NCAA Division I-FCS NaN 2717 2543 5260 ... NaN NaN 31 0 454767.0 NaN 454767.0 454767.0 NaN 454767.0
Basketball 186131 Princeton NJ 85440070.0 2 NCAA Division I-FCS NaN 2717 2543 5260 ... NaN NaN 16 17 1117914.0 1170976.0 2288890.0 1117914.0 1170976.0 2288890.0
All Track Combined 186131 Princeton NJ 85440070.0 2 NCAA Division I-FCS NaN 2717 2543 5260 ... NaN NaN 159 175 776687.0 665927.0 1442614.0 776687.0 665927.0 1442614.0
Fencing 186131 Princeton NJ 85440070.0 2 NCAA Division I-FCS NaN 2717 2543 5260 ... NaN NaN 16 13 238594.0 219334.0 457928.0 238594.0 219334.0 457928.0
Field Hockey 186131 Princeton NJ 85440070.0 2 NCAA Division I-FCS NaN 2717 2543 5260 ... NaN NaN 0 23 NaN 500755.0 500755.0 NaN 500755.0 500755.0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
2019 Swimming and Diving 186131 Princeton NJ 8544.0 2 NCAA Division I-FCS NaN 2670 2638 5308 ... NaN NaN 38 37 685836.0 710247.0 1396083.0 685836.0 710247.0 1396083.0
Tennis 186131 Princeton NJ 8544.0 2 NCAA Division I-FCS NaN 2670 2638 5308 ... NaN NaN 11 11 550389.0 472485.0 1022874.0 550389.0 472485.0 1022874.0
Volleyball 186131 Princeton NJ 8544.0 2 NCAA Division I-FCS NaN 2670 2638 5308 ... NaN NaN 17 16 324540.0 423386.0 747926.0 324540.0 423386.0 747926.0
Water Polo 186131 Princeton NJ 8544.0 2 NCAA Division I-FCS NaN 2670 2638 5308 ... NaN NaN 22 17 408972.0 335028.0 744000.0 408972.0 335028.0 744000.0
Wrestling 186131 Princeton NJ 8544.0 2 NCAA Division I-FCS NaN 2670 2638 5308 ... NaN NaN 31 0 840184.0 NaN 840184.0 840184.0 NaN 840184.0

91 rows × 25 columns

The pandas documentation on MultiIndexes describes a wide variety of ways MultiIndexes can be used to slice and dice data frames.

Checkpoint Exercise

Use a MultiIndex to get a series that lists the sports for which there were university athletics programs in Aberdeen, WA in 2018.

1.5. The .query Method#

The .query method is another powerful way to extract information from a data frame. The idea of .query is that you write a query string using a query mini-language to select a subset of the rows. This has several advantages over other ways of indexing:

  • For large data sets, the .query method is slightly faster than other ways of indexing

  • You don’t have to repeat the name of the data frame every time you refer to a column in a condition

  • You can reuse a single query across multiple data frames, provided they all have the queried columns

  • Conditions are easier to read:

    • You can use not, or and and to invert and combine conditions instead of ~, |, and &

    • You can use in and not in in conditions

    • You can write a < b < c rather than (a < b) & (b < c)

    • It’s not necessary to enclose sub-conditions in parentheses, since the query mini-language defines its own order of operations

Here’s an example of a query:

sports = sports.reset_index()

qs = "city_txt == 'Davis' and state_cd == 'CA' and sports == 'Basketball'"
ucd_bball = sports.query(qs)
ucd_bball[["institution_name", "sports", "year", "total_rev_menwomen"]]
institution_name sports year total_rev_menwomen
913 University of California-Davis Basketball 2015 3203324.0
18263 University of California-Davis Basketball 2016 3817436.0
35698 University of California-Davis Basketball 2017 4021500.0
53319 University of California-Davis Basketball 2018 4240907.0
73947 University of California-Davis Basketball 2019 4141401.0

Checkpoint Exercise

Use a query string to get a data frame that lists the sports for which there were university athletics programs in Aberdeen, WA in 2018.

The pandas documentation provides further details about how to write query strings.