Data Types & Structures

2. Data Types & Structures#

Learning Objectives

After this lesson, you should be able to:

Check the type of an object
Cast an object to a different type
Describe and differentiate lists, series, tuples, sets, dicts, and arrays
Explain what a comprehension is
Identify and cast categorical data
Explain what broadcasting is
Describe and differentiate None, null, and NaN
Locate missing values in a series
Select columns of a data frame
Filter rows of a data frame on a condition
Negate or combine conditions with logic operators

The previous chapter introduced Python, providing enough background to do simple computations on data sets. This chapter focuses on the foundational knowledge and skills you’ll need to use Python effectively in the long term. Specifically, it’s a deep dive into data types and data structures in Python and Polars. Working knowledge of these will make you more effective at analyzing data and solving problems.

2.1. Data Types#

In Summarizing Data, we used the .glimpse method to get a structural summary of the California least tern data set:

terns.glimpse()

Rows: 791
Columns: 43
$ year                <i64> 2000, 2000, 2000, 2000, 2000, 2000, 2000, 2000, 2000, 2000
$ site_name           <str> 'PITTSBURG POWER PLANT', 'ALBANY CENTRAL AVE', 'ALAMEDA POINT', 'KETTLEMAN CITY', 'OCEANO DUNES STATE VEHICULAR RECREATION AREA', 'RANCHO GUADALUPE DUNES PRESERVE', 'VANDENBERG SFB', 'SANTA CLARA RIVER MCGRATH STATE BEACH', 'ORMOND BEACH', 'NBVC POINT MUGU'
$ site_name_2013_2018 <str> 'Pittsburg Power Plant', 'NA_NO POLYGON', 'Alameda Point', 'Kettleman', 'Oceano Dunes State Vehicular Recreation Area', 'Rancho Guadalupe Dunes Preserve', 'Vandenberg AFB', 'Santa Clara River', 'Ormond Beach', 'NBVC Point Mugu'
$ site_name_1988_2001 <str> 'NA_2013_2018 POLYGON', 'Albany Central Avenue', 'NA_2013_2018 POLYGON', 'NA_2013_2018 POLYGON', 'NA_2013_2018 POLYGON', 'NA_2013_2018 POLYGON', 'NA_2013_2018 POLYGON', 'NA_2013_2018 POLYGON', 'NA_2013_2018 POLYGON', 'NA_2013_2018 POLYGON'
$ site_abbr           <str> 'PITT_POWER', 'AL_CENTAVE', 'ALAM_PT', 'KET_CTY', 'OCEANO_DUNES', 'RGDP', 'VAN_SFB', 'S_CLAR_MCG', 'ORMOND', 'PT_MUGU'
$ region_3            <str> 'S.F._BAY', 'S.F._BAY', 'S.F._BAY', 'KINGS', 'CENTRAL', 'CENTRAL', 'CENTRAL', 'SOUTHERN', 'SOUTHERN', 'SOUTHERN'
$ region_4            <str> 'S.F._BAY', 'S.F._BAY', 'S.F._BAY', 'KINGS', 'CENTRAL', 'CENTRAL', 'CENTRAL', 'VENTURA', 'VENTURA', 'VENTURA'
$ event               <str> 'LA_NINA', 'LA_NINA', 'LA_NINA', 'LA_NINA', 'LA_NINA', 'LA_NINA', 'LA_NINA', 'LA_NINA', 'LA_NINA', 'LA_NINA'
$ bp_min              <f64> 15.0, 6.0, 282.0, 2.0, 4.0, 9.0, 30.0, 21.0, 73.0, 166.0
$ bp_max              <f64> 15.0, 12.0, 301.0, 3.0, 5.0, 9.0, 32.0, 21.0, 73.0, 167.0
$ fl_min              <i64> 16, 1, 200, 1, 4, 17, 11, 9, 60, 64
$ fl_max              <i64> 18, 1, 230, 2, 4, 17, 11, 9, 65, 64
$ total_nests         <i64> 15, 20, 312, 3, 5, 9, 32, 22, 73, 252
$ nonpred_eggs        <i64> 3, None, 124, None, 2, 0, None, 4, 2, None
$ nonpred_chicks      <i64> 0, None, 81, 3, 0, 1, 27, 3, 0, None
$ nonpred_fl          <i64> 0, None, 2, 1, 0, 0, 0, None, 0, None
$ nonpred_ad          <i64> 0, None, 1, 6, 0, 0, 0, None, 0, None
$ pred_control        <str> None, None, None, None, None, None, None, None, None, None
$ pred_eggs           <i64> 4, None, 17, None, 0, None, 0, None, None, None
$ pred_chicks         <i64> 2, None, 0, None, 4, None, 3, None, None, None
$ pred_fl             <i64> 0, None, 0, None, 0, None, 0, None, None, None
$ pred_ad             <i64> 0, None, 0, None, 0, None, 0, None, None, None
$ pred_pefa           <str> 'N', None, 'N', None, 'N', None, 'N', None, 'N', None
$ pred_coy_fox        <str> 'N', None, 'N', None, 'N', None, 'N', None, 'N', None
$ pred_meso           <str> 'N', None, 'N', None, 'N', None, 'N', None, 'Y', None
$ pred_owlspp         <str> 'N', None, 'N', None, 'N', None, 'N', None, 'N', None
$ pred_corvid         <str> 'Y', None, 'N', None, 'N', None, 'N', None, 'N', None
$ pred_other_raptor   <str> 'Y', None, 'Y', None, 'N', None, 'Y', None, 'Y', None
$ pred_other_avian    <str> 'N', None, 'Y', None, 'Y', None, 'N', None, 'N', None
$ pred_misc           <str> 'N', None, 'N', None, 'N', None, 'N', None, 'N', None
$ total_pefa          <i64> 0, None, 0, None, 0, None, 0, None, None, None
$ total_coy_fox       <i64> 0, None, 0, None, 0, None, 0, None, None, None
$ total_meso          <i64> 0, None, 0, None, 0, None, 0, None, None, None
$ total_owlspp        <i64> 0, None, 0, None, 0, None, 0, None, None, None
$ total_corvid        <i64> 4, None, 0, None, 0, None, 0, None, None, None
$ total_other_raptor  <i64> 2, None, 6, None, 0, None, 3, None, None, None
$ total_other_avian   <i64> 0, None, 11, None, 4, None, 0, None, None, None
$ total_misc          <i64> 0, None, 0, None, 0, None, 0, None, None, None
$ first_observed      <str> '2000-05-11', None, '2000-05-01', '2000-06-10', '2000-05-04', '2000-05-07', '2000-05-07', '2000-06-06', None, '2000-05-21'
$ last_observed       <str> '2000-08-05', None, '2000-08-19', '2000-09-24', '2000-08-30', '2000-08-13', '2000-08-17', '2000-09-05', None, '2000-08-12'
$ first_nest          <str> '2000-05-26', None, '2000-05-16', '2000-06-17', '2000-05-28', '2000-05-31', '2000-05-28', '2000-06-06', '2000-06-08', '2000-06-01'
$ first_chick         <str> '2000-06-18', None, '2000-06-07', '2000-07-22', '2000-06-20', '2000-06-22', '2000-06-20', '2000-06-28', '2000-06-26', '2000-06-24'
$ first_fledge        <str> '2000-07-08', None, '2000-06-30', '2000-08-06', '2000-07-13', '2000-07-20', '2000-07-15', '2000-07-24', '2000-07-17', '2000-07-16'

The first two rows describe the shape of the data set. After that, each row lists a column name, the type of data in that column, and that column’s first few values. For instance, the site_name column contains str, or string, data.

We categorize data into different types based on sets of shared characteristics because types are useful for reasoning about what we can do with the data. For example, statisticians conventionally categorize data as one of four types within two larger categories:

numeric
- continuous (real or complex numbers)
- discrete (integers)
categorical
- nominal (categories with no ordering)
- ordinal (categories with some ordering)

Which approaches and statistical techniques are appropriate depends on the type of the data. Of course, other types of data, like graphs (networks) and natural language (books, speech, and so on), are also possible.

Most programming languages, including Python, also categorize data by type. The following table lists some of Python’s built-in types:

Type	Example	Description
`bool`	`True`, `False`	Boolean values
`int`	`-8`, `0`, `42`	Integers
`float`	`-2.1`, `0.5`	Real numbers
`complex`	`3j`, `1-2j`	Complex numbers
`str`	`"hi"`, `"2.1"`	Strings

You can check the type of an object in Python with the built-in type function. Take a look at the types of a few objects:

type("hi")

str

type(True)

bool

type(-8.3)

float

In Python, class is just another word for type. So we can also say that the type function returns the class of an object.

Note

Python provides a class keyword to create your own classes. Creating classes is beyond the scope of this reader, but is explained in detail in most Python programming textbooks.

Tip

You can use the isinstance function to test whether a value is of a particular class. For example, to test whether 5 is a string:

isinstance(5, str)

False

2.1.1. Coercion & Casting#

Although bool, int, and float are different types, in most situations Python will automatically convert between them as needed. For example, you can multiply a floating point number by an integer and then add a Boolean value:

n = 3.1 * 2 + True
n

7.2

First, the integer 2 is converted to a floating point number and multiplied by 3.1, yielding 6.2. Then the Boolean True is converted to a floating point number and added to 6.2. In Python and most other programming languages, False corresponds to 0 and True corresponds to 1. Thus the result is 7.2, a floating point number:

type(n)

float

This automatic conversion of types is known as implicit coercion. Conversion always proceeds from less general to more general types, so that no information is lost.

Implicit coercion only applies in situations where the intent of the code is relatively unambiguous, such as arithmetic between different types of numbers (including Booleans). For example, you can’t add a number to a string, because it’s unclear what the result should be:

"hi" + 1

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Cell In[8], line 1
----> 1 "hi" + 1

TypeError: can only concatenate str (not "int") to str

A cast explicitly converts an object from one type to another, sometimes losing information. You can cast an object to a particular type with the function of the same name. For example, to cast to the bool type:

bool(0)

False

Or to cast to the int type:

int(4.67)

Casts are especially useful for converting to and from the str type:

"hi" + str(1)

'hi1'

float("7.3")

7.3

Python will raise an error if a cast is not possible. For example, this will not work:

int("Hello world!")

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Cell In[13], line 1
----> 1 int("Hello world!")

ValueError: invalid literal for int() with base 10: 'Hello world!'

2.2. Lists#

A data structure is a collection of data organized in a particular way. In Python, data structures are also called containers, because they contain data. Containers make working with lots of data manageable and efficient. Data frames, introduced in the previous chapter, are an example of a two-dimensional data structure.

A list is a general-purpose one-dimensional data structure. Lists are built into Python; you don’t even need to import a module in order to use them. You can create a list by enclosing any number of comma-separated values in square brackets [], like this:

x = [10, 20, 30, 40, 50]
x

[10, 20, 30, 40, 50]

Lists are ordered, which means the values, or elements, have specific positions. The first element is 10, the second is 20, the fifth is 50, and so on.

The elements of a list can be of different types, so we say lists are heterogeneous. For instance, this list contains a number, a string, and another list (with one element):

li = [8, "hello", [4.2]]
li

[8, 'hello', [4.2]]

A list can have no elements, in which case we say it’s empty. For example:

empty = []
empty

[]

You can get the length of a list with the len function:

len(empty)

The list function converts other containers into lists. Strings are technically containers for individual characters, so:

list("data science")

['d', 'a', 't', 'a', ' ', 's', 'c', 'i', 'e', 'n', 'c', 'e']

2.2.1. Indexing#

So far you’ve learned two ways to use square brackets []:

To select columns from a data frame, as in terns["year"]
To create lists, as in ["a", "b", 1]

The first case is an example of indexing, which means getting or setting elements of a data structure. The square brackets [] are Python’s indexing operator.

You can use indexing to get an element of a list based on the element’s position. Python uses zero-based indexing, which means the positions of elements are counted starting from 0 rather than 1. So the first element of a series is at position 0, the second is at position 1, and so on.

Note

Many programming languages use zero-based indexing. It may seem strange at first, but it makes some kinds of computations simpler by eliminating the need to add or subtract 1.

The indexing operator requires at least one argument, called the index, which goes inside of the square brackets []. The index says which elements you want to get. For a data frame, you can use a position or a column name as the index. For a list, you can only use a position.

As an example, consider the list li we created earlier:

li = [8, "hello", [4.2]]
li

[8, 'hello', [4.2]]

The code to get the first element is:

li[0]

Likewise, to get the third element:

li[2]

[4.2]

The third element is a list too. If you want to get its first element, you can chain, or repeat, the indexing operator:

li[2][0]

4.2

Read this code from right to left as “get the first element of the third element of the variable li.”

You can use a slice to select a range of elements. The syntax for a slice is lower:upper:stride, where all of the arguments and the second colon : are optional. The lower bound defaults to 0, the upper bound defaults to the length of the list, and the stride defaults to 1. For example, to get the first two elements:

li[:2]

[8, 'hello']

As another example, you can use a slice to get every other element:

li[::2]

[8, [4.2]]

Negative values in a slice index backwards from the end of the list. For instance, to get the last 2 elements:

li[-2:]

['hello', [4.2]]

You can set an element of a list by assigning a value at a given index. So the code to change the first element of li to the string “hi” is:

li[0] = "hi"
li

['hi', 'hello', [4.2]]

Indexing isn’t just for lists: most of the examples in this section also apply to series, data frames, and other data structures.

2.2.2. References#

Assigning elements of a container is not without complication. Suppose you assign a list to a variable x and then create a new variable y from x. If you change an element of y, it will also change x:

x = [1, 2]
y = x
y[0] = 10
x

[10, 2]

This happens because of how Python handles containers. When you create a container, Python stores it in your computer’s memory. If you then assign the container to a variable, the variable points, or refers, to the location of the container in memory. If you create a second variable from the first, both will refer to the same location. As a result, operations on one variable will affect the value of the other, because there’s really only one container in memory and both variables refer to it.

The example above uses lists, but other containers—such as data frames—behave the same way. If you want to assign an independent copy of a container to a variable rather than a reference, you need to use a function or method to explicitly make a copy. Many containers have a .copy or .clone method that makes a copy:

x = [1, 2]
y = x.copy()
y[0] = 10
x

[1, 2]

2.2.3. Comprehensions#

A list comprehension creates a new list from the elements of an existing list. We’ll use this list to demonstrate comprehensions:

values = [10, 11, 12, -5, 13, 14]

If you want, for example, to add 1 to each element of the list, you can use a comprehension to do it. Here’s how:

[v + 1 for v in values]

[11, 12, 13, -4, 14, 15]

In words, this code tells Python to create a new list where the elements are v + 1 for each element v in values. The enclosing square brackets [] indicate that the result should be a list. More generally, the syntax for a comprehension is:

EXPRESSION for ELEMENT in CONTAINER

Replace CONTAINER with a data structure, ELEMENT with a variable name for the elements, and EXPRESSION with an expression to compute (typically using the elements).

For instance, you can also use a comprehension to compute the type of each element in a container:

[type(v) for v in values]

[int, int, int, int, int, int]

Comprehensions can also filter out some elements based on a condition. Suppose we only want the positive elements of the list:

[v for v in values if v > 0]

[10, 11, 12, 13, 14]

The syntax for a comprehension with a condition is:

EXPRESSION for ELEMENT in CONTAINER if CONDITION

The CONDITION must be an expression that evaluates to True or False (typically using the elements).

Comprehensions are an efficient way to compute (and compute on) lists. You can also use comprehensions with Python’s other built-in data structures, which you’ll learn about in Built-in Data Structures.

2.3. Series#

A series is an ordered, one-dimensional data structure. Series are a fundamental data structure in Polars, because each column in a data frame is a series.

For example, in the California least tern data set, the site_name column is a series. Take a look at the first few elements with its .head method:

terns["site_name"].head()

shape: (10,)

site_name
str
"PITTSBURG POWER PLANT"
"ALBANY CENTRAL AVE"
"ALAMEDA POINT"
"KETTLEMAN CITY"
"OCEANO DUNES STATE VEHICULAR R…
"RANCHO GUADALUPE DUNES PRESERV…
"VANDENBERG SFB"
"SANTA CLARA RIVER MCGRATH STAT…
"ORMOND BEACH"
"NBVC POINT MUGU"

Series and data frames have many attributes and methods in common; the .head method is one of these.

Notice that the elements of the site_name series are all strings. Unlike a list, in a series all elements must be of the same type, so we say series are homogeneous. A series can contain strings, integers, decimal numbers, or any of several other types of data, but not a mix of these all at once.

The other columns in the least tern data are also series. For instance, the year column is a series of integers:

terns["year"]

shape: (791,)

year
i64
2000
2000
2000
2000
2000
…
2023
2023
2023
2023
2023

Series can contain any number of elements, including 0 or 1 element. You can check the number of elements, or length, of a series with Python’s built-in len function:

len(terns["year"])

Since this is a column from the terns data frame, its length is the same as the number of rows in terns.

Note

You can also check the length of a series with the .shape attribute:

terns["year"].shape

(791,)

Python prints the value of .shape differently from the result of len because they are different types of data. Most of the time, it’s more convenient to use the len function to check lengths of one-dimensional objects like series, because it returns an integer.

2.3.1. Creating Series#

Sometimes you’ll want to create series by manually inputting data, perhaps because your data set isn’t digitized or because you want a toy data set to test out some code. You can create a series from a list (or other sequence) with the pl.Series function:

pl.Series([1, 2, 19, -3])

shape: (4,)


i64
1
2
19
-3

pl.Series(["hi", "hello"])

shape: (2,)


str
"hi"
"hello"

The Polars documentation recommends setting a name for every series. To do this with pl.Series, pass the name as the first argument and the elements as the second argument:

pl.Series("tens", [10, 20, 30])

shape: (3,)

tens
i64
10
20
30

Polars will print the name when you print the series and will use the name as a column name if you put the series in a data frame. You can get or set the name on a series through the .name attribute:

x = pl.Series("tens", [10, 20, 30])
x.name

'tens'

Tip

If you want to create a series that contains a sequence of numbers, there are several helper functions you can use. Python’s built-in range function creates a sequence of integers. NumPy’s np.arange and np.linspace functions can create sequences of integers or decimal numbers. You can pass the result from any of these functions to pl.Series to create a series.

2.3.2. Data Types in Polars#

Series are homogeneous, so if you try to create a series from elements of different types, the pl.Series function will raise an error:

pl.Series([1, "cool", 2.3])

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
File ~/mill/datalab/teaching/python_basics/.pixi/envs/default/lib/python3.13/site-packages/polars/_utils/construction/series.py:316, in _construct_series_with_fallbacks(constructor, name, values, dtype, strict)
    315 try:
--> 316     return constructor(name, values, strict)
    317 except (TypeError, OverflowError) as e:
    318     # # This retry with i64 is related to https://github.com/pola-rs/polars/issues/17231
    319     # # Essentially, when given a [0, u64::MAX] then it would Overflow.

TypeError: 'str' object cannot be interpreted as an integer

During handling of the above exception, another exception occurred:

TypeError                                 Traceback (most recent call last)
Cell In[40], line 1
----> 1 pl.Series([1, "cool", 2.3])

File ~/mill/datalab/teaching/python_basics/.pixi/envs/default/lib/python3.13/site-packages/polars/series/series.py:295, in Series.__init__(self, name, values, dtype, strict, nan_to_null)
    292         raise TypeError(msg)
    294 if isinstance(values, Sequence):
--> 295     self._s = sequence_to_pyseries(
    296         name,
    297         values,
    298         dtype=dtype,
    299         strict=strict,
    300         nan_to_null=nan_to_null,
    301     )
    303 elif values is None:
    304     self._s = sequence_to_pyseries(name, [], dtype=dtype)

File ~/mill/datalab/teaching/python_basics/.pixi/envs/default/lib/python3.13/site-packages/polars/_utils/construction/series.py:301, in sequence_to_pyseries(name, values, dtype, strict, nan_to_null)
    298     except RuntimeError:
    299         return PySeries.new_from_any_values(name, values, strict=strict)
--> 301 return _construct_series_with_fallbacks(
    302     constructor, name, values, dtype, strict=strict
    303 )

File ~/mill/datalab/teaching/python_basics/.pixi/envs/default/lib/python3.13/site-packages/polars/_utils/construction/series.py:329, in _construct_series_with_fallbacks(constructor, name, values, dtype, strict)
    325     return _construct_series_with_fallbacks(
    326         PySeries.new_opt_u64, name, values, dtype, strict=strict
    327     )
    328 elif dtype is None:
--> 329     return PySeries.new_from_any_values(name, values, strict=strict)
    330 else:
    331     return PySeries.new_from_any_values_and_dtype(
    332         name, values, dtype, strict=strict
    333     )

TypeError: unexpected value while building Series of type Int64; found value of type String: "cool"

Hint: Try setting `strict=False` to allow passing data with mixed types.

By default, Polars infers the data type of a series’ elements from the first element. This can lead to errors you might not expect:

pl.Series([1, 9.2, 2.3])

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
File ~/mill/datalab/teaching/python_basics/.pixi/envs/default/lib/python3.13/site-packages/polars/_utils/construction/series.py:316, in _construct_series_with_fallbacks(constructor, name, values, dtype, strict)
    315 try:
--> 316     return constructor(name, values, strict)
    317 except (TypeError, OverflowError) as e:
    318     # # This retry with i64 is related to https://github.com/pola-rs/polars/issues/17231
    319     # # Essentially, when given a [0, u64::MAX] then it would Overflow.

TypeError: 'float' object cannot be interpreted as an integer

During handling of the above exception, another exception occurred:

TypeError                                 Traceback (most recent call last)
Cell In[41], line 1
----> 1 pl.Series([1, 9.2, 2.3])

File ~/mill/datalab/teaching/python_basics/.pixi/envs/default/lib/python3.13/site-packages/polars/series/series.py:295, in Series.__init__(self, name, values, dtype, strict, nan_to_null)
    292         raise TypeError(msg)
    294 if isinstance(values, Sequence):
--> 295     self._s = sequence_to_pyseries(
    296         name,
    297         values,
    298         dtype=dtype,
    299         strict=strict,
    300         nan_to_null=nan_to_null,
    301     )
    303 elif values is None:
    304     self._s = sequence_to_pyseries(name, [], dtype=dtype)

File ~/mill/datalab/teaching/python_basics/.pixi/envs/default/lib/python3.13/site-packages/polars/_utils/construction/series.py:301, in sequence_to_pyseries(name, values, dtype, strict, nan_to_null)
    298     except RuntimeError:
    299         return PySeries.new_from_any_values(name, values, strict=strict)
--> 301 return _construct_series_with_fallbacks(
    302     constructor, name, values, dtype, strict=strict
    303 )

File ~/mill/datalab/teaching/python_basics/.pixi/envs/default/lib/python3.13/site-packages/polars/_utils/construction/series.py:329, in _construct_series_with_fallbacks(constructor, name, values, dtype, strict)
    325     return _construct_series_with_fallbacks(
    326         PySeries.new_opt_u64, name, values, dtype, strict=strict
    327     )
    328 elif dtype is None:
--> 329     return PySeries.new_from_any_values(name, values, strict=strict)
    330 else:
    331     return PySeries.new_from_any_values_and_dtype(
    332         name, values, dtype, strict=strict
    333     )

TypeError: unexpected value while building Series of type Int64; found value of type Float64: 9.2

Hint: Try setting `strict=False` to allow passing data with mixed types.

You can explicitly specify a data type for a series with the pl.Series function’s third parameter, dtype:

pl.Series([1, 9.2, 2.3], dtype = float)

shape: (3,)


f64
1.0
9.2
2.3

Polars uses its own data types for series elements, so that:

It can efficiently support types of data that are not built into Python, such as categorical data.
Every numeric type has an explicit bit size: the number of bits of memory necessary to store an element. Bit sizes appear as suffixes in the name of the type. For instance, Float32 stores a floating point number in 32 bits.
A special value, null, can be present in any series to indicate missing data. We’ll explain null in Special Values.

When you create or access the elements of a series, Polars silently converts between its types and Python’s built-in types. Some of the Polars types and their Python equivalents are listed in the following table:

Type	Python Equivalent	Description
`Boolean`	`bool`	Boolean values
`Int8`, `Int16`, `Int32`, `Int64`	`int`	Integers
`Float32`, `Float64`	`float`	Real numbers (base-2 floating point)
Not yet supported	`complex`	Complex numbers
`String`	`str`	Strings
`Categorical`, `Enum`	No equivalent	Categorical data

The Polars documentation has the complete list.

Why does bit size matter?

By using more bits to store values, you can express a wider range of values. This is best illustrated by the integer types: Int8 can only express values from -128 to 127 (inclusive), whereas Int64 can express values from -9,223,372,036,854,775,808 to 9,223,372,036,854,775,807.

If a computation produces a value too small or too large to express as its given type, the value overflows, leading to an inaccurate result. The exact effect of overflow depends on the type, software, and hardware, but one common outcome is that the value “wraps around” to the other side of the type’s range:

pl.Series([127], dtype = pl.Int8) + 1

shape: (1,)
Series: '' [i8]
[
        -128
]

The tradeoff is that the more bits you use to store values, the more memory you need. For computations that generate or process large quantities of data, as is often the case in research computing, memory efficiency is a major concern—computers have a limited amount of memory.

You can use bit size to estimate the amount of memory a series will require. For example, since a single element of a Float64 series requires about 64 bits, the bp_min column in the least terns data requires roughly this many bytes:

64 * len(terns["bp_min"]) / 8  # 8 bits per byte 

6328.0

For series and data frames, you can use the .estimated_size method to have Python do this calculation for you. When a computation runs out of memory, an estimate of how much memory is necessary can help you decide whether to change your computing strategy or get more memory.

Most of Python’s built-in data types don’t specify bit sizes, and their sizes can even vary depending on your computer’s hardware and operating system!

If you call Python’s type function on a data structure, it returns the type of the data structure:

type(terns["site_name"])

polars.series.series.Series

For a series, you can get the element type with the .dtype attribute:

terns["site_name"].dtype

String

Note

Data frames don’t have a .dtype attribute since they can consist of multiple series. Instead, they have a .dtypes attribute, a list with the element type for each column.

If your goal is to summarize a data frame, the .glimpse method is usually more convenient.

You can use the .cast method to cast the elements of a series to a specific type. For example, here’s how to cast the total_nests column to a String series:

terns["total_nests"].cast(pl.String)

shape: (791,)

total_nests
str
"15"
"20"
"312"
"3"
"5"
…
"717"
"44"
"59"
"48"
"171"

2.3.3. Categorical Data#

A feature is categorical if it measures a qualitative category. For example, the genres rock, blues, alternative, folk, pop are categories.

Polars uses the Categorical and Enum data types to represent categorical data. Visualizations and statistical models sometimes treat categorical data differently than other data types, so it’s important to make sure you have the right data type.

When it reads a data set, Polars usually can’t tell which features are categorical. That means identifying and converting the categorical features is up to you. For beginners, it can be difficult to understand whether a feature is categorical or not. The key is to think about whether you want to use the feature to divide the data into groups.

For example, if you want to know how many songs are in the rock genre, you first need to divide the songs by genre, and then count the number of songs in each group (or at least the rock group).

As a second example, months recorded as numbers can be categorical or not, depending on how you want to use them. You might want to treat them as categorical (for example, to compute max rainfall in each month) or you might want to treat them as numbers (for example, to compute the number of months time between two events).

The bottom line is that you have to think about what you’ll be doing in the analysis. In some cases, you might treat a feature as categorical only for part of the analysis.

Let’s think about which features are categorical in least terns data set. To refresh your memory of what’s in the data set, take a look at the structural summary:

terns.glimpse()

Rows: 791
Columns: 43
$ year                <i64> 2000, 2000, 2000, 2000, 2000, 2000, 2000, 2000, 2000, 2000
$ site_name           <str> 'PITTSBURG POWER PLANT', 'ALBANY CENTRAL AVE', 'ALAMEDA POINT', 'KETTLEMAN CITY', 'OCEANO DUNES STATE VEHICULAR RECREATION AREA', 'RANCHO GUADALUPE DUNES PRESERVE', 'VANDENBERG SFB', 'SANTA CLARA RIVER MCGRATH STATE BEACH', 'ORMOND BEACH', 'NBVC POINT MUGU'
$ site_name_2013_2018 <str> 'Pittsburg Power Plant', 'NA_NO POLYGON', 'Alameda Point', 'Kettleman', 'Oceano Dunes State Vehicular Recreation Area', 'Rancho Guadalupe Dunes Preserve', 'Vandenberg AFB', 'Santa Clara River', 'Ormond Beach', 'NBVC Point Mugu'
$ site_name_1988_2001 <str> 'NA_2013_2018 POLYGON', 'Albany Central Avenue', 'NA_2013_2018 POLYGON', 'NA_2013_2018 POLYGON', 'NA_2013_2018 POLYGON', 'NA_2013_2018 POLYGON', 'NA_2013_2018 POLYGON', 'NA_2013_2018 POLYGON', 'NA_2013_2018 POLYGON', 'NA_2013_2018 POLYGON'
$ site_abbr           <str> 'PITT_POWER', 'AL_CENTAVE', 'ALAM_PT', 'KET_CTY', 'OCEANO_DUNES', 'RGDP', 'VAN_SFB', 'S_CLAR_MCG', 'ORMOND', 'PT_MUGU'
$ region_3            <str> 'S.F._BAY', 'S.F._BAY', 'S.F._BAY', 'KINGS', 'CENTRAL', 'CENTRAL', 'CENTRAL', 'SOUTHERN', 'SOUTHERN', 'SOUTHERN'
$ region_4            <str> 'S.F._BAY', 'S.F._BAY', 'S.F._BAY', 'KINGS', 'CENTRAL', 'CENTRAL', 'CENTRAL', 'VENTURA', 'VENTURA', 'VENTURA'
$ event               <str> 'LA_NINA', 'LA_NINA', 'LA_NINA', 'LA_NINA', 'LA_NINA', 'LA_NINA', 'LA_NINA', 'LA_NINA', 'LA_NINA', 'LA_NINA'
$ bp_min              <f64> 15.0, 6.0, 282.0, 2.0, 4.0, 9.0, 30.0, 21.0, 73.0, 166.0
$ bp_max              <f64> 15.0, 12.0, 301.0, 3.0, 5.0, 9.0, 32.0, 21.0, 73.0, 167.0
$ fl_min              <i64> 16, 1, 200, 1, 4, 17, 11, 9, 60, 64
$ fl_max              <i64> 18, 1, 230, 2, 4, 17, 11, 9, 65, 64
$ total_nests         <i64> 15, 20, 312, 3, 5, 9, 32, 22, 73, 252
$ nonpred_eggs        <i64> 3, None, 124, None, 2, 0, None, 4, 2, None
$ nonpred_chicks      <i64> 0, None, 81, 3, 0, 1, 27, 3, 0, None
$ nonpred_fl          <i64> 0, None, 2, 1, 0, 0, 0, None, 0, None
$ nonpred_ad          <i64> 0, None, 1, 6, 0, 0, 0, None, 0, None
$ pred_control        <str> None, None, None, None, None, None, None, None, None, None
$ pred_eggs           <i64> 4, None, 17, None, 0, None, 0, None, None, None
$ pred_chicks         <i64> 2, None, 0, None, 4, None, 3, None, None, None
$ pred_fl             <i64> 0, None, 0, None, 0, None, 0, None, None, None
$ pred_ad             <i64> 0, None, 0, None, 0, None, 0, None, None, None
$ pred_pefa           <str> 'N', None, 'N', None, 'N', None, 'N', None, 'N', None
$ pred_coy_fox        <str> 'N', None, 'N', None, 'N', None, 'N', None, 'N', None
$ pred_meso           <str> 'N', None, 'N', None, 'N', None, 'N', None, 'Y', None
$ pred_owlspp         <str> 'N', None, 'N', None, 'N', None, 'N', None, 'N', None
$ pred_corvid         <str> 'Y', None, 'N', None, 'N', None, 'N', None, 'N', None
$ pred_other_raptor   <str> 'Y', None, 'Y', None, 'N', None, 'Y', None, 'Y', None
$ pred_other_avian    <str> 'N', None, 'Y', None, 'Y', None, 'N', None, 'N', None
$ pred_misc           <str> 'N', None, 'N', None, 'N', None, 'N', None, 'N', None
$ total_pefa          <i64> 0, None, 0, None, 0, None, 0, None, None, None
$ total_coy_fox       <i64> 0, None, 0, None, 0, None, 0, None, None, None
$ total_meso          <i64> 0, None, 0, None, 0, None, 0, None, None, None
$ total_owlspp        <i64> 0, None, 0, None, 0, None, 0, None, None, None
$ total_corvid        <i64> 4, None, 0, None, 0, None, 0, None, None, None
$ total_other_raptor  <i64> 2, None, 6, None, 0, None, 3, None, None, None
$ total_other_avian   <i64> 0, None, 11, None, 4, None, 0, None, None, None
$ total_misc          <i64> 0, None, 0, None, 0, None, 0, None, None, None
$ first_observed      <str> '2000-05-11', None, '2000-05-01', '2000-06-10', '2000-05-04', '2000-05-07', '2000-05-07', '2000-06-06', None, '2000-05-21'
$ last_observed       <str> '2000-08-05', None, '2000-08-19', '2000-09-24', '2000-08-30', '2000-08-13', '2000-08-17', '2000-09-05', None, '2000-08-12'
$ first_nest          <str> '2000-05-26', None, '2000-05-16', '2000-06-17', '2000-05-28', '2000-05-31', '2000-05-28', '2000-06-06', '2000-06-08', '2000-06-01'
$ first_chick         <str> '2000-06-18', None, '2000-06-07', '2000-07-22', '2000-06-20', '2000-06-22', '2000-06-20', '2000-06-28', '2000-06-26', '2000-06-24'
$ first_fledge        <str> '2000-07-08', None, '2000-06-30', '2000-08-06', '2000-07-13', '2000-07-20', '2000-07-15', '2000-07-24', '2000-07-17', '2000-07-16'

The site_name, site_abbr, and event columns are all examples of categorical data. The region_ columns and some of the pred_ columns also contain categorical data.

One way to check whether a feature is useful for grouping (and thus effectively categorical) is to count the number of times each value appears. For a series, you can do this with the .value_counts method. For instance, to count the number of times each category of event appears:

terns["event"].value_counts()

shape: (3, 2)

event	count
str	u32
"EL_NINO"	120
"NEUTRAL"	413
"LA_NINA"	258

Features with only a few unique values, repeated many times, are ideal for grouping. Numerical features, like total_nests, usually aren’t good for grouping, both because of what they measure and because they tend to have many unique values, which leads to very small groups.

The year column can be treated as categorical or quantitative data. It’s easy to imagine grouping observations by year, but years are also numerical: they have an order and we might want to do math on them. The most appropriate type for year depends on how we want to use it for analysis.

You can cast a column to the Categorical type with the .cast method. Try this for the event column:

event = terns["event"].cast(pl.Categorical)
event

shape: (791,)

event
cat
"LA_NINA"
"LA_NINA"
"LA_NINA"
"LA_NINA"
"LA_NINA"
…
"LA_NINA"
"LA_NINA"
"LA_NINA"
"LA_NINA"
"LA_NINA"

Polars organizes attributes and methods for categorical data under the .cat attribute of series. These raise errors if the element type of the series is not Categorical (or Enum). You can get the categories of a categorical series with the .cat.get_categories method:

event.cat.get_categories()

shape: (3,)

event
str
"LA_NINA"
"NEUTRAL"
"EL_NINO"

A categorical series remembers all possible categories even if you take a subset where some of the categories aren’t present:

event[:3]

shape: (3,)

event
cat
"LA_NINA"
"LA_NINA"
"LA_NINA"

event[:3].cat.get_categories()

shape: (3,)

event
str
"LA_NINA"
"NEUTRAL"
"EL_NINO"

This is one way the Categorical type is different from the String type, and ensures that when you, for example, plot a categorical series, missing categories are represented.

Note

The Categorical and Enum types both represent categorical data. The Categorical type is more flexible, allowing you to add categories as needed. The Enum type is more memory-efficient, but requires that you specify all possible categories up front. In practice, the Categorical type is more convenient for interactive work.

2.3.4. Broadcasting#

If you use an arithmetic operator on a series, Polars broadcasts the operation to each element:

x = pl.Series([1, 3, 0])
x - 3

shape: (3,)


i64
-2
0
-3

The result is the same as if you had applied the operation element-by-element. That is:

pl.Series([1 - 3, 3 - 3, 0 - 3])

shape: (3,)


i64
-2
0
-3

Most NumPy (and SciPy) functions also broadcast. For instance:

import numpy as np

x = pl.Series([1.0, 3.0, 0.0, np.pi])
np.sin(x)

shape: (4,)


f64
0.841471
0.14112
0.0
1.2246e-16

Some examples of functions that broadcast are np.sin, np.cos, np.tan, np.log, np.exp, and np.sqrt.

NumPy functions that combine or aggregate values usually don’t broadcast. For example, np.sum, np.mean, and np.median don’t broadcast.

Tip

Broadcasting is the counterpart to comprehensions (introduced in Comprehensions). Both are highly efficient. Generally, you should:

Use broadcasting with data structures that support it, such as series and NumPy arrays (explained in NumPy Arrays).
Use comprehensions with lists and Python’s other built-in data structures (explained in Built-in Data Structures).

A function can broadcast across multiple arguments. To demonstrate this, suppose we want to estimate number of nests per breeding pair for the least terns data. The total_nests column contains the total number of nests at each site, and the bp_max column contains the maximum reported number of breeding pairs. So to compute nests per breeding pair:

terns["total_nests"] / terns["bp_max"]

shape: (791,)

total_nests
f64
1.0
1.666667
1.036545
1.0
1.0
…
1.113354
1.157895
1.092593
1.170732
1.036364

The elements are paired up and divided according to their positions. Notice that the result is a Float64 series. The total_nests column is an Int64 series, so besides broadcasting, the example also demonstrates that Series are subject to implicit coercion (introduced in Coercion & Casting).

If you try to broadcast a function across two series of different lengths, Polars raises an error:

x = pl.Series([1, 2])
y = pl.Series([9, 8, 7])
x - y

---------------------------------------------------------------------------
InvalidOperationError                     Traceback (most recent call last)
Cell In[56], line 3
      1 x = pl.Series([1, 2])
      2 y = pl.Series([9, 8, 7])
----> 3 x - y

File ~/mill/datalab/teaching/python_basics/.pixi/envs/default/lib/python3.13/site-packages/polars/series/series.py:1077, in Series.__sub__(self, other)
   1075 if self.dtype.is_decimal() and isinstance(other, (float, int)):
   1076     return self.to_frame().select(F.col(self.name) - other).to_series()
-> 1077 return self._arithmetic(other, "sub", "sub_<>")

File ~/mill/datalab/teaching/python_basics/.pixi/envs/default/lib/python3.13/site-packages/polars/series/series.py:1011, in Series._arithmetic(self, other, op_s, op_ffi)
   1008     other = pl.Series("", [None])
   1010 if isinstance(other, Series):
-> 1011     return self._from_pyseries(getattr(self._s, op_s)(other._s))
   1012 elif _check_for_numpy(other) and isinstance(other, np.ndarray):
   1013     return self._from_pyseries(getattr(self._s, op_s)(Series(other)._s))

InvalidOperationError: cannot do arithmetic operation on series of different lengths: got 2 and 3

2.4. Other Data Structures#

In this section, you’ll learn about several one-dimensional data structures that are fundamental to programming in Python.

2.4.1. Built-in Data Structures#

Besides lists, Python provides several other useful data structures:

Like lists, tuples are ordered and heterogeneous. The main difference is that tuples are immutable: once you create a tuple, you can’t change it. This makes tuples safer and more efficient than lists.

You can make a tuple by enclosing comma-separated values in parentheses ():
```
(True, 1, "hi")
```
You can cast other data structures to a tuple with the tuple function. Use a tuple when the number of elements is constant and known in advance.
A set is unordered and heterogeneous. As in a mathematical set, the elements in a set must be unique. Python automatically discards any duplicates added to a set. Sets support set theoretic operations such as unions and intersections.

You can make a set by enclosing comma-separated values in curly braces {}:
```
{True, 1, "hi"}
```
You can convert other data structures to a set with the set function. Use a set when you need a guarantee that the elements are unique.
A dict is an ordered, heterogeneous collection of key-value pairs. Keys must be distinct and many different types of keys are valid. The indexing operator [] gets elements by key rather than position.

You can make a dict by enclosing comma-separated key: value pairs in curly braces {}:
```
{"hi": -3.5}
```
Use a dict when you need to index elements by something other than position or need a mapping from one collection of data to another.

2.4.2. NumPy Arrays#

An array (or ndarray) is an ordered, homogeneous data structure, similar to a series. Arrays are a fundamental data structure in NumPy.

You can create an array with the np.array function and a list of elements:

x = np.array([10, 20, 30])
x

array([10, 20, 30])

You can convert an array into a series with the pl.Series function:

pl.Series(x)

shape: (3,)


i64
10
20
30

Conversely, you can convert a series to an array with the .to_numpy method:

terns["total_nests"][:5].to_numpy()

array([ 15,  20, 312,   3,   5])

Tip

Series tend to be a good choice for data analysis, while arrays tend to be a good choice for sophisticated mathematical computations (such as simulations).

Note

NumPy uses its own data types for array elements, for many of the same reasons Polars does for series elements. The NumPy documentation has more details.

NumPy is primarily designed for numerical computing, so working with strings in NumPy can be tricky. See the documentation for details about its string types. If you need to work with strings, Polars is more convenient than NumPy.

2.5. Special Values#

2.5.1. None#

In Python, None represents an absent or undefined value. It is useful:

As a way to explicitly indicate a value is absent.
As the return value for functions that are useful for their side effects and don’t need to return anything.
As a default argument for optional parameters in functions.

For example, Python’s built-in print function, which prints a string to the console, returns None:

print("Hello!")

Hello!

The Python console doesn’t print anything when an expression produces None:

None

None is the only value of type NoneType:

type(None)

NoneType

You can check if a value is None with Python’s is keyword:

x = None
x is None

True

2.5.2. Missing Values#

In the least terns data set, notice that some of the entries are null. For instance, look at the second element of the nonpred_eggs column:

terns.head()

shape: (5, 43)

year	site_name	site_name_2013_2018	site_name_1988_2001	site_abbr	region_3	region_4	event	bp_min	bp_max	fl_min	fl_max	total_nests	nonpred_eggs	nonpred_chicks	nonpred_fl	nonpred_ad	pred_control	pred_eggs	pred_chicks	pred_fl	pred_ad	pred_pefa	pred_coy_fox	pred_meso	pred_owlspp	pred_corvid	pred_other_raptor	pred_other_avian	pred_misc	total_pefa	total_coy_fox	total_meso	total_owlspp	total_corvid	total_other_raptor	total_other_avian	total_misc	first_observed	last_observed	first_nest	first_chick	first_fledge
i64	str	str	str	str	str	str	str	f64	f64	i64	i64	i64	i64	i64	i64	i64	str	i64	i64	i64	i64	str	str	str	str	str	str	str	str	i64	i64	i64	i64	i64	i64	i64	i64	str	str	str	str	str
2000	"PITTSBURG POWER PLANT"	"Pittsburg Power Plant"	"NA_2013_2018 POLYGON"	"PITT_POWER"	"S.F._BAY"	"S.F._BAY"	"LA_NINA"	15.0	15.0	16	18	15	3	0	0	0	null	4	2	0	0	"N"	"N"	"N"	"N"	"Y"	"Y"	"N"	"N"	0	0	0	0	4	2	0	0	"2000-05-11"	"2000-08-05"	"2000-05-26"	"2000-06-18"	"2000-07-08"
2000	"ALBANY CENTRAL AVE"	"NA_NO POLYGON"	"Albany Central Avenue"	"AL_CENTAVE"	"S.F._BAY"	"S.F._BAY"	"LA_NINA"	6.0	12.0	1	1	20	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null
2000	"ALAMEDA POINT"	"Alameda Point"	"NA_2013_2018 POLYGON"	"ALAM_PT"	"S.F._BAY"	"S.F._BAY"	"LA_NINA"	282.0	301.0	200	230	312	124	81	2	1	null	17	0	0	0	"N"	"N"	"N"	"N"	"N"	"Y"	"Y"	"N"	0	0	0	0	0	6	11	0	"2000-05-01"	"2000-08-19"	"2000-05-16"	"2000-06-07"	"2000-06-30"
2000	"KETTLEMAN CITY"	"Kettleman"	"NA_2013_2018 POLYGON"	"KET_CTY"	"KINGS"	"KINGS"	"LA_NINA"	2.0	3.0	1	2	3	null	3	1	6	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	"2000-06-10"	"2000-09-24"	"2000-06-17"	"2000-07-22"	"2000-08-06"
2000	"OCEANO DUNES STATE VEHICULAR R…	"Oceano Dunes State Vehicular R…	"NA_2013_2018 POLYGON"	"OCEANO_DUNES"	"CENTRAL"	"CENTRAL"	"LA_NINA"	4.0	5.0	4	4	5	2	0	0	0	null	0	4	0	0	"N"	"N"	"N"	"N"	"N"	"N"	"Y"	"N"	0	0	0	0	0	0	4	0	"2000-05-04"	"2000-08-30"	"2000-05-28"	"2000-06-20"	"2000-07-13"

Polars uses null, called the missing value, to represent missing entries in a data set. It’s implied that the entries are missing due to how the data was collected, although there are exceptions. As an example, imagine the data came from a survey, and respondents chose not to answer some questions. In the data set, their answers for those questions can be recorded as null.

The missing value null is a chameleon: it can be of an element of any type in a series. Polars implicitly converts null to and from None when you get or set an element in a series. This means you can use None to create a series with null elements:

x = pl.Series([1, 2, None])
x

shape: (3,)


i64
1
2
null

And you get back None if you access a null element:

terns["nonpred_eggs"][1]

The missing value null is also contagious: it represents an unknown quantity, so computing on it usually produces another missing value. The idea is that if the inputs to a computation are unknown, generally so is the output:

x - 3

shape: (3,)


i64
-2
-1
null

Polars makes an exception for aggregation functions, which automatically filter out missing values:

x.mean()

1.5

You can use the .is_null method to test if elements of a series are null:

x.is_null()

shape: (3,)


bool
false
false
true

Polars also provides an .is_not_null method and a .fill_null method to fill missing values with a different value.

2.5.3. Infinity#

NumPy (and Polars) use np.inf to represent infinity. It is of type float. You’re most likely to encounter it as the result of certain computations:

pl.Series([13]) / 0

shape: (1,)


f64
inf

You can us the .is_infinite method to test if elements of a series are infinite:

x = pl.Series([1.0, 2.0, np.inf])
x.is_infinite()

shape: (3,)


bool
false
false
true

2.5.4. Not a Number#

NumPy (and Polars) use np.nan, called not a number and also written NaN, to represent mathematically undefined results. It is of type float. As an example, dividing 0 by 0 is undefined:

pl.Series([0]) / 0

shape: (1,)


f64
NaN

You can use the .is_nan method to test if elements of a series are NaN:

x = pl.Series([0, 1, 2]) / 0
x.is_nan()

shape: (3,)


bool
true
false
false

2.6. Data Frames#

2.6.1. Selecting Columns#

An excellent starting point for selecting and transforming columns in a data frame is the .select method. Summarizing Columns already showed how to select a single column by name with the indexing operator [], but the .select method is much more flexible. You can use it to select multiple columns at once, by name or type, and can transform or rename them.

As with the indexing operator, you can use .select to select a single column by providing the column name as an argument. Here’s an example (with .head to limit the output):

terns.select("year").head()

shape: (5, 1)

year
i64
2000
2000
2000
2000
2000

Unlike the indexing operator, .select returns a data frame rather than a series.

You can also select multiple columns this way:

terns.select("year", "site_name").head()

shape: (5, 2)

year	site_name
i64	str
2000	"PITTSBURG POWER PLANT"
2000	"ALBANY CENTRAL AVE"
2000	"ALAMEDA POINT"
2000	"KETTLEMAN CITY"
2000	"OCEANO DUNES STATE VEHICULAR R…

The .select method is flexible because it can evaluate a Polars expression: instructions for how to select or transform data. One way to create an expression is with the pl.col function, which represents a column or set of columns. So another way to select the year and site_name column from the least terns data is:

terns.select(
    pl.col("year", "site_name")
).head()

shape: (5, 2)

year	site_name
i64	str
2000	"PITTSBURG POWER PLANT"
2000	"ALBANY CENTRAL AVE"
2000	"ALAMEDA POINT"
2000	"KETTLEMAN CITY"
2000	"OCEANO DUNES STATE VEHICULAR R…

An advantage of using pl.col is that you’re not limited to selecting columns by name: you can also select columns by type. Here’s how to get only the Int64 and Float64 columns in the least terns data:

terns.select(
    pl.col(pl.Int64, pl.Float64)
).head()

shape: (5, 22)

year	bp_min	bp_max	fl_min	fl_max	total_nests	nonpred_eggs	nonpred_chicks	nonpred_fl	nonpred_ad	pred_eggs	pred_chicks	pred_fl	pred_ad	total_pefa	total_coy_fox	total_meso	total_owlspp	total_corvid	total_other_raptor	total_other_avian	total_misc
i64	f64	f64	i64	i64	i64	i64	i64	i64	i64	i64	i64	i64	i64	i64	i64	i64	i64	i64	i64	i64	i64
2000	15.0	15.0	16	18	15	3	0	0	0	4	2	0	0	0	0	0	0	4	2	0	0
2000	6.0	12.0	1	1	20	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null
2000	282.0	301.0	200	230	312	124	81	2	1	17	0	0	0	0	0	0	0	0	6	11	0
2000	2.0	3.0	1	2	3	null	3	1	6	null	null	null	null	null	null	null	null	null	null	null	null
2000	4.0	5.0	4	4	5	2	0	0	0	0	4	0	0	0	0	0	0	0	0	4	0

Selecting columns this way is useful for doing things like computing summaries of only numeric columns. In fact, to compute widely-used summaries, like the mean, all you need to do is call the corresponding method on the Polars expression:

terns.select(
    pl.col(pl.Int64, pl.Float64).mean()
).head()

shape: (1, 22)

year	bp_min	bp_max	fl_min	fl_max	total_nests	nonpred_eggs	nonpred_chicks	nonpred_fl	nonpred_ad	pred_eggs	pred_chicks	pred_fl	pred_ad	total_pefa	total_coy_fox	total_meso	total_owlspp	total_corvid	total_other_raptor	total_other_avian	total_misc
f64	f64	f64	f64	f64	f64	f64	f64	f64	f64	f64	f64	f64	f64	f64	f64	f64	f64	f64	f64	f64	f64
2013.082174	129.319923	151.043423	40.815148	50.349166	162.845466	60.291866	44.372681	4.181488	0.850987	41.574074	8.518519	2.365385	2.689655	1.740741	9.464286	5.555556	1.454545	7.961538	1.711538	8.897959	6.566038

A third way to select columns with pl.col is with a pattern. Patterns are strings, and must begin with a caret ^ and end with a dollar sign $. Within a pattern, you can use .* as a wild card that matches any characters.

Note

Technically, Polars’ patterns are regular expressions, a widely-used language for describing patterns in text. You can learn more about regular expressions in the Date & String Processing chapter of DataLab’s Intermediate Python workshop reader.

As motivation to demonstrate patterns, most of the columns with names that start with pred_ are categorical but currently have string elements. It would be good to cast them to the Categorical type. To begin, select all of the columns with names that start with pred_:

terns.select(
    pl.col("^pred_.*$")
).head()

shape: (5, 13)

pred_control	pred_eggs	pred_chicks	pred_fl	pred_ad	pred_pefa	pred_coy_fox	pred_meso	pred_owlspp	pred_corvid	pred_other_raptor	pred_other_avian	pred_misc
str	i64	i64	i64	i64	str	str	str	str	str	str	str	str
null	4	2	0	0	"N"	"N"	"N"	"N"	"Y"	"Y"	"N"	"N"
null	null	null	null	null	null	null	null	null	null	null	null	null
null	17	0	0	0	"N"	"N"	"N"	"N"	"N"	"Y"	"Y"	"N"
null	null	null	null	null	null	null	null	null	null	null	null	null
null	0	4	0	0	"N"	"N"	"N"	"N"	"N"	"N"	"Y"	"N"

There are a few columns in the result, such as pred_eggs, with Int64 elements. These columns aren’t categorical, so we should exclude them before casting. You can exclude columns from an expression with the .exclude method:

terns.select(
    pl.col("^pred_.*$").exclude(pl.Int64).cast(pl.Categorical)
).head()

shape: (5, 9)

pred_control	pred_pefa	pred_coy_fox	pred_meso	pred_owlspp	pred_corvid	pred_other_raptor	pred_other_avian	pred_misc
cat	cat	cat	cat	cat	cat	cat	cat	cat
null	"N"	"N"	"N"	"N"	"Y"	"Y"	"N"	"N"
null	null	null	null	null	null	null	null	null
null	"N"	"N"	"N"	"N"	"N"	"Y"	"Y"	"N"
null	null	null	null	null	null	null	null	null
null	"N"	"N"	"N"	"N"	"N"	"N"	"Y"	"N"

To make this change permanent, we need to reassign the terns data frame. The .select method only returns the selected columns, so assigning the result to terns would mean losing all of the other columns.

Instead of using .select, you can use .with_columns to transform some columns but return all of the columns. In all other respects, .with_columns works the same way as .select. So to make the cast permanent:

terns = terns.with_columns(
    pl.col("^pred_.*$").exclude(pl.Int64).cast(pl.Categorical)
)

terns.head()

shape: (5, 43)

year	site_name	site_name_2013_2018	site_name_1988_2001	site_abbr	region_3	region_4	event	bp_min	bp_max	fl_min	fl_max	total_nests	nonpred_eggs	nonpred_chicks	nonpred_fl	nonpred_ad	pred_control	pred_eggs	pred_chicks	pred_fl	pred_ad	pred_pefa	pred_coy_fox	pred_meso	pred_owlspp	pred_corvid	pred_other_raptor	pred_other_avian	pred_misc	total_pefa	total_coy_fox	total_meso	total_owlspp	total_corvid	total_other_raptor	total_other_avian	total_misc	first_observed	last_observed	first_nest	first_chick	first_fledge
i64	str	str	str	str	str	str	str	f64	f64	i64	i64	i64	i64	i64	i64	i64	cat	i64	i64	i64	i64	cat	cat	cat	cat	cat	cat	cat	cat	i64	i64	i64	i64	i64	i64	i64	i64	str	str	str	str	str
2000	"PITTSBURG POWER PLANT"	"Pittsburg Power Plant"	"NA_2013_2018 POLYGON"	"PITT_POWER"	"S.F._BAY"	"S.F._BAY"	"LA_NINA"	15.0	15.0	16	18	15	3	0	0	0	null	4	2	0	0	"N"	"N"	"N"	"N"	"Y"	"Y"	"N"	"N"	0	0	0	0	4	2	0	0	"2000-05-11"	"2000-08-05"	"2000-05-26"	"2000-06-18"	"2000-07-08"
2000	"ALBANY CENTRAL AVE"	"NA_NO POLYGON"	"Albany Central Avenue"	"AL_CENTAVE"	"S.F._BAY"	"S.F._BAY"	"LA_NINA"	6.0	12.0	1	1	20	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null
2000	"ALAMEDA POINT"	"Alameda Point"	"NA_2013_2018 POLYGON"	"ALAM_PT"	"S.F._BAY"	"S.F._BAY"	"LA_NINA"	282.0	301.0	200	230	312	124	81	2	1	null	17	0	0	0	"N"	"N"	"N"	"N"	"N"	"Y"	"Y"	"N"	0	0	0	0	0	6	11	0	"2000-05-01"	"2000-08-19"	"2000-05-16"	"2000-06-07"	"2000-06-30"
2000	"KETTLEMAN CITY"	"Kettleman"	"NA_2013_2018 POLYGON"	"KET_CTY"	"KINGS"	"KINGS"	"LA_NINA"	2.0	3.0	1	2	3	null	3	1	6	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	"2000-06-10"	"2000-09-24"	"2000-06-17"	"2000-07-22"	"2000-08-06"
2000	"OCEANO DUNES STATE VEHICULAR R…	"Oceano Dunes State Vehicular R…	"NA_2013_2018 POLYGON"	"OCEANO_DUNES"	"CENTRAL"	"CENTRAL"	"LA_NINA"	4.0	5.0	4	4	5	2	0	0	0	null	0	4	0	0	"N"	"N"	"N"	"N"	"N"	"N"	"Y"	"N"	0	0	0	0	0	0	4	0	"2000-05-04"	"2000-08-30"	"2000-05-28"	"2000-06-20"	"2000-07-13"

Tip

In general, choose:

.select if you only want to get back the selected columns.
.with_columns if you want get back all of the columns.

The .select method is also useful for testing expressions before switching to the .with_columns method.

You can use columns to transform other columns. As a final example, suppose we want to compute eggs per breeding pair and nests per breeding pair for the least terns data. Non-predated egg counts are in the nonpred_eggs column and nest counts are in the total_nests column. For now, we’ll use the maximum reported breeding pairs, bp_max, as the number of breeding pairs:

terns.select(
    pl.col("nonpred_eggs", "total_nests") / pl.col("bp_max")
).head()

shape: (5, 2)

nonpred_eggs	total_nests
f64	f64
0.2	1.0
null	1.666667
0.41196	1.036545
null	1.0
0.4	1.0

There’s also a bp_min column with the minimum reported breeding pairs. To be thorough, we should compute the rates with both bp_min and bp_max, not just bp_max. We’ll also need to rename the resulting columns, so that each column has a unique name. You can use the .alias method to rename a single column, or the .name.prefix and .name.suffix methods, respectively, to prefix or suffix a column’s name. Let’s add a suffix to the column names to identify which breeding pair column was used:

terns.select(
    (
        pl.col("nonpred_eggs", "total_nests") / pl.col("bp_max")
    ).name.suffix("_per_bp_max"),
    (
        pl.col("nonpred_eggs", "total_nests") / pl.col("bp_min")
    ).name.suffix("_per_bp_min")
).head()

shape: (5, 4)

nonpred_eggs_per_bp_max	total_nests_per_bp_max	nonpred_eggs_per_bp_min	total_nests_per_bp_min
f64	f64	f64	f64
0.2	1.0	0.2	1.0
null	1.666667	null	3.333333
0.41196	1.036545	0.439716	1.106383
null	1.0	null	1.5
0.4	1.0	0.5	1.25

2.6.2. Filtering Rows#

Filtering the rows of a data frame is the counterpart selecting columns. The .filter method filters rows based on one or more conditions: expressions that evaluate to a series of Boolean values.

As an example, suppose we want to find all sites in the least terns data where the number of nests in the total_nests column is greater than 5. Here’s the code:

terns.filter(pl.col("total_nests") > 5).head()

shape: (5, 43)

year	site_name	site_name_2013_2018	site_name_1988_2001	site_abbr	region_3	region_4	event	bp_min	bp_max	fl_min	fl_max	total_nests	nonpred_eggs	nonpred_chicks	nonpred_fl	nonpred_ad	pred_control	pred_eggs	pred_chicks	pred_fl	pred_ad	pred_pefa	pred_coy_fox	pred_meso	pred_owlspp	pred_corvid	pred_other_raptor	pred_other_avian	pred_misc	total_pefa	total_coy_fox	total_meso	total_owlspp	total_corvid	total_other_raptor	total_other_avian	total_misc	first_observed	last_observed	first_nest	first_chick	first_fledge
i64	str	str	str	str	str	str	str	f64	f64	i64	i64	i64	i64	i64	i64	i64	cat	i64	i64	i64	i64	cat	cat	cat	cat	cat	cat	cat	cat	i64	i64	i64	i64	i64	i64	i64	i64	str	str	str	str	str
2000	"PITTSBURG POWER PLANT"	"Pittsburg Power Plant"	"NA_2013_2018 POLYGON"	"PITT_POWER"	"S.F._BAY"	"S.F._BAY"	"LA_NINA"	15.0	15.0	16	18	15	3	0	0	0	null	4	2	0	0	"N"	"N"	"N"	"N"	"Y"	"Y"	"N"	"N"	0	0	0	0	4	2	0	0	"2000-05-11"	"2000-08-05"	"2000-05-26"	"2000-06-18"	"2000-07-08"
2000	"ALBANY CENTRAL AVE"	"NA_NO POLYGON"	"Albany Central Avenue"	"AL_CENTAVE"	"S.F._BAY"	"S.F._BAY"	"LA_NINA"	6.0	12.0	1	1	20	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null
2000	"ALAMEDA POINT"	"Alameda Point"	"NA_2013_2018 POLYGON"	"ALAM_PT"	"S.F._BAY"	"S.F._BAY"	"LA_NINA"	282.0	301.0	200	230	312	124	81	2	1	null	17	0	0	0	"N"	"N"	"N"	"N"	"N"	"Y"	"Y"	"N"	0	0	0	0	0	6	11	0	"2000-05-01"	"2000-08-19"	"2000-05-16"	"2000-06-07"	"2000-06-30"
2000	"RANCHO GUADALUPE DUNES PRESERV…	"Rancho Guadalupe Dunes Preserv…	"NA_2013_2018 POLYGON"	"RGDP"	"CENTRAL"	"CENTRAL"	"LA_NINA"	9.0	9.0	17	17	9	0	1	0	0	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	"2000-05-07"	"2000-08-13"	"2000-05-31"	"2000-06-22"	"2000-07-20"
2000	"VANDENBERG SFB"	"Vandenberg AFB"	"NA_2013_2018 POLYGON"	"VAN_SFB"	"CENTRAL"	"CENTRAL"	"LA_NINA"	30.0	32.0	11	11	32	null	27	0	0	null	0	3	0	0	"N"	"N"	"N"	"N"	"N"	"Y"	"N"	"N"	0	0	0	0	0	3	0	0	"2000-05-07"	"2000-08-17"	"2000-05-28"	"2000-06-20"	"2000-07-15"

If we only want the site names, we can chain this with a call to .select and use the .unique method on the site_name column:

terns.filter(
    pl.col("total_nests") > 5
).select(
    pl.col("site_name").unique()
)

shape: (40, 1)

site_name
str
"SEAL BEACH NWR ANAHEIM BAY"
"NAVAL AMPHIBIOUS BASE CORONADO"
"NAS NORTH ISLAND"
"HAYWARD REGIONAL SHORELINE"
"CHULA VISTA WILDLIFE RESERVE"
…
"NBVC POINT MUGU"
"MISSION BAY MARINERS POINT"
"MISSION BAY STONY POINT"
"MCB CAMP PENDLETON"
"DSTREET FILL SWEETWATER MARSH …

We can conclude that there are 40 sites that had at least 5 nests at some point.

2.6.2.1. Logic Operators#

Series with Boolean elements, such as conditions, can be inverted or combined with logic operators. All of the logic operators broadcast to series elements. For demonstration, we’ll use the following series:

x1 = pl.Series([True, False, True, False])
x2 = pl.Series([True, True, False, False])

The NOT operator ~ inverts values, so True becomes False and False becomes True:

~x1

shape: (4,)


bool
false
true
false
true

The OR operator | combines two values, returning True unless both values are False:

x1 | x2

shape: (4,)


bool
true
true
true
false

The AND operator & combines two values, returning False unless both values are True:

x1 & x2

shape: (4,)


bool
true
false
false
false

Caution

The logic operators ~, |, and & only work on Polars series, NumPy arrays, and other homogeneous data structures. If you use them on Python’s built-in bool values, Python will return an unexpected result or produce an error.

Python instead uses the keywords not, or, and and as the respective logic operators on bool values. Polars, NumPy, and other packages don’t use these keywords because their behavior can’t be customized for data structures.

2.6.2.2. Multiple Conditions#

As a final example, let’s filter the least terns data with multiple conditions. We’ll get all rows for 2023 where there were at least 10 fledglings reported. We’ll use the fl_min column for the minimum reported fledgling count. Here’s the call to .filter:

terns.filter(
    (pl.col("year") == 2023) &
    (pl.col("fl_min") > 10)
)

shape: (14, 43)

year	site_name	site_name_2013_2018	site_name_1988_2001	site_abbr	region_3	region_4	event	bp_min	bp_max	fl_min	fl_max	total_nests	nonpred_eggs	nonpred_chicks	nonpred_fl	nonpred_ad	pred_control	pred_eggs	pred_chicks	pred_fl	pred_ad	pred_pefa	pred_coy_fox	pred_meso	pred_owlspp	pred_corvid	pred_other_raptor	pred_other_avian	pred_misc	total_pefa	total_coy_fox	total_meso	total_owlspp	total_corvid	total_other_raptor	total_other_avian	total_misc	first_observed	last_observed	first_nest	first_chick	first_fledge
i64	str	str	str	str	str	str	str	f64	f64	i64	i64	i64	i64	i64	i64	i64	cat	i64	i64	i64	i64	cat	cat	cat	cat	cat	cat	cat	cat	i64	i64	i64	i64	i64	i64	i64	i64	str	str	str	str	str
2023	"ALAMEDA POINT"	"Alameda Point"	"NA_2013_2018 POLYGON"	"ALAM_PT"	"S.F._BAY"	"S.F._BAY"	"LA_NINA"	169.0	339.0	109	131	339	112	56	5	0	"Y"	null	null	null	null	"Y"	"N"	"N"	"N"	"Y"	"N"	"Y"	"N"	null	null	null	null	null	null	null	null	"2023-04-28"	"2023-09-09"	"2023-06-17"	"2023-07-09"	null
2023	"HAYWARD REGIONAL SHORELINE"	"Hayward Regional Shoreline"	"NA_2013_2018 POLYGON"	"HAY_REG_SHOR"	"S.F._BAY"	"S.F._BAY"	"LA_NINA"	88.0	143.0	127	130	144	17	1	0	0	"Y"	null	null	null	null	"Y"	"N"	"Y"	"N"	"N"	"N"	"Y"	"Y"	null	null	null	null	null	null	null	null	"2023-05-01"	"2023-09-11"	"2023-05-30"	"2023-06-17"	null
2023	"OCEANO DUNES STATE VEHICULAR R…	"Oceano Dunes State Vehicular R…	"NA_2013_2018 POLYGON"	"OCEANO_DUNES"	"CENTRAL"	"CENTRAL"	"LA_NINA"	40.0	42.0	35	35	42	8	2	0	0	"Y"	null	null	null	null	"Y"	"N"	"N"	"N"	"N"	"N"	"N"	"N"	null	null	null	null	null	null	null	null	"2023-05-07"	"2023-08-20"	"2023-06-01"	"2023-06-19"	null
2023	"VANDENBERG SFB"	"Vandenberg AFB"	"NA_2013_2018 POLYGON"	"VAN_SFB"	"CENTRAL"	"CENTRAL"	"LA_NINA"	33.0	39.0	17	17	42	3	28	0	0	"Y"	null	null	null	null	"N"	"N"	"N"	"N"	"N"	"N"	"Y"	"N"	null	null	null	null	null	null	null	null	"2023-05-11"	"2023-08-26"	"2023-06-01"	"2023-06-25"	null
2023	"NBVC POINT MUGU"	"NBVC Point Mugu"	"NA_2013_2018 POLYGON"	"PT_MUGU"	"SOUTHERN"	"VENTURA"	"LA_NINA"	168.0	177.0	51	100	188	40	3	0	1	"Y"	null	null	null	null	"N"	"Y"	"N"	"Y"	"N"	"N"	"N"	"Y"	null	null	null	null	null	null	null	null	"2023-05-03"	"2023-09-02"	"2023-05-21"	"2023-06-14"	null
…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…
2023	"BATIQUITOS LAGOON ECOLOGICAL R…	"Batiquitos Lagoon"	"NA_2013_2018 POLYGON"	"BLER"	"SOUTHERN"	"SOUTHERN"	"LA_NINA"	205.0	218.0	77	89	228	67	94	7	0	"Y"	null	null	null	null	"Y"	"N"	"N"	"Y"	"N"	"Y"	"N"	"Y"	null	null	null	null	null	null	null	null	"2023-04-27"	"2023-08-03"	"2023-05-11"	"2023-06-05"	null
2023	"SAN DIEGUITO LAGOON ECOLOGICAL…	"San Diequito Lagoon"	"NA_2013_2018 POLYGON"	"SANDIEGU_LAG"	"SOUTHERN"	"SOUTHERN"	"LA_NINA"	34.0	47.0	16	22	47	0	0	0	0	"Y"	null	null	null	null	"Y"	"Y"	"N"	"Y"	"N"	"Y"	"N"	"N"	null	null	null	null	null	null	null	null	"2023-05-24"	"2023-09-02"	"2023-05-24"	null	null
2023	"MISSION BAY FAA ISLAND"	"FAA Island"	"NA_2013_2018 POLYGON"	"MB_FAA"	"SOUTHERN"	"SOUTHERN"	"LA_NINA"	141.0	144.0	44	48	156	50	36	4	0	"Y"	null	null	null	null	"N"	"N"	"N"	"N"	"N"	"N"	"Y"	"Y"	null	null	null	null	null	null	null	null	"2023-04-23"	"2023-08-17"	"2023-05-10"	"2023-05-28"	null
2023	"NAVAL AMPHIBIOUS BASE CORONADO"	"Naval Base Coronado"	"NA_2013_2018 POLYGON"	"NAB"	"SOUTHERN"	"SOUTHERN"	"LA_NINA"	596.0	644.0	90	128	717	329	185	6	6	"Y"	null	null	null	null	"N"	"N"	"N"	"N"	"Y"	"N"	"Y"	"Y"	null	null	null	null	null	null	null	null	"2023-04-22"	"2023-09-09"	"2023-05-07"	"2023-05-31"	null
2023	"TIJUANA ESTUARY NERR"	"Tijuana Estuary"	"NA_2013_2018 POLYGON"	"TJ_RIV"	"SOUTHERN"	"SOUTHERN"	"LA_NINA"	144.0	165.0	35	35	171	65	44	1	1	"Y"	null	null	null	null	"N"	"N"	"N"	"N"	"N"	"N"	"Y"	"Y"	null	null	null	null	null	null	null	null	"2023-04-26"	"2023-08-28"	"2023-05-12"	"2023-06-10"	null

This gives us 14 sites with at least 10 fledglings in 2023. Notice that we had to put the conditions in parentheses () so that Python gets the order of operations right.

2.7. Exercises#

2.7.1. Exercise#

Python’s range function offers another way to create a sequence of numbers. Read the help file for this function.

Create an example range. How does this differ from a list?
Describe the three arguments that you can use in range. Give examples of each.
Convert one of those ranges to a list and print it to screen. What changes in the way Python represents this sequence?

2.7.2. Exercise#

Return to the discussion in Coercion & Casting.

Why does "3" + 4 raise an error?
Why does True - 1 return 0?
Why does int(4.6) < 4.6 return True?

2.7.3. Exercise#

Create a new data frame from the least terns data with the following characteristics:
- Each entry’s year is between 2010 and 2019 (inclusive).
- Each entry reports at least 100 breeding pairs.
- The columns are year, site_name, bp_min, bp_max, total_nests.
Use this data frame for the remaining questions.
Count the number of entries for each site. How many sites have at least 100 breeding pairs across all 10 years?
Which site-year combination has the highest number of nests?

Data Types & Structures

Contents

2. Data Types & Structures#

2.1. Data Types#

2.1.1. Coercion & Casting#

2.2. Lists#

2.2.1. Indexing#

2.2.2. References#

2.2.3. Comprehensions#

2.3. Series#

2.3.1. Creating Series#

2.3.2. Data Types in Polars#

2.3.3. Categorical Data#

2.3.4. Broadcasting#

2.4. Other Data Structures#

2.4.1. Built-in Data Structures#

2.4.2. NumPy Arrays#

2.5. Special Values#

2.5.1. None#

2.5.2. Missing Values#

2.5.3. Infinity#

2.5.4. Not a Number#

2.6. Data Frames#

2.6.1. Selecting Columns#

2.6.2. Filtering Rows#

2.6.2.1. Logic Operators#

2.6.2.2. Multiple Conditions#

2.7. Exercises#

2.7.1. Exercise#

2.7.2. Exercise#

2.7.3. Exercise#