1. Date & String Processing#

Learning Goals

After this lesson, you should be able to:

  • Explain why we use special data structures for dates & times

  • Identify the correct data structure for a given date/time

  • Use Polars and the datetime module to parse dates

  • Use the date format string mini-language

  • Use escape codes in strings to represent non-keyboard characters

  • Explain what a text encoding is

  • Use Polars to detect, extract, and change patterns in strings

  • Use the regular expressions mini-language

This chapter is part 1 (of 2) of Cleaning & Reshaping Data, a workshop series about how to prepare data for analysis. The major topics of this chapter are how to convert dates and times into appropriate data types and how to extract and clean up data from strings (including numbers with non-numeric characters such as $, %, and ,).

1.1. Parsing Dates & Times#

Given a data set with dates or times, you might want to:

  • Sort the observations chronologically

  • Compute the times between observations

  • Add or subtract a time offset

  • Get just the month, year, hour, etc. of each observation

You can do all of these things and more with Polars, but only if your dates and times are represented by appropriate data types. Many popular file formats for data, such as CSV, do not store metadata about which values are dates and times, so these values get typed as strings. Dates and times extracted from text are also naturally strings.

The first step to working with dates and times typed as strings is to parse them: break them down into their components and cast them to more suitable data types. To demonstrate parsing dates, consider a series of date strings:

import polars as pl

date_strings = pl.Series(["Jan 10, 2021", "Sep 3, 2018", "Feb 28, 1982"])
date_strings
shape: (3,)
str
"Jan 10, 2021"
"Sep 3, 2018"
"Feb 28, 1982"

We can use Polars to parse the strings into dates. For series (including columns), string methods are listed under the .str namespace, and the .str.to_date method parses strings as dates. In some cases, the method can infer the format of the dates automatically, but that’s not the case for our dates:

date_strings.str.to_date()
---------------------------------------------------------------------------
ComputeError                              Traceback (most recent call last)
Cell In[3], line 1
----> 1 date_strings.str.to_date()

File ~/mill/datalab/teaching/intermediate_python/.pixi/envs/default/lib/python3.13/site-packages/polars/series/utils.py:106, in call_expr.<locals>.wrapper(self, *args, **kwargs)
    104     expr = getattr(expr, namespace)
    105 f = getattr(expr, func.__name__)
--> 106 return s.to_frame().select_seq(f(*args, **kwargs)).to_series()

File ~/mill/datalab/teaching/intermediate_python/.pixi/envs/default/lib/python3.13/site-packages/polars/dataframe/frame.py:9657, in DataFrame.select_seq(self, *exprs, **named_exprs)
   9634 def select_seq(
   9635     self, *exprs: IntoExpr | Iterable[IntoExpr], **named_exprs: IntoExpr
   9636 ) -> DataFrame:
   9637     """
   9638     Select columns from this DataFrame.
   9639 
   (...)   9655     select
   9656     """
-> 9657     return self.lazy().select_seq(*exprs, **named_exprs).collect(_eager=True)

File ~/mill/datalab/teaching/intermediate_python/.pixi/envs/default/lib/python3.13/site-packages/polars/lazyframe/frame.py:2065, in LazyFrame.collect(self, type_coercion, _type_check, predicate_pushdown, projection_pushdown, simplify_expression, slice_pushdown, comm_subplan_elim, comm_subexpr_elim, cluster_with_columns, collapse_joins, no_optimization, streaming, engine, background, _check_order, _eager, **_kwargs)
   2063 # Only for testing purposes
   2064 callback = _kwargs.get("post_opt_callback", callback)
-> 2065 return wrap_df(ldf.collect(callback))

ComputeError: could not find an appropriate format to parse dates, please define a format

We need to supply the method with a format string that describes the format of the dates. In a format string, a percent sign % followed by a character is called a specifier and represents a component of a date or time. Here are some useful specifiers:

Specifier

Description

2015-01-29 21:32:55

%Y

4-digit year

2015

%m

2-digit month

01

%d

2-digit day

29

%H

2-digit hour

21

%M

2-digit minute

32

%S

2-digit second

55

%%

literal %

%

%y

2-digit year

15

%B

full month name

January

%b

short month name

Jan

Note

Most programming languages use format strings to parse and format dates and times. There isn’t a standard, but the idea seems to have originated with the strptime and strftime functions in the C programming language. Polars uses the same specifiers as Rust’s chrono crate.

Our date strings have a short month name (%b), a 2-digit day (%d), and a 4-digit year (%Y). The format string must match the format of the dates exactly, so we also need to include whitespace and punctuation. Thus the format string is %b %d, %Y. Use this to parse the dates:

dates = date_strings.str.to_date("%b %d, %Y")
dates
shape: (3,)
date
2021-01-10
2018-09-03
1982-02-28

Polars represents dates with the Date type, which you can see by inspecting the series’ .dtype attribute:

dates.dtype
Date

Polars can also parse times and datetimes (which combine a date and a time). The respective methods are .str.to_time and .str.to_datetime. Let’s try parsing an unusual time format:

time_string = pl.Series(["6 minutes, 32 seconds after 10 o'clock"])
time = time_string.str.to_time("%M minutes, %S seconds after %H o'clock")
time
shape: (1,)
time
10:06:32

Since this is a time, the data type is Time:

time.dtype
Time

Polars is compatible with Python’s built-in datetime module, which provides similar data types and functions for working with dates and times. The equivalents of Polars’ Date, Time, and Datetime types are Python’s date, time, and datetime types. In fact, Polars returns these types if you access the elements of a series individually. For example:

type(time[0])
datetime.time

The datetime module is also the recommended way to create literal date and time values for Polars. For instance, you can use the dt.datetime function to explicitly create datetimes for a series:

import datetime as dt

events = pl.Series([
    dt.datetime(year = 2023, month = 1, day = 10, hour = 8, minute = 3),
    dt.datetime(2002, 8, 16, 14, 59)
])
events
shape: (2,)
datetime[μs]
2023-01-10 08:03:00
2002-08-16 14:59:00

The module also provides dt.date and dt.time functions to create dates and times.

Caution

The similarly-named Polars functions pl.date, pl.time, and pl.datetime only create dates and times in calls to .select or .with_columns. If you aren’t calling .select or .with_columns, make sure to use the functions from the datetime module!

1.1.1. Extracting Components#

Once you have dates or times, you might want to extract components from them. For instance, suppose we want to get the years from the dates we parsed earlier. For series (including columns), date and time methods are listed under the .dt namespace. There are methods for a variety of different date and time components. For instance, the .dt.year method gets the year:

dates.dt.year()
shape: (3,)
i32
2021
2018
1982

Similarly, the .dt.month method gets the month:

dates.dt.month()
shape: (3,)
i8
1
9
2

The Polars documentation provides a complete list of .dt methods.

Note

Python’s date and time types (from the datetime module) expose their components as attributes rather than methods, and don’t use the .dt prefix.

For example, you can get the year of a date called x with x.year. No parentheses () are necessary, since .year is an attribute.

See the documentation for the datetime module for more details.

1.1.2. Durations#

Occasionally, you might need to adjust dates or times by adding an offset. For example, suppose we want to add 30 days to the dates from earlier. The dt.timedelta function creates a timedelta object, which represents a fixed period of time. So to add 30 days:

dates + dt.timedelta(days = 30)
shape: (3,)
date
2021-02-09
2018-10-03
1982-03-30

What if we want to add 1 month to the dates instead of 30 days? The length of a month varies, so we can’t use a timedelta. Polars provides a solution: the .dt.offset_by method. The method takes a string argument that specifies the period of time by which to offset. The string for 1 month is "1mo", so:

dates.dt.offset_by("1mo")
shape: (3,)
date
2021-02-10
2018-10-03
1982-03-28

This increments each date by exactly 1 month, regardless of how many days that is. The documentation for .dt.offset_by describes how to specify other periods of time.

1.1.3. Case Study: CA Parks & Recreation Fleet#

The government of California publishes data about its fleet of vehicles on the California Open Data portal. As of March 2025, the data set includes all non-confidential vehicles owned by agencies from 2015-2023. We’ll use a subset of this data to compute how many vehicles the CA Department of Parks and Recreation purchased each year from 2019 to 2023. The data set is published as a messy CSV, so we’ll need to do some cleaning and parse the dates in order to use it.

Important

Click here to download the CA Parks & Recreation Fleet data set.

If you haven’t already, we recommend you create a directory for this workshop. In your workshop directory, create a data/ subdirectory. Download and save the data set in the data/ subdirectory.

To get started, read the data set from wherever you saved it:

fleet = pl.read_csv("data/2015-2023_ca_parks_fleet.csv")
fleet.head()
shape: (5, 34)
agencyreport_yeardisposedequipment_numberasset_categorymodel_yearmake_modelpostal_codeasset_typeweight_classpassenger_vehiclepayload_ratingshipping_weightwheel_typetire_sizefuel_typeengine_configurationemissions_type_codeprimary_applicationsecondary_applicationacquisition_delivery_datesuv_justificationpickup_4_by_4_justificationacquisition_methodpurchase_priceannual_lease_rateacquisition_mileagedisposition_datetransferred_todisposition_methoddisposition_reasondisposition_mileagedisposition_sold_amounttotal_miles
stri64strstrstri64stri64strstrstrstrstrstrstrstrstrstrstrstrstrstrstrstri64strstrstrstrstrstri64i64str
"Parks and Recreation, Departme…2015"No""93462""Ground"1976"TUCKER 1643"96142"Construction Equipment"null"No"nullnullnullnull"Diesel""Dedicated"null"Maintenance of public faciliti…null"1/1/2001"nullnull"Purchase"2000null"Null "nullnullnullnullnullnullnull
"Parks and Recreation, Departme…2015"No""100380""Ground"1979"FLEETWOOD HOMETTE"94954"Non-Self Propelled"null"Yes"nullnullnullnullnullnullnull"Maintenance of public faciliti…null"1/1/1979"nullnull"Purchase"11483null"Null "nullnullnullnullnullnullnull
"Parks and Recreation, Departme…2015"Yes""117844""Ground"2001"MACHETE CHIPPER"91302"Non-Self Propelled"null"Yes"nullnullnullnull"Gasoline""Dedicated"null"Maintenance of public faciliti…null"1/31/2001"nullnull"Purchase"26870null"Null ""6/7/2018"nullnullnullnullnullnull
"Parks and Recreation, Departme…2015"No""115503""Ground"1987"KIT OFFICE TRAILER"95430"Non-Self Propelled"null"Yes"nullnullnullnullnullnullnull"Maintenance of public faciliti…null"1/1/2001"nullnull"Purchase"10197null"Null "nullnullnullnullnullnullnull
"Parks and Recreation, Departme…2015"No""3554""Ground"1960"MAVERICK MFG CORP WELDER"95430"Non-Self Propelled""Light Duty""Yes"nullnullnullnullnullnullnull"Maintenance of public faciliti…null"1/1/2001"nullnull"Purchase"500null"Null "nullnullnullnullnullnullnull

Since we want to compute the number of acquisitions each year, we’ll focus on two columns: acquisition_method and acquisition_delivery_date. The acquisition_method column indicates whether the vehicle was purchased, donated, transferred from a different agency, or something else. We’re only interested in purchases, so we can filter the others out:

fleet = fleet.filter(pl.col("acquisition_method") == "Purchase")

In order to determine how many vehicles were purchased each year, we need to extract the year from acquisition_delivery_date. Let’s take a look at this column:

fleet["acquisition_delivery_date"].head()
shape: (10,)
acquisition_delivery_date
str
"1/1/2001"
"1/1/1979"
"1/31/2001"
"1/1/2001"
"1/1/2001"
"1/1/1991"
"1/1/2003"
"1/1/1998"
"1/1/1990"
"1/1/1997"

It looks like the dates are in month-day-year format. The format string for this is "%m/%d/%Y", so:

fleet.select(
    pl.col("acquisition_delivery_date").str.to_date("%m/%d/%Y")
)
---------------------------------------------------------------------------
InvalidOperationError                     Traceback (most recent call last)
Cell In[17], line 1
----> 1 fleet.select(
      2     pl.col("acquisition_delivery_date").str.to_date("%m/%d/%Y")
      3 )

File ~/mill/datalab/teaching/intermediate_python/.pixi/envs/default/lib/python3.13/site-packages/polars/dataframe/frame.py:9632, in DataFrame.select(self, *exprs, **named_exprs)
   9532 def select(
   9533     self, *exprs: IntoExpr | Iterable[IntoExpr], **named_exprs: IntoExpr
   9534 ) -> DataFrame:
   9535     """
   9536     Select columns from this DataFrame.
   9537 
   (...)   9630     └──────────────┘
   9631     """
-> 9632     return self.lazy().select(*exprs, **named_exprs).collect(_eager=True)

File ~/mill/datalab/teaching/intermediate_python/.pixi/envs/default/lib/python3.13/site-packages/polars/lazyframe/frame.py:2065, in LazyFrame.collect(self, type_coercion, _type_check, predicate_pushdown, projection_pushdown, simplify_expression, slice_pushdown, comm_subplan_elim, comm_subexpr_elim, cluster_with_columns, collapse_joins, no_optimization, streaming, engine, background, _check_order, _eager, **_kwargs)
   2063 # Only for testing purposes
   2064 callback = _kwargs.get("post_opt_callback", callback)
-> 2065 return wrap_df(ldf.collect(callback))

InvalidOperationError: conversion from `str` to `date` failed in column 'acquisition_delivery_date' for 365 out of 627 values: ["44832", "44462", … "42628"]

You might want to try:
- setting `strict=False` to set values that cannot be converted to `null`
- using `str.strptime`, `str.to_date`, or `str.to_datetime` and providing a format string

Uh-oh. According to the error message from Polars, some of the date values don’t follow the month-day-year format. Instead, they’re numbers like 44832, 44462, and 43451. These numbers probably seem inscrutable, but there is an explanation for them: Microsoft Excel, perhaps the most popular tool for working with tabular data, stores dates by counting days from 31 December 1899. So 1 is 1 January 1900, 32 is 1 February 1900, and so on. Unfortunately, the Excel developers incorrectly assumed 1900 was a leap year, so all of the counts were off by 1 after 28 February 1900. Most modern Excel-compatible spreadsheet programs fix this by counting from 30 December 1899 rather than 31 December, so that only dates before 28 February have different numbers. We can convert the numbers in the acquisition_delivery_date column into dates by treating them as offsets to 30 December 1899.

Transforming a series in two different ways at once is not (yet) an easy task with Polars, but it is possible. We’ll do it by splitting the series into two series: one for the month-day-year dates and one for the Excel dates. The former are easy to identify because they contain slashes (/). We can use the .str.contains method, which tests whether a string contains a character, to test for slashes. So a condition for the month-day-year dates of the acquisition_delivery_date column is:

is_mdy = pl.col("acquisition_delivery_date").str.contains("/")

We can use the condition with the pl.when function to get the month-day-year dates and the Excel dates. The pl.when function returns elements of a series specified in .then when its condition is true and returns null otherwise. So to create a new data frame dates with columns mdy and excel for the month-day-year and Excel dates, respectively:

dates = fleet.select(
    mdy = pl.when(is_mdy).then(pl.col("acquisition_delivery_date")),
    excel = pl.when(~is_mdy).then(pl.col("acquisition_delivery_date"))
)
dates.head()
shape: (5, 2)
mdyexcel
strstr
"1/1/2001"null
"1/1/1979"null
"1/31/2001"null
"1/1/2001"null
"1/1/2001"null

We can convert the mdy column to dates with .str.to_date and the "%m/%d/%Y" format string from earlier.

To convert the excel column to dates, we have to start with the literal date 30 December 1899 and offset it by the value of the column. We can use dt.date to create the date and .dt.offset_by to offset it. The code to transform both the mdy and excel columns is:

start_date = dt.date(1899, 12, 30)

dates = dates.with_columns(
    mdy = pl.col("mdy").str.to_date("%m/%d/%Y"),
    excel = pl.lit(start_date).dt.offset_by(pl.col("excel") + "d")
)
dates.head()
shape: (5, 2)
mdyexcel
datedate
2001-01-01null
1979-01-01null
2001-01-31null
2001-01-01null
2001-01-01null

We can combine the two columns by replacing all of the null values in mdy with values from excel. This becomes the new acquisition_delivery_date column:

fleet = fleet.with_columns(
    acquisition_delivery_date = dates["mdy"].fill_null(dates["excel"])
)
fleet.head()
shape: (5, 34)
agencyreport_yeardisposedequipment_numberasset_categorymodel_yearmake_modelpostal_codeasset_typeweight_classpassenger_vehiclepayload_ratingshipping_weightwheel_typetire_sizefuel_typeengine_configurationemissions_type_codeprimary_applicationsecondary_applicationacquisition_delivery_datesuv_justificationpickup_4_by_4_justificationacquisition_methodpurchase_priceannual_lease_rateacquisition_mileagedisposition_datetransferred_todisposition_methoddisposition_reasondisposition_mileagedisposition_sold_amounttotal_miles
stri64strstrstri64stri64strstrstrstrstrstrstrstrstrstrstrstrdatestrstrstri64strstrstrstrstrstri64i64str
"Parks and Recreation, Departme…2015"No""93462""Ground"1976"TUCKER 1643"96142"Construction Equipment"null"No"nullnullnullnull"Diesel""Dedicated"null"Maintenance of public faciliti…null2001-01-01nullnull"Purchase"2000null"Null "nullnullnullnullnullnullnull
"Parks and Recreation, Departme…2015"No""100380""Ground"1979"FLEETWOOD HOMETTE"94954"Non-Self Propelled"null"Yes"nullnullnullnullnullnullnull"Maintenance of public faciliti…null1979-01-01nullnull"Purchase"11483null"Null "nullnullnullnullnullnullnull
"Parks and Recreation, Departme…2015"Yes""117844""Ground"2001"MACHETE CHIPPER"91302"Non-Self Propelled"null"Yes"nullnullnullnull"Gasoline""Dedicated"null"Maintenance of public faciliti…null2001-01-31nullnull"Purchase"26870null"Null ""6/7/2018"nullnullnullnullnullnull
"Parks and Recreation, Departme…2015"No""115503""Ground"1987"KIT OFFICE TRAILER"95430"Non-Self Propelled"null"Yes"nullnullnullnullnullnullnull"Maintenance of public faciliti…null2001-01-01nullnull"Purchase"10197null"Null "nullnullnullnullnullnullnull
"Parks and Recreation, Departme…2015"No""3554""Ground"1960"MAVERICK MFG CORP WELDER"95430"Non-Self Propelled""Light Duty""Yes"nullnullnullnullnullnullnull"Maintenance of public faciliti…null2001-01-01nullnull"Purchase"500null"Null "nullnullnullnullnullnullnull

With the dates in hand, we can compute the number of purchases from 2019 to 2023. Since each row in the data set represents a vehicle-year, and we just want to count vehicles, we must first get a subset of unique vehicles. We can do this by grouping on the equipment_number column, which is a unique identifier for each vehicle:

vehicles = fleet.group_by("equipment_number").first()

We can compute the number of purchases by extracting the years with the .dt.year method and then counting them with the .value_counts method:

year = vehicles["acquisition_delivery_date"].dt.year().alias("year")
year.value_counts().sort("year").tail()
shape: (5, 2)
yearcount
i32u32
2019345
2020254
2021189
2022227
2023192

It’s unclear whether the data set contains complete records for 2023, so the count for that year might be too small. It’s also likely that the agency purchases vehicles months or years in advance of their delivery, so these counts are more accurately described as deliveries rather than purchases.

1.2. String Fundamentals#

Strings represent text, but even if your data sets are composed entirely of numbers, you’ll need to know how to work with strings. Text formats for data are widespread: comma-separated values (CSV), tab-separated values (TSV), JavaScript object notation (JSON), a panopoly of markup languages (HTML, XML, YAML, TOML), and more. When you read data in these formats, sometimes Python will correctly convert the values to appropriate non-string types. The rest of the time, you need to know how to work with strings so that you can fix whatever went wrong and convert the data yourself.

This section introduces fundamental concepts for computing on strings. The next section, Section 1.3, describes how to find, extract, and replace patterns in strings.

1.2.1. Escape Sequences#

In a string, an escape sequence or escape code consists of a backslash followed by one or more characters. Escape sequences make it possible to:

  1. Write quotes or backslashes within a string

  2. Write characters that don’t appear on your keyboard (for example, characters in a foreign language)

For example, the escape sequence \n corresponds to the newline character. Notice that the print function prints \n as a literal new line:

x = "Hello\nNick"
print(x)
Hello
Nick

On the other hand, when you let Python print a string or other object automatically (without explicitly calling print), it prints a programmer-friendly representation. For strings, the representation shows escape codes:

x
'Hello\nNick'

This makes it easy to identify characters that are not normally visible, such as whitespace, and also makes it easy to copy the string. You can use the built-in repr function to get a representation for any object.

Important

Make sure to call print explicitly if you want to see what a string actually looks like.

As another example, suppose we want to put literal quotation marks in a string. We can either enclose the string in the other kind of quotation marks (single versus double), or escape the quotation marks in the string:

x = 'She said, "Hi"'
print(x)

y = "She said, \"Hi\""
print(y)
She said, "Hi"
She said, "Hi"

Since escape sequences begin with backslash, we also need to use an escape sequence to write a literal backslash. The escape sequence for a literal backslash is two backslashes:

x = "\\"
print(x)
\

The Python documentation provides a complete list of escape sequences. Other programming languages also use escape sequences, and many of them are the same.

Tip

You can mark a string as raw by prefixing it with r or R (lowercase r is more common) to make Python ignore escape sequences. For example, r"backslashes \ are okay" is a raw string.

Raw strings are especially useful for writing regular expressions (Section 1.3).

A quirk of raw strings is that they can’t end with an odd number of backslashes, since the next character after a backslash is treated as part of the string.

1.2.2. Character Encodings#

Computers store data as numbers. In order to store text on a computer, people have to agree on a character encoding, a system for mapping characters to numbers. For example, in ASCII, one of the most popular encodings in the United States, the character a maps to the number 97.

Many different character encodings exist, and sharing text used to be an inconvenient process of asking or trying to guess the correct encoding. This was so inconvenient that in the 1980s, software engineers around the world united to create the Unicode standard. Unicode includes symbols for nearly all languages in use today, as well as emoji and many ancient languages (such as Egyptian hieroglyphs).

Unicode maps characters to numbers, but unlike a character encoding, it doesn’t dictate how those numbers should be mapped to bytes (sequences of ones and zeroes). As a result, there are several different character encodings that support and are synonymous with Unicode. The most popular of these is UTF-8.

Python 3 uses Unicode by default. You can write Unicode characters with the escape sequence \u (or \U) followed by a 4-digit (or 8-digit) number for the character in base 16. For instance, the number for a in Unicode is 97 (the same as in ASCII). In base 16, 97 is 61. So you can write an a as:

"\u0061"
'a'

Unicode escape sequences are usually only used for characters that are not easy to type. For example, the cat emoji is number 1f408 (in base 16) in Unicode. So "\U0001f408" is the cat emoji:

"\U0001f408"
'🐈'

Tip

Whether you can see a Unicode character depends on whether the current font has a glyph (image representation) for that character. Most fonts only have glyphs for a few languages.

Make sure to use a font with good Unicode coverage if you love emoji or expect to work with many different languages. The [Noto Fonts][] project aims to create a collection of fonts with a common style and complete language coverage. The NerdFonts project patches fonts commonly used for programming so that they have better coverage of symbols.

Note

If you ever read or write a text file (including CSV and other formats) and the text looks like gibberish, it might be an encoding problem. This is especially true on Windows, the only modern operating system that does not (yet) use UTF-8 as the default encoding.

Encoding problems when reading a file can usually be fixed by passing the encoding to the function doing the reading. For instance, the code to read an ISO-8859-1 encoded CSV file with Polars is:

pl.read_csv("my_data.csv", encoding = "iso-8859-1")

Other reader functions might have a different parameter for the encoding, so always check the documentation.

1.3. Processing Strings#

String processing encompasses a variety of tasks such as searching for patterns within strings, extracting data from within strings, splitting strings into component parts, and removing or replacing unwanted characters (excess whitespace, punctuation, and so on). If you work with data, sooner or later you’ll run into a data set in text format that needs a few corrections, and for that you’ll find familiarity with string processing invaluable.

1.3.1. Strings in Polars#

We recommend using Polars for most string processing, although Python’s built-in str data type and re module are also an excellent option. Polars is slightly more convenient and concise for handling many strings at once (in a data frame or series). It is also more efficient in some cases.

See also

The Polars user guide has a section about strings.

For series (including columns), string methods are listed under the .str namespace. For example, the .str.contains method tests whether a pattern appears within a string. The method returns True if the pattern is found and False if it isn’t. Try it out on a sample series:

words = pl.Series(["help", "kelp", "grow", "tall"])
words.str.contains("el")
shape: (4,)
bool
true
true
false
false
words.str.contains("g")
shape: (4,)
bool
false
false
true
false

As another example, the .str.slice method extracts a substring from a string, given the substring’s position and length. So to extract 9 characters beginning at position 5:

quotes = pl.Series([
    "It's dangerous to go alone!", "Tombs with piped in music."
])

quotes.str.slice(5, 9)
shape: (2,)
str
"dangerous"
" with pip"

The .str.slice method is especially useful for extracting data from strings that have a fixed width.

There are a lot of methods in the .str namespace, but five that are especially important (which we will use) are:

  • .str.contains, to test whether a string contains a pattern

  • .str.slice, to extract a substring at a given position

  • .str.replace, to replace or remove part of a string

  • .str.split, to split a string into parts

  • .str.extract_groups to extract parts of a string that match a pattern

You can find a complete list of methods with examples in the Polars API reference.

1.3.2. Regular Expressions#

Many of Polars’ .str methods use a special language called regular expressions or regex to describe patterns in strings. Python’s re module and many other programming languages also have string processing tools that use regular expressions, so fluency with regular expressions is a valuable, transferable skill.

You can use a regular expression to describe a complicated pattern in just a few characters because some characters, called metacharacters, have special meanings. Metacharacters are usually punctuation characters. They are never letters or numbers, which always have their literal meaning.

This table lists some of the most useful metacharacters:

Metacharacter

Meaning

.

any one character (wildcard)

\

escape character (in both Python and regex), see Section 1.2.1

^

the beginning of string (not a character)

$

the end of string (not a character)

[ab]

one character, either 'a' or 'b'

[^ab]

one character, anything except 'a' or 'b'

?

the previous character appears 0 or 1 times

*

the previous character appears 0 or more times

+

the previous character appears 1 or more times

()

make a group

|

match left OR right side (not a character)

Section 1.4 provides examples of how most of the metacharacters work.

See also

You can find even more examples and a complete listing of regex metacharacters in the documentation for the re module and the documentation for Rust’s regex crate. The former is more approachable if you’re not comfortable reading Rust code, but the latter is what Polars actually uses. Differences between the two are minor.

For Polars methods that support regex, you can disable them by setting the literal parameter to True. For example, to test whether a string contains a literal dot .:

x = pl.Series(["No dot", "Lotsa dots..."])
x.str.contains(".", literal = True)
shape: (2,)
bool
false
true

Tip

Make it a habit to set literal = True anytime you use a pattern that doesn’t contain regex metacharacters. Doing so communicates to the reader that you’re not using regex, prevents bugs, and makes your code more efficient.

1.3.3. Replacing Parts of Strings#

Replacing part of a string is a common string processing task. For instance, quantitative data often contain non-numeric characters such as commas, currency symbols, and percent signs. These must be removed before converting to numeric data types. Replacement and removal go hand-in-hand, since removal is equivalent to replacing part of a string with the empty string "".

The .str.replace method replaces the first part of a string that matches a pattern (from left to right), while the related .str.replace_all method replaces every part of a string that matches a pattern. Several Polars .str methods that do pattern matching come in a pair like this: one to process only the first match and one to process every match.

As an example, suppose you want to remove commas from a series of numbers (in strings) so that you can cast them to a numeric data type. You want to remove all of the commas, so .str.replace_all is the method to use. The first argument is the pattern and the second argument is the replacement. In this case, the pattern is "," and the replacement is the empty string "". So the code to remove the commas and cast the numbers is:

x = pl.Series(["1,000,000", "525,600", "42"])
x.str.replace_all(",", "", literal = True).cast(pl.Int64)
shape: (3,)
i64
1000000
525600
42

The .str.replace method wouldn’t work as well for this task, since it only replaces the first match to the pattern:

x.str.replace(",", "", literal = True)
shape: (3,)
str
"1000,000"
"525600"
"42"

You can also use these methods to replace or remove longer patterns within words. For instance, suppose you want to change the word "dog" to "cat":

x = pl.Series(["dogs are great, dogs are fun", "dogs are fluffy"])
x.str.replace("dog", "cat", literal = True)
shape: (2,)
str
"cats are great, dogs are fun"
"cats are fluffy"
x.str.replace_all("dog", "cat", literal = True)
shape: (2,)
str
"cats are great, cats are fun"
"cats are fluffy"

As a final example, you can use the replacement methods and a regex pattern to replace repeated spaces with a single space. This is a good standardization step if you’re working with text. The key is to use the regex quantifier +, which means a character “repeats one or more times” in the pattern, and to use a single space " " as the replacement:

x = pl.Series(["This    sentence  has  extra      space."])
x.str.replace_all(" +", " ")
shape: (1,)
str
"This sentence has extra space."

If you just want to remove all whitespace (or other characters) from the beginning and end of a string, you can use the .str.strip_chars method instead.

1.3.4. Splitting Strings#

Distinct data in a text are generally separated by a character like a space or a comma, to make them easy for people to read. Often these separators also make the data easy to parse with Polars. The idea is to split the string into a separate value at each separator.

The .str.split method splits a string at each match to a pattern. The matching characters—that is, the separators—are discarded.

Important

The .str.split method and Polars’ other methods for splitting strings don’t support regular expressions yet (but this is a planned feature).

For example, suppose you want to split some sentences into words:

sentences = pl.Series([
    "The wonderful, wonderful cat!",
    "Who is this duke of zill anyway?"
])

words = sentences.str.split(" ")
words
shape: (2,)
list[str]
["The", "wonderful,", … "cat!"]
["Who", "is", … "anyway?"]

The .str.split method returns the splits for each string in a list. You can use the .list.get method to get all of the splits at a particular position:

words.list.get(3)
shape: (2,)
str
"cat!"
"duke"

For lists with different lengths, you can set null_on_oob = True in .list.get to get splits that are beyond the end of the shortest list. The method will return null for any list shorter than the requested position. For example:

words.list.get(5, null_on_oob = True)
shape: (2,)
str
null
"zill"

Note

The .list namespace has other methods for working with series of lists as well.

Tip

When you know exactly how many splits you expect a string to have, use the .str.splitn method instead of .str.split. It requires a second argument for the maximum number of splits. It returns the results in structs (fixed-length data structures) rather than lists. You can expand the structs into columns with the .struct.unnest method.

Specifying the number of splits can help prevent bugs in your code and make splitting more efficient. In some cases it’s also more convenient to work with a data frame (from chaining .struct.unnest) than a series of lists.

1.3.5. Extracting Matches#

Occasionally, you might need to extract parts of a string in a more complicated way than string splitting allows. One solution is to write a regular expression that will match all of the data you want to capture, with parentheses ( ), the regex metacharacter for a group, around each distinct value. Then you can use the .str.extract_groups method to extract the groups. groups presents some examples of regex groups.

For example, suppose you want to split an email address into three parts: the user name, the domain name, and the [top-level domain][tld]. To create a regular expression that matches email addresses, you can use the @ and . in the address as anchors. The surrounding characters are generally alphanumeric, which you can represent with the “word” metacharacter \w:

\w+@\w+[.]\w+

Next, put parentheses ( ) around each part that you want to extract:

(\w+)@(\w+)[.](\w+)

Finally, use this pattern in .str.extract_groups:

x = pl.Series(["datalab@ucdavis.edu"])

# Note this is a raw string:
pattern = r"(\w+)@(\w+)[.](\w+)"

result = x.str.extract_groups(pattern)
result
shape: (1,)
struct[3]
{"datalab","ucdavis","edu"}

The method returns a series of structs (fixed-length data structures) with one field for each group. You can use the .struct.unnest method to expand the result into a data frame with a column for each group:

result.struct.unnest()
shape: (1, 3)
123
strstrstr
"datalab""ucdavis""edu"

Note

The .struct namespace has other methods for working with series of structs as well.

The pattern in this example doesn’t work for all possible email addresses, since user names can contain dots and other characters that are not alphanumeric. You could generalize the pattern if necessary. The point is that the .str.extract_groups method and groups provide an extremely flexible way to extract data from strings.

1.4. Regular Expression Examples#

Important

This section is intended as a reference and is not taught in the workshop.

This section provides examples of several different regular expression metacharacters and other features. Most of the examples use the .str.extract_all method, which extracts all matches to a pattern.

The RegExr website is also helpful for testing regular expressions; it provides an interactive interface where you can write regular expressions and see where they match a string.

1.4.1. The Wildcard#

The regex wildcard character is . and matches any single character. For example:

x = pl.Series(["dog"])
x.str.extract_all("d.g")
shape: (1,)
list[str]
["dog"]

By default, regex searches from left to right:

x.str.extract_all(".")
shape: (1,)
list[str]
["d", "o", "g"]

1.4.2. Escape Sequences#

Like Python, regular expressions can contain escape sequences that begin with a backslash. These are computed separately and after Python escape sequences. The main use for escape sequences in regex is to turn a metacharacter into a literal character.

For example, suppose you want to match a literal dot .. The regex for a literal dot is \.. Since backslashes in Python strings have to be escaped, the Python string for this regex is "\\.. For example:

pl.Series(["this.string"]).str.extract_all("\\.")
shape: (1,)
list[str]
["."]

The double backslash can be confusing, and it gets worse if you want to match a literal backslash. You have to escape the backslash in the regex (because backslash is the regex escape character) and then also have to escape the backslashes in Python (because backslash is also the Python escape character). So to match a single literal backslash, the code is:

pl.Series(["this\\that"]).str.extract_all("\\\\")
shape: (1,)
list[str]
["\"]

Raw strings (see raw-strings) make regular expressions easier to read, because they make backslashes literal (but they still mark the beginning of an escape sequence in regex). You can use a raw string to write the above as:

pl.Series([r"this\that"]).str.extract_all(r"\\")
shape: (1,)
list[str]
["\"]

1.4.3. Anchors#

By default, a regex will match anywhere in the string. If you want to force a match at specific place, use an anchor.

The beginning of string anchor is ^. It marks the beginning of the string, but doesn’t count as a character in the pattern.

For example, suppose you want to match an a at the beginning of the string:

x = pl.Series(["abc", "cab"])

x.str.extract_all("a")
shape: (2,)
list[str]
["a"]
["a"]
x.str.extract_all("^a")
shape: (2,)
list[str]
["a"]
[]

It doesn’t make sense to put characters before ^, since no characters can come before the beginning of the string.

Likewise, the end of string anchor is $. It marks the end of the string, but doesn’t count as a character in the pattern.

1.4.4. Character Classes#

In regex, square brackets [ ] denote a character class. A character class matches exactly one character, but that character can be any of the characters inside of the square brackets. The square brackets themselves don’t count as characters in the pattern.

For example, suppose you want to match c followed by either a or t:

x = pl.Series(["ca", "ct", "cat", "cta"])

x.str.extract_all("c[ta]")
shape: (4,)
list[str]
["ca"]
["ct"]
["ca"]
["ct"]

You can use a dash - in a character class to create a range. For example, to match letters p through z:

x.str.extract_all("c[p-z]")
shape: (4,)
list[str]
[]
["ct"]
[]
["ct"]

Ranges also work with numbers and capital letters. To match a literal dash, place the dash at the end of the character class (instead of between two other characters), as in [abc-].

Most metacharacters are literal when inside a character class. For example, [.] matches a literal dot.

A hat ^ at the beginning of the character class negates the class. So for example, [^abc] matches any one character except for a, b, or c:

pl.Series(["abcdef"]).str.extract_all("[^abc]")
shape: (1,)
list[str]
["d", "e", "f"]

1.4.5. Quantifiers#

Quantifiers are metacharacters that affect how many times the preceding character must appear in a match. The quantifier itself doesn’t count as a character in the match.

For example, the question mark ? quantifier means the preceding character can appear 0 or 1 times. In other words, ? makes the preceding character optional. For example:

x = pl.Series(["abc", "ab", "ac", "abbc"])

x.str.extract_all("ab?c")
shape: (4,)
list[str]
["abc"]
[]
["ac"]
[]

The star * quantifier means the preceding character can appear 0 or more times. In other words, * means the preceding character can appear any number of times or not at all. For instance:

x.str.extract_all("ab*c")
shape: (4,)
list[str]
["abc"]
[]
["ac"]
["abbc"]

The plus + quantifier means the preceding character must appear 1 or more times.

Quantifiers are greedy, meaning they always match as many characters as possible. In this example, notice that the pattern matches the entire string, even though it could also match just abba:

pl.Series(["abbabbba"]).str.extract_all(".+a")
shape: (1,)
list[str]
["abbabbba"]

You can add a question mark ? after another quantifier to make it non-greedy:

pl.Series(["abbabbba"]).str.extract_all(".+?a")
shape: (1,)
list[str]
["abba", "bbba"]

1.4.6. Groups#

In regex, parentheses ( ) denote a group. The parentheses themselves don’t count as characters in the pattern. Groups are useful for repeating or extracting specific parts of a pattern (see Section @ref(extracting-matches)).

Quantifiers can act on groups in addition to individual characters. For example, suppose you want to make the entire substring ", dogs," optional in a pattern, so that both of the test strings in this example match:

x = pl.Series(["cats, dogs, and frogs", "cats and frogs"])

x.str.extract_all("cats(, dogs,)? and frogs")
shape: (2,)
list[str]
["cats, dogs, and frogs"]
["cats and frogs"]