4. Organizing Code#

Learning Objectives

After this lesson, you should be able to:

  • Write functions to organize and encapsulate reusable code

  • Create code that only runs when a condition is satisfied

  • Identify when a problem requires iteration

  • Select appropriate iteration strategies for problems

4.1. Functions#

The main way to interact with Python is by calling functions, which was first explained back in Calling Functions. This section explains how to write your own functions.

First, a review of what functions are and some of the vocabulary associated with them:

  • Parameters are placeholder variables for inputs.

    • Arguments are the actual values assigned to the parameters in a call.

  • The return value is the output.

  • Calling a function means using a function to compute something.

  • The body is the code inside.

It’s useful to think of functions as factories, meaning arguments go in and a return value comes out. Here’s a visual representation of the idea for a function f:

A diagram which shows arguments go into a function and a return value comes out.

A function definition begins with the def keyword, followed by:

  • The name of the function

  • A list of parameters surrounded by parentheses

  • A colon :

A function can have any number of parameters. Code in the body of the function must be indented by 4 spaces. Use the return keyword to return a result. The return keyword causes the function to return a result immediately, without running any subsequent code in its body.

For example, let’s create a function that detects negative numbers. It should take a series of numbers as input, compare them to zero, and then return the logical result from the comparison as output. Here’s the code to do that:

def is_negative(x):
    return x < 0

The name of the function, is_negative, describes what the function does and includes a verb. The parameter x is the input. The return value is the result of x < 0.

Tip

Choosing descriptive names is a good habit. For functions, that means choosing a name that describes what the function does. It often makes sense to use verbs in function names.

Any time you write a function, the first thing you should do afterwards is test that it actually works. Try the is_negative function on a few test cases:

import polars as pl

x = pl.Series([5, -1, -2, 0, 3])

is_negative(6)
False
is_negative(-1.1)
True
is_negative(x)
shape: (5,)
bool
false
true
true
false
false

Notice that the parameter x inside the function is different from the variable x you created outside the function. Remember that parameters and variables inside of a function are separate from variables outside of a function.

Recall that a default argument is an argument assigned to a parameter if no argument is assigned in the call to the function. You can use = to assign default arguments to parameters when you define a function with the def keyword.

For example, suppose you want to write a function that gets the largest values in a series. You can make a parameter for the number of values to get, with a default argument of 5. Here’s the code and some test cases:

def get_largest(x, n = 5):
    return x.sort(descending=True).head(n)

y = pl.Series([-6, 7, 10, 3, 1, 15, -2])

get_largest(y, 3)
shape: (3,)
i64
15
10
7
get_largest(y)
shape: (5,)
i64
15
10
7
3
1

Tip

The return keyword causes a function to return a result immediately, without running any subsequent code in its body. So before the end of the function, it only makes sense to use return from inside of Conditional Statements.

A function returns one object, but sometimes computations have multiple results. In that case, return the results in a container such as a tuple or list.

For example, let’s make a function that computes the mean and median for a vector. We’ll return the results in a tuple:

def compute_mean_med(x):
    m1 = x.mean()
    m2 = x.median()
    return m1, m2

compute_mean_med(pl.Series([1, 2, 3, 1]))
(1.75, 1.5)

Tip

Before you write a function, it’s useful to go through several steps:

  1. Write down what you want to do, in detail. It can also help to draw a picture of what needs to happen.

  2. Check whether there’s already a built-in function. Search online and in the Python documentation.

  3. Write the code to handle a simple case first. For data science problems, use a small dataset at this step.

Functions are the building blocks for solving larger problems. Take a divide-and-conquer approach, breaking large problems into smaller steps. Use a short function for each step. This approach makes it easier to:

  • Test that each step works correctly.

  • Modify, reuse, or repurpose a step.

4.2. Conditional Statements#

Sometimes you’ll need code to do different things depending on a condition. You can use an if-statement to write conditional code.

An if-statement begins with the if keyword, followed by a condition and a colon :. The condition must be an expression that returns a Boolean value (False or True). The body of the if-statement is the code that will run when the condition is True. Code in the body must be indented by 4 spaces.

For example, suppose you want your code to generate a different greeting depending on an input name:

name = "Nick"

# Default greeting
greeting = "Nice to meet you!"

if name == "Nick":
    greeting = "Hi Nick, nice to see you again!"

greeting
'Hi Nick, nice to see you again!'

Use the else keyword (and a colon :) if you want to add an alternative when the condition is false. So the previous code can also be written as:

name = "Nick"

if name == "Nick":
    greeting = "Hi Nick, nice to see you again!"
else:
    # Default greeting
    greeting = "Nice to meet you!"

greeting
'Hi Nick, nice to see you again!'

Use the elif keyword with a condition (and a colon :) if you want to add an alternative to the first condition that also has its own condition. Only the first case where a condition is True will run. You can use elif as many times as you want, and can also use else. For example:

name = "Susan"

if name == "Nick":
   greeting = "Hi Nick, nice to see you again!"
elif name == "Peter":
   greeting = "Go away Peter, I'm busy!"
else:
   greeting = "Nice to meet you!"

greeting
'Nice to meet you!'

You can create compound conditions with the keywords not, and and or. The not keyword inverts a condition. The and keyword combines two conditions and returns True only if both are True. The or keyword combines two conditions and returns True if either or both are True. For example:

name1 = "Arthur"
name2 = "Nick"

if name1 == "Arthur" and name2 == "Nick":
  greeting = "These are the authors."
else:
  greeting = "Who are these people?!"

greeting
'These are the authors.'

You can write an if-statement inside of another if-statement. This is called nesting if-statements. Nesting is useful when you want to check a condition, do some computations, and then check another condition under the assumption that the first condition was True.

Tip

If-statements correspond to special cases in your code. Lots of special cases in code makes the code harder to understand and maintain. If you find yourself using lots of if-statements, especially nested if-statements, consider whether there is a more general strategy or way to write the code.

4.3. Iteration#

Python is powerful tool for automating tasks that have repetitive steps. For example, you can:

  • Apply a transformation to an entire column of data.

  • Compute distances between all pairs from a set of points.

  • Read a large collection of files from disk in order to combine and analyze the data they contain.

  • Simulate how a system evolves over time from a specific set of starting parameters.

  • Scrape data from the pages of a website.

You can implement concise, efficient solutions for these kinds of tasks in Python by using iteration, which means repeating a computation many times. Python provides four different strategies for writing iterative code:

  1. Broadcasting, where a function is implicitly called on each element of a data structure. This was introduced in Broadcasting.

  2. Comprehensions, where a function is explicitly called on each element of a vector or array. This was introduced in Comprehensions.

  3. Loops, where an expression is evaluated repeatedly until some condition is met.

  4. Recursion, where a function calls itself.

Broadcasting is the most efficient and most concise iteration strategy, but also the least flexible, because it only works with a few functions and data structures. Comprehensions are more flexible—they work with any function and any data structure with elements—but less efficient and less concise. Loops and recursion provide the most flexibility but are the least concise. In recent versions of Python, comprehensions are slightly more efficient than loops. Recursion tends to be the least efficient iteration strategy in Python.

The rest of this section explains how to write loops and how to choose which iteration strategy to use. We assume you’re already comfortable with broadcasting and have at least some familiarity with comprehensions.

4.3.1. For-Loops#

A for-loop evaluates the expressions in its body once for each element of a data structure. A for-loop begins with the for keyword, followed by:

  • A placeholder variable, which will be automatically signed to an element at the beginning of each iteration

  • The in keyword

  • An object with elements

  • A colon :

Code in the body of the loop must be indented by 4 spaces.

For example, to print out all the column names in terns.columns, you can write:

for column in terns.columns:
    print(column)
year
site_name
site_name_2013_2018
site_name_1988_2001
site_abbr
region_3
region_4
event
bp_min
bp_max
fl_min
fl_max
total_nests
nonpred_eggs
nonpred_chicks
nonpred_fl
nonpred_ad
pred_control
pred_eggs
pred_chicks
pred_fl
pred_ad
pred_pefa
pred_coy_fox
pred_meso
pred_owlspp
pred_corvid
pred_other_raptor
pred_other_avian
pred_misc
total_pefa
total_coy_fox
total_meso
total_owlspp
total_corvid
total_other_raptor
total_other_avian
total_misc
first_observed
last_observed
first_nest
first_chick
first_fledge

Within the indented part of a for-loop, you can compute values, check conditions, etc.

Oftentimes you want to save the result of the code you perform within a for-loop. The easiest way to do this is by creating an empty list and using append to add values to it.

diffs = []
values = [10, 12, 11, 2, 3]

for a, b in zip(values[:-1], values[1:]):
    diffs.append(a - b)

diffs
[-2, 1, 9, -1]

4.3.2. Planning for Iteration#

At first it might seem difficult to decide if and what kind of iteration to use. Start by thinking about whether you need to do something over and over. If you don’t, then you probably don’t need to use iteration. If you do, then try iteration strategies in this order:

  1. Broadcasting

  2. Comprehensions

    • Try a comprehension if iterations are independent.

  3. Loops

    • Try a for-loop if some iterations depend on others.

  4. Recursion (which isn’t covered here)

    • Convenient for naturally recursive tasks (like Fibonacci), but often there are faster solutions.

Start by writing the code for just one iteration. Make sure that code works; it’s easy to test code for one iteration.

When you have one iteration working, then try using the code with an iteration strategy (you will have to make some small changes). If it doesn’t work, try to figure out which iteration is causing the problem. One way to do this is to use print to print out information. Then try to write the code for the broken iteration, get that iteration working, and repeat this whole process.

4.4. Case Study: CA Hospital Utilization#

The California Department of Health Care Access and Information (HCAI) requires hospitals in the state to submit detailed information each year about how many beds they have and the total number of days for which each bed was occupied. The HCAI publishes the data to the California Open Data Portal. Let’s use Python to read data from 2016 to 2023 and investigate whether hospital utilization is noticeably different in and after 2020.

The data set consists of a separate Microsoft Excel file for each year. Before 2018, HCAI used a data format (in Excel) called ALIRTS. In 2018, they started collecting more data and switched to a data format called SIERA. The 2018 data file contains a crosswalk that shows the correspondence between SIERA columns and ALIRTS columns.

Important

Click here to download the CA Hospital Utilization data set (8 Excel files).

If you haven’t already, we recommend you create a directory for this workshop. In your workshop directory, create a data/ca_hospitals subdirectory. Download and save the data set in the data/ca_hospitals subdirectory.

When you need to solve a programming problem, get started by writing some comments that describe the problem, the inputs, and the expected output. Try to be concrete. This will help you clarify what you’re trying to achieve and serve as a guiding light while you work.

As a programmer (or any kind of problem-solver), you should always be on the lookout for ways to break problems into smaller, simpler steps. Think about this when you frame a problem. Small steps are easier to reason about, implement, and test. When you complete one, you also get a nice sense of progress towards your goal.

For the CA Hospital Utilization data set, our goal is to investigate whether there was a change in hospital utilization in 2020. Before we can do any investigation, we need to read the files into Python. The files all contain tabular data and have similar formats, so let’s try to combine them into a single data frame. We’ll say this in the framing comments:

# Read the CA Hospital Utilization data set into Python. The inputs are
# yearly Excel files (2016-2023) that need to be combined. The pre-2018 files
# have a different format from the others. The result should be a single data
# frame with information about bed and patient counts.
#
# After reading the data set, we'll investigate utilization in 2020.

“Investigate utilization” is a little vague, but for an exploratory data analysis, it’s hard to say exactly what to do until you’ve started working with the data.

We need to read multiple files, but we can simplify the problem by starting with just one. Let’s start with the 2023 data. It’s in an Excel file, which you can read with Polars’ pl.read_excel function (note: you’ll first need to install the fastexcel package). The function requires the path to the file; you can also optionally provide the sheet name or number (starting from 1). Open up the Excel file in your computer’s spreadsheet program and take a look. There are multiple sheets, and the data about beds and patients are in the second sheet. Back in Python, read just the second sheet:

path = "data/ca_hospitals/hosp23_util_data_final.xlsx"
sheet = pl.read_excel(path, sheet_id=2)
sheet.head()
shape: (5, 355)
DescriptionFAC_NOFAC_NAMEFAC_STR_ADDRFAC_CITYFAC_ZIPFAC_PHONEFAC_ADMIN_NAMEFAC_OPERATED_THIS_YRFAC_OP_PER_BEGIN_DTFAC_OP_PER_END_DTFAC_PAR_CORP_NAMEFAC_PAR_CORP_BUS_ADDRFAC_PAR_CORP_CITYFAC_PAR_CORP_STATEFAC_PAR_CORP_ZIPREPT_PREP_NAMESUBMITTED_DTREV_REPT_PREP_NAMEREVISED_DTCORRECTED_DTLICENSE_NOLICENSE_EFF_DATELICENSE_EXP_DATELICENSE_STATUSFACILITY_LEVELTRAUMA_CTRTEACH_HOSPTEACH_RURALLONGITUDELATITUDEASSEMBLY_DISTSENATE_DISTCONGRESS_DISTCENSUS_KEYMED_SVC_STUDY_AREALA_COUNTY_SVC_PLAN_AREAEQUIP_VAL_05EQUIP_VAL_06EQUIP_VAL_07EQUIP_VAL_08EQUIP_VAL_09EQUIP_VAL_10PROJ_EXPENDITURES_01PROJ_EXPENDITURES_02PROJ_EXPENDITURES_03PROJ_EXPENDITURES_04PROJ_EXPENDITURES_05DT_AQUIRE_EQUIP_01DT_AQUIRE_EQUIP_02DT_AQUIRE_EQUIP_03DT_AQUIRE_EQUIP_04DT_AQUIRE_EQUIP_05DT_AQUIRE_EQUIP_06DT_AQUIRE_EQUIP_07DT_AQUIRE_EQUIP_08DT_AQUIRE_EQUIP_09DT_AQUIRE_EQUIP_10HCAI_PROJ_NO_01HCAI_PROJ_NO_02HCAI_PROJ_NO_03HCAI_PROJ_NO_04HCAI_PROJ_NO_05MEANS_FOR_ACQUISITION_01MEANS_FOR_ACQUISITION_02MEANS_FOR_ACQUISITION_03MEANS_FOR_ACQUISITION_04MEANS_FOR_ACQUISITION_05MEANS_FOR_ACQUISITION_06MEANS_FOR_ACQUISITION_07MEANS_FOR_ACQUISITION_08MEANS_FOR_ACQUISITION_09MEANS_FOR_ACQUISITION_10__UNNAMED__354
strstrstrstrstrstrstrstrstrstrstrstrstrstrstrstrstrstrstrstrstrstrstrstrstrstrstrstrstrstrstrstrstrstrstrstrstrstrstrstrstrstrstrstrstrstrstrstrstrstrstrstrstrstrstrstrstrstrstrstrstrstrstrstrstrstrstrstrstrstrstrstrstrstr
"FINAL 2023 UTILIZATION DATABAS…"FINAL 2023 UTILIZATION DATABAS…"FINAL 2023 UTILIZATION DATABAS…"FINAL 2023 UTILIZATION DATABAS…"FINAL 2023 UTILIZATION DATABAS…"FINAL 2023 UTILIZATION DATABAS…"FINAL 2023 UTILIZATION DATABAS…"FINAL 2023 UTILIZATION DATABAS…"FINAL 2023 UTILIZATION DATABAS…"FINAL 2023 UTILIZATION DATABAS…"FINAL 2023 UTILIZATION DATABAS…"FINAL 2023 UTILIZATION DATABAS…"FINAL 2023 UTILIZATION DATABAS…"FINAL 2023 UTILIZATION DATABAS…"FINAL 2023 UTILIZATION DATABAS…"FINAL 2023 UTILIZATION DATABAS…"FINAL 2023 UTILIZATION DATABAS…"FINAL 2023 UTILIZATION DATABAS…"FINAL 2023 UTILIZATION DATABAS…"FINAL 2023 UTILIZATION DATABAS…"FINAL 2023 UTILIZATION DATABAS…"FINAL 2023 UTILIZATION DATABAS…"FINAL 2023 UTILIZATION DATABAS…"FINAL 2023 UTILIZATION DATABAS…"FINAL 2023 UTILIZATION DATABAS…"FINAL 2023 UTILIZATION DATABAS…"FINAL 2023 UTILIZATION DATABAS…"FINAL 2023 UTILIZATION DATABAS…"FINAL 2023 UTILIZATION DATABAS…"FINAL 2023 UTILIZATION DATABAS…"FINAL 2023 UTILIZATION DATABAS…"FINAL 2023 UTILIZATION DATABAS…"FINAL 2023 UTILIZATION DATABAS…"FINAL 2023 UTILIZATION DATABAS…"FINAL 2023 UTILIZATION DATABAS…"FINAL 2023 UTILIZATION DATABAS…"FINAL 2023 UTILIZATION DATABAS…"FINAL 2023 UTILIZATION DATABAS…"FINAL 2023 UTILIZATION DATABAS…"FINAL 2023 UTILIZATION DATABAS…"FINAL 2023 UTILIZATION DATABAS…"FINAL 2023 UTILIZATION DATABAS…"FINAL 2023 UTILIZATION DATABAS…"FINAL 2023 UTILIZATION DATABAS…"FINAL 2023 UTILIZATION DATABAS…"FINAL 2023 UTILIZATION DATABAS…"FINAL 2023 UTILIZATION DATABAS…"FINAL 2023 UTILIZATION DATABAS…"FINAL 2023 UTILIZATION DATABAS…"FINAL 2023 UTILIZATION DATABAS…"FINAL 2023 UTILIZATION DATABAS…"FINAL 2023 UTILIZATION DATABAS…"FINAL 2023 UTILIZATION DATABAS…"FINAL 2023 UTILIZATION DATABAS…"FINAL 2023 UTILIZATION DATABAS…"FINAL 2023 UTILIZATION DATABAS…"FINAL 2023 UTILIZATION DATABAS…"FINAL 2023 UTILIZATION DATABAS…"FINAL 2023 UTILIZATION DATABAS…"FINAL 2023 UTILIZATION DATABAS…"FINAL 2023 UTILIZATION DATABAS…"FINAL 2023 UTILIZATION DATABAS…"FINAL 2023 UTILIZATION DATABAS…"FINAL 2023 UTILIZATION DATABAS…"FINAL 2023 UTILIZATION DATABAS…"FINAL 2023 UTILIZATION DATABAS…"FINAL 2023 UTILIZATION DATABAS…"FINAL 2023 UTILIZATION DATABAS…"FINAL 2023 UTILIZATION DATABAS…"FINAL 2023 UTILIZATION DATABAS…"FINAL 2023 UTILIZATION DATABAS…"FINAL 2023 UTILIZATION DATABAS…"FINAL 2023 UTILIZATION DATABAS…"FINAL 2023 UTILIZATION DATABAS…
"Page""1.0""1.0""1.0""1.0""1.0""1.0""1.0""1.0""1.0""1.0""1.0""1.0""1.0""1.0""1.0""1.0""1.0""1.0""1.0""1.0""1.0""1.0""1.0""1.0""1.0""1.0""1.0""1.0""1.0""1.0""1.0""1.0""1.0""1.0""1.0""1.0""6.0""6.0""6.0""6.0""6.0""6.0""6.0""6.0""6.0""6.0""6.0""6.0""6.0""6.0""6.0""6.0""6.0""6.0""6.0""6.0""6.0""6.0""6.0""6.0""6.0""6.0""6.0""6.0""6.0""6.0""6.0""6.0""6.0""6.0""6.0""6.0"null
"Column""1.0""1.0""1.0""1.0""1.0""1.0""1.0""1.0""1.0""1.0""1.0""1.0""1.0""1.0""1.0""1.0""1.0""1.0""1.0""1.0""1.0""1.0""1.0""1.0""1.0""1.0""1.0""1.0""1.0""1.0""1.0""1.0""1.0""1.0""1.0""1.0""2.0""2.0""2.0""2.0""2.0""2.0""2.0""2.0""2.0""2.0""2.0""3.0""3.0""3.0""3.0""3.0""3.0""3.0""3.0""3.0""3.0""3.0""3.0""3.0""3.0""3.0""4.0""4.0""4.0""4.0""4.0""4.0""4.0""4.0""4.0""4.0"null
"Line""2.0""1.0""3.0""4.0""5.0""6.0""7.0""9.0""10.0""11.0""12.0""13.0""14.0""15.0""16.0""20.0""24.0""25.0""29.0""30.0""100.0""101.0""102.0""103.0""104.0""105.0""106.0""107.0""108.0""109.0""110.0""111.0""112.0""113.0""114.0""115.0""6.0""7.0""8.0""9.0""10.0""11.0""26.0""27.0""28.0""29.0""30.0""2.0""3.0""4.0""5.0""6.0""7.0""8.0""9.0""10.0""11.0""26.0""27.0""28.0""29.0""30.0""2.0""3.0""4.0""5.0""6.0""7.0""8.0""9.0""10.0""11.0"null
null"106010735""ALAMEDA HOSPITAL""2070 CLINTON AVE""ALAMEDA""94501""5102333700""Mario Harding""Yes""2023-01-01 00:00:00""2023-12-31 00:00:00""Alameda Health System""1431 East 31st Street""Oakland""CA - California""94602""Ramneet Kaur""2024-02-14 16:11:00""Ramneet Kaur""02/15/2024 01:35 PM"null"140000002""2023-05-01 00:00:00""2024-04-30 00:00:00""Open""Parent Facility"null"No""No""-122.253491""37.762781""District 18""District 7""District 12""06001428500""2e"nullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnull

The first four rows contain metadata about the columns; the first hospital, Alameda Hospital, is listed in the fifth row. So let’s remove the first four rows:

sheet = sheet[4:, :]
sheet.head()
shape: (5, 355)
DescriptionFAC_NOFAC_NAMEFAC_STR_ADDRFAC_CITYFAC_ZIPFAC_PHONEFAC_ADMIN_NAMEFAC_OPERATED_THIS_YRFAC_OP_PER_BEGIN_DTFAC_OP_PER_END_DTFAC_PAR_CORP_NAMEFAC_PAR_CORP_BUS_ADDRFAC_PAR_CORP_CITYFAC_PAR_CORP_STATEFAC_PAR_CORP_ZIPREPT_PREP_NAMESUBMITTED_DTREV_REPT_PREP_NAMEREVISED_DTCORRECTED_DTLICENSE_NOLICENSE_EFF_DATELICENSE_EXP_DATELICENSE_STATUSFACILITY_LEVELTRAUMA_CTRTEACH_HOSPTEACH_RURALLONGITUDELATITUDEASSEMBLY_DISTSENATE_DISTCONGRESS_DISTCENSUS_KEYMED_SVC_STUDY_AREALA_COUNTY_SVC_PLAN_AREAEQUIP_VAL_05EQUIP_VAL_06EQUIP_VAL_07EQUIP_VAL_08EQUIP_VAL_09EQUIP_VAL_10PROJ_EXPENDITURES_01PROJ_EXPENDITURES_02PROJ_EXPENDITURES_03PROJ_EXPENDITURES_04PROJ_EXPENDITURES_05DT_AQUIRE_EQUIP_01DT_AQUIRE_EQUIP_02DT_AQUIRE_EQUIP_03DT_AQUIRE_EQUIP_04DT_AQUIRE_EQUIP_05DT_AQUIRE_EQUIP_06DT_AQUIRE_EQUIP_07DT_AQUIRE_EQUIP_08DT_AQUIRE_EQUIP_09DT_AQUIRE_EQUIP_10HCAI_PROJ_NO_01HCAI_PROJ_NO_02HCAI_PROJ_NO_03HCAI_PROJ_NO_04HCAI_PROJ_NO_05MEANS_FOR_ACQUISITION_01MEANS_FOR_ACQUISITION_02MEANS_FOR_ACQUISITION_03MEANS_FOR_ACQUISITION_04MEANS_FOR_ACQUISITION_05MEANS_FOR_ACQUISITION_06MEANS_FOR_ACQUISITION_07MEANS_FOR_ACQUISITION_08MEANS_FOR_ACQUISITION_09MEANS_FOR_ACQUISITION_10__UNNAMED__354
strstrstrstrstrstrstrstrstrstrstrstrstrstrstrstrstrstrstrstrstrstrstrstrstrstrstrstrstrstrstrstrstrstrstrstrstrstrstrstrstrstrstrstrstrstrstrstrstrstrstrstrstrstrstrstrstrstrstrstrstrstrstrstrstrstrstrstrstrstrstrstrstrstr
null"106010735""ALAMEDA HOSPITAL""2070 CLINTON AVE""ALAMEDA""94501""5102333700""Mario Harding""Yes""2023-01-01 00:00:00""2023-12-31 00:00:00""Alameda Health System""1431 East 31st Street""Oakland""CA - California""94602""Ramneet Kaur""2024-02-14 16:11:00""Ramneet Kaur""02/15/2024 01:35 PM"null"140000002""2023-05-01 00:00:00""2024-04-30 00:00:00""Open""Parent Facility"null"No""No""-122.253491""37.762781""District 18""District 7""District 12""06001428500""2e"nullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnull
null"106010739""ALTA BATES SUMMIT MEDICAL CENT…"2450 ASHBY AVENUE""BERKELEY""94705""510-655-4000""David D. Clark, Chief Executiv…"Yes""2023-01-01 00:00:00""2023-12-31 00:00:00""Sutter Health""2000 Powell Street, 10th  Floo…"Emeryville""CA - California""94608""Jeannie Cornelius""2024-02-15 18:55:00"nullnullnull"140000004""2023-11-01 00:00:00""2024-10-31 00:00:00""Open""Parent Facility"null"No""No""-122.257501""37.855628""District 14""District 7""District 12""06001423902""2b"nullnullnullnullnullnullnull"3451229"nullnullnullnull"2023-03-23 00:00:00"nullnullnullnullnullnullnullnullnullnullnullnullnullnull"Purchase"nullnullnullnullnullnullnullnullnullnull
null"106010776""UCSF BENIOFF CHILDREN'S HOSPIT…"747 52ND STREET""OAKLAND""94609""510-428-3000""Joan Zoltanski""Yes""2023-01-01 00:00:00""2023-12-31 00:00:00"nullnullnullnullnull"LAURA CHERRY""2024-02-15 14:02:00"nullnullnull"140000015""2023-05-17 00:00:00""2023-12-31 00:00:00""Open""Parent Facility""Level I - Pediatric""No""No""-122.267028""37.83719""District 18""District 7""District 12""06001401000""2c"nullnullnullnullnullnullnull"106700000""3500000""2935000""6388650"null"2023-09-30 00:00:00""2023-11-30 00:00:00"nullnullnullnullnullnullnullnullnullnullnullnullnull"Purchase""Purchase"nullnullnullnullnullnullnullnullnull
null"106010811""FAIRMONT HOSPITAL""15400 FOOTHILL BOULEVARD""SAN LEANDRO""94578""5104374800""Richard Espinoza""Yes""2023-01-01 00:00:00""2023-12-31 00:00:00""Alameda Health System""1431 East 31st Street""Oakland""CA - California""94702""Ramneet Kaur""2024-02-12 11:58:00"nullnullnull"140000046""2023-11-01 00:00:00""2024-10-31 00:00:00""Open""Consolidated Facility"null"No""No""-122.119823""37.708168""District 20""District 9""District 14""06001430500""2f"nullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnull
null"106010844""ALTA BATES SUMMIT MEDICAL CENT…"2001 DWIGHT WAY""BERKELEY""94704""510-655-4000""David D. Clark, Chief Executiv…"Yes""2023-01-01 00:00:00""2023-12-31 00:00:00""Sutter Health""2000 Powell Street, 10th  Floo…"Emeryville""CA - California""94608""Jeannie Cornelius""2024-02-15 18:55:00"nullnullnull"140000004""2023-11-01 00:00:00""2024-10-31 00:00:00""Open""Consolidated Facility"null"No""No""-122.268712""37.864072""District 14""District 7""District 12""06001422900""2a"nullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnull

Some data sets also have metadata in the last rows, so let’s check for that here:

sheet.tail()
shape: (5, 355)
DescriptionFAC_NOFAC_NAMEFAC_STR_ADDRFAC_CITYFAC_ZIPFAC_PHONEFAC_ADMIN_NAMEFAC_OPERATED_THIS_YRFAC_OP_PER_BEGIN_DTFAC_OP_PER_END_DTFAC_PAR_CORP_NAMEFAC_PAR_CORP_BUS_ADDRFAC_PAR_CORP_CITYFAC_PAR_CORP_STATEFAC_PAR_CORP_ZIPREPT_PREP_NAMESUBMITTED_DTREV_REPT_PREP_NAMEREVISED_DTCORRECTED_DTLICENSE_NOLICENSE_EFF_DATELICENSE_EXP_DATELICENSE_STATUSFACILITY_LEVELTRAUMA_CTRTEACH_HOSPTEACH_RURALLONGITUDELATITUDEASSEMBLY_DISTSENATE_DISTCONGRESS_DISTCENSUS_KEYMED_SVC_STUDY_AREALA_COUNTY_SVC_PLAN_AREAEQUIP_VAL_05EQUIP_VAL_06EQUIP_VAL_07EQUIP_VAL_08EQUIP_VAL_09EQUIP_VAL_10PROJ_EXPENDITURES_01PROJ_EXPENDITURES_02PROJ_EXPENDITURES_03PROJ_EXPENDITURES_04PROJ_EXPENDITURES_05DT_AQUIRE_EQUIP_01DT_AQUIRE_EQUIP_02DT_AQUIRE_EQUIP_03DT_AQUIRE_EQUIP_04DT_AQUIRE_EQUIP_05DT_AQUIRE_EQUIP_06DT_AQUIRE_EQUIP_07DT_AQUIRE_EQUIP_08DT_AQUIRE_EQUIP_09DT_AQUIRE_EQUIP_10HCAI_PROJ_NO_01HCAI_PROJ_NO_02HCAI_PROJ_NO_03HCAI_PROJ_NO_04HCAI_PROJ_NO_05MEANS_FOR_ACQUISITION_01MEANS_FOR_ACQUISITION_02MEANS_FOR_ACQUISITION_03MEANS_FOR_ACQUISITION_04MEANS_FOR_ACQUISITION_05MEANS_FOR_ACQUISITION_06MEANS_FOR_ACQUISITION_07MEANS_FOR_ACQUISITION_08MEANS_FOR_ACQUISITION_09MEANS_FOR_ACQUISITION_10__UNNAMED__354
strstrstrstrstrstrstrstrstrstrstrstrstrstrstrstrstrstrstrstrstrstrstrstrstrstrstrstrstrstrstrstrstrstrstrstrstrstrstrstrstrstrstrstrstrstrstrstrstrstrstrstrstrstrstrstrstrstrstrstrstrstrstrstrstrstrstrstrstrstrstrstrstrstr
null"106580996""ADVENTIST HEALTH AND RIDEOUT""726 FOURTH ST""MARYSVILLE""95901""530-751-4300""Chris Champlin""Yes""2023-01-01 00:00:00""2023-12-31 00:00:00"nullnullnullnullnull"Erika Collazo""2024-02-15 17:09:00"nullnullnull"230000126""2023-11-01 00:00:00""2024-10-31 00:00:00""Open""Parent Facility""Level III""No""No""-121.594228""39.138174""District 3""District 1""District 1""06115040100""249"nullnullnullnullnullnullnull"2000000"nullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnull
null"206100718""COMMUNITY SUBACUTE AND TRANSIT…"3003 NORTH MARIPOSA STREET""FRESNO""93703""559-459-1844""Leslie Cotham""No""2023-01-01 00:00:00""2023-12-31 00:00:00""Fresno Community Hospital and …"789 N Medical Center Drive Eas…"Clovis""CA - California""93611""Irma Lorenzo""2024-02-15 16:08:00"nullnullnull"040000096""2023-11-07 00:00:00""2024-11-06 00:00:00""Open""Distinct Part Facility"null"No""No""-119.779788""36.777715""District 31""District 14""District 21""06019003400""35c"nullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnull
null"206274027""WESTLAND HOUSE""100 BARNET SEGAL LANE""MONTEREY""93940""8316245311""Dr. Steven Packer""No""2023-01-01 00:00:00""2023-12-31 00:00:00""Montage Health""PO Box HH""Monterey""CA - California""93940""John W. Thibeau""2024-02-19 15:45:00"nullnullnull"070000026""2023-12-01 00:00:00""2024-11-30 00:00:00""Open""Distinct Part Facility"null"No""No""-121.885834""36.5812""District 30""District 17""District 19""06053013200""110"nullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnull
null"206351814""HAZEL HAWKINS MEMORIAL HOSPITA…"900 SUNSET DRIVE""HOLLISTER""95023""831-635-1106""Mary Casillas""Yes""2023-01-01 00:00:00""2023-12-31 00:00:00""San Benito Health Care Distric…"911 Sunset Drive""Hollister""CA - California""95023""SANDRA DILAURA""2024-02-21 15:35:00"nullnullnull"070000004""2023-01-01 00:00:00""2023-12-31 00:00:00""Open""Distinct Part Facility"null"No""No""-121.386815""36.83558""District 29""District 17""District 18""06069000600""140"nullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnull
nullnull"n = 506"nullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnull

Sure enough, the last row contains what appears to be a count of the hospitals rather than a hospital. Let’s remove it by indexing with a negative value, which counts backward from the end:

sheet = sheet[:-1, :]
sheet.tail()
shape: (5, 355)
DescriptionFAC_NOFAC_NAMEFAC_STR_ADDRFAC_CITYFAC_ZIPFAC_PHONEFAC_ADMIN_NAMEFAC_OPERATED_THIS_YRFAC_OP_PER_BEGIN_DTFAC_OP_PER_END_DTFAC_PAR_CORP_NAMEFAC_PAR_CORP_BUS_ADDRFAC_PAR_CORP_CITYFAC_PAR_CORP_STATEFAC_PAR_CORP_ZIPREPT_PREP_NAMESUBMITTED_DTREV_REPT_PREP_NAMEREVISED_DTCORRECTED_DTLICENSE_NOLICENSE_EFF_DATELICENSE_EXP_DATELICENSE_STATUSFACILITY_LEVELTRAUMA_CTRTEACH_HOSPTEACH_RURALLONGITUDELATITUDEASSEMBLY_DISTSENATE_DISTCONGRESS_DISTCENSUS_KEYMED_SVC_STUDY_AREALA_COUNTY_SVC_PLAN_AREAEQUIP_VAL_05EQUIP_VAL_06EQUIP_VAL_07EQUIP_VAL_08EQUIP_VAL_09EQUIP_VAL_10PROJ_EXPENDITURES_01PROJ_EXPENDITURES_02PROJ_EXPENDITURES_03PROJ_EXPENDITURES_04PROJ_EXPENDITURES_05DT_AQUIRE_EQUIP_01DT_AQUIRE_EQUIP_02DT_AQUIRE_EQUIP_03DT_AQUIRE_EQUIP_04DT_AQUIRE_EQUIP_05DT_AQUIRE_EQUIP_06DT_AQUIRE_EQUIP_07DT_AQUIRE_EQUIP_08DT_AQUIRE_EQUIP_09DT_AQUIRE_EQUIP_10HCAI_PROJ_NO_01HCAI_PROJ_NO_02HCAI_PROJ_NO_03HCAI_PROJ_NO_04HCAI_PROJ_NO_05MEANS_FOR_ACQUISITION_01MEANS_FOR_ACQUISITION_02MEANS_FOR_ACQUISITION_03MEANS_FOR_ACQUISITION_04MEANS_FOR_ACQUISITION_05MEANS_FOR_ACQUISITION_06MEANS_FOR_ACQUISITION_07MEANS_FOR_ACQUISITION_08MEANS_FOR_ACQUISITION_09MEANS_FOR_ACQUISITION_10__UNNAMED__354
strstrstrstrstrstrstrstrstrstrstrstrstrstrstrstrstrstrstrstrstrstrstrstrstrstrstrstrstrstrstrstrstrstrstrstrstrstrstrstrstrstrstrstrstrstrstrstrstrstrstrstrstrstrstrstrstrstrstrstrstrstrstrstrstrstrstrstrstrstrstrstrstrstr
null"106574010""SUTTER DAVIS HOSPITAL""2000 SUTTER PLACE""DAVIS""95616""(530) 756-6440""Michael Cureton""Yes""2023-01-01 00:00:00""2023-12-31 00:00:00""Sutter Health""2300 River Plaza Dr""Sacramento""CA - California""95833""Luke D. Peterson""2024-01-25 11:04:00"nullnullnull"030000124""2023-06-05 00:00:00""2024-05-30 00:00:00""Open""Parent Facility"null"No""No""-121.770684""38.562688""District 4""District 3""District 4""06113010505""244"nullnullnullnullnullnullnull"4847000""1999354"nullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnull
null"106580996""ADVENTIST HEALTH AND RIDEOUT""726 FOURTH ST""MARYSVILLE""95901""530-751-4300""Chris Champlin""Yes""2023-01-01 00:00:00""2023-12-31 00:00:00"nullnullnullnullnull"Erika Collazo""2024-02-15 17:09:00"nullnullnull"230000126""2023-11-01 00:00:00""2024-10-31 00:00:00""Open""Parent Facility""Level III""No""No""-121.594228""39.138174""District 3""District 1""District 1""06115040100""249"nullnullnullnullnullnullnull"2000000"nullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnull
null"206100718""COMMUNITY SUBACUTE AND TRANSIT…"3003 NORTH MARIPOSA STREET""FRESNO""93703""559-459-1844""Leslie Cotham""No""2023-01-01 00:00:00""2023-12-31 00:00:00""Fresno Community Hospital and …"789 N Medical Center Drive Eas…"Clovis""CA - California""93611""Irma Lorenzo""2024-02-15 16:08:00"nullnullnull"040000096""2023-11-07 00:00:00""2024-11-06 00:00:00""Open""Distinct Part Facility"null"No""No""-119.779788""36.777715""District 31""District 14""District 21""06019003400""35c"nullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnull
null"206274027""WESTLAND HOUSE""100 BARNET SEGAL LANE""MONTEREY""93940""8316245311""Dr. Steven Packer""No""2023-01-01 00:00:00""2023-12-31 00:00:00""Montage Health""PO Box HH""Monterey""CA - California""93940""John W. Thibeau""2024-02-19 15:45:00"nullnullnull"070000026""2023-12-01 00:00:00""2024-11-30 00:00:00""Open""Distinct Part Facility"null"No""No""-121.885834""36.5812""District 30""District 17""District 19""06053013200""110"nullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnull
null"206351814""HAZEL HAWKINS MEMORIAL HOSPITA…"900 SUNSET DRIVE""HOLLISTER""95023""831-635-1106""Mary Casillas""Yes""2023-01-01 00:00:00""2023-12-31 00:00:00""San Benito Health Care Distric…"911 Sunset Drive""Hollister""CA - California""95023""SANDRA DILAURA""2024-02-21 15:35:00"nullnullnull"070000004""2023-01-01 00:00:00""2023-12-31 00:00:00""Open""Distinct Part Facility"null"No""No""-121.386815""36.83558""District 29""District 17""District 18""06069000600""140"nullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnull

There are a lot of columns in sheet, so let’s make a list of just a few that we’ll use for analysis. We’ll keep:

  • Columns with facility name, location, and operating status

  • All of the columns whose names start with TOT, because these are totals for number of beds, number of census-days, and so on.

  • Columns about acute respiratory beds, with names that contain RESPIRATORY, because they might also be relevant.

We can get all of these columns with the .select method. The TOT and RESPIRATORY columns all contain numbers, but the element type is str, so we’ll cast them to floats. We’ll also add a column with the year:

facility_cols = pl.col(
    "FAC_NAME", "FAC_CITY", "FAC_ZIP", "FAC_OPERATED_THIS_YR", "FACILITY_LEVEL",
    "TEACH_HOSP", "COUNTY", "PRIN_SERVICE_TYPE",
)

sheet = sheet.select(
    pl.lit(2023).alias("year"),
    facility_cols,
    pl.col("^TOT.*$", "^.*RESPIRATORY.*$").cast(pl.Float64),
)
sheet.head()
shape: (5, 22)
yearFAC_NAMEFAC_CITYFAC_ZIPFAC_OPERATED_THIS_YRFACILITY_LEVELTEACH_HOSPCOUNTYPRIN_SERVICE_TYPETOT_LIC_BEDSTOT_LIC_BED_DAYSTOT_DISCHARGESTOT_CEN_DAYSTOT_ALOS_CYTOT_ALOS_PYACUTE_RESPIRATORY_CARE_LIC_BEDSACUTE_RESPIRATORY_CARE_LIC_BED_DAYSACUTE_RESPIRATORY_CARE_DISCHARGESACUTE_RESPIRATORY_CARE_INTRA_TRANSFERSACUTE_RESPIRATORY_CARE_CEN_DAYSACUTE_RESPIRATORY_CARE_ALOS_CYACUTE_RESPIRATORY_CARE_ALOS_PY
i32strstrstrstrstrstrstrstrf64f64f64f64f64f64f64f64f64f64f64f64f64
2023"ALAMEDA HOSPITAL""ALAMEDA""94501""Yes""Parent Facility""No""Alameda""General Medical / Surgical"247.090155.02974.066163.022.224.20.00.0nullnullnullnull0.0
2023"ALTA BATES SUMMIT MEDICAL CENT…"BERKELEY""94705""Yes""Parent Facility""No""Alameda""General Medical / Surgical"339.0123735.011295.056442.05.04.70.00.0nullnullnullnull0.0
2023"UCSF BENIOFF CHILDREN'S HOSPIT…"OAKLAND""94609""Yes""Parent Facility""No""Alameda""Pediatric"155.059465.07537.040891.05.45.70.00.0nullnullnullnull0.0
2023"FAIRMONT HOSPITAL""SAN LEANDRO""94578""Yes""Consolidated Facility""No""Alameda""Long-Term Care (SN / IC)"109.039785.0176.045336.0257.6371.70.00.0nullnullnullnull0.0
2023"ALTA BATES SUMMIT MEDICAL CENT…"BERKELEY""94704""Yes""Consolidated Facility""No""Alameda""Psychiatric"64.024576.01853.017413.09.410.20.00.0nullnullnullnull0.0

Lowercase names are easier to type, so let’s also make all of the names lowercase with the .rename and str.lower methods:

sheet = sheet.rename(str.lower)
sheet.head()
shape: (5, 22)
yearfac_namefac_cityfac_zipfac_operated_this_yrfacility_levelteach_hospcountyprin_service_typetot_lic_bedstot_lic_bed_daystot_dischargestot_cen_daystot_alos_cytot_alos_pyacute_respiratory_care_lic_bedsacute_respiratory_care_lic_bed_daysacute_respiratory_care_dischargesacute_respiratory_care_intra_transfersacute_respiratory_care_cen_daysacute_respiratory_care_alos_cyacute_respiratory_care_alos_py
i32strstrstrstrstrstrstrstrf64f64f64f64f64f64f64f64f64f64f64f64f64
2023"ALAMEDA HOSPITAL""ALAMEDA""94501""Yes""Parent Facility""No""Alameda""General Medical / Surgical"247.090155.02974.066163.022.224.20.00.0nullnullnullnull0.0
2023"ALTA BATES SUMMIT MEDICAL CENT…"BERKELEY""94705""Yes""Parent Facility""No""Alameda""General Medical / Surgical"339.0123735.011295.056442.05.04.70.00.0nullnullnullnull0.0
2023"UCSF BENIOFF CHILDREN'S HOSPIT…"OAKLAND""94609""Yes""Parent Facility""No""Alameda""Pediatric"155.059465.07537.040891.05.45.70.00.0nullnullnullnull0.0
2023"FAIRMONT HOSPITAL""SAN LEANDRO""94578""Yes""Consolidated Facility""No""Alameda""Long-Term Care (SN / IC)"109.039785.0176.045336.0257.6371.70.00.0nullnullnullnull0.0
2023"ALTA BATES SUMMIT MEDICAL CENT…"BERKELEY""94704""Yes""Consolidated Facility""No""Alameda""Psychiatric"64.024576.01853.017413.09.410.20.00.0nullnullnullnull0.0

We’ve successfully read one of the files! Since the 2018-2023 files all have the same format, it’s likely that we can use almost the same code for all of them. Any time you want to reuse code, it’s a sign that you should write a function, so that’s what we’ll do. We’ll take all of the code we have so far and put it in the body of a function called read_hospital_data, adding some comments to indicate the steps and a return statement at the end:

def read_hospital_data():
    # Read the 2nd sheet of the file.
    path = "data/ca_hospitals/hosp23_util_data_final.xlsx"
    sheet = pl.read_excel(path, sheet_id=2)
    
    # Remove the first 4 and last row.
    sheet = sheet[4:, :]
    sheet = sheet[:-1, :]

    # Select only a few columns of interest.
    facility_cols = pl.col(
        "FAC_NAME", "FAC_CITY", "FAC_ZIP", "FAC_OPERATED_THIS_YR",
        "FACILITY_LEVEL", "TEACH_HOSP", "COUNTY", "PRIN_SERVICE_TYPE",
    )

    sheet = sheet.select(
        pl.lit(2023).alias("year"),
        facility_cols,
        pl.col("^TOT.*$", "^.*RESPIRATORY.*$").cast(pl.Float64),
    )

    # Rename the columns to lowercase.
    sheet = sheet.rename(str.lower)

    return sheet

As it is, the function still only reads the 2023 file. The other files have different paths, so the first thing we need to do is make the path variable a parameter. We’ll also make a year parameter, for the year value inserted as a column:

def read_hospital_data(path, year):
    # Read the 2nd sheet of the file.
    sheet = pl.read_excel(path, sheet_id=2)
    
    # Remove the first 4 and last row.
    sheet = sheet[4:-1, :]

    # Select only a few columns of interest.
    facility_cols = pl.col(
        "FAC_NAME", "FAC_CITY", "FAC_ZIP", "FAC_OPERATED_THIS_YR",
        "FACILITY_LEVEL", "TEACH_HOSP", "COUNTY", "PRIN_SERVICE_TYPE",
    )

    sheet = sheet.select(
        pl.lit(year).alias("year"),
        facility_cols,
        pl.col("^TOT.*$", "^.*RESPIRATORY.*$").cast(pl.Float64),
    )

    # Rename the columns to lowercase.
    sheet = sheet.rename(str.lower)

    return sheet

Test the function out on a few of the files to make sure it works correctly:

read_hospital_data(
    "data/ca_hospitals/hosp23_util_data_final.xlsx", 2023
).head()
shape: (5, 22)
yearfac_namefac_cityfac_zipfac_operated_this_yrfacility_levelteach_hospcountyprin_service_typetot_lic_bedstot_lic_bed_daystot_dischargestot_cen_daystot_alos_cytot_alos_pyacute_respiratory_care_lic_bedsacute_respiratory_care_lic_bed_daysacute_respiratory_care_dischargesacute_respiratory_care_intra_transfersacute_respiratory_care_cen_daysacute_respiratory_care_alos_cyacute_respiratory_care_alos_py
i32strstrstrstrstrstrstrstrf64f64f64f64f64f64f64f64f64f64f64f64f64
2023"ALAMEDA HOSPITAL""ALAMEDA""94501""Yes""Parent Facility""No""Alameda""General Medical / Surgical"247.090155.02974.066163.022.224.20.00.0nullnullnullnull0.0
2023"ALTA BATES SUMMIT MEDICAL CENT…"BERKELEY""94705""Yes""Parent Facility""No""Alameda""General Medical / Surgical"339.0123735.011295.056442.05.04.70.00.0nullnullnullnull0.0
2023"UCSF BENIOFF CHILDREN'S HOSPIT…"OAKLAND""94609""Yes""Parent Facility""No""Alameda""Pediatric"155.059465.07537.040891.05.45.70.00.0nullnullnullnull0.0
2023"FAIRMONT HOSPITAL""SAN LEANDRO""94578""Yes""Consolidated Facility""No""Alameda""Long-Term Care (SN / IC)"109.039785.0176.045336.0257.6371.70.00.0nullnullnullnull0.0
2023"ALTA BATES SUMMIT MEDICAL CENT…"BERKELEY""94704""Yes""Consolidated Facility""No""Alameda""Psychiatric"64.024576.01853.017413.09.410.20.00.0nullnullnullnull0.0
read_hospital_data(
    "data/ca_hospitals/hosp21_util_data_final-revised-06.15.2023.xlsx", 2021
).head()
shape: (5, 22)
yearfac_namefac_cityfac_zipfac_operated_this_yrfacility_levelteach_hospcountyprin_service_typetot_lic_bedstot_lic_bed_daystot_dischargestot_cen_daystot_alos_cytot_alos_pyacute_respiratory_care_lic_bedsacute_respiratory_care_lic_bed_daysacute_respiratory_care_dischargesacute_respiratory_care_intra_transfersacute_respiratory_care_cen_daysacute_respiratory_care_alos_cyacute_respiratory_care_alos_py
i32strstrstrstrstrstrstrstrf64f64f64f64f64f64f64f64f64f64f64f64f64
2021"ALAMEDA HOSPITAL""ALAMEDA""94501""Yes""Parent Facility""No""Alameda""General Medical / Surgical"247.090155.02431.066473.027.326.80.00.0nullnullnullnull0.0
2021"ALTA BATES SUMMIT MEDICAL CENT…"BERKELEY""94705""Yes""Parent Facility""No""Alameda""General Medical / Surgical"339.0123735.011344.053710.04.74.80.00.0nullnullnullnull0.0
2021"UCSF BENIOFF CHILDREN'S HOSPIT…"OAKLAND""94609""Yes""Parent Facility""No""Alameda""Pediatric"215.078475.07256.038381.05.35.80.00.0nullnullnullnull0.0
2021"FAIRMONT HOSPITAL""SAN LEANDRO""94578""Yes""Consolidated Facility""No""Alameda""Long-Term Care (SN / IC)"109.039785.0179.038652.0215.9542.40.00.0nullnullnullnull0.0
2021"ALTA BATES SUMMIT MEDICAL CENT…"BERKELEY""94704""Yes""Consolidated Facility""No""Alameda""Psychiatric"68.024820.02236.019950.08.98.30.00.0nullnullnullnull0.0

The function appears to work correctly for two of the files, so let’s try it on all of the 2018-2023 files. We can use the built-in pathlib module’s Path class to help get a list of the files. The class’ .glob method does a wildcard search for files:

from pathlib import Path

paths = Path("data/ca_hospitals/").glob("*.xlsx")
paths = list(paths)
paths
[PosixPath('data/ca_hospitals/hosp21_util_data_final-revised-06.15.2023.xlsx'),
 PosixPath('data/ca_hospitals/hosp18_util_data_final.xlsx'),
 PosixPath('data/ca_hospitals/hosp20_util_data_final-revised-06.15.2023.xlsx'),
 PosixPath('data/ca_hospitals/hosp17_util_data_final.xlsx'),
 PosixPath('data/ca_hospitals/hosp22_util_data_final_revised_11.28.2023.xlsx'),
 PosixPath('data/ca_hospitals/hosp16_util_data_final.xlsx'),
 PosixPath('data/ca_hospitals/hosp23_util_data_final.xlsx'),
 PosixPath('data/ca_hospitals/hosp19_util_data_final.xlsx')]

In addition to the file paths, we also need the year for each file. Fortunately, the last two digits of the year are included in each file’s name. We can write a function to get these:

def get_hospital_year(path):
    year = Path(path).name[4:6]
    return int(year) + 2000


get_hospital_year(paths[0])
2021

Now we can use a for-loop to iterate over all of the files. We’ll skip the pre-2018 files for now:

hosps = []

for path in paths:
    year = get_hospital_year(path)
    # Only years 2018-2023.
    if year >= 2018:
        hosp = read_hospital_data(path, year)
        hosps.append(hosp)

len(hosps)
6

With that working, we need to read the pre-2018 files. In the 2018 file, the fourth sheet is a crosswalk that shows which columns in the pre-2018 files correspond to columns in the later files. Let’s write some code to read the crosswalk. First, read the sheet:

cwalk = pl.read_excel(path, sheet_id=4)
cwalk.head()
shape: (5, 6)
PageLineColumnSIERA Dataset Header (2019)ALIRTS Dataset Header (2017)Notes
i64i64i64strstrstr
111"FAC_NAME""FAC_NAME"null
121"FAC_NO""OSHPD_ID"null
131"FAC_STR_ADDR"null"Address 1 and Address 2 combin…
141"FAC_CITY""FAC_CITY"null
151"FAC_ZIP""FAC_ZIPCODE"null

The new and old column names are in the fourth and fifth columns, respectively, so we’ll get just those:

cwalk = cwalk[:, 3:5]
cwalk.head()
shape: (5, 2)
SIERA Dataset Header (2019)ALIRTS Dataset Header (2017)
strstr
"FAC_NAME""FAC_NAME"
"FAC_NO""OSHPD_ID"
"FAC_STR_ADDR"null
"FAC_CITY""FAC_CITY"
"FAC_ZIP""FAC_ZIPCODE"

Finally, use a comprehension to turn the cwalk data frame into a dictionary with the old column names as the keys and the new column names as the values. This way we can easily look up the new name for any of the old columns. Some of the column names in cwalk have extra spaces at the end, so we’ll use .strip method to remove them. We’ll also skip any rows where there’s no old column name (because the column is only in the new data):

cwalk = {
    old.strip(): new.strip()
    for new, old in cwalk.iter_rows()
    if old is not None
}

cwalk
{'FAC_NAME': 'FAC_NAME',
 'OSHPD_ID': 'FAC_NO',
 'FAC_CITY': 'FAC_CITY',
 'FAC_ZIPCODE': 'FAC_ZIP',
 'FAC_PHONE': 'FAC_PHONE',
 'FAC_ADMIN_NAME': 'FAC_ADMIN_NAME',
 'FAC_OPER_CURRYR': 'FAC_OPERATED_THIS_YR',
 'BEG_DATE': 'FAC_OP_PER_BEGIN_DT',
 'END_DATE': 'FAC_OP_PER_END_DT',
 'PARENT_NAME': 'FAC_PAR_CORP_NAME',
 'PARENT_CITY': 'FAC_PAR_CORP_CITY',
 'PARENT_STATE': 'FAC_PAR_CORP_STATE',
 'PARENT_ZIP_9': 'FAC_PAR_CORP_ZIP',
 'REPORT_PREP_NAME': 'REPT_PREP_NAME',
 'SUBMITTED_DT_TIME': 'SUBMITTED_DT',
 'LICENSE_NO': 'LICENSE_NO',
 'LIC_STATUS': 'LICENSE_STATUS',
 'FACILITY_LEVEL': 'FACILITY_LEVEL',
 'TRAUMA_CTR': 'TRAUMA_CTR',
 'TEACH_HOSP': 'TEACH_HOSP',
 'LONGITUDE': 'LONGITUDE',
 'LATITUDE': 'LATITUDE',
 'ASSEMBLY_DIST': 'ASSEMBLY_DIST',
 'SENATE_DIST': 'SENATE_DIST',
 'CONGRESS_DIST': 'CONGRESS_DIST',
 'CENS_TRACT': 'CENS_TRACT',
 'MED_SVC_STUDY_AREA': 'MED_SVC_STUDY_AREA',
 'LACO_SVC_PLAN_AREA': 'LA_COUNTY_SVC_PLAN_AREA',
 'HEALTH_SVC_AREA': 'HEALTH_SVC_AREA',
 'COUNTY': 'COUNTY',
 'TYPE_LIC': 'LIC_CAT',
 'TYPE_CNTRL': 'LICEE_TOC',
 'TYPE_SVC_PRINCIPAL': 'PRIN_SERVICE_TYPE',
 'MED_SURG_LICBED_DAY': 'MED_SURG_LIC_BED_DAYS',
 'MED_SURG_CENS_DAY': 'MED_SURG_CEN_DAYS',
 'MED_SURG_DIS': 'MED_SURG_DISCHARGES',
 'MED_SURG_BED_LIC': 'MED_SURG_LIC_BEDS',
 'PERINATL_CENS_DAY': 'PERINATAL_CEN_DAYS',
 'PERINATL_LICBED_DAY': 'PERINATAL_LIC_BED_DAYS',
 'PERINATL_BED_LIC': 'PERINATAL_LIC_BEDS',
 'PERINATL_DIS': 'PERINATAL_DISCHARGES',
 'PED_CENS_DAY': 'PEDIATRIC_CEN_DAYS',
 'PED_DIS': 'PEDIATRIC_DISCHARGES',
 'PED_BED_LIC': 'PEDIATRIC_LIC_BEDS',
 'PED_LICBED_DAY': 'PEDIATRIC_LIC_BED_DAYS',
 'ICU_TFR_INHOSP': 'IC_INTRA_TRANSFERS',
 'ICU_LICBED_DAY': 'IC_LIC_BED_DAYS',
 'ICU_BED_LIC': 'IC_LIC_BEDS',
 'ICU_CENS_DAY': 'IC_CEN_DAYS',
 'ICU_DIS': 'IC_DISCHARGES',
 'CCU_CENS_DAY': 'CORONARY_CARE_CEN_DAYS',
 'CCU_BED_LIC': 'CORONARY_CARE_LIC_BEDS',
 'CCU_LICBED_DAY': 'CORONARY_CARE_LIC_BED_DAYS',
 'CCU_DIS': 'CORONARY_CARE_DISCHARGES',
 'CCU_TFR_INHOSP': 'CORONARY_CARE_INTRA_TRANSFERS',
 'RESP_DIS': 'ACUTE_RESPIRATORY_CARE_DISCHARGES',
 'RESP_LICBED_DAY': 'ACUTE_RESPIRATORY_CARE_LIC_BED_DAYS',
 'RESP_BED_LIC': 'ACUTE_RESPIRATORY_CARE_LIC_BEDS',
 'RESP_CENS_DAY': 'ACUTE_RESPIRATORY_CARE_CEN_DAYS',
 'RESP_TFR_INHOSP': 'ACUTE_RESPIRATORY_CARE_INTRA_TRANSFERS',
 'BURN_LICBED_DAY': 'BURN_LIC_BED_DAYS',
 'BURN_CENS_DAY': 'BURN_CEN_DAYS',
 'BURN_DIS': 'BURN_DISCHARGES',
 'BURN_TFR_INHOSP': 'BURN_INTRA_TRANSFERS',
 'BURN_BED_LIC': 'BURN_LIC_BEDS',
 'NICU_DIS': 'IC_NEWBORN_DISCHARGES',
 'NICU_LICBED_DAY': 'IC_NEWBORN_LIC_BED_DAYS',
 'NICU_BED_LIC': 'IC_NEWBORN_LIC_BEDS',
 'NICU_CENS_DAY': 'IC_NEWBORN_CEN_DAYS',
 'NICU_TFR_INHOSP': 'IC_NEWBORN_INTRA_TRANSFERS',
 'REHAB_LICBED_DAY': 'REHAB_CTR_LIC_BED_DAYS',
 'REHAB_CENS_DAY': 'REHAB_CTR_CEN_DAYS',
 'REHAB_DIS': 'REHAB_CTR_DISCHARGES',
 'REHAB_BED_LIC': 'REHAB_CTR_LIC_BEDS',
 'GAC_DIS_SUBTOTL': 'GAC_SUBTOT_DISCHARGES',
 'GAC_LICBED_DAY_SUBTOTL': 'GAC_SUBTOT_LIC_BED_DAYS',
 'GAC_BED_LIC_SUBTOTL': 'GAC_SUBTOT_LIC_BEDS',
 'GAC_CENS_DAY_SUBTOTL': 'GAC_SUBTOT_CEN_DAYS',
 'CHEM_LICBED_DAY': 'CHEM_DEPEND_RECOVERY_LIC_BED_DAYS',
 'CHEM_DIS': 'CHEM_DEPEND_RECOVERY_DISCHARGES',
 'CHEM_CENS_DAY': 'CHEM_DEPEND_RECOV_CEN_DAYS',
 'CHEM_BED_LIC': 'CHEM_DEPEND_RECOVERY_LIC_BEDS',
 'PSY_BED_LIC': 'ACUTE_PSYCHIATRIC_LIC_BEDS',
 'PSY_DIS': 'ACUTE_PSYCHIATRIC_DISCHARGES',
 'PSY_CENS_DAY': 'ACUTE_PSYCHIATRIC_CEN_DAYS',
 'PSY_LICBED_DAY': 'ACUTE_PSYCHIATRIC_LIC_BED_DAYS',
 'SN_DIS': 'SN_DISCHARGES',
 'SN_TFR_INHOSP': 'SN_INTRA_TRANSFERS',
 'SN_LICBED_DAY': 'SN_LIC_BED_DAYS',
 'SN_CENS_DAY': 'SN_CEN_DAYS',
 'SN_BED_LIC': 'SN_LIC_BEDS',
 'IC_DIS': 'INTERMEDIATE_CARE_DISCHARGES',
 'IC_BED_LIC': 'INTERMEDIATE_CARE_LIC_BEDS',
 'IC_CENS_DAY': 'INTERMEDIATE_CARE_CEN_DAYS',
 'IC_LICBED_DAY': 'INTERMEDIATE_CARE_LIC_BED_DAYS',
 'IC_DEV_DISBL_DIS': 'INTERMEDIATE_CARE_DEV_DIS_DISCHARGES',
 'IC_DEV_DISBL_BED_LIC': 'INTERMEDIATE_CARE_DEV_DIS_LIC_BEDS',
 'IC_DEV_DISBL_LICBED_DAY': 'INTERMEDIATE_CARE_DEV_DIS_LIC_BED_DAYS',
 'IC_DEV_DISBL_CENS_DAY': 'INTERMEDIATE_CARE_DEV_DIS_CEN_DAYS',
 'HOSP_BED_LIC_TOTL': 'TOT_LIC_BEDS',
 'HOSP_CENS_DAY_TOTL': 'TOT_CEN_DAYS',
 'HOSP_LICBED_DAY_TOTL': 'TOT_LIC_BED_DAYS',
 'HOSP_DIS_TOTL': 'TOT_DISCHARGES',
 'CHEM_GAC_CENS_DAY': 'GAC_CDRS_CEN_DAYS',
 'CHEM_GAC_BED_LIC': 'GAC_CDRS_LIC_BEDS',
 'CHEM_GAC_DIS': 'GAC_CDRS_DISCHARGES',
 'CHEM_PSY_CENS_DAY': 'ACUTE_PSYCH_CEN_DAYS',
 'CHEM_PSY_DIS': 'ACUTE_PSYCH_DISCHARGES',
 'CHEM_PSY_BED_LIC': 'ACUTE_PSYCH_LIC_BEDS',
 'NEWBORN_NURSRY_BASSINETS': 'NEWBORN_NURSERY_BASSINETS',
 'NEWBORN_NURSRY_INFANTS': 'NEWBORN_NURSERY_INFANTS',
 'NEWBORN_NURSRY_CENS_DAY': 'NEWBORN_NURSERY_CEN_DAYS',
 'BED_SWING_SN': 'GEN_ACUTE_CARE_SN_SWING_BEDS',
 'PSY_LCK_CENS_PATIENT': 'ACUTE_PSYCHIATRIC_PATS_LOCKED_ON_1231',
 'PSY_OPN_CENS_PATIENT': 'ACUTE_PSYCHIATRIC_PATS_OPEN_ON_1231',
 'PSY_CENS_PATIENT_TOTL': 'ACUTE_PSYCHIATRIC_PATS_TOT_BY_PAYOR',
 'PSY_CENS_PATIENT_<=17': 'ACUTE_PSYCHIATRIC_PATS_0_TO_17_ON_1231',
 'PSY_CENS_PATIENT_18-64': 'ACUTE_PSYCHIATRIC_PATS_18_TO_64_ON_1231',
 'PSY_CENS_PATIENT_=65': 'ACUTE_PSYCHIATRIC_PATS_65_AND_UP_ON_1231',
 'PSY_CENS_PATIENT_MCAR': 'ACUTE_PSYCHIATRIC_PATS_MED_TRAD_ON_1231',
 'PSY_CENS_PATIENT_MNG_MCAR': 'ACUTE_PSYCHIATRIC_PATS_MED_MANAGED_CARE_ON_1231',
 'PSY_CENS_PATIENT_MCAL': 'ACUTE_PSYCHIATRIC_PATS_MED_CAL_TRAD_ON_1231',
 'PSY_CENS_PATIENT_MNG_MCAL': 'ACUTE_PSYCHIATRIC_PATS_MED_CAL_MANAGED_CARE_ON_1231',
 'PSY_CENS_PATIENT_CO_INDIG': 'ACUTE_PSYCHIATRIC_PATS_COUNTY_INDIGENT_PROG',
 'PSY_CENS_PATIENT_OTHR_THIRDPTY': 'ACUTE_PSYCHIATRIC_PATS_3RD_PARTIES_TRAD',
 'PSY_CENS_PATIENT_MNG_OTHR_THIRDPTY': 'ACUTE_PSYCHIATRIC_PATS_3RD_PARTIES_MANAGED_CARE',
 'PSY_CENS_PATIENT_SHDOYL': 'ACUTE_PSYCHIATRIC_PATS_SHORT_DOYLE',
 'PSY_CENS_PATIENT_OTHR_INDIG': 'ACUTE_PSYCHIATRIC_PATS_OTHER_INDIGENT',
 'PSY_CENS_PATIENT_OTHR_PAYER': 'ACUTE_PSYCHIATRIC_PATS_OTHER_PAYERS',
 'PSY_PROG_SHDOYL': 'SHORT_DOYLE_SERVICES_OFFERED',
 'HOSPICE_PROG': 'INPATIENT_HOSPICE_PROG_OFFERED',
 'HOSPICE_CLASS_GAC_BED': 'BED_CLASS_GEN_ACUTE_CARE_SERVICE',
 'HOSPICE_CLASS_SN_BED': 'BED_CLASS_SN_HOSPICE_SERVICE',
 'HOSPICE_CLASS_IC_BED': 'BED_CLASS_IC_HOSPICE_SERVICE',
 'INPATIENT_PALLIATIVE_CARE_PROG_OFFERED': 'INPATIENT_PALLIATIVE_CARE_PROG_OFFERED',
 'INPATIENT_PALLIATIVE_CARE_PROG_NURSES': 'INPATIENT_PALLIATIVE_CARE_PROG_NURSES',
 'INPATIENT_PALLIATIVE_CARE_PROG_NURSES_CERTIFIED': 'INPATIENT_PALLIATIVE_CARE_PROG_NURSES_CERTIFIED',
 'INPATIENT_PALLIATIVE_CARE_PROG_PHYSICIAN': 'INPATIENT_PALLIATIVE_CARE_PROG_PHYSICIAN',
 'INPATIENT_PALLIATIVE_CARE_PROG_PHYSICIAN_CERTIFIED': 'INPATIENT_PALLIATIVE_CARE_PROG_PHYSICIAN_CERTIFIED',
 'INPATIENT_PALLIATIVE_CARE_PROG_SOCIAL_WORKER': 'INPATIENT_PALLIATIVE_CARE_PROG_SOCIAL_WORKER',
 'INPATIENT_PALLIATIVE_CARE_PROG_SOCIAL_WORKER_CERTIFIED': 'INPATIENT_PALLIATIVE_CARE_PROG_SOCIAL_WORKER_CERTIFIED',
 'INPATIENT_PALLIATIVE_CARE_PROG_CHAPLAINS': 'INPATIENT_PALLIATIVE_CARE_PROG_CHAPLAINS',
 'OUTPATIENT_PALLIATIVE_CARE_SERV_OFFERED': 'OUTPATIENT_PALLIATIVE_CARE_SERV_OFFERED',
 'EMSA_TRAUMA_CTR_DESIG': 'EMSA_TRAUMA_DESIGNATION',
 'EMSA_TRAUMA_PEDS_CTR_DESIG': 'EMSA_TRAUMA_DESIGNATION_PEDIATRIC',
 'ED_LIC_LEVL_END': 'LIC_ED_LEV_END',
 'ED_LIC_LEVL_BEGIN': 'LIC_ED_LEV_BEGIN',
 'ED_ANESTH_AVAIL24HRS': 'AVAIL_SERVICES_ANESTHESIOLOGIST_24HR',
 'ED_ANESTH_AVAIL_ON_CALL': 'AVAIL_SERVICES_ANESTHESIOLOGIST_ON_CALL',
 'ED_LAB_SVCS_AVAIL24HRS': 'AVAIL_SERVICES_LAB_24HR',
 'ED_LAB_SVCS_AVAIL_ON_CALL': 'AVAIL_SERVICES_LAB_ON_CALL',
 'ED_OP_RM_AVAIL_ON_CALL': 'AVAIL_SERVICES_OPER_RM_ON_CALL',
 'ED_OP_RM_AVAIL24HRS': 'AVAIL_SERVICES_OPER_RM_24HR',
 'ED_PHARM_AVAIL24HRS': 'AVAIL_SERVICES_PHARMACIST_24HR',
 'ED_PHARM_AVAIL_ON_CALL': 'AVAIL_SERVICES_PHARMACIST_ON_CALL',
 'ED_PHYSN_AVAIL24HRS': 'AVAIL_SERVICES_PHYSICIAN_24HR',
 'ED_PHYSN_AVAIL_ON_CALL': 'AVAIL_SERVICES_PHYSICIAN_ON_CALL',
 'ED_PSYCH_ER_AVAIL24HRS': 'AVAIL_SERVICES_PSYCHIATRIC_ER_24HR',
 'ED_PSYCH_ER_AVAIL_ON_CALL': 'AVAIL_SERVICES_PSYCHIATRIC_ER_ON_CALL',
 'ED_RADIOL_SVCS_AVAIL_ON_CALL': 'AVAIL_SERVICES_RADIOLOGY_ON_CALL',
 'ED_RADIOL_SVCS_AVAIL24HRS': 'AVAIL_SERVICES_RADIOLOGY_24HR',
 'EMS_VISITS_NON_URGENT_ADMITTED': 'EMS_VISITS_NON_URGENT_ADMITTED',
 'EMS_VISITS_NON_URGENT_NOT_ADMITTED': 'EMS_VISITS_NON_URGENT_TOT',
 'EMS_VISITS_URGENT_ADMITTED': 'EMS_VISITS_URGENT_ADMITTED',
 'EMS_VISITS_URGENT_NOT_ADMITTED': 'EMS_VISITS_URGENT_TOT',
 'EMS_VISITS_MODERATE_ADMITTED': 'EMS_VISITS_MODERATE_ADMITTED',
 'EMS_VISITS_MODERATE_NOT_ADMITTED': 'EMS_VISITS_MODERATE_TOT',
 'EMS_VISITS_SEVERE_ADMITTED': 'EMS_VISITS_SEVERE_ADMITTED',
 'EMS_VISITS_SEVERE_NOT_ADMITTED': 'EMS_VISITS_SEVERE_TOT',
 'EMS_VISITS_CRITICAL_NOT_ADMITTED': 'EMS_VISITS_CRITICAL_TOT',
 'EMS_VISITS_CRITICAL_ADMITTED': 'EMS_VISITS_CRITICAL_ADMITTED',
 'EMS_NO_ADM_VIS_TOTL': 'EMER_DEPT_VISITS_NOT_RESULT_ADMISSIONS_TOT',
 'ED_TRAFFIC_TOTL': 'ER_TRAFFIC_TOT',
 'EMS_ADM_VIS_TOTL': 'ADMITTED_FROM_EMER_DEPT_TOT',
 'EMS_STATION': 'EMER_MED_TREAT_STATIONS_ON_1231',
 'EMS_NON_EMERG_VIS': 'NON_EMER_VISITS_IN_EMER_DEPT',
 'EMS_REGISTERS_NO_TREAT': 'EMER_REGISTRATIONS_PATS_LEAVE_WO_BEING_SEEN',
 'EMS_AMB_DIVERS': 'EMER_DEPT_AMBULANCE_DIVERSION_HOURS',
 'EMS_AMB_DIVERS_JAN_HOURS': 'EMER_DEPT_HR_DIVERSION_JAN',
 'EMS_AMB_DIVERS_FEB_HOURS': 'EMER_DEPT_HR_DIVERSION_FEB',
 'EMS_AMB_DIVERS_MAR_HOURS': 'EMER_DEPT_HR_DIVERSION_MAR',
 'EMS_AMB_DIVERS_APR_HOURS': 'EMER_DEPT_HR_DIVERSION_APR',
 'EMS_AMB_DIVERS_MAY_HOURS': 'EMER_DEPT_HR_DIVERSION_MAY',
 'EMS_AMB_DIVERS_JUN_HOURS': 'EMER_DEPT_HR_DIVERSION_JUN',
 'EMS_AMB_DIVERS_JUL_HOURS': 'EMER_DEPT_HR_DIVERSION_JUL',
 'EMS_AMB_DIVERS_AUG_HOURS': 'EMER_DEPT_HR_DIVERSION_AUG',
 'EMS_AMB_DIVERS_SEP_HOURS': 'EMER_DEPT_HR_DIVERSION_SEP',
 'EMS_AMB_DIVERS_OCT_HOURS': 'EMER_DEPT_HR_DIVERSION_OCT',
 'EMS_AMB_DIVERS_NOV_HOURS': 'EMER_DEPT_HR_DIVERSION_NOV',
 'EMS_AMB_DIVERS_DEC_HOURS': 'EMER_DEPT_HR_DIVERSION_DEC',
 'EMS_AMB_DIVERS_TOTL_HOURS': 'EMER_DEPT_HR_DIVERSION_TOT',
 'OP_RM_MIN_IP': 'INPATIENT_SURG_OPER_RM_MINS',
 'SURG_IP': 'INPATIENT_SURG_OPER',
 'SURG_OP': 'OUTPATIENT_SURG_OPER',
 'OP_RM_MIN_OP': 'OUTPATIENT_SURG_OPER_RM_MINS',
 'OP_RM_IP_ONLY': 'INPAT_OPER_RM',
 'OP_RM_OP_ONLY': 'OUTPAT_OPER_RM',
 'OP_RM_IP_AND_OP': 'INPAT_OUTPAT_OPER_RM',
 'OP_RM_TOTL': 'OPER_RM_TOT',
 'AMB_SURG_PROG': 'OFFER_AMBULATORY_SURG_PROG',
 'BIRTHS_LIVE_TOTL': 'LIVE_BIRTHS_TOT',
 'BIRTHS_LIVE_<5LBS_8OZ': 'LIVE_BIRTHS_LT_2500GM',
 'BIRTHS_LIVE_<3LBS_5OZ': 'LIVE_BIRTHS_LT_1500GM',
 'ABC_OP_PROG': 'OFFER_ALTERNATE_BIRTH_PROG',
 'ABC_LDR': 'ALTERNATE_SETTING_LDR',
 'ABC_LDRP': 'ALTERNATE_SETTING_LDRP',
 'BIRTHS_LIVE_ABC': 'LIVE_BIRTHS_IN_ALTERNATIVE_SETTING',
 'BIRTHS_LIVE_C_SEC': 'LIVE_BIRTHS_C_SECTION',
 'LICENSURE_CVSURG_SVCS': 'LIC_CARDIOLOGY_CARDIOVASCULAR_SURG_SERVICES',
 'CVSURG_LIC_OP_RM': 'CARDIOVASCULAR_OPER_RM',
 'CVSURG_WITH_ECBPASS_PED': 'CARDIOVASCULAR_SURG_OPER_PEDIATRIC_BYPASS_USED',
 'CVSURG_WITHOUT_ECBPASS_PED': 'CARDIOVASCULAR_SURG_OPER_PEDIATRIC_BYPASS_NOT_USED',
 'CVSURG_WITHOUT_ECBPASS_ADLT': 'CARDIOVASCULAR_SURG_OPER_ADULT_BYPASS_NOT_USED',
 'CVSURG_WITH_ECBPASS_ADLT': 'CARDIOVASCULAR_SURG_OPER_ADULT_BYPASS_USED',
 'CVSURG_WITHOUT_ECBPASS_TOTL': 'CARDIOVASCULAR_SURG_OPER_BYPASS_NOT_USED_TOT',
 'CVSURG_WITH_ECBPASS_TOTL': 'CARDIOVASCULAR_SURG_OPER_BYPASS_USED_TOT',
 'CVSURG_CABG_TOTL': 'CORONARY_ARTERY_BYPASS_GRAFT_SURG',
 'CATH_CARD_RM': 'CARDIAC_CATHETERIZATION_LAB_RM',
 'CATH_IP_PED_THER_VIS': 'CARD_CATH_PED_IP_THER_VST',
 'CATH_IP_PED_DX_VIS': 'CARD_CATH_PED_IP_DIAG_VST',
 'CATH_OP_PED_THER_VIS': 'CARD_CATH_PED_OP_THER_VST',
 'CATH_OP_PED_DX_VIS': 'CARD_CATH_PED_OP_DIAG_VST',
 'CATH_IP_ADLT_DX_VIS': 'CARDIAC_CATHETERIZATION_ADULT_INPAT_DIAGNOSTIC_VISITS',
 'CATH_IP_ADLT_THER_VIS': 'CARDIAC_CATHETERIZATION_ADULT_INPAT_THERAPEUTIC_VISITS',
 'CATH_OP_ADLT_THER_VIS': 'CARDIAC_CATHETERIZATION_ADULT_OUTPAT_THERAPEUTIC_VISITS',
 'CATH_OP_ADLT_DX_VIS': 'CARDIAC_CATHETERIZATION_ADULT_OUTPAT_DIAGNOSTIC_VISITS',
 'CATH_THER_VIS_TOTL': 'CARDIAC_CATHETERIZATION_THERAPEUTIC_VISITS_TOT',
 'CATH_DX_VIS_TOTL': 'CARDIAC_CATHETERIZATION_DIAGNOSTIC_VISITS_TOT',
 'CARDIAC_CATH_DX_PROC': 'DIAGNOSTIC_CARDIAC_CATH_PROC',
 'MYOCARDIAL_BIOPSY': 'MYOCARDIAL_BIOPSY',
 'PACEMKR_PERM_IMPL': 'PERM_PACEMAKER_IMPLANT',
 'OTH_PERM_PACEMAKER_PROC': 'OTH_PERM_PACEMAKER_PROC',
 'ICD_IMPLANTATION': 'IMPLANABLE_CARDIO_DEFIB_IMPLANTATION',
 'OTH_ICD_PROCEDURES': 'OTH_PROCEDURES',
 'PCI_WITH_STENT': 'PCI_WITH_STENT',
 'PCI_WITHOUT_STENT': 'PCI_WO_STENT',
 'ATHERECTOMY_PTCRA': 'ATHERECTOMY',
 'THROMBO_AGT': 'THROMBOLYTIC_AGENTS',
 'PTBV': 'PTBV',
 'DX_EP_STUDY': 'DIAGNOSTIC_ELECTROPHYSIOLOGY_EP',
 'CATHETER_ABLATION': 'CATHETER_ABLATION',
 'PERIPHERAL_VASCULAR_ANGIOGRAPHY': 'PERIPHERAL_VASCULAR_ANGIOGRAPHY',
 'PERIPHERAL_VASCULAR_INTERVENTIONAL': 'PERIPHERAL_VASCULAR_INTERVENTIONAL',
 'CAROTID_STENTING': 'CAROTID_STENTING',
 'INTRA_AORTIC_BALLOON_PUMP_INS': 'INTRA_AORTIC_BALLOON_PUMP_INSERTION',
 'CATHETER_BASED_VENTRICULAR_INS': 'CATHETER_BASED_VENTRICULAR_ASSIST_DEVICE_INSERTION',
 'OTHR_CATH_PROC': 'ALL_OTHER_CATHETERIZATION_PROC',
 'CATH_TOTL': 'CATHETERIZATION_PROC_TOT',
 'EQUIP_ACQ_OVER_500K': 'FAC_ACQUIRE_EQUIP_OVER_500K',
 'EQUIP_01_ACQUI_DT': 'DT_AQUIRE_EQUIP_01',
 'EQUIP_01_VALUE': 'EQUIP_VAL_01',
 'EQUIP_01_ACQUI_MEANS': 'MEANS_FOR_ACQUISITION_01',
 'EQUIP_01_DESCRIP': 'DESC_EQUIP_01',
 'EQUIP_02_DESCRIP': 'DESC_EQUIP_02',
 'EQUIP_02_ACQUI_MEANS': 'MEANS_FOR_ACQUISITION_02',
 'EQUIP_02_ACQUI_DT': 'DT_AQUIRE_EQUIP_02',
 'EQUIP_02_VALUE': 'EQUIP_VAL_02',
 'EQUIP_03_VALUE': 'EQUIP_VAL_03',
 'EQUIP_03_ACQUI_DT': 'DT_AQUIRE_EQUIP_03',
 'EQUIP_03_ACQUI_MEANS': 'MEANS_FOR_ACQUISITION_03',
 'EQUIP_03_DESCRIP': 'DESC_EQUIP_03',
 'EQUIP_04_DESCRIP': 'DESC_EQUIP_04',
 'EQUIP_04_ACQUI_MEANS': 'MEANS_FOR_ACQUISITION_04',
 'EQUIP_04_ACQUI_DT': 'DT_AQUIRE_EQUIP_04',
 'EQUIP_04_VALUE': 'EQUIP_VAL_04',
 'EQUIP_05_VALUE': 'EQUIP_VAL_05',
 'EQUIP_05_ACQUI_DT': 'DT_AQUIRE_EQUIP_05',
 'EQUIP_05_ACQUI_MEANS': 'MEANS_FOR_ACQUISITION_05',
 'EQUIP_05_DESCRIP': 'DESC_EQUIP_05',
 'EQUIP_06_ACQUI_MEANS': 'MEANS_FOR_ACQUISITION_06',
 'EQUIP_06_ACQUI_DT': 'DT_AQUIRE_EQUIP_06',
 'EQUIP_06_DESCRIP': 'DESC_EQUIP_06',
 'EQUIP_06_VALUE': 'EQUIP_VAL_06',
 'EQUIP_07_VALUE': 'EQUIP_VAL_07',
 'EQUIP_07_ACQUI_DT': 'DT_AQUIRE_EQUIP_07',
 'EQUIP_07_ACQUI_MEANS': 'MEANS_FOR_ACQUISITION_07',
 'EQUIP_07_DESCRIP': 'DESC_EQUIP_07',
 'EQUIP_08_DESCRIP': 'DESC_EQUIP_08',
 'EQUIP_08_ACQUI_MEANS': 'MEANS_FOR_ACQUISITION_08',
 'EQUIP_08_ACQUI_DT': 'DT_AQUIRE_EQUIP_08',
 'EQUIP_08_VALUE': 'EQUIP_VAL_08',
 'EQUIP_09_VALUE': 'EQUIP_VAL_09',
 'EQUIP_09_ACQUI_DT': 'DT_AQUIRE_EQUIP_09',
 'EQUIP_09_ACQUI_MEANS': 'MEANS_FOR_ACQUISITION_09',
 'EQUIP_09_DESCRIP': 'DESC_EQUIP_09',
 'EQUIP_10_DESCRIP': 'DESC_EQUIP_10',
 'EQUIP_10_ACQUI_MEANS': 'MEANS_FOR_ACQUISITION_10',
 'EQUIP_10_ACQUI_DT': 'DT_AQUIRE_EQUIP_10',
 'EQUIP_10_VALUE': 'EQUIP_VAL_10',
 'CAP_EXP_OVER_1MIL': 'PROJ_OVER_1M',
 'PROJ_01_PROJTD_CAP_EXP': 'PROJ_EXPENDITURES_01',
 'PROJ_01_OSHPD_PROJ_NO': 'OSHPD_PROJ_NO_01',
 'PROJ_01_DESCRIP_CAP_EXP': 'DESC_PROJ_01',
 'PROJ_02_DESCRIP_CAP_EXP': 'DESC_PROJ_02',
 'PROJ_02_OSHPD_PROJ_NO': 'OSHPD_PROJ_NO_02',
 'PROJ_02_PROJTD_CAP_EXP': 'PROJ_EXPENDITURES_02',
 'PROJ_03_OSHPD_PROJ_NO': 'OSHPD_PROJ_NO_03',
 'PROJ_03_PROJTD_CAP_EXP': 'PROJ_EXPENDITURES_03',
 'PROJ_03_DESCRIP_CAP_EXP': 'DESC_PROJ_03',
 'PROJ_04_DESCRIP_CAP_EXP': 'DESC_PROJ_04',
 'PROJ_04_OSHPD_PROJ_NO': 'OSHPD_PROJ_NO_04',
 'PROJ_04_PROJTD_CAP_EXP': 'PROJ_EXPENDITURES_04',
 'PROJ_05_OSHPD_PROJ_NO': 'OSHPD_PROJ_NO_05',
 'PROJ_05_PROJTD_CAP_EXP': 'PROJ_EXPENDITURES_05',
 'PROJ_05_DESCRIP_CAP_EXP': 'DESC_PROJ_05'}

We can now define a new version of the read_hospital_data function that uses the crosswalk to change the column names when year < 2018. You can use the dict .get method, which tries to get the value for the key in its first argument and returns its second argument if that key isn’t in the dict. Let’s also change function to exclude columns with ALOS in the name, because they have no equivalent in the pre-2018 files:

def read_hospital_data(path, year):
    # Read the 2nd sheet of the file.
    sheet = pl.read_excel(path, sheet_id=2)
    
    # Remove the first 4 and last row.
    sheet = sheet[4:-1, :]

    # Fix pre-2018 column names.
    if year < 2018:
        sheet.columns = [cwalk.get(c, c) for c in sheet.columns]

    # Select only a few columns of interest.
    facility_cols = pl.col(
        "FAC_NAME", "FAC_CITY", "FAC_ZIP", "FAC_OPERATED_THIS_YR",
        "FACILITY_LEVEL", "TEACH_HOSP", "COUNTY", "PRIN_SERVICE_TYPE",
    )

    sheet = sheet.select(
        pl.lit(year).alias("year"),
        facility_cols,
        pl.col("^TOT.*$", "^.*RESPIRATORY.*$")
            .exclude("^.*ALOS.*$")
            .cast(pl.Float64),
    )

    # Rename the columns to lowercase.
    sheet = sheet.rename(str.lower)

    return sheet

Now we can test the function on all of the files. This time, we’ll check that

hosps = []

for path in paths:
    year = get_hospital_year(path)
    hosp = read_hospital_data(path, year)
    hosps.append(hosp)

len(hosps)
8

You can use the pl.concat function to concatenate, or stack, a list of data frames:

hosps = pl.concat(hosps)
hosps.head()
shape: (5, 18)
yearfac_namefac_cityfac_zipfac_operated_this_yrfacility_levelteach_hospcountyprin_service_typetot_lic_bedstot_lic_bed_daystot_dischargestot_cen_daysacute_respiratory_care_lic_bedsacute_respiratory_care_lic_bed_daysacute_respiratory_care_dischargesacute_respiratory_care_intra_transfersacute_respiratory_care_cen_days
i32strstrstrstrstrstrstrstrf64f64f64f64f64f64f64f64f64
2021"ALAMEDA HOSPITAL""ALAMEDA""94501""Yes""Parent Facility""No""Alameda""General Medical / Surgical"247.090155.02431.066473.00.00.0nullnullnull
2021"ALTA BATES SUMMIT MEDICAL CENT…"BERKELEY""94705""Yes""Parent Facility""No""Alameda""General Medical / Surgical"339.0123735.011344.053710.00.00.0nullnullnull
2021"UCSF BENIOFF CHILDREN'S HOSPIT…"OAKLAND""94609""Yes""Parent Facility""No""Alameda""Pediatric"215.078475.07256.038381.00.00.0nullnullnull
2021"FAIRMONT HOSPITAL""SAN LEANDRO""94578""Yes""Consolidated Facility""No""Alameda""Long-Term Care (SN / IC)"109.039785.0179.038652.00.00.0nullnullnull
2021"ALTA BATES SUMMIT MEDICAL CENT…"BERKELEY""94704""Yes""Consolidated Facility""No""Alameda""Psychiatric"68.024820.02236.019950.00.00.0nullnullnull

We’ve finally got all of the data in a single data frame!

To begin to address whether hospital utilization changed in 2020, let’s make a bar plot of total census-days:

from plotnine import *

(
    ggplot(hosps) +
    aes(x = "year", weight = "tot_cen_days") +
    geom_bar()
)

According to the plot, total census-days was slightly lower in 2020 than in 2019. This is a bit surprising, but it’s possible that California hospitals typically operate close to maximum capacity and were not able to substantially increase the number of beds in 2020 in response to the COVID-19 pandemic. You can use other columns in the data set, such as tot_lic_beds or tot_lic_bed_days, to check this.

Let’s also look at census-days for acute respiratory care:

(
    ggplot(hosps) +
    aes(x = "year", weight = "acute_respiratory_care_cen_days") +
    geom_bar()
)

In this plot, there’s a clear uptick in census-days in 2020, and then an interesting decrease to below 2019 levels in the years following. Again, you could use other columns in the data set to investigate this further. We’ll end this case study here, having accomplished the difficult task of reading the data and the much easier task of doing a cursory preliminary analysis of the data.

4.5. Exercises#

4.5.1. Exercise 1#

Try writing a function is_leap that detects leap years. The input to your function should be an integer year (or a series of years), and the output should be a Boolean value. A year is a leap year if either of these conditions is true:

  • It is divisible by 4 and not 100

  • It is divisible by 400

That means the years 2004 and 2000 are leap years, but the year 2200 is not.

Hint: The modulo operator % returns the remainder after divding a number, so for example 4 % 3 returns 1.

Here’s a few test cases for your function:

is_leap(400)
is_leap(1997)