Skip to article frontmatterSkip to article content
Site not loading correctly?

This may be due to an incorrect BASE_URL configuration. See the MyST Documentation for reference.

COVID-19 in Prisons

# Run this cell to set up your notebook
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from otter import Notebook

In California, during the early months of the COVID-19 pandemic, Coronavirus spread most quickly through state prisons. Due to the close proximity of roommates, poor living conditions, and uncertainty of the transmissibility of the virus, it became extremely difficult to contain outbreaks. The data we’ll be analyzing comes from California’s open data portal. Feel free to read more about the data on the CDCR’s website.

# load the data
covid_in_prisons = pd.read_csv('https://cal-icor.github.io/textbook.data/ucb/eth-std-21ac/covid19dashboard.csv')
covid_in_prisons

About the Data

Before we jump into any data analysis, it’s important to familiarize ourselves with the dataset.

Recall that the ‘prisons’ dataset was a Table. Here, the covid_in_prisons dataset is a DataFrame. (If you’d like to know more, read through the documentation of DataFrames.) It looks just like a table did and has very similar features, but slightly different ways to walk through them.

QUESTION 1 : Scroll through the prisons dataset above. What do you notice? What do the rows and columns mean?

Replace this text with your response!

In this notebook, we are focusing less on the technicalities of code writing and more on the common techniques that are used to manipulate data that we have already seen.

Below, display the different column names.

covid_in_prisons.columns

Each row represents a prison or detention center on a particular day. The data starts March 10, 2020 and ends January 7, 2022. Below, display the different prisons and detention centers represented in the data using the unique() function of DataFrames.

institution_names = covid_in_prisons['InstitutionName'].unique()
institution_names

Based on the following cell’s code, we can see that there are 35 institutions represented in the covid_in_prisons DataFrame.

# len() is a function that takes the length of an array/list (in this case, institution_names)
len(institution_names)

To see how many rows/records there are for any Institution in our DataFrame, we take the sum of the amount of rows that have the given Institution Name in its ‘InstitutionName’ column.

Written in code, this is sum(covid_in_prisons[‘InstitutionName’] == ‘Example Prison’).

In the below cell, we’ve found the total number of rows/records from Avenal State Prison that exist in the covid_in_prisons DataFrame.

sum(covid_in_prisons['InstitutionName'] == 'Avenal State Prison (ASP)')

QUESTION 2 : Replicate the code from the cell above to find the amount of rows/records from San Quentin State Prison. (Feel free to copy & paste the above cell.)

*Note: For your code to work, the prison name needs to be written identically to how it is represented in institution_names (including spaces, capitalization, and parenthesis). It is written in institution_names as: ‘San Quentin State Prison (SQ)’*

... # Delete this comment and write your line of code here
# (copy and paste the cell above then modify the prison name!)

Your answer should again be 736!

We can actually see that there are the same number (736) of records/rows for each Institution in our DataFrame by running the following cell. (There will be more discussion on how this doesn’t imply that the data from each Institution is collected in the same ways.)

covid_in_prisons.groupby('InstitutionName').size()

Now, let’s start our EDA (exploratory data analysis).

Exploratory Data Analysis (EDA)

IMPORTANT NOTE: the columns TotalConfirmed and TotalDeaths in the prisons dataset are cummulative.

For example, if we take row 9992 in the covid_in_prisons dataset and look at the number in the ‘TotalConfirmed’ column, 111 represented the TOTAL number of confirmed cases at Ironwood State Prison (ISP) up until 10/13/20. It’s important to recognize that there were not 111 new cases at ISP on this day, rather 111 total confirmed cases at ISP up until 10/13/20.

# This function grabs the 9992nd row and shows the data from each column for this particular row.
covid_in_prisons.iloc[9992]
# Read the notes about what this data means above!

Below, let’s isolate the data from the last day in the data set. This way, we can see the total number of confirmed cases along with the total number of deaths from each prison/detention center. We’ll isolate data from 1/6/22.

jan6_22 = covid_in_prisons[covid_in_prisons['Date'] == '2022-01-06']
jan6_22 = jan6_22.sort_values('TotalConfirmed', ascending=False)
jan6_22

According to the table above, the center with the highest number of confirmed cases is Avenal State Prison (ASP). The center with the lowest number of confirmed cases is Pelican Bay State Prison (PMSP). Note, that we have no data on how often each prison tests inmates. Some centers may test more frequently than others, causing a higher case count because an individual can test positive multiple times.

QUESTION 3: The table above is sorted in decreasing order according to the “TotalConfirmed” column. Do you recognize any prisons that are at the top of the table?

Replace this text with your response!

Next, we will create a histogram to visualize the 8 institutions with the highest total confirmed cases.

top8 = jan6_22[:8].reset_index()
sns.barplot(data=top8, y="InstitutionName", x='TotalConfirmed')
plt.xlabel('Total Confirmed Cases')
plt.ylabel('Institution Name')
plt.title('Total Confirmed Cases for Top 8 Institutions')

QUESTION 4: What are some potential issues in drawing conclusions based on solely totals in data?

Hint: Consider how the populations of each prison may relate to the number of positive cases.

Replace this text with your response!

Totals are very helpful units of measure because they indicate exact values. However, (especially if you’re stuck on Question 4) consider: In a histogram like above, a Prison with 200 positive cases would be represented as a very small value. Notice that a Prison with 200 positive cases and a population of 250 is a highly contaminated prison, while a Prison with 200 positive cases and a population of 2000 is not as significant.

For the rest of this notebook we are going to use the totals. We noted this just to remind you to always consider factors of the data that can misleading and to acknowledge what is and is not represented in the covid_in_prisons DataFrame.

Next, let’s analyze the total number of new cases in the 14 days before each collection record (from the ‘NewInTheLast14Days’ column) across all institutions.

per14Days = covid_in_prisons.groupby('Date')['NewInTheLast14Days'].agg(np.sum).to_frame().reset_index()
per14Days

Next, create a line plot below to visualize the table above.

sns.lineplot(data=per14Days,  x='Date', y='NewInTheLast14Days')
plt.xticks(['2020-03-10', '2020-06-10', '2020-09-10', '2020-12-10', '2021-03-10', '2021-06-10', '2021-09-10', '2021-12-10', '2022-03-10'], rotation=90)
plt.ylabel('New Cases in the Last 14 Days')
plt.title('New Cases Per 14 Day Period')

QUESTION 5: What do you notice about the above line plot? In what months and years do the two main peaks seem to coincide with? Optional: Any ideas about what was going on during these times in regards to COVID-19?

Replace this text with your response!

Below, calculate the death rate for each institution (from Jan 6, 2022).

# Since the data is cummulative, the data from the jan6_22 DataFrame is the sum. Then find the percent by divinding by 736.
# Recall that 736 is the number of rows in the DataFrame for each institution
rates = jan6_22['TotalDeaths']/736

# Create a new column in the jan6_22 DataFrame 
jan6_22['DeathRates'] = rates

# Sort the DeathRates column in descending order.
jan6_22.sort_values('DeathRates', ascending=False)

QUESTION 6: Name the three institutions with the highest COVID-19 death rates based on the DataFrame above. Read through the comments of the cell above for an explanation of how the DataFrame above was created.

Replace this text with your response!

Below we plot line graphs that show the Positive COVID Cases and COVID Deaths over time at the two prisons with the highest death rate in our data set (California Institution for Men (CIM) and San Quentin State Prison (SQ)).

# Run this code to create line plots ... feel free to ignore this code or take a look if interested!
CIM_only = covid_in_prisons[covid_in_prisons['InstitutionName'] == 'California Institution for Men (CIM)'].sort_values('Date')
sns.lineplot(x = 'Date', y='TotalDeaths', data=CIM_only, label = 'Total COVID Deaths at CIM')
sns.lineplot(x = 'Date', y='TotalConfirmed', data=CIM_only, label = 'Total Positive COVID Cases at CIM')
SQ_only = covid_in_prisons[covid_in_prisons['InstitutionName'] == 'San Quentin State Prison (SQ)'].sort_values('Date')
sns.lineplot(x = 'Date', y='TotalDeaths', data=SQ_only, label = 'Total COVID Deaths at SQ')
sns.lineplot(x = 'Date', y='TotalConfirmed', data=SQ_only, label = 'Total Positive COVID Cases at SQ')
plt.title('COVID Cases and COVID Deaths at California Institution for Men (CIM) and San Quentin State Prison (SQ)')
plt.ylabel('Number of People')
plt.xticks(['2020-03-10', '2020-06-10', '2020-09-10', '2020-12-10', '2021-03-10', '2021-06-10', '2021-09-10', '2021-12-10', '2022-03-10'], rotation=90);

Note that even though you likely can’t see the whole lines for the Total COVID Deaths of both institutions, they are both there but they just overlap.

QUESTION 7: Which institution seems to have had a major spike in Total Positive COVID Cases in 2020? Approximately what month(s) did this take place in?

Replace this text with your response!

San Quentin State Prison in 2020

We will now look at San Quentin State Prison in particular during 2020, the primary pandemic year. We will plot 3 different lines on one graph to compare the San Quentin State Prison’s population, cumulative positive COVID cases, and cumulative deaths caused by COVID.

SQ_Covid_2020 = pd.read_csv("./data/SQ_2020.csv") #importing cleaned data about San Quentin SP for 2020
prisons_df = pd.read_csv("./data/monthly_cdcr_upto2022.csv") #importing prisons table from Lecture 1 as a DataFrame
SQ_pop_df = prisons_df[prisons_df['institution_name']== 'SQ (SAN QUENTIN SP)']
SQ_pop_2020 = SQ_pop_df[SQ_pop_df['year'] == 2020]
# Run this cell, it isn't supposed to return anything since there are only assignment statements
sns.lineplot('month', 'population_felons', data=SQ_pop_2020, label='SQ Population');
sns.lineplot(x = 'Date', y='Total Deaths', data=SQ_Covid_2020, label = 'Total COVID Deaths at SQ');
sns.lineplot(x = 'Date', y='Total Confirmed', data=SQ_Covid_2020, label = 'Total Positive COVID Cases at SQ');
plt.xlabel('Month (in numbers ie, 2 = February, 2020)')
plt.ylabel('Number of People')
plt.title('San Quentin State Prison in 2020')
plt.ylim(0,5000);
# Don't worry if a red message pops up with your graph - it is a warning, but not an error.
# All you need for analysis is the graph

QUESTION 8: Based on this graph, assign a range of months to the following:

(Replace each ... below with your response.)

Little to no COVID cases in the following month range: ...

Rapidly increasing amount of COVID cases in the following month range: ...

High but plateauing/minimally changing amount of COVID cases in the following month range: ...

As you may have noticed, the graph shows a decline in San Quentin State Prison population by about 1,000 people from the start to end of the year 2020. The San Quentin State Prison website from CDCR (the source of the data) spoke to this decrease in population with this statement: “To reduce crowding and promote better physical distancing, San Quentin State Prison’s inmate population has been reduced from 4,051 in March [2020] to 3,129 on Aug. 12 [2020]. This reduction has been accomplished through the suspension of intake from county jails, expedited releases and natural releases from the prison.

QUESTION 9: What are your opinions on this quotation? What are the consequences (positive and/or negative) of these reactions to the COVID outbreak in San Quentin State Prison? There is no right answer; This question looking for you to consider the quotation and the context of the data to formulate an opinion of your own.

Replace this text with your response!

QUESTION 10: Reflection time! Please write a minimum of 4 sentences about what you’ve come to realize from this COVID_in_Prisons Homework Notebook.

If you need any ideas for what to brainstorm, consider these topics:

Replace this text with your response!

Thank you for taking the time to complete the homework notebook! Remember if you are struggling to check out the resources for help listed at the top of this notebook to get all of your questions answered. We hope you enjoyed this notebook and are so glad you took a step out of your comfort zone to view this material from a new perspective.

TO SUBMIT YOUR HOMEWORK:

1. Run all your cells by clicking “Cell” > “Run All”. (Should take no more than ~10 seconds)

2. Save the notebook by clicking “File” > “Save and checkpoint”.

3. Run the next cell!

#This may take a few extra seconds.
from otter.export import export_notebook
from IPython.display import display, HTML
export_notebook("covid-prisons.ipynb", filtering=True, pagebreaks=True)
display(HTML("Save this notebook, then click <a href='Lecture_3.pdf' download>here</a> to open the pdf."))

If the cell above returns an error or clicking the link that is returned doesn’t work for you, don’t worry! Simply right click and choose “Save Link As...” to save a copy of your pdf onto your computer. Then, complete Step 4: Submit completed Lecture 1 notebook to your corresponding bCourses assignment.


Notebook developed by: Skye Pickett and Caitlin Yee

Data Science Modules: http://data.berkeley.edu/education/modules