Skip to article frontmatterSkip to article content
Site not loading correctly?

This may be due to an incorrect BASE_URL configuration. See the MyST Documentation for reference.

Oh Rats!: A Data Science Dive Into Invasive Species and Island Sustainability

Table of Contents

Oh Rats!: A Data Science Dive Into Invasive Species and Island Sustainability

Import Libraries and Their Aliases

Here we will import many of the industry-standard libraries including NumPy, Pandas, MatPlotLib, and Plotly. Feel free to find these packages on the Python Package Index!

For your reference, we’ll break down one line of the large cell below. In this line:

import pandas as pd

We import the pandas library. Essentially, we’re telling python that we want to use a specific set of functions that has been named “pandas”. We then use the as keyword to specify that we want to use some other name to refer to “pandas”; in this context we told python we want to call it “pd”.

Lets say we want to use the pandas.notnull() function (this checks if any values in an array or list are None). Because we imported “pandas” as “pd”, we could now call pd.notnull()

Note: We name pandas as “pd” because it is the standard name in the industry, however we could have said:

import pandas as harrystyles

Which would’ve allowed us to run Harry_Styles.notnull(). This would be confusing for a reader, so its considered best practice to use standard names such as “pd” for “pandas” or “np” for “numpy”.

The cell below runs what we just talked about. Feel free to play around and try things out!

import pandas as pd 
import pandas as harrystyles
print("Harry Styles is a Panda?", harrystyles == pd)
truth = [1, 2, 3, None, 4, harrystyles,
         "Harry Styles", None, None, 
         5 * 2 - 1 / 1e-20, 1]
print(harrystyles.notnull(truth))
print(pd.notnull(truth))

# these lines delete the two imports we just made
del harrystyles
del pd

Now, lets get to work! Here we’ll import the actual modules we’ll need for this notebook!

#necessary to implement `download_data()` [see below]
import requests
from pathlib import Path
import time
#standard data analysis libraries
import numpy as np 
import pandas as pd 
#imports for displaying, rendering, and saving plots and visualizations
import plotly
import plotly.express as px
from IPython.display import *
import plotly.io as pio
import ipywidgets as widgets
from ipywidgets import *
#for dependency if not on datahub
# >>>%pip install openpyxl

Function Fun

download_data

The download_data function will get data from data_url on the Internet and saves it so we can use it for analysis!

This is a good practice to get into so that your future research or projects are reproducible. Essentially, if you gave someone else a copy of this notebook, they should be able to run all of the cells and get the same result. So, if we have a cell that downloads the data when run, people don’t even need access to the data! This is also great because if the data changes or is updated, we don’t have to update the notebook, it will do so automatically (woahhh how cool)!!

Please don’t worry about understanding this function, it’s rather complex, even for those with a moderate amount of data science experience. You can just use this function and trust it works.

def download_data(data_url, 
                  file):
    file_path = Path(file)
    print('Downloading...', end=' ')
    resp = requests.get(data_url)
    with file_path.open('wb') as f:
        f.write(resp.content)
    print('Done!')
    return file_path
#treat this cell as magic, just run it
funcs = download_data("https://tinyurl.com/island-functions",
                      "funcs.py")
from funcs import *

show

The function .show() below will allow us to freely examine this data! You don’t need to understand how it works, just read the following properties:

CodeResult
your_table.show(True)shows the entire DataFrame
your_table.show(columns = ['x', 'y'])shows the first 20 rows of columns x and y
your_table.show(rows = 18)shows the first 18 rows
your_table.show(rows = -18)shows the last 18 rows
your_table.show(start = 25, rows = 18)shows the first 18 rows that come after row 25

In the above table, your_table represents whatever table you are working in

Examples of .show()

Here, we’re using a small dataset to show what show can do (pun-intended). Don’t worry about understanding the data, just look at the resulting tables

airports = download_data("https://raw.githubusercontent.com/vega/vega-datasets/next/data/airports.csv", "airports.csv")
print(airports)
airports = pd.read_csv(airports)
airports.show(rows = 3, 
              title = ["First 3 rows", ["b"]])
airports.show(rows =3, columns = ['name', 'iata'],
             title = ['First 3 rows with columns "name" and "iata" (in that order)', ["b"]])
airports.show(rows = 3, start = 101,
             title = ['3 rows starting with 101', ["b"]])

Import and Tidy-Up the Data

Here, we’ll use the function above (download_data), to get the data we want.

The data source we’ll be looking at is a Microsoft Excel File, or .xls for short, which some of you may know a little bit about. One advantage that an excel file has over other data storage methods (such as .csv files) is that it can have separate “sheets”, which are each are their own table/data frame.

To ensure we can work with this data, we read the data which we call island_data twice, each time with the pd.read_excel function (oh how creative computer scientists are at naming things).

The first time we’ll load the 0th page, and then the 1st (in non-data-science-terms the 1st and 2nd sheets), using the sheet_name argument.

We call these seedlings and trees for reasons that will soon become evident.

island_data = "https://cal-icor.github.io/textbook.data/ucb/espm-109/island.xlsx"
seedlings = pd.read_excel(island_data, sheet_name = 0, engine="openpyxl")
trees = pd.read_excel(island_data, sheet_name = 1, engine="openpyxl")

Now, lets explore this data a bit! Below we’ll look at what we just downloaded!

Seedlings

The seedlings DataFrame has 275 rows and 5 columns (as noted at the bottom left of the data) Each row represents one collection/examination. The columns are:

Seedlings Columns
TitleMeaningType
Transect_IDa uniqe ID for each samplestr
P. grandis Count Seedingscount of grandis seedingsint
C.nucifera Count Seedingscount of nucifera seedingsint
YearThe year which the sample was takenint
Forest TypeThe type of forest the same was taken instr
seedlings.show(rows = 3)

Now we want to clean the data up. We won’t use the Transect_ID column, so lets drop it.

While we’re at it lets shorten some of the column names. Here’s a list of what we’ll change:

New Seedlings Column Labels
OldNew
Transect_IDREMOVED
P. grandis Count SeedingsPisona Count
C.nucifera Count SeedingsCoco Count
YearYear
Forest TypeForest

Below, we’ll use the .drop method to drop the first column. We’ll then set the table’s column labels (seedlings.columns) equal to a list of new labels. You can view the result of each operation by running the cell.

if 'Transect_ID' in seedlings.columns:
    # this if statement allows you to run this cell
    # multiple times without encountering an error
    seedlings = seedlings.drop(columns="Transect_ID")
seedlings.show(rows = 3, 
               title = 'Removed <code>Transect_ID</code>')
seedlings.columns = ['Pisona Count', "Coco Count", "Year", "Forest"]
seedlings.show(rows = 3, title = 'Relabeled Columns')

Trees

The second table we’ll look at is called trees. It has 245 rows and 4 columns (again as shown in the same location). Each row here represents a count of certain trees, per a set area. The columns and their meanings are in the table below

Trees Columns
TitleMeaningType
TreeIDa uniqe ID for each samplestr
YearThe year which the sample was takenint
Count Seedings (0 - 2m)count of grandis seedingsint
SPECIESThe species of tree representedstr

For our purposes, the column TreeID is irrelevant, so we can drop it in the same fashion as before. Lets also relabel the columns as such:

New Trees Column Labels
OldNew
TreeIDREMOVED
YearYear
Count Seedings (0 - 2m)Count
SPECIESTree
if "TreeID" in trees.columns:
    # this if statement allows you to run this cell
    # multiple times without encountering an error
    trees = trees.drop(columns = "TreeID")
trees.show(rows = 3, title = "Removed <code>TreeID</code> Column")
trees.columns = ["Year", "Count", "Tree"]
trees.show(rows = 3, title = "Relabeled Columns")

Next Steps

Now we’ve got our cleaned tables! Check them out below

seedlings.show(rows = 3, title = "Seedlings")
trees.show(rows = 3, title = "Trees")

rats_present

Hmmm...it looks like we’re missing something.

We still don’t know when the rats were eradicated; good thing we can figure that out! We know that by the year 2011, rats were eradicated from the island. Lets use that to write a function to add to our tables!

def add_rats(table):
    def rats_present(x):
        if x > 2010:
            return "No"
        else:
            return "Yes"
    if "Rats" in table.columns:
        show("Rats already exists!")
    else:
        table["Rats"] = table["Year"].map(rats_present)
        show("Rats added")
    return table
for data in [seedlings, trees]:
    data = add_rats(data)

first

Don’t worry about the implementation of this function, but it returns the first value of an iterable (or list-like) object.

def first(series):
    return series.iloc[0]

The Figures

Here are some figures that we got from the original research paper (where we got the data from). Later, you’ll replicate them and you can decide who’s you like more!

image.png image-2.png image-3.png

Function Fun 2: Agregating and Plotting

Agregating Functions

tree_type_table

This function takes in a DataFrame and a string, and does all of the following, each on one line:

def tree_type_table(table: pd.DataFrame, name: str):
    tree = (table.query(f"Tree == '{name}'")
              .groupby("Year")
              .agg({"Count": "sum", "Rats":first})
              .reset_index())
    return tree

triple_seedling_table

This function is a bit more complicated, so we’ll go though it quickly; if you’re interested, you’re more than welcome to try to figure it out, but it is more advanced than the majority of this notebook. triple_seedlings_table does the following:

def triple_seedling_table(table: pd.DataFrame, genus: str):
    main = (table[[f'{genus} Count','Year', 'Forest', 'Rats']]
                     .rename(columns = {f'{genus} Count': "Count"}))
    
    forest = (main.drop(columns="Rats")
                   .query("Year == 2016"))
    
    rats = (main.drop(columns="Forest")
                .groupby("Year")
                .agg({"Count": "sum", "Rats": first})
                .reset_index())
    return main, forest, rats

Plotting Functions

plot_rats

You don’t need to worry about the implementation of this function, but it will give you a nicely formatted graph given a Table/DataFrame. This function should ideally have one column named 'Year', one named 'Count', and one named 'Rats'. If it has these three functions you can simply call the following:

plot_rats(data_frame)

If your data frame doesn’t have all of these, you’ll need to call the function as so:

plot_rats(data = data_frame, x = x axis, y = y axis, color = group column)

where all of the values right of the = are filled in by you. You may also optionally include the argument:

title = "Your Title"

The implementation of the function can be found below, and if you’re interested, you can read the documentation of the pandas and plotly modules which make this function work.

def plot_rats(data, x = None, y = None, color = None, title = None):
    cols = pd.Series(['Year', 'Count', 'Rats'])
    present = cols.isin(data.columns).all()
    if present:
        fig = px.bar(data_frame = data, x='Year', y="Count", color = 'Rats',
                       color_discrete_sequence =['#a63932', '#003262'], 
                       title = title) #, width = 1000, height = 500)
    else:
        fig = px.bar(data_frame = data, x=x, y=y, color = color,
                       color_discrete_sequence =['#a63932', '#003262'], 
                       title = title) #, width = 1000, height = 500)
    fig.update_xaxes(type = 'category')
    fig.show()

plot_forest

This function functions in a very similar fashion. It takes in a dataframe, and outputs an interactive bar graph!

def plot_forest(df, title = None):
    grouped = df.groupby(["Year", "Forest"]).sum().reset_index().drop(columns="Year")
    plot = px.bar(data_frame = grouped, x = "Forest", y = "Count", color = "Forest", 
      color_discrete_sequence = ["#823e2b",'#85683c', "#3e853c"], title = title) #, width = 1000, height = 500
    plot.show()

Results

The long awaited end has arrived! Theres no more work for you to do in this section. Simply run the cell below to see what all of your hard work has created. Go you!

Agregating Results

Play Around a bit with the cell below after running it to view the tables you just made!

@interact(Function = widgets.Dropdown(options = ["tree_type", "seedlings"]))
def the_tables(Function):
    if Function == "tree_type":
        @interact(Tree = widgets.Dropdown(options = ['Barr', 'Calo', 'Cord', 'Guet', 'Hern']))
        def show_me_tree_type(Tree):
            tree_type_table(trees, Tree).show()
    else:
        @interact(Tree = widgets.Dropdown(options = ['Pisona', "Coco"]),
                  Rows = widgets.BoundedIntText(value = 10, max = 30, min = 1))
        def show_me_seedlings(Tree, Rows):
            a, b, c = triple_seedling_table(seedlings, Tree)
            for df in [a, b]:
                df.show(rows = Rows)

Aggregate the Tables

This cell is just nicely formatted and supplies all the tables from the data, using only your functions!

(barrs, calos, cords, guets, herns,
(pisonia, pisona_forest, pisona_rats), 
(coco, coco_forest, coco_rats))          = (tree_type_table(trees, 'Barr'), 
                                            tree_type_table(trees, 'Calo'), 
                                            tree_type_table(trees, 'Cord'), 
                                            tree_type_table(trees, 'Guet'), 
                                            tree_type_table(trees, 'Hern'), 
                                            triple_seedling_table(seedlings, "Pisona"), 
                                            triple_seedling_table(seedlings, "Coco"))
tree_dict = {"Plot Forests": {"Pisona": pisona_forest, "Coco": coco_forest}, 
             "Plot Rats": {"Barr": barrs, "Calo": calos, "Cord": cords, "Guet": guets, "Hern": herns, "Pisona": pisona_rats, "Coco": coco_rats}}

Plotting Results

Use the cell below to view some of the graphs you just made!

@interact(show_all = widgets.ToggleButton(value=False, description='Show All Tables', icon = "eye", button_style = "success"), 
          Function = widgets.Dropdown(options = ["Plot Forests", "Plot Rats"]))
def the_plots(show_all, Function):
    if show_all:
        frst_lbl, sdlgs_lbl  = iter(['Cocos', "Pisonas"]), iter(["Barrs", "Calos", "Cords", "Guets", "Herns", "Pisonas", "Cocos"])
        for data in (coco_forest, pisona_forest):
            plot_forest(data, title = next(frst_lbl))
        for data in (barrs, calos, cords, guets, herns, pisona_rats, coco_rats):
            plot_rats(data, title = next(sdlgs_lbl))
    elif Function == "Plot Rats":
        @interact(Tree = widgets.Dropdown(options = ['Barr', 'Calo', 'Cord', 'Guet', 'Hern']))
        def show_me_rats(Tree):
            plot_rats(tree_dict[Function][Tree], title = Tree)
    else:
        @interact(Tree = widgets.Dropdown(options = ['Pisona', "Coco"]))
        def show_me_forest(Tree):
            plot_forest(tree_dict[Function][Tree], title = Tree)