In this homework we will review different methods of testing hypotheses. These methods depend on the kinds of data we are working with: categorical or numeric. To start, let’s consider when we have quantities of categorical variables.
Statistic¶
Before we begin, we need to import several libaries:
from datascience import Table
import scipy.stats as stats
import numpy as npBefore we begin, let’s review: what is the purpose of calculating the statistic?
*YOUR ANSWER HERE
In this example, we’re going to use the relationship between union membership and 2016 presidential vote choice. Below, we have a table representing the number of voters in some poll who voted for Clinton or Trump, separated by union membership status.
clinton_trump_union = Table().with_columns([
'Candidate', ['Clinton', 'Trump', 'Column Total'],
'Not Union', [1068, 1019, 2087],
'Union', [218, 157, 375],
'Row Total', [1286, 1176, 2462]
])
clinton_trump_unionWhat are your initial impressions of how the data are distributed?
TYPE YOUR ANSWER HERE
In order to determine if there is a relationship between union membership and presidential vote choice, we need to compare the real data to data generated under the null hypothesis. Under the null hypothesis, each cell’s value will be equal to its proportion of the overall population. Therefore, we need a table where each cell value is
The first step is to calculate what percentage of the vote each candidate recieved overall:
clinton_percent = 1286/2462
trump_percent = ...
clinton_percent, trump_percentNow we can create lists representing the expected values for the “Not Union” and “Union” columns. As we saw above, this will equal the column total multiplied by the percent of the overall vote that the candidate received.
NOTE: In order to keep the Column Total row, each list will need to have the column total at the end of the list, i.e [clinton_expected, trump_expected, column_total]
not_union_expected = [2087*clinton_percent, 2087*trump_percent, 2087]
union_expected = ...
not_union_expected, union_expectedNow that we have the expected values for the column, we can create a table to visualize the expectations for vote counts if there is no relationship.
clinton_trump_expected = Table().with_columns([
'Candidate', ['Clinton', 'Trump', 'Column Total'],
'Not Union', not_union_expected,
'Union', union_expected,
'Row Total', [1286, 1176, 2462]
])
clinton_trump_expectedHow do we calculate the statistic? Recall the formula:
Using our coding knowledge, we know we can solve this using lists of the expected and observed values. First, let’s create observed_votes and expected_votes, lists where the first value is the value of Clinton votes for non-union voters, the second is the value of Trump votes for non-union voters, etc.
NOTE: We need to wrap these lists in np.array in order to perform the calculations outlined above. This is accomplished with np.array(list).
observed_votes = np.array([1068, 1019, 218, 157])
expected_votes = ...
observed_votes, expected_votesNow that we have the the values as lists, we can calculate the statistic. Some important reminders:
np.array() lists can be treated as single numbers in that we can subtract and add them to one another.
To square a value, we use . For example, .
To add all values, use
sum(values). For example,sum([1,2,3]) = 6.
NOTE: The correct value is 6.172092136005482, but we want you to calculate this using Python
chi_squared = ...
chi_squaredWe can now check that value in a statistic table. You can find such a table online or in your textbook. To determine the p-value for the relationship between union membership, we must first calculate the degrees of freedom. How many degrees of freedom does this relationship have?
YOUR ANSWER HERE
Now that we have the degrees of freedom, we can use the table to calculate the p-value.
What is the p-value?
What does that tell us about the relationship between the two variables?
Can we establish a causal connection?
YOUR ANSWERS HERE
Difference of Means¶
When we want to measure a continuous, numeric variable with different categories, e.g sales by product, we use a difference of means test. This allows us to compare the different categories, and determine the likelihood that the distributions of data from each category are from the same underlying distribution (the null hypothesis). In the example for this homework, you will be testing whether party control of the legislature has an impact on government longevity. Specifically, we want to see if there is a relationship between whether the minority party is in control (a categorical variable) and the length of time that the government lasts (a continuous variable).
To start, we need to load in our data. To do that, we use the Table.read_table() method:
govt_table = Table.read_table('https://cal-icor.github.io/textbook.data/ucb/pols-3/govts.csv')
govt_tableWe need to split this table into two parts: the rows where the party in control was the minority (mingov = 1), and the rows where the party in control was not the minority (mingov = 0). To do this, we use the table.where() selector. Below, we create the minority party government table.
minority_table = govt_table.where('mingov', 1) #selects the rows where mingov = 1
minority_tableNow, repeat the above but for when the party in control was not the minority:
majority_table = ... #selects the rows where mingov = 0
majority_tableGreat! Now that we have the data separated by our categorical variable, we can calculate the relevant t-statistic:
In our case, this becomes:
To perform this calculation, we need the mean and standard deviation of the two distributions of government longevity. These can be calculated using np.std() and np.mean(). Below, we have derived these measurements for the majority-party-controlled government. It is your responsibility to repeat this for the governments run by minority parties.
majority_time = majority_table.column('govttime')
majority_std = np.std(majority_time)
majority_mean = np.mean(majority_time)
majority_time, majority_std, majority_meanminority_time = ...
minority_std = ...
minority_mean = ...
minority_time, minority_std, minority_meanGreat! We next need to calculate the difference of the means:
mean_difference = ...
mean_differenceNext, calculate the standard error of the difference of means, denoted SE(Y_1, Y_2), which is equal to . To find the relevant n_i, use len(time_list) for each list of times.
se_minority_majority = ...
se_minority_majorityNow that we have these two values, divide the mean difference by the standard error of the difference between the means to produce the t-statistic.
t_stat = ...
t_statWe can find the corresponding p-value using a table like above, which you can find in your textbook or online.
Review What is our null hypothesis?
At what significance level can we reject the null hypothesis?
What does this rejection mean? What knowledge do we now have?
Can we establish a causal relationship?
YOUR ANSWER HERE
Continuous Variables: Covariance¶
When dealing with two continuous variables, we want to measure how much change (variation) in one variable coincides with variation in another (thus, covariation). In this example, we will consider the relationship between incumbent vote percentage and GDP growth. In the table below, the “VOTE” column represents the incumbent vote share and the “GROWTH” column represents GDP growth.
inc_gdp_table = Table.read_table('https://cal-icor.github.io/textbook.data/ucb/pols-3/fair.csv')
inc_gdp_tableLet’s set two variables votes and growths using table.column(“column_name”). Additionally, let’s set a value n equal to the number of observations, or the length of one of the columns. This can be determined with len(list_name).
votes = inc_gdp_table.column('VOTE')
growths = ...
n = len(votes)In order to calculate the covariance between two variables, we use the formula
In our case, this becomes:
First, let’s calculate the means for each variable:
votes_mean = np.mean(votes)
growth_mean = ...
votes_mean, growth_meanNow, let’s calculate the covariance. Remember, we can subtract a value from each member of a list by writing list_name - value.
covariance_growth_vote = sum((growths-growth_mean)*(votes-votes_mean))/n
covariance_growth_voteNow that we have the covariance between the two variables, we can calculate Pearson’s r, a value representing the linear correlation between two variables. This is done with the following formula:
To start, let’s create variables representing the covariance between each of the variables and themselves. This will follow the same formula as the covariance above, but by multiplying the same sum of differences. This is equivalent to finding the variance of a variable.
covariance_growth_growth = ...
covariance_vote_vote = ...
covariance_growth_growth, covariance_vote_voteNow that we have these values, we can find the value for r, the correlation coefficent. This will be of the form
r = ...
rIn order to determine if this sample r value is statistically significant, we need to calculate a t-statistic, which is found thus:
In the cell below, calculate a t-statistic.
t_r = np.sqrt(np.abs(r)*((n-1)/(1-r**2)))
t_rNow that we have the t statistic, we can determine the p value, the likelihood that the r value we found could have happened by chance.
What does the p value represent?
What can we determine about the relationship between the two variables?
Can we declare a causal relationship?
TYPE YOUR ANSWER HERE
Written HW¶
Question 1: True or False, with a one sentence justification for each:
Only a very small (<5) percentage of measurements can be more than 2 standard deviations from the mean of the normal distribution.
TYPE YOUR ANSWER HERE
Only a very small (<5) percentage of measurements can be more than 2 standard deviations from the mean of the any distribution.
TYPE YOUR ANSWER HERE
If we would reject a null hypothesis at the 5% level, we would also reject it at the 1% level.
TYPE YOUR ANSWER HERE
If we would reject a null hypothesis at the 1% level, we would also reject it at the 5% level.
TYPE YOUR ANSWER HERE
The p-value, which is the Type I error rate, is chosen by the investigator before a hypothesis test is conducted.
TYPE YOUR ANSWER HERE
Question 2: For the following questions, use the below scenario:
A recent poll asked people whether they supported passing a constitutional amendment
to ban burning of the national flag using a 1 to 100 scale.”1” means that they do not
support passing the amendment at all and “100” means they support it completely. The
sample was random (from the population of US adults) and included 986 people, 400 of
whom were women, and 586 were men. The mean score for women was 80.6, while the
mean score for men was 77.6.
Is the difference in means test statistically significant at the 5% level? Assume the standard deviation for both men and women is 3.5.
Use the coding cell below to perform any calculations you may need to determine this
## YOUR ANSWER HERE
gender_flag_tstat = ...
gender_flag_tstatIs the difference of means statistically significant?
YOUR ANSWER HERE
If the standard deviation for men is 1.4 and the standard deviation for women is 1.2, is the difference in means statistically significant at the 5% level?
Use the coding cell below to perform any calculations you may need to determine this
## YOUR ANSWER HERE
gender_flag_tstat_small_stddev = ...
gender_flag_tstat_small_stddevIs the difference of means statistically significant?
YOUR ANSWER HERE
Question 3: The following table is based on a random sample conducted of high school seniors and their parents by Jennings and Niemi, in which they explore the party identification of parents and their children.
student_id_table = Table().with_columns(
'Parent Party ID', ['Democrat', 'Independent', 'Republican'],
'Democrat', [604, 130, 63],
'Independent', [245, 235, 180],
'Republican', [67, 76, 252]
)
student_id_tableWhat is the percentage of students who share the same party identification as their parents? Show your work in the cell below
## YOUR ANSWER HERE
sample_size = sum(student_id_table.column('Democrat'))+sum(student_id_table.column('Independent'))+sum(student_id_table.column('Republican'))
democrat_same = 604
independent_same = ...
republican_same = ...
same_percent = ...
same_percentWhat percentage of Democrat parents have Republican children?
## YOUR ANSWER HERE
dem_parent_rep_child = .../sum([604,245,67])
dem_parent_rep_childBased on these data, can we say if the relationship is causal? Explain your answer.
TYPE YOUR ANSWER HERE
Suppose you were exploring the hypothesis that there is a relationship between parents’ and children’s party identification. Would we be correct in inferring that such a relationship also exists in the population? Explain your answer.
TYPE YOUR ANSWER HERE
Question 4: Tens of thousands of people die each year on America’s highways. How can the number of fatalities be reduced? In this question we explore trends in traffic fatalities across the 50 states and potential solutions to the problem. Consider the table below, indicating summary statistics of the number of traffic deaths per million miles driven across all 50 states, in 1985 and 1992. (Assume that these observations represent random samples of all years.)
traffic_table = Table().with_columns(
'Variable', ['traffic_1985', 'traffic_1992'],
'Observations', [50,50],
'Mean', [2.694, 1.844],
'Standard Deviation', [.6079104, .449108],
'Minimum Value', [1.9, 1],
'Maximum Value', [4.4, 2.7]
)
traffic_tableWhat is the average change across states between 1985 and 1992 in traffic deaths per million miles driven?
## YOUR ANSWER HERE
mean_traffic_death_diff = ...
mean_traffic_death_diffHow likely would we get the average change found in part a) if the true difference in the population were actually zero? Show your work in the cell below
## YOUR ANSWER HERE
traffic_year_diff_mean_tstat = ...
traffic_year_diff_mean_tstatCan we reasonably claim that there is a difference in traffic fatalities between 1985 and 1992? (Again, the samples can be treated as random.) Explain.
YOUR ANSWER HERE
Saving Your Notebook¶
Now that you’ve finished the homework, we need to save it! To do this, click File Download as PDF via Chrome