Hypothesis Testing

using computer simulation. Based on examples from the infer package. Code for Quiz 13.

Load the R package we will use.

Question: t-test

set.seed(123)
hr  <- read_csv("https://estanny.com/static/week13/data/hr_1_tidy.csv", 
                col_types = "fddfff") 
skim(hr)
Table 1: Data summary
Name hr
Number of rows 500
Number of columns 6
_______________________
Column type frequency:
factor 4
numeric 2
________________________
Group variables None

Variable type: factor

skim_variable n_missing complete_rate ordered n_unique top_counts
gender 0 1 FALSE 2 fem: 260, mal: 240
evaluation 0 1 FALSE 4 bad: 153, fai: 142, goo: 106, ver: 99
salary 0 1 FALSE 6 lev: 93, lev: 92, lev: 91, lev: 84
status 0 1 FALSE 3 fir: 185, pro: 162, ok: 153

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
age 0 1 40.60 11.58 20.2 30.37 41.00 50.82 59.9 ▇▇▇▇▇
hours 0 1 49.32 13.13 35.0 37.55 45.25 58.45 79.7 ▇▂▃▂▂

The mean hours worked per week is: 49.3

hr  %>% 
  specify(response = hours)
Response: hours (numeric)
# A tibble: 500 x 1
   hours
   <dbl>
 1  36.5
 2  55.8
 3  35  
 4  52  
 5  35.1
 6  36.3
 7  40.1
 8  42.7
 9  66.6
10  35.5
# … with 490 more rows

hypothesize that the average hours worked is 48

hr  %>% 
  specify(response = hours)  %>% 
  hypothesize(null = "point", mu = 48)
Response: hours (numeric)
Null Hypothesis: point
# A tibble: 500 x 1
   hours
   <dbl>
 1  36.5
 2  55.8
 3  35  
 4  52  
 5  35.1
 6  36.3
 7  40.1
 8  42.7
 9  66.6
10  35.5
# … with 490 more rows

generate 1000 replicates representing the null hypothesis

hr %>% 
  specify(response = hours)  %>% 
  hypothesize(null = "point", mu = 48)  %>% 
  generate(reps = 1000, type = "bootstrap") 
Response: hours (numeric)
Null Hypothesis: point
# A tibble: 500,000 x 2
# Groups:   replicate [1,000]
   replicate hours
       <int> <dbl>
 1         1  33.7
 2         1  34.9
 3         1  46.6
 4         1  33.8
 5         1  61.2
 6         1  34.7
 7         1  37.9
 8         1  39.0
 9         1  62.8
10         1  50.9
# … with 499,990 more rows

The output has 500,000 rows


‘Calculate’ the distribution of statistics from the generated data - Assign the output null_t_distribution - Display null_t_distribution

null_t_distribution  <- hr  %>% 
  specify(response = age)  %>% 
  hypothesize(null = "point", mu = 48)  %>% 
  generate(reps = 1000, type = "bootstrap")  %>% 
  calculate(stat = "mean")

null_t_distribution
# A tibble: 1,000 x 2
   replicate  stat
 *     <int> <dbl>
 1         1  48.4
 2         2  47.6
 3         3  48.7
 4         4  47.9
 5         5  47.4
 6         6  48.2
 7         7  47.5
 8         8  48.4
 9         9  48.5
10        10  48.5
# … with 990 more rows

‘visualize’ the simulated null distribution

visualize(null_t_distribution)


‘calculate’ the statistic from your observed data

observed_t_statistic <- hr  %>%
  specify(response = hours)  %>% 
  hypothesize(null = "point", mu = 48)  %>%
  calculate(stat = "t")

observed_t_statistic
# A tibble: 1 x 1
   stat
  <dbl>
1  2.25

get_p_value from the simulated null distribution and the observed statistic

null_t_distribution  %>% 
  get_p_value(obs_stat = observed_t_statistic , direction = "two-sided")
# A tibble: 1 x 1
  p_value
    <dbl>
1       0

shade_p_value on the simulated null distribution

null_t_distribution  %>% 
  visualize() +
  shade_p_value(obs_stat = observed_t_statistic, direction = "two-sided")

If the p-value < 0.05? yes (yes/no)
Does your analysis support the null hypothesis that the true mean number of hours worked was 48? no (yes/no)

Question: 2 samples t-test

‘hr_1_tidy.csv’ is the name of your data subset

hr_2 <- read_csv("https://estanny.com/static/week13/data/hr_1_tidy.csv", 
                col_types = "fddfff") 

‘Q: Is the average number of hours worked the same for both genders in hr_2?’

hr_2 %>% 
  group_by(gender)  %>% 
  skim()
Table 2: Data summary
Name Piped data
Number of rows 500
Number of columns 6
_______________________
Column type frequency:
factor 3
numeric 2
________________________
Group variables gender

Variable type: factor

skim_variable gender n_missing complete_rate ordered n_unique top_counts
evaluation female 0 1 FALSE 4 fai: 81, bad: 71, ver: 57, goo: 51
evaluation male 0 1 FALSE 4 bad: 82, fai: 61, goo: 55, ver: 42
salary female 0 1 FALSE 6 lev: 54, lev: 50, lev: 44, lev: 41
salary male 0 1 FALSE 6 lev: 52, lev: 47, lev: 46, lev: 39
status female 0 1 FALSE 3 fir: 96, pro: 87, ok: 77
status male 0 1 FALSE 3 fir: 89, ok: 76, pro: 75

Variable type: numeric

skim_variable gender n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
age female 0 1 41.78 11.50 20.5 32.15 42.35 51.62 59.9 ▆▅▇▆▇
age male 0 1 39.32 11.55 20.2 28.70 38.55 49.52 59.7 ▇▇▆▇▆
hours female 0 1 50.32 13.23 35.0 38.38 47.80 60.40 79.7 ▇▃▃▂▂
hours male 0 1 48.24 12.95 35.0 37.00 42.40 57.00 78.1 ▇▂▂▁▂

Use ‘geom_boxplot’ to plot distributions of hours worked by gender

hr_2 %>% 
  ggplot(aes(x = gender, y = hours)) + 
  geom_boxplot()


‘specify’ the variables of interest are ‘hours’ and ‘gender’

hr_2 %>% 
  specify(response = hours, explanatory = gender)
Response: hours (numeric)
Explanatory: gender (factor)
# A tibble: 500 x 2
   hours gender
   <dbl> <fct> 
 1  36.5 female
 2  55.8 female
 3  35   male  
 4  52   female
 5  35.1 male  
 6  36.3 female
 7  40.1 female
 8  42.7 female
 9  66.6 male  
10  35.5 male  
# … with 490 more rows

‘hypothesize’ that the number of hours worked and gender are independent

hr_2  %>% 
  specify(response = hours, explanatory = gender)  %>% 
  hypothesize(null = "independence")
Response: hours (numeric)
Explanatory: gender (factor)
Null Hypothesis: independence
# A tibble: 500 x 2
   hours gender
   <dbl> <fct> 
 1  36.5 female
 2  55.8 female
 3  35   male  
 4  52   female
 5  35.1 male  
 6  36.3 female
 7  40.1 female
 8  42.7 female
 9  66.6 male  
10  35.5 male  
# … with 490 more rows

generate 1000 replicates representing the null hypothesis

hr_2 %>% 
  specify(response = hours, explanatory = gender)  %>% 
  hypothesize(null = "independence")  %>% 
  generate(reps = 1000, type = "permute")
Response: hours (numeric)
Explanatory: gender (factor)
Null Hypothesis: independence
# A tibble: 500,000 x 3
# Groups:   replicate [1,000]
   hours gender replicate
   <dbl> <fct>      <int>
 1  36.4 female         1
 2  35.8 female         1
 3  35.6 male           1
 4  39.6 female         1
 5  35.8 male           1
 6  55.8 female         1
 7  63.8 female         1
 8  40.3 female         1
 9  56.5 male           1
10  50.1 male           1
# … with 499,990 more rows

‘calculate’ the distribution of statistics from the generated data

 null_distribution_2_sample_permute <- hr_2 %>% 
  specify(response = hours, explanatory = gender)  %>% 
  hypothesize(null = "independence")  %>% 
  generate(reps = 1000, type = "permute")  %>% 
  calculate(stat = "t", order = c("female", "male"))

null_distribution_2_sample_permute
# A tibble: 1,000 x 2
   replicate   stat
 *     <int>  <dbl>
 1         1 -0.208
 2         2 -0.328
 3         3 -2.28 
 4         4  0.528
 5         5  1.60 
 6         6  0.795
 7         7  1.24 
 8         8 -3.31 
 9         9  0.517
10        10  0.949
# … with 990 more rows

visualize the simulated null distribution

visualize(null_distribution_2_sample_permute)


‘calculate’ the statistic from your observed data

observed_t_2_sample_stat <- hr_2 %>%
  specify(response = hours, explanatory = gender)  %>% 
  calculate(stat = "t", order = c("female", "male"))

observed_t_2_sample_stat
# A tibble: 1 x 1
   stat
  <dbl>
1  1.78

get_p_value from the simulated null distribution and the observed statistic

null_t_distribution  %>% 
  get_p_value(obs_stat = observed_t_2_sample_stat , direction = "two-sided")
# A tibble: 1 x 1
  p_value
    <dbl>
1       0

‘shade_p_value’ on the simulated null distribution

null_t_distribution  %>% 
  visualize() +
  shade_p_value(obs_stat = observed_t_2_sample_stat, direction = "two-sided")


If the p-value < 0.05? no (yes/no)

Does your analysis support the null hypothesis that the true mean number of hours worked by female and male employees was the same? yes (yes/no)


Question: ANOVA

hr_3_tidy.csv is the name of your data subset

hr_anova <- read_csv("https://estanny.com/static/week13/data/hr_3_tidy.csv", 
                col_types = "fddfff") 

Q: Is the average number of hours worked the same for all three status (fired, ok and promoted) ?

use ‘skim’ to summarize the data in ‘hr_anova’ by ‘status’

hr_anova %>% 
  group_by(status)  %>% 
  skim()
Table 3: Data summary
Name Piped data
Number of rows 500
Number of columns 6
_______________________
Column type frequency:
factor 3
numeric 2
________________________
Group variables status

Variable type: factor

skim_variable status n_missing complete_rate ordered n_unique top_counts
gender promoted 0 1 FALSE 2 fem: 91, mal: 81
gender fired 0 1 FALSE 2 mal: 98, fem: 98
gender ok 0 1 FALSE 2 mal: 68, fem: 64
evaluation promoted 0 1 FALSE 4 goo: 79, ver: 52, fai: 21, bad: 20
evaluation fired 0 1 FALSE 4 bad: 77, fai: 64, ver: 30, goo: 25
evaluation ok 0 1 FALSE 4 fai: 53, bad: 51, goo: 18, ver: 10
salary promoted 0 1 FALSE 6 lev: 42, lev: 37, lev: 36, lev: 28
salary fired 0 1 FALSE 6 lev: 59, lev: 40, lev: 39, lev: 25
salary ok 0 1 FALSE 6 lev: 33, lev: 29, lev: 28, lev: 23

Variable type: numeric

skim_variable status n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
age promoted 0 1 40.22 11.11 20.1 31.67 41.00 48.82 59.7 ▆▇▇▇▇
age fired 0 1 38.95 11.23 20.0 29.82 38.80 48.75 59.9 ▇▆▇▇▅
age ok 0 1 39.03 11.77 20.0 28.28 38.75 49.92 59.7 ▇▇▆▇▆
hours promoted 0 1 59.29 12.53 35.0 49.90 58.65 70.35 79.9 ▅▆▇▆▇
hours fired 0 1 42.37 9.15 35.0 36.20 39.20 43.80 79.6 ▇▁▁▁▁
hours ok 0 1 47.99 11.55 35.0 37.45 45.75 55.23 75.7 ▇▃▃▂▂

Use ‘geom_boxplot’ to plot distributions of hours worked by status

hr_anova %>% 
  ggplot(aes(x = status, y = hours)) + 
  geom_boxplot()


‘specify’ the variables of interest are ‘hours’ and ‘status’

hr_anova %>% 
  specify(response = hours, explanatory = status)
Response: hours (numeric)
Explanatory: status (factor)
# A tibble: 500 x 2
   hours status  
   <dbl> <fct>   
 1  49.6 promoted
 2  39.2 fired   
 3  63.2 promoted
 4  42.2 promoted
 5  54.7 promoted
 6  54.3 fired   
 7  37.3 fired   
 8  45.6 promoted
 9  35.1 fired   
10  53   promoted
# … with 490 more rows

‘hypothesize’ that the number of hours worked and status are independent

hr_anova  %>% 
  specify(response = hours, explanatory = status)  %>% 
  hypothesize(null = "independence")
Response: hours (numeric)
Explanatory: status (factor)
Null Hypothesis: independence
# A tibble: 500 x 2
   hours status  
   <dbl> <fct>   
 1  49.6 promoted
 2  39.2 fired   
 3  63.2 promoted
 4  42.2 promoted
 5  54.7 promoted
 6  54.3 fired   
 7  37.3 fired   
 8  45.6 promoted
 9  35.1 fired   
10  53   promoted
# … with 490 more rows

‘generate’ 1000 replicates representing the null hypothesis

hr_anova %>% 
  specify(response = hours, explanatory = status)  %>% 
  hypothesize(null = "independence")  %>% 
  generate(reps = 1000, type = "permute")
Response: hours (numeric)
Explanatory: status (factor)
Null Hypothesis: independence
# A tibble: 500,000 x 3
# Groups:   replicate [1,000]
   hours status   replicate
   <dbl> <fct>        <int>
 1  38.4 promoted         1
 2  41   fired            1
 3  66.1 promoted         1
 4  46.1 promoted         1
 5  65.1 promoted         1
 6  43.7 fired            1
 7  35.1 fired            1
 8  40.9 promoted         1
 9  38.4 fired            1
10  67.7 promoted         1
# … with 499,990 more rows

The output has 500000 rows


*‘calculate’ the distribution of statistics from the generated data**

null_distribution_anova  <- hr_anova %>% 
  specify(response = hours, explanatory = gender)  %>% 
  hypothesize(null = "independence")  %>% 
  generate(reps = 1000, type = "permute")  %>% 
  calculate(stat = "F")

null_distribution_anova
# A tibble: 1,000 x 2
   replicate     stat
 *     <int>    <dbl>
 1         1 0.00878 
 2         2 0.359   
 3         3 1.16    
 4         4 0.00524 
 5         5 2.25    
 6         6 0.259   
 7         7 2.61    
 8         8 0.0457  
 9         9 0.0830  
10        10 0.000339
# … with 990 more rows

‘visualize’ the simulated null distribution

visualize(null_distribution_anova)


‘calculate’ the statistic from your observed data

observed_f_sample_stat  <- hr_anova %>%
  specify(response = hours, explanatory = status)  %>% 
  calculate(stat = "F")

observed_f_sample_stat
# A tibble: 1 x 1
   stat
  <dbl>
1  110.

get_p_value from the simulated null distribution and the observed statistic

null_distribution_anova  %>% 
  get_p_value(obs_stat = observed_f_sample_stat , direction = "greater")
# A tibble: 1 x 1
  p_value
    <dbl>
1       0

shade_p_value on the simulated null distribution

null_t_distribution  %>% 
  visualize() +
  shade_p_value(obs_stat = observed_f_sample_stat, direction = "greater")


If the p-value < 0.05? yes (yes/no)

Does your analysis support the null hypothesis that the true means of the number of hours worked for those that were “fired”, “ok” and “promoted” were the same? no (yes/no)

ggsave(filename = "preview.png", 
       path = here::here("_posts", "2021-05-12-hypothesis-testing"))