Hypothesis Testing with Z test

Intoduction
By its nature, statistical tests give an answer to questions like:
- What is the probability of the observed difference be due to pure chance? (Q1)
For an actionalbe insight, a yes/no question is often preffered. In abstract terms, we seek to answer another question:
- Based on the observations, can we reject the null hypothesis
? (Q2)
where the null hypothesis
- Does the new website design improve the subscription rate by
% or more?
In this case,
Before jumping into experiments, couple decisions need to be made. These are decisions about the tollerance we have to a random chance messing with the test and leading us to wrong conclusions. These have to be set before the start of the test to avoid bias in interpreting result. There are two types of error, so two thresholds to agree upon.
Null Hypothesis Actually is | |||
---|---|---|---|
Decision | True | False | |
Decision about Null Hypothesis | Don't Reject | Correct Decision (true positive) probability = | Type 2 Error (false negative) probability = |
Reject | Type 1 Error (false positive) probability = | Correct Decision (True negative) probability = |
The first one is the probability threshold
The second one probability threshold
Example problem
To show how Z-test would work in practice, lets quantify the effect of treatment on the probability of a certain outcome. Consider a toy example:
“Does the new website design improve the subscription rate by at least
%?”
As an example, we seek to answer the above question. In the simplest case, we assume that
- For the experiment, customers are selected at random from a cohort that represents target audience
- Customers make decisions independent of each other
- Customer traffic is randomly split into two groups, A and B, depending on which version the website, new or old, they were presented with
- Each of two groups has sufficiently large (for Central Limit Theorem to work) number of customers
Experiment setup
Under these assumptions, the probability distribution for a number of subscriptions in each group is given by the binomial distribution with probabilities of a customer subscribing are
Step 1: Null Hypothesis
The null hypothesis states that
, so that the new website design has less than the desired effect on the subscription probability.
In the language of mathematics, the null hypothesis
Step 2: Sensitivity and Power
Next, we need to select values of sensitivity and power that we are comfortable with. Usually,
These values come from classical literature and aim to strike a balance between detecting the effect of treatment if it is present, while reducing the cost of collecting the data. For more detailed discussion check out p. 17 in Statistical Power Analysis for the Behavioral Sciences and p. 54, 55 in The Essential Guide to Effect Sizes.
Step 3: Test statistic
If the number of customers
where
If the p-value for the test statistic is at or below
Step 4: Confidence Interval for test statistic
Based on test statistic
critical value is
for
Corresponding cumulative distribution function for this critical value
Step 5: Effect Size and Sample Size
Finally, lets estimate a minimal sample size required to achieve desired statistical power. In addition to sensitivity and power, effect size needs to be evaluated, which requires a priori information and / or assumptions. For equal number customers in both groups with the assumption that this number is sufficiently large (so that the sampling distribution is well approximated a normal one), Cohen’s D can be used to approximate effect size
where
A good estimate for
is %
Substituting the parameters, we find
Estimating required number of test participants
With all pieces of the puzzle in place, we can proceed to estimating sample size from effect size, sensitivity and power. Python statsmodel
package comes handy for this task
from statsmodels.stats.power import NormalIndPower
ALPHA = 0.05
BETA = 0.2
PowerAnalysis = NormalIndPower()
N_total = PowerAnalysis.solve_power(effect_size=D, alpha=ALPHA, power=1 - BETA, alternative="larger", ratio=1)
The calcalation returns
Under the hood, above code involves numerically solving this equation for
where function
from scipy.stats import norm
Zc = 1.64
norm.cdf(Zc - D*np.sqrt(N_total/2))
returns
Code for Experiments
Here is the notebook with the whole code. Each experiment consists of simulating the observed data and running the Z-test for this data.
from statsmodels.stats.weightstats import ztest
import numpy as np
def run_ztest(p0, p1, N_samples, value):
# generate data
data_0 = np.random.binomial(1, p0, N_samples)
data_1 = np.random.binomial(1, p1, N_samples)
# run Z-test
result = ztest(data_1, data_0, alternative="larger", value=value)
return result
The above code generates two sets of data, for both groups ztest
returns two values, test statistic value
is set to the threshold value of the effect p0
and p1
are the true, unknown values for probabilities of subscription.
Experiments
True subscription rate due to new website design either exceed or is below the threshold. Consider the former case when it exceeds the threshold and null hypothesis has to be rejected. The analysis is the same for the latter case. In the considered case lets focus on the Type II Error, when the test shows that the new website design does not achive the goal.
There are three possible scenarios for the true value of subscription probability in group B compared to the estimate
In each of these scenarios the chances of making correct decision whenever or not to reject null hypothesis. Lets consider each of these possibilities.
Actual subscription probability is
% due to the new website design
Good estimate for
Actual subscription probability is
% due to new website version
Estimate for
Actual subscription probability is
% due to new website version
Estimate for
Even though in all of the above cases the new website design achieved the goal, out chance of picking it with Z-test was dramatically different. One way to address this is to be more conservative when estimating effect size, but that might be incure costs in practice. Another is to explore different test setups, which is my future quest.