Ab Testing Introduction

Posted by VitoDH Blog on December 28, 2019

AB Testing - Lesson 1 - Introduction

1. Overview

What

A general methodology used online when you want to test a new product or a new feature.

How

Take two sets of users,

  • Control set, your existing product
  • Another set, your experiment, the new version

How do these two sets respond differently?

Useful when

  • You want to climb to the peak of the current mountain

eg.

  • Movie recommendation site
  • Change backend-page load time, results

Not useful when

  • not when you choose which mountain to climb(i.e. testing out new experiences)

  • Short term A-B testing
  • Can’t tell if you’re missing something

eg.

  • Online shopping service. Is item complete?
  • New premium service
  • To long to get repeat customers
  • Update brand and logo

2. Example

Audacity

  • Create online finance courses
  • User flow / Customer funnel
    • Homepage visits
    • Exploring the site
    • Create account
    • Complelte
  • Experiment
    • Hypothesis: Changing the “start now” button from orange to pink will increase how many students explore the Audacity’s courses

3. Choose a metric

Following the audacity example,

  • Total number of courses completed
    • Too much time to accumulate, not practical
  • Number of clicks on specific button
    • The number of total clicks in old and new version is different
  • CTR(Click Through Rate) Number of clicks on specific button / Number of page views
    • Maybe the page loads slowly and the visitor impatiently clicks 5 times
    • Use when deciding whether user progresses to the second level of the funnel: Homepage visits
  • Click through probability: Unique visitors who click / Unique visitors who view the page
    • Use when deciding whether user progresses to the second level of the funnel: Exploring the site

Update hypothesis: Changing the “start now” button from orange to pink will increase the Click Through Probability of the button

4. Review statistics

Which distribution?

  • Binomial distribution
    • Click: success No click: failure
    • Mean: p
    • Std: $\sqrt{\frac{p(1-p)}{N}}​$
    • $\hat p=\frac{X}{N}$
    • When we can use?
      • 2 types of outcomes
      • Independent events
        • Clicks on a search result page is not independent because people will search again with slightly different word if they don’t find the answer in the first search
        • Student completion of course after 2 months: can be assumed independent (exception: student creates 2 accounts; take courses with friends)
      • Identical distributions

Confidence Interval

If we theoretically repeat the experiment over and over again, we would expect the interval we construct around our sample mean to cover the true value in the population 95% of the time

  • To use normal, check $N\cdot \hat p > 5$ and $N\cdot (1-\hat p) > 5$
  • Margin: $m=z\cdot\sqrt{\frac{\hat p(1-\hat p)}{N}} $
    • Large $\hat p$: Std will be smaller, distribution will be tighter, confidence interval will be smaller
    • Large $N$: Std will be smaller, distribution will be tighter, confidence interval will be smaller
    • When considering 95% confidence interval, z-score is 1.96

Hypothesis Testing

We want to calculate p(results due to change). We have two distribution: $p_{control}$ and $p_{exp}$(experiment)

  • Null hypothesis: $p_{control} = p_{exp}$
  • Alternative hypothesis: $p_{control} \neq p_{exp}$
  • Measure $\hat p_{control}$ and $\hat p_{exp}$ and calculate $p(\hat p_{exp}-\hat p_{control} H_0)$ and reject null if the previous p is below 0.05

Compare two samples

The quantitative task tells us whether it’s likely that the difference we observed could have occurred by chance, or if it would be extremely unlikely to occur if the two sides are actually the same

  • Variables we have: $X_{control},X_{exp},N_{control},N_{exp}$
  • $\hat p_{pool}=\frac{X_{control}+X_{exp}}{N_{control}+N_{exp}}$
  • $SE_{pool}=\sqrt{\hat p_{pool}(1-\hat p_{pool})\cdot(\frac{1}{N_{control}}+\frac{1}{N_{exp}})}$
  • $\hat d=\hat p_{exp}-\hat p_{control}$
  • $H_0:d=0$, $\hat d\sim N(0,SE_{pool})$
  • If $\hat d>1.96\cdot SE_{pool}$ or $\hat d<-1.96\cdot SE_{pool}$, reject null

Make sure that the experiment is practically significant(from business viewpoint, e.g., 2% change in the click through probability ) and statistically significant. Set the statistical significance bar lower than the practical significance bar.

5. Design

How many page views we need?

  • $\alpha=p(reject\, \,null null\,\,true)$
  • $\beta=p(fail\,\,to\,reject null\,\,false)​$
  • Small sample
    • Low $\alpha$: you are unlikely to launch a bad experiment
    • High $\beta$: you are likely to fail to launch an experiment that actually did have a difference you care about
  • Large sample
    • Same $\alpha$
    • Low $\beta$
  • Sensitivity=$1-\beta$, often $80\%$

Relationship between change and sample size

Change Sample size change
Higher CTP in control(still < 0.5) Increase page views(reduce the std to the original level)
Increased practical significance level($d_{min}$) Decrease page views(Lager changes are easier to detect)
Increased confidence level($1-\alpha$) Increase page views(Reject the null less often but we need to keep the sensitivity the same)
Higher sensitivity $(1-\beta)$ Increase page views(Narrow the distribution)

6. Analyze

  • Confidence interval larger than $d_{min}$: launch the new version
  • Confidence interval falls within $[-d_{min},d_{min}]​$: not launch because it’s not practically significant
  • Confidence interval overlaps with $[-d_{min},d_{min}]$: run additional test. When you don’t have time to do this, talk to the decision-makers and use other methods beside data