Ab Testing Analyzing Results

Posted by VitoDH Blog on January 3, 2020

AB Testing - Lesson 5 - Analyzing Results

  • Sanity Checks
  • Single Metric
  • Multiple Metrics
  • Gotchas

1. Sanity Checks

Reasons for Sanity Check (example)

  • Something goes wrong in the experiment diversion and the experiment and control aren’t comparable
  • You set up the filter differently in the experiment and the control
  • Your data capture didn’t capture the events that you want to look for in your experiment

To do this, we should check

  • Population sizing metrics based on your unit of diversion to make sure the population in experiment and control are comparable
  • The invariant metrics: metrics that shouldn’t change when you run your experiment

Choosing invariant metrics

Example 1

  # signed in users # cookies # events # CTR on “Start now” # Time to complete
Changes order of courses in course list (unit of diversion: user-id) 1 1 1 1  
Changes infrastructure to reduce load time (unit of diversion: event) 1 1 1 1  
  • For case 1,
    • # signed in users is a good population sizing metric
    • # cookies and # events are also good population sizing metrics unless users in the experiment group tend to clear their cookies more often than the control group
    • “Start Now” button shows before the course list so # CTR should remain the same
    • Putting easier course at first might reduce the time to complete for the users
  • For case 2,
    • # events is good population sizing metric since it’s the unit of diversion
    • # signed in users and # cookies are randomly assigned due to the events so they are also the population sizing metrics
    • People press” Start Now” before viewing the videos
    • # time complete is difficult to track in event-based diversion ( users might be assigned to control and experiment multiple times)

Example 2

Change location of sign-in button to appear on every page

Unit of diversion: cookie

Which metrics should make good invariants?

  • # events , # cookies, # users are good population sizing metrics
  • CTR on “Start Now”: No, because sign-in button in home page will affect CTR on “start now”
  • Probability of enrolling: No, because users enroll after signing in
  • Sign-in rate: No, this is what we’re trying to change
  • Video load time : Yes. This is back-end thing.

Checking invariants

Example 1

Run experiment for 2 weeks

Unit of diversion: cookie

If you notice that # total control is not equal to # total experiment, how would you figure out whether this difference is within expectations?

Given: Each cookies is randomly assigned to the control or experiment group with probability 0.5

Answer:

Like flipping a fair coin: heads = control, the number of heads follows a binomial distribution

  • Compute standard deviation of binomial with probability 0.5 of success
    • SD = sqrt(0.5*0.5/(# total control+# total experiment))
  • Multiply by z-score to get margin of error
  • Compute confidence interval around 0.5
  • Check whether the observed fraction of successes falls within the interval
  • If not in the interval, one good thing to check is to look at the day-by-day data. Compute the fraction of control group in each day to figure out whether it’s a overall problem or a specific day problem

What to do?

  • Talk to the engineers
  • Try slicing(e.g. by platform) to see if one particular slice is weird
  • Check age of cookies - does one group have more new cookies

Common scenarios that experiment and control don’t match up

  • data capture
    • Capture a new experience the user is undergoing. Maybe the change is rare in experiment but more in control
  • Experiment setup
    • Maybe you set up a filter for experiment only but not for the control
  • Infrastructure

2. Single Metric

Example

Change color and placement of “Start Now” button

Metric: Click-through rate

Unit of diversion: cookie

Check the videos to see the details.

Usually we can run two test,

  • Binomial normal approximation for empirical result
  • Sign test - nonparametric test
  • If these two test’s results are different, we need to dig deeper(e.g. weekday effect of weekend effect)

Example

Simpson’s paradox in CTR

  • Looking at the total aggregate result, the experiment group has a lower CTR
  • But if breaking it down by new users and experienced users, the result is on the contrary
  • Possible reason: changes might affect the new user and experienced user differently

3. Multiple Metrics

Example

Prompt students to contact coach more frequently

Metrics:

  • Probability that student signs up for coaching
  • How early students sign up for coaching
  • Average price paid per student

If Audacity tracks all three metrics and does three separate significance tests (alpha=0.05), what is probability at least one metric will show a significant difference if there is no true difference?

Here we assume these three metrics are independent, but actually they are related. So this is an overestimate of the probability of a false positive.

  • Probability of any false positive increases as you increase number of metrics
    • Solution: use higher confidence level for each metric
    • Method 1: Assume independence and set alpha separately for each metric
    • Method 2: Bonferroni correction
      • Simple
      • No assumptions
      • Conservative

Example

Update description on course list

Check the videos

Different strategies:

  • Control probability that any metric shows a false positive
    • Family wise error rate (FWER)
  • Control false discovery rate (FDR)
    • FDR=E[# false positives / # rejections]
    • Suppose you have 200 metrics, cap FDR at 0.05. This means you’re okay with 5 false positives and 95 true positives in every experiment
    • If you want to detect a significant change across a number of metrics, cap the FDR instead of FWER

4. Conclusion

If you get significant result, ask yourself

  • Do you understand the change
    • When one is significant and the other one is not significant
    • When one is positive change while the other one is negative change
  • Do you want to launch the change
    • Do I have statistically significant and practically significant results to launch the change
    • Do I understand what the change has done with the user experience
    • Is it worth it

Gotchas

  • Remove all the filters and do a ramp-up to see incidental change on the unaffected users that you didn’t test in the original experiment