Find important customer segments in A/B tests

When A/B testing at uSwitch, often our goal is to find which of two designs (or two ‘variants’) has a higher performance metric for users who see that design.

For example, we might look at the success rate (proportion of users who perform a specific action) of users who saw design A and compare it to the success rate for users who saw design B, then pick the design with the higher success rate to roll out on the site.

These overall metrics can hide big differences for isolated customer segments and, for this reason, we often break out these metrics by categorical variables like browser, device type, traffic source or date.

A variable value (e.g., device_type = 'mobile') might mark users with a large difference in the performance metric between the two designs, and this difference often needs to be investigated: this can uncover unexpected interaction differences and technical bugs present in one design but not the other.

At uSwitch, we have a method and an R function which helps us find these important customer segments quickly and reliably. Before we get to that, we need to explore the question:

What are the most important customer segments to investigate?

We assume here that we are running an A/B test on a website with an equal split of traffic going to each variant. Each observation in the test has a success variable which marks whether observation was associated with a successful action.

We’ll take ‘segment’ to mean a single value of a category we see in the dataset. An example of a segment would be the 'Chrome' value of the category 'browser'.

We’ll take ‘most important to investigate’ to mean ‘most evidence for a difference in success rate’. We’ll use the p-value from a chi-squared test for this, with lower p-values indicating higher evidence for a difference.

The meanings here for importance and segment aren’t the only useful ones; I’ve suggested some more later.

Now that we've made the question less vague, we can move on to:

The Method

We start with data from an A/B test that looks like

uuid variant_id success browser device_type traffic_medium
1 A fail Chrome desktop direct
2 B success Firefox desktop organic
3 A success Safari mobile direct
4 A success Chrome desktop email
5 B fail Chrome desktop email
6 B success Safari tablet paid-search

Where 'success' gives whether or not the person was successful in doing an action we wanted to make a difference in. 'uuid' is a user ID, 'variant_id' is what bucket in the test they are in, and the rest of the fields are categorical variables about the user.

The defining characteristics of this dataset are:

  • Each row is a unique observation in the test
  • There is a variable for the test bucket the observation is in
  • There are several categorical variables

Next:

  1. For every category value in the test data, filter the dataset down to include only observations that fall into that category value.
  2. Calculate the p-value for the (Bucket A, Bucket B) X (Success, Not Success) contingency table
  3. From these values, create a table which looks like:
Category Category value Variant A
Observations
Variant B
Observations
p-value Variant A
success rate
Variant B
success rate
device type mobile 2500 2500 0.065 6.6% 8.0%
channel search 2500 2500 0.357 9.0% 9.8%
device type desktop 1500 1500 0.378 11.3% 10.3%
device type tablet 1000 1000 0.588 8.8% 9.6%
channel email 2500 2500 0.755 7.9% 8.2%

with each row corresponding to exactly one unique category and value pair, along with the associated total observations metrics and p-value. This table should be ordered by ascending p-value.

From this, we first see those segments of users who have the largest evidence for a difference in conversion rates. We can also balance this by looking at the total observations column to check relative sizes of buckets, which can affect what order we investigate the top few segments.

The Script

The 'prioritise' function here will do this for us:

prioritise <- function(df, bucket, ss, variable_vec) {  
  #df is our data frame.
  #bucket is the name (as a string) of the column for the variant ID in the data frame. This column should only contain the values 'A' and 'B'.
  #ss is the name (as a string) of the column which marks whether or not the observation was associated with a successful action. This column should only contain the values 'success' and 'fail'.
  #variable_vec is the string vector which has the column names of variables to make segments out of.

  #The output has the following variables:
  #var: a categorical variable name
  #var_value: a value of the categorical variable
  #A_observations, B_observations: observations in A and B
  #p_value: p_value as calculated from a chi.square test
  #A_SR_pct, B_SR_pct: success rates for A and B in percent, rounded to 1 d.p., as calculated from the samples.

  output <- data.frame()
  vars <- list()

  for(v in variable_vec) {
    vars[[v]] <- data.frame()
    form <- as.formula(paste('var','+','var_value','~',bucket,'+',ss))
    ov <- group_by_(df, bucket, ss, var_value = v) %>% 
      summarise(count = n()) %>% 
      ungroup %>% 
      mutate(var = v) %>% 
      dcast(form, value.var = 'count', fill = 0)
    ov_figures <- as.matrix(ov[,c('A_fail', 'A_success', 'B_fail', 'B_success')])
    ov <- transmute(ov, 
                    var,
                    var_value,
                    A_observations = A_success + A_fail,
                    B_observations = B_success + B_fail,
                    p_value = NA, 
                    A_SR_pct = round(100*A_success/(A_success + A_fail),1), 
                    B_SR_pct = round(100*B_success/(B_success + B_fail),1))

    for(r in 1:nrow(ov_figures)) {
      ov$p_value[r] <- round(chisq.test(matrix(ov_figures[r,], ncol = 2, byrow = TRUE))$p.value, 5)
    }

    ov <- arrange(ov, p_value)
    vars[[v]] <- ov
  }

  for(df1 in vars){
    output <- rbind(output,df1)
  }

  output <- arrange(output, p_value)

  return(output)
}

The following script will simulate some observations in an A/B test and apply prioritiser to the results:

#Assume 50:50 A/B test and only consider two variables: device type and traffic channel 
#Simulate data by assuming fixed chance for success for each channel, device type and bucket combination 
#For a given channel and device type combination, this chance is the same across buckets, except for:
#mobile traffic combinations, which see a 2 percentage points increase to the success rate
#Assume 40/20/40 split in observations for mobile, tablet and desktop, 50/50 for search/email and 50/50 for buckets
#Assume observations are split homogeneously

example_data <- data.frame(); sim_inputs <- data.frame(device_type = rep(NA, 12));  
sim_inputs$device_type <- rep(c('desktop', 'tablet', 'mobile'), each = 4)  
sim_inputs$channel <- rep(rep(c('email', 'search'), each = 2), times = 3)  
sim_inputs$bucket <- rep(c('A', 'B'), time = 6)  
sim_inputs$proportion <- 0.5*rep(c(0.3*0.5, 0.2*0.5, 0.5*0.5), each = 4)  
sim_inputs$sr <- c(rep(c(0.10, 0.12, 0.08, 0.09), each = 2), c(0.06,0.08, 0.07,0.09))

observations_in_test <- 10000  
for(i in 1:nrow(sim_inputs)){  
  segment <- sim_inputs[i,]
  total_in_segment <- round(segment$proportion*observations_in_test)
  interim <- data.frame(
    device_type = rep(segment$device_type, total_in_segment), 
    channel = rep(segment$channel, total_in_segment), 
    bucket = rep(segment$bucket, total_in_segment),
    success = rep(NA, total_in_segment))
  interim$success <- ifelse(rbinom(total_in_segment, size = 1, prob = segment$sr) == 1, 'success', 'fail')
  example_data <- rbind(example_data, interim)
}

prioritised_segments <- prioritise(example_data, bucket = 'bucket', ss = 'success', variable_vec = c('device_type', 'channel'))

#Mobile should come out top here:
prioritised_segments

Other interpretations of the question

There are other ways to interpret the question 'What are the most important customer segments to investigate?' given the dataset format.

Customer segment could be expanded to include combinations of category values (e.g., pairs of category variable values).

‘Most important’ could mean:

  • Largest expected difference in success rates between the two buckets;
  • Largest 5% lower bound on the difference in success rates, assuming success rates for each bucket for that segment were generated from a uniform distribution (i.e., an uninformative prior).

Final thoughts

There is a danger of data dredging and p-value hacking if this method is used incorrectly. At no point should the low p-values here be presented as indication of significant change in the success rate. Instead, this method should be used for guiding investigations.

That being said, the method described in this post has helped us at uSwitch several times by highlighting customers who were experiencing issues with recently launched A/B tests. It has saved us time and effort, and improved the consistency at which we spot issues and unexpected design consequences in tests. Hopefully it can help you in a similar way.