if (!requireNamespace("dplyr", quietly = TRUE)) {
install.packages("dplyr")
}if (!requireNamespace("fixest", quietly = TRUE)) {
install.packages("fixest")
}
library(dplyr)
library(fixest)
As a former data analyst at an ecommerce company, I regularly conducted A/B testing to determine whether a feature was worth releasing or a product worth launching. In its simplest case, we use t-test between two groups to conduct A/B testing. However, we cannot blindly applying the t-test for comparing control and experiment variants when the metric of interest is in ratio. The core issue is that when we assign users to control or experiment groups randomly, our unit of analysis for ratio metrics (like clicks per impression or revenue per order) is often at a more granular level (sessions or orders).
In online experiments where we randomize at the user level, a single user can generate multiple sessions or orders. Imagine a user who loves a new feature and interacts with it multiple times during the experiment. These interactions are not independent events; they are influenced by that single user. The standard statistical procedures, on the other hand, assume each data point is independent and identically distributed (iid). Ignoring this correlation can result in incorrect standard errors and potentially flawed decisions about whether to launch a feature or product.
To address this, different statistical approaches have been developed. One method I frequently used was the delta method, based on the 2018 paper by Deng, Knoblich, and Lu (2018). This method adjusts the standard errors of the ratio metric to account for the user-level randomization and the more granular unit of analysis. Intuitively, it approximates the variance of the ratio by considering the variances and covariance of the numerator and denominator.
Interestingly, I encountered a similar problem in my field experiment class. The solution there was to use clustered standard errors. This technique explicitly accounts for the correlation of observations within groups (in our case, users). It adjusts the standard errors by looking at the variation between these clusters rather than treating each individual observation as independent.
At that moment, I realized that in online experimentation and my field experiment class, we faced a similar problem but had seemingly different approaches yet similar results. A quick web search confirmed this intuition, leading me to another paper by Deng, Lu, and Qin (2021) – the same authors who proposed the delta method – which proves the equivalence of their proposed method and methods based on clustered standard errors.
In this post, I want to demonstrate this equivalence using a practical example and R code. We will generate simulated A/B test data and apply both the delta method and clustered standard errors to estimate the difference in click-through rate between a control and an experiment group.
First, let’s load the necessary R libraries and set up our data generating process:
set.seed(10027)
<- 20000
n_users <- 100000
n_sessions <- 0.2
uplift
<- rlnorm(n_users)
prob <- rbinom(n_users, 1, 0.5)
is_treated
<- rbeta(n_users, 10, 90) * (1 + is_treated * uplift)
ctr <- sample(1:n_users, n_sessions, replace = TRUE, prob = prob)
user_sessions <- rbinom(n_sessions, 1, ctr[user_sessions])
clicks
<- data.frame(
session_level session_id = 1:n_sessions,
user_id = user_sessions,
treatment = ifelse(is_treated[user_sessions] == 0, "Ctrl", "Exp"),
clicks = clicks
)
Now, let’s implement the delta method. We’ll first aggregate our session-level data to the user level to calculate the total clicks and views per user, as the delta method in this context often operates on user-level aggregates. Then we calculate the variance of the ratio using the delta method and then to estimate the difference in click-through rate and its standard error between the control and experiment groups.
<- session_level |>
user_level ::group_by(user_id, treatment, .drop = TRUE) |>
dplyr::summarise(
dplyrviews = n(),
clicks = sum(clicks)
)
<- function(data, treatment_group) {
calculate_delta <- data[data$treatment == treatment_group, ]
group_data <- group_data$clicks
clicks <- group_data$views
views
<- length(clicks)
K <- mean(clicks)
X_bar <- mean(views)
Y_bar
<- var(clicks)
X_var <- var(views)
Y_var <- cov(clicks, views)
XY_cov
1 / K) * (1 / Y_bar^2) * (X_var + Y_var * (X_bar / Y_bar)^2 - 2 * (X_bar / Y_bar) * XY_cov)
(
}
<- function(data, ctrl_group, exp_group) {
calculate_statistics <- calculate_delta(data, ctrl_group)
delta_ctrl <- calculate_delta(data, exp_group)
delta_exp
<- sum(data$clicks[data$treatment == exp_group])/sum(data$views[data$treatment == exp_group]) -
estimate sum(data$clicks[data$treatment == ctrl_group])/sum(data$views[data$treatment == ctrl_group])
<- sqrt(delta_ctrl + delta_exp)
std_error <- estimate / std_error
t_value <- 2 * (1 - pnorm(abs(t_value)))
p_value
data.frame(
estimate = estimate,
std_error = std_error,
t_value = t_value,
p_value = p_value
)
}
calculate_statistics(user_level, "Ctrl", "Exp")
estimate std_error t_value p_value
1 0.01871455 0.002118153 8.835316 0
Now, let’s analyze the session-level data using clustered standard errors, clustering at the user level. We’ll use the feols
function from the {fixest} package, which provides a convenient way to estimate linear models with clustered standard errors.
::feols(clicks ~ treatment, cluster = ~ user_id, data = session_level) fixest
OLS estimation, Dep. Var.: clicks
Observations: 100,000
Standard-errors: Clustered (user_id)
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.099233 0.001416 70.08886 < 2.2e-16 ***
treatmentExp 0.018715 0.002118 8.83553 < 2.2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
RMSE: 0.310971 Adj. R2: 8.946e-4
Looking at the output of both methods, you should observe that the estimated difference in the treatment effect and, crucially, the standard errors from both approaches are remarkably similar. The estimated treatment effect (around 0.018) and the standard error (around 0.0021) are virtually identical between the delta method and the clustered standard errors approach.
The consistent results from both the delta method and clustered standard errors highlight an important point: different statistical communities might arrive at equivalent solutions for the same underlying problem using different terminology and techniques. The delta method, often employed in the online experimentation world, and clustered standard errors, a common tool in econometrics and field experiments, both effectively address the issue of within-group correlation in randomized experiments.
Back to top