SQLrom

Two-Way Fixed Effects (TWFE) models are a workhorse in causal inference, but applying them to interaction effects — such as product color × search query — introduces subtle pitfalls that can silently bias your estimates. This article walks through how to set up TWFE for interaction effects in an e-commerce search setting, where the pitfalls lurk, and how to build a more robust estimation strategy.

Why Product Color × Search Query?

In fashion e-commerce, color is a first-order signal. A user searching for "linen shirt" behaves very differently from one searching for "black linen shirt" — the latter has an explicit color intent, while the former is open to discovery. If you want to measure how surfacing color-matched products affects conversion, you need to estimate a treatment effect that varies across both the product dimension (color) and the query dimension (search term).

This is exactly the kind of heterogeneous treatment effect that TWFE is often asked to handle. The estimand of interest is the incremental conversion lift from showing color-relevant results, conditional on a given query type.

The Standard TWFE Setup

Model Specification

A standard TWFE specification for this problem looks like:

Y_{iqt} = \alpha_i + \alpha_t + \beta \cdot \text{Treat}_{iqt} + \gamma \cdot (\text{Color}_{i} \times \text{QueryType}_{q}) + \varepsilon_{iqt}

Y: outcome (e.g., click-through rate, conversion rate)
α_i: item fixed effect (absorbs time-invariant product characteristics)
α_t: time fixed effect (absorbs common shocks across all items)
Treat_{iqt}: indicator for whether item i was shown in a color-matched position for query q at time t
Color_i × QueryType_q: interaction term capturing differential effects by color-query combination

This setup is clean in theory. In practice, however, treatment assignment often doesn't happen simultaneously across all items — it rolls out in waves, which triggers the staggered adoption problem.

The Staggered Adoption Bias

What Goes Wrong

When color-matching is introduced to different query buckets at different times (e.g., color queries in week 1, pattern queries in week 3, material queries in week 6), TWFE uses already-treated units as implicit controls for later-treated units. If treatment effects are heterogeneous across cohorts — which they almost certainly are in a fashion context — this contamination can make β a weighted average that assigns negative weights to some cohorts, producing biased and potentially sign-reversed estimates.

A Diagnostic: Event Study Plot

Before trusting any TWFE result, run an event study. Decompose β into pre- and post-treatment period effects:

Y_{iqt} = \alpha_i + \alpha_t + \sum_{k \neq -1} \delta_k \cdot \mathbf{1}[t - G_i = k] + \varepsilon_{iqt}

where G_i is the cohort (treatment timing) of item i and k indexes periods relative to treatment. Flat pre-trends are a necessary (though not sufficient) condition for parallel trends. Violation in pre-periods is a red flag that your control group is contaminated.

Robust Alternatives

Callaway–Sant'Anna (2021)

The Callaway–Sant'Anna estimator computes cohort-specific ATTs by comparing each cohort only to units that have not yet been treated (clean controls). These cohort-period ATTs can then be aggregated into an overall ATT, a calendar-time ATT, or a dynamic ATT. For the color × query interaction setting, you'd estimate a separate CS-DiD per (color group, query type) cell and then aggregate.

Sun–Abraham (2021)

Sun–Abraham decompose the TWFE coefficient into a linear combination of cohort-specific ATTs and provide a re-weighting scheme that eliminates the negative-weight problem. This is easier to implement in practice since it stays within the OLS framework — you just interact treatment indicators with cohort dummies and use heteroskedasticity-robust SEs.

Which to Use?

CS-DiD: preferred when you want interpretable cohort-level ATTs and have sufficient sample size per cohort-query cell
Sun–Abraham: preferred when staggered timing is mild and you want a single aggregate estimate with minimal implementation overhead
Standard TWFE: acceptable only when treatment timing is simultaneous or you've verified homogeneous treatment effects across cohorts

Interaction Effects: Practical Considerations

Defining Color and Query Groups

Color intent in queries can be operationalized in several ways:

Explicit color query: query contains a color token ("black dress", "ivory blouse")
Implicit color query: query doesn't mention color but the category is highly color-sensitive (e.g., "formal wear")
Color-agnostic query: category where color rarely drives purchase decision (e.g., "compression socks")

The interaction β you estimate will differ substantially across these groups, so pooling them into a single TWFE coefficient will obscure the true heterogeneity.

Handling Sparse Cells

Product color × query type interactions can produce very sparse cells — for example, "chartreuse" products matched to "unique color top" queries may have only a handful of impressions per day. When cells are sparse, fixed effects blow up variance dramatically. Consider:

Collapsing rare colors into an "other" category
Aggregating to daily or weekly item-query-level panels instead of session-level
Using partial pooling (hierarchical models) instead of full fixed effects for color groups

Clustering and Standard Errors

In this panel setting, errors are likely correlated within item (same product across time) and potentially within query (same search term across time). A standard approach is to cluster at the item level. If query-level correlation is also a concern, two-way clustering at both item and query simultaneously is more conservative but appropriate.

If your panel is short (few time periods) but wide (many items), standard asymptotic cluster-robust SEs may be poorly calibrated — consider wild cluster bootstrap as a finite-sample correction.

Putting It Together: A Practical Workflow

Step 1: Classify queries into color-intent buckets. Tag items by primary color attribute.
Step 2: Construct an item × query × time panel with impression, click, and conversion counts.
Step 3: Run an event study using standard TWFE to check pre-trends. Flag if pre-periods show drift.
Step 4: If treatment timing is staggered, switch to Sun–Abraham or Callaway–Sant'Anna. Estimate cohort-specific ATTs per (color group, query type) cell.
Step 5: Aggregate to an overall interaction ATT with appropriate weights. Report with two-way clustered SEs or wild bootstrap CIs.
Step 6: Sanity check: does the direction and magnitude of the color × query interaction make business sense? A large positive effect for explicit-color queries and near-zero for color-agnostic queries is what you'd expect.

Conclusion

TWFE is a powerful tool for estimating causal effects in panel data, but it demands care when treatment timing is staggered and when the estimand involves interaction effects. In the product color × search query setting, the biggest risks are cohort contamination from staggered rollout and inflated variance from sparse interaction cells. Pairing TWFE with a robust DiD estimator (CS-DiD or Sun–Abraham) and running pre-trend diagnostics first will keep your estimates interpretable and defensible.

In practice, even a well-specified TWFE is just the starting point — layering in query-level controls (e.g., query volume, session position) and product-level covariates (price tier, inventory level) will further sharpen your estimates and reduce omitted variable bias in observational settings.