The Hidden Trap of Fixed and Random Effects

What Are Random Effects and Fixed Effects?

When designing a study, we often aim to isolate independent variables from those of no interest to observe their true effects on the dependent variables. For example, let’s say we would like to study the effects of using Github Copilot (independent variable) on developer productivity (dependent variable). One approach is to measure how much time developers spend using Copilot and how quickly they complete coding tasks. At first glance, we may observe a strong positive correlation: more Copilot usage, faster task completion.

However, other factors can also influence how quickly developers finish their work. For example, Company A might have faster CI/CD pipelines or deal with smaller and simpler tasks, while Company B may require lengthy code reviews or handle more complex and time-consuming tasks. If we don’t account for these organizational differences, we might mistakenly conclude that Copilot is less effective for developers in Company B, although it’s the environment, not Copilot, that truly slows them down.

These kinds of group-level variations — differences across teams, companies, or projects — are typically known as “random effects“ or “fixed effects“.

Fixed effects are variables of interest, where each group is treated separately using one-hot coding. This way, since the within-group variability is captured neatly within each dummy variable, we are assuming the variance of each group is similar, or homoscedastic.

\[y_i = \beta_0 + \beta_1 x_i + \gamma_1 D_{1i} + \gamma_2 D_{2i} + \cdots + \varepsilon_i\]

where D_1i, D_2i, … respectively are dummy variables representing group D_1i, D_2i, … and γ₁, γ₂, … respectively are fixed effect coefficients for each corresponding group.

Random effects, on the other hand, are typically not variables of interest. We assume each group is part of a broader population and each group effect lies somewhere within a broader probability distribution of that population. As such, the variance of each group is heterogeneous.

\[ y_{ij} = \beta_0 + \beta_1 x_{ij} + u_j + \varepsilon_{ij} \]

where u_j is a random effect of group j of sample i, drawn from a distribution, typically a normal distribution 𝒩(0, σ²ᵤ).

Rethink Carefully Fixed and Random Effects

However, it may mislead your analysis if you just randomly insert these effects into your model without thinking carefully about what kinds of variations they are actually capturing.

I recently worked on a project analyzing Environmental Impacts of AI models, which I studied how certain architectural features (number of parameters, number of compute, dataset size, and training time) and hardware choices (hardware type, number of hardware) of AI models affect energy use during training. I found that Training_time, Hardware_quantity, and Hardware_type significantly affected the energy usage. The relationship can be roughly modeled as:

\[ \text{energy} = \text{Training_time} + \text{Hardware_quantity} + \text{Hardware}\]

Since I thought there might be differences between organizations, for example, in coding style, code structure, or algorithm preferences, I believed that including Organization as random effects would help account for all of these unobserved potential differences. To test my assumption, I compared the results of two models: with and without Organization, to see which one is a better fit. In the two models, the dependent variable Energy was extremely right-skewed, so I applied a log transformation to stabilize its variance. Here I used Generalized Linear Models (GLM) as the distribution of my data was not normal.

glm <- glm(
  log_Energy ~ Training_time_hour + 
               Hardware_quantity + 
               Training_hardware,
               data = df)
summary(glm)

glm_random_effects <- glmer(
  log_Energy ~ Training_time_hour + 
               Hardware_quantity + 
               Training_hardware + 
               (1 | Organization), // Random effects
               data = df)
summary(glm_random_effects)
AIC(glm_random_effects)

The GLM model without Organization produced an AIC of 312.55, with Training_time, Hardware_quantity, and certain types of Hardware were statistically significant.

> summary(glm)

Call:
glm(formula = log_Energy ~ Training_time_hour + Hardware_quantity + 
    Training_hardware, data = df)

Coefficients:
                                                 Estimate Std. Error t value Pr(>|t|)    
(Intercept)                                     7.134e+00  1.393e+00   5.123 5.07e-06 ***
Training_time_hour                              1.509e-03  2.548e-04   5.922 3.08e-07 ***
Hardware_quantity                               3.674e-04  9.957e-05   3.690 0.000563 ***
Training_hardwareGoogle TPU v3                  1.887e+00  1.508e+00   1.251 0.216956    
Training_hardwareGoogle TPU v4                  3.270e+00  1.591e+00   2.055 0.045247 *  
Training_hardwareHuawei Ascend 910              2.702e+00  2.485e+00   1.087 0.282287    
Training_hardwareNVIDIA A100                    2.528e+00  1.511e+00   1.674 0.100562    
Training_hardwareNVIDIA A100 SXM4 40 GB         3.103e+00  1.750e+00   1.773 0.082409 .  
Training_hardwareNVIDIA A100 SXM4 80 GB         3.866e+00  1.745e+00   2.216 0.031366 *  
Training_hardwareNVIDIA GeForce GTX 285        -4.077e+00  2.412e+00  -1.690 0.097336 .  
Training_hardwareNVIDIA GeForce GTX TITAN X    -9.706e-01  1.969e+00  -0.493 0.624318    
Training_hardwareNVIDIA GTX Titan Black        -8.423e-01  2.415e+00  -0.349 0.728781    
Training_hardwareNVIDIA H100 SXM5 80GB          3.600e+00  1.864e+00   1.931 0.059248 .  
Training_hardwareNVIDIA P100                   -1.663e+00  1.899e+00  -0.876 0.385436    
Training_hardwareNVIDIA Quadro P600            -1.970e+00  2.419e+00  -0.814 0.419398    
Training_hardwareNVIDIA Quadro RTX 4000        -1.367e+00  2.424e+00  -0.564 0.575293    
Training_hardwareNVIDIA Quadro RTX 5000        -2.309e+00  2.418e+00  -0.955 0.344354    
Training_hardwareNVIDIA Tesla K80               1.761e+00  1.988e+00   0.886 0.380116    
Training_hardwareNVIDIA Tesla V100 DGXS 32 GB   3.415e+00  1.833e+00   1.863 0.068501 .  
Training_hardwareNVIDIA Tesla V100S PCIe 32 GB  3.698e+00  2.413e+00   1.532 0.131852    
Training_hardwareNVIDIA V100                   -3.638e-01  1.582e+00  -0.230 0.819087    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for gaussian family taken to be 3.877685)

    Null deviance: 901.45  on 69  degrees of freedom
Residual deviance: 190.01  on 49  degrees of freedom
AIC: 312.55

Number of Fisher Scoring iterations: 2

On the other hand, the GLM model with Organization produced an AIC of 300.38, much lower than the previous model, indicating a better model fit. However, when taking a closer look, I noticed a significant issue: The statistical significance of other variables have gone away, as if Organization took away the significance from them!

> summary(glm_random_effects)
Linear mixed model fit by REML ['lmerMod']
Formula: log_Energy ~ Training_time_hour + Hardware_quantity + Training_hardware +  
    (1 | Organization)
   Data: df

REML criterion at convergence: 254.4

Scaled residuals: 
     Min       1Q   Median       3Q      Max 
-1.65549 -0.24100  0.01125  0.26555  1.51828 

Random effects:
 Groups       Name        Variance Std.Dev.
 Organization (Intercept) 3.775    1.943   
 Residual                 1.118    1.057   
Number of obs: 70, groups:  Organization, 44

Fixed effects:
                                                 Estimate Std. Error t value
(Intercept)                                     6.132e+00  1.170e+00   5.243
Training_time_hour                              1.354e-03  2.111e-04   6.411
Hardware_quantity                               3.477e-04  7.035e-05   4.942
Training_hardwareGoogle TPU v3                  2.949e+00  1.069e+00   2.758
Training_hardwareGoogle TPU v4                  2.863e+00  1.081e+00   2.648
Training_hardwareHuawei Ascend 910              4.086e+00  2.534e+00   1.613
Training_hardwareNVIDIA A100                    3.959e+00  1.299e+00   3.047
Training_hardwareNVIDIA A100 SXM4 40 GB         3.728e+00  1.551e+00   2.404
Training_hardwareNVIDIA A100 SXM4 80 GB         4.950e+00  1.478e+00   3.349
Training_hardwareNVIDIA GeForce GTX 285        -3.068e+00  2.502e+00  -1.226
Training_hardwareNVIDIA GeForce GTX TITAN X     4.503e-02  1.952e+00   0.023
Training_hardwareNVIDIA GTX Titan Black         2.375e-01  2.500e+00   0.095
Training_hardwareNVIDIA H100 SXM5 80GB          4.197e+00  1.552e+00   2.704
Training_hardwareNVIDIA P100                   -1.132e+00  1.512e+00  -0.749
Training_hardwareNVIDIA Quadro P600            -1.351e+00  1.904e+00  -0.710
Training_hardwareNVIDIA Quadro RTX 4000        -2.167e-01  2.503e+00  -0.087
Training_hardwareNVIDIA Quadro RTX 5000        -1.203e+00  2.501e+00  -0.481
Training_hardwareNVIDIA Tesla K80               1.559e+00  1.445e+00   1.079
Training_hardwareNVIDIA Tesla V100 DGXS 32 GB   3.751e+00  1.536e+00   2.443
Training_hardwareNVIDIA Tesla V100S PCIe 32 GB  3.487e+00  1.761e+00   1.980
Training_hardwareNVIDIA V100                    7.019e-01  1.434e+00   0.489

Correlation matrix not shown by default, as p = 21 > 12.
Use print(x, correlation=TRUE)  or
    vcov(x)        if you need it

fit warnings:
Some predictor variables are on very different scales: consider rescaling
> AIC(glm_random_effects)
[1] 300.3767

Thinking over it carefully, it made a lot of sense. Certain organizations may consistently prefer specific types of hardware, or larger organizations may be able to afford more expensive hardware and resources to train bigger AI models. In other words, the random effects here likely overlapped and overly explained the variations of our available independent variables, hence they absorbed a large portion of what we were trying to study.

This highlights an important point: while random or fixed effects are useful tools to control for unwanted group-level differences, they can also unintentionally capture the underlying variations of our independent variables. We should carefully consider what these effects truly represent, before just blindly introducing them to our models hoping they would happily absorb all the noise.

References: Steve Midway, Data Analysis in R, https://bookdown.org/steve_midway/DAR/random-effects.html