Skip to content

Original data set cannot be safely accessed when the estimation is performed in a loop #340

@lrberge

Description

@lrberge

Example

library(fixest)

base = setNames(iris, c("y", "x1", "x2", "x3", "species"))
all_species = unique(base$species)
res_all = list()
for(i in 1:3){
    s = all_species[i]
    res_all[[i]] = feols(y ~ x1 + x2, base[base$species == s, ])
}

# Have a look at the mean of the dependent variable:
etable(res_all, fitstat = "my")
#>                            model 1            model 2            model 3
#> Dependent Var.:                  y                  y                  y
#>                                                                         
#> (Intercept)      2.304*** (0.3853)  2.116*** (0.4943)    0.6248 (0.5249)
#> x1              0.6674*** (0.0904)    0.2476 (0.1868)   0.2600. (0.1533)
#> x2                 0.2834 (0.1972) 0.7356*** (0.1248) 0.9348*** (0.0896)
#> _______________ __________________ __________________ __________________
#> S.E. type                      IID                IID                IID
#> Dep. Var. mean              6.5880             6.5880             6.5880

What happens

In fixest many computations are delayed, occurring after the estimation and only at the user's request.
This is true for computing the standard-errors (when clustering w.r.t. a variable not used in the estimation) and for several fit statistics.

For post-computation to work properly, the original data, the one used for the estimation, needs to be accessed.
Of course, one solution would be to store the original data in the estimation object. This would be safe but would come at an exorbitant memory price.
Currently, the data is accessed by using the same data call as in the estimation: it is just an access to a data set currently stored in memory. Following the example, base[base$species == s, ] is evaluated to get the data.

In general, this is fine. Now comes the loop problem. When information have to be computed after the loop, like in the example the mean of the dependent variable, for all three models the data is accessed with base[base$species == s, ], leading to erroneously use the same data set. Hence leading to wrong results.

Solutions

VCOV

This problem affects the VCOV if clustering (or any other VCOV using an extra variable) has to be performed ex-post.
The solution is then to use the argument vcov at estimation time.

Note that you will not be able to navigate through different VCOVs (other than standard and heteroskedasticity-robust) ex post. If needed, you will have to store the object with the different VCOVs within the loop, while the right data is accessible.

fit statistics

Currently there is no way to store the fit statistics at estimation time.
There is a major overhaul of the fit statistics mechanism under way. Once the new fit-stats are implemented, that will be possible and easy (basically just using fitstat = TRUE will do).

In the short run, if you want to use data-dependent fit-stats ex post, you need to do it manually. FWIW, here's an example:

# A) create a custom fit stat
my_meandep = function(x){
    # x: fixest object
    if("mean_dep" %in% names(x)) x$mean_dep else NA
}

# B) register it
fitstat_register("meandep", my_meandep, "Mean of the Dep. Var.")

# C) within the loop, save the requested information 

base = setNames(iris, c("y", "x1", "x2", "x3", "species"))
all_species = unique(base$species)
res_all = list()
for(i in 1:3){
    s = all_species[i]
    
    # C') => we create the slot mean_dep
    mod = feols(y ~ x1 + x2, base[base$species == s, ])
    mod$mean_dep = fitstat(mod, "my", simplify = TRUE)
    res_all[[i]] = mod
}

# D) use it
etable(res_all, fitstat = ~my + meandep)
#>                                  model 1            model 2            model 3
#> Dependent Var.:                        y                  y                  y
#>                                                                               
#> (Intercept)            2.304*** (0.3853)  2.116*** (0.4943)    0.6248 (0.5249)
#> x1                    0.6674*** (0.0904)    0.2476 (0.1868)   0.2600. (0.1533)
#> x2                       0.2834 (0.1972) 0.7356*** (0.1248) 0.9348*** (0.0896)
#> _____________________ __________________ __________________ __________________
#> S.E. type                            IID                IID                IID
#> Dep. Var. mean                    6.5880             6.5880             6.5880
#> Mean of the Dep. Var.             5.0060             5.9360             6.5880

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions