adding weights to regressions with up to 2 FE dimensions by jamesbrandecon · Pull Request #12 · grantmcdermott/dbreg

jamesbrandecon · 2025-08-30T13:33:36Z

First step of #11

I am running local tests of every combination of no FEs, 1 FE, weights, and no weights, and the output matches feols exactly.

jamesbrandecon · 2025-08-30T18:00:13Z

@grantmcdermott if you can, it'd help to confirm that weighted regs match feols on your end. Next task is to figure out if I can refactor the dbplyr stuff to avoid writing temp tables, which I think will be hard but probably necessary.

This is also the second step of #11, at least for TWFE. More work for HDFE.

grantmcdermott · 2025-08-30T18:35:57Z

Awesome. I think the next step is to set up a test suite and CI so that we can iterate more systemically and securely. I've got a couple of things on today, but will try to do that when I get sec. Then we can merge into this branch and add weighted tests too before diving into the code.

jamesbrandecon · 2025-08-30T19:44:30Z

I agree! Already it was unwieldy to make sure I haven't broken anything. Happy to defer to you on test structure/content.

grantmcdermott · 2025-08-31T03:15:45Z

Tests and CI added. If you merge from main locally and then push, they should auto propagate.

(You can also add additional tests as part of this PR. They need to be saved as inst/tinytest/test_<suffix>.R. See inst/tinytest/test_trade.R for some examples.)

jamesbrandecon · 2025-08-31T03:43:52Z

Thanks! Any attachment to the current test structure? Locally, I'm simulating one small-ish dataset and then running regs with many different options and comparing to feols as source of truth. So expect_equal(feols_coefs == dbreg_coefs) for weighted, unweighted, FEs, data vs. table, and each strategy.

grantmcdermott · 2025-08-31T06:13:43Z

No attachment, but it does need to follow the inst/tinytest location and test_*.R naming convention for the test suite to run. You can add your new test file, including and the simulated dataset etc. all in a single script. (It doesn't need to be added to mine.)

grantmcdermott · 2025-08-31T15:16:14Z

Just a quick note: the CI will fail because you've got new undeclared dependencies data.table and tidyr (these would need to be added to the DESCRIPTION file). But I'd prefer to only introduce the former as a dependency if possible, since it has zero recursive deps.

grantmcdermott · 2025-08-31T15:17:44Z

To clarify: Just add data.table to the list of packages in Suggests.

(It doesn't need to be in Imports since it's only required for the test suite).

jamesbrandecon · 2025-08-31T15:39:05Z

I'll try harder on testing later today -- sorry for the CI breaking emails. Need to learn local testing in R!

grantmcdermott · 2025-08-31T15:47:32Z

No worries! I actually don't get the breaking emails (since it's your PR). The workflow is quite nice once you get used to it. I might add a Makefile when I get a chance, but two tips for local testing in the meantime:

To run just the tests locally: tinytest::test_all()
To run the full R CMD check locally, including tests (what the CI is doing under the hood): devtools::check()

jamesbrandecon · 2025-08-31T19:47:25Z

Thanks! Now I get the flow.

I hadn't realized we weren't importing dplyr yet -- is that something you want to avoid? I was relying heavily on dplyr/dbplyr and pipes to get alternating projections to work.

grantmcdermott · 2025-08-31T23:39:57Z

I hadn't realized we weren't importing dplyr yet -- is that something you want to avoid? I was relying heavily on dplyr/dbplyr and pipes to get alternating projections to work.

I'd prefer not to add it as a dependency if we can avoid it. At the same time, I obviously don't want to make our lives harder than it needs be. Regardless, maybe it's a good idea to touch base over a call at this point? The weights functionality sounds very useful regardless of where we land on HDFE. I don't want to lose that through too much integration with alternating projects functionality yet....

Pure SQL alternating projections

Speaking of which, here is my largely vibe-coded "pure SQL" implementation of alternating projections. Right now, this is just a standalone function that does the AP (using temp tables) and then calls dbreg::dbreg() on the finalized demeaned table, which is still in our connection memory. So the good news is that everything stays on the database... but we'd obviously have to do more work to integrate it as a direct dbreg(..., strategy = "altproj") option. (I also haven't tested it thoroughly).

dbreg_altproj.R

#' High Dimensional Fixed Effects via Alternating Projections
#'
#' @description
#' Implements the alternating projections method (Gaure, 2013) for handling
#' high dimensional fixed effects (3+) by performing iterative demeaning in SQL,
#' then delegating the final regression to dbreg on the residualized data.
#'
#' @param fml Formula with 3+ fixed effects
#' @param conn Database connection
#' @param table,data,path Data specification (same as dbreg)
#' @param vcov Variance-covariance type
#' @param tol Convergence tolerance for alternating projections
#' @param maxit Maximum iterations
#' @param verbose Print progress messages
#'
#' @return A dbreg result object with alternating projections metadata
#'
#' @references
#' Gaure, S. (2013). OLS with multiple high dimensional category variables.
#' Computational Statistics & Data Analysis, 66, 8-18.
#'
#' @export
dbreg_altproj = function(
  fml,
  conn = NULL,
  table = NULL,
  data = NULL,
  path = NULL,
  vcov = c("iid", "hc1"),
  tol = 1e-10,
  maxit = 1000,
  verbose = TRUE
) {
  vcov = match.arg(vcov)

  # Parse formula to extract variables
  fml_parsed = Formula::Formula(fml)
  yvar = all.vars(formula(fml_parsed, lhs = 1, rhs = 0))
  xvars = all.vars(formula(fml_parsed, lhs = 0, rhs = 1))
  fes = if (length(fml_parsed)[2] > 1) {
    all.vars(formula(fml_parsed, lhs = 0, rhs = 2))
  } else {
    NULL
  }

  # if (length(fes) < 3) {
  #   stop(
  #     "dbreg_altproj requires 3+ fixed effects. Use regular dbreg() for fewer FE."
  #   )
  # }

  # Set up connection if needed
  own_conn = FALSE
  if (is.null(conn)) {
    conn = DBI::dbConnect(duckdb::duckdb(), shutdown = TRUE)
    own_conn = TRUE
  }

  # Determine data source
  if (!is.null(table)) {
    from_statement = glue::glue("FROM {table}")
  } else if (!is.null(data)) {
    duckdb::duckdb_register(conn, "tmp_altproj_data", data)
    from_statement = "FROM tmp_altproj_data"
  } else if (!is.null(path)) {
    if (!(grepl("^read|^scan", path) && grepl("'", path))) {
      path = gsub('"', "'", path)
      from_statement = glue::glue("FROM '{path}'")
    } else {
      from_statement = glue::glue("FROM {path}")
    }
  } else {
    stop("Provide one of `table`, `data`, or `path`.")
  }

  if (verbose) {
    message(
      "[altproj] Starting alternating projections for ",
      length(fes),
      " fixed effects"
    )
  }

  # Step 1: Create dense FE identifiers
  fe_dense_cols = paste0("fe_", seq_along(fes))
  dense_rank_exprs = character(length(fes))
  for (i in seq_along(fes)) {
    dense_rank_exprs[i] = glue::glue(
      "DENSE_RANK() OVER (ORDER BY {fes[i]}) AS {fe_dense_cols[i]}"
    )
  }

  all_vars = c(yvar, xvars)
  var_list = paste(all_vars, collapse = ", ")
  dense_fe_list = paste(fe_dense_cols, collapse = ", ")

  sql_dense_ids = glue::glue(
    "CREATE TEMP TABLE altproj_base AS
     SELECT
       {var_list},
       {paste(dense_rank_exprs, collapse = ',\n       ')}
     {from_statement}"
  )

  DBI::dbExecute(conn, sql_dense_ids)

  if (verbose) {
    message("[altproj] Created dense FE identifiers")
  }

  # Step 2: Initial multiway demeaning
  window_exprs = character()
  demean_exprs = character()

  for (var in all_vars) {
    # Create window functions for each FE
    for (i in seq_along(fe_dense_cols)) {
      fe_col = fe_dense_cols[i]
      window_exprs = c(
        window_exprs,
        glue::glue("AVG({var}) OVER (PARTITION BY {fe_col}) AS {var}_{i}")
      )
    }

    # Create demeaning expression (subtract all FE means)
    fe_terms = paste0(var, "_", seq_along(fe_dense_cols))
    demean_expr = var
    for (term in fe_terms) {
      demean_expr = paste0("(", demean_expr, " - ", term, ")")
    }
    demean_exprs = c(demean_exprs, glue::glue("{demean_expr} AS {var}_res"))
  }

  sql_initial_demean = glue::glue(
    "CREATE TEMP TABLE ap_work AS
     WITH windowed AS (
       SELECT
         {dense_fe_list},
         {var_list},
         {paste(window_exprs, collapse = ',\n         ')}
       FROM altproj_base
     )
     SELECT
       {dense_fe_list},
       {paste(demean_exprs, collapse = ',\n       ')}
     FROM windowed"
  )

  DBI::dbExecute(conn, sql_initial_demean)

  if (verbose) {
    message("[altproj] Completed initial multiway demeaning")
  }

  # Step 3: Alternating projections loop
  res_vars = paste0(all_vars, "_res")

  # Helper function for one demeaning step
  do_half_step = function(part_col) {
    demean_exprs = character()
    for (var in res_vars) {
      demean_exprs = c(
        demean_exprs,
        glue::glue("{var} - AVG({var}) OVER (PARTITION BY {part_col}) AS {var}")
      )
    }

    sql_step = glue::glue(
      "CREATE TEMP TABLE ap_work_new AS
       SELECT {dense_fe_list},
              {paste(demean_exprs, collapse = ',\n              ')}
       FROM ap_work"
    )

    DBI::dbExecute(conn, "DROP TABLE IF EXISTS ap_work_new")
    DBI::dbExecute(conn, sql_step)
    DBI::dbExecute(conn, "DROP TABLE ap_work")
    DBI::dbExecute(conn, "ALTER TABLE ap_work_new RENAME TO ap_work")
  }

  # Function to compute squared norm for convergence
  get_norm2 = function() {
    norm_terms = paste0(res_vars, "*", res_vars)
    sql_norm = glue::glue(
      "SELECT COALESCE(SUM({paste(norm_terms, collapse = ' + ')}), 0) AS n2 FROM ap_work"
    )
    as.numeric(DBI::dbGetQuery(conn, sql_norm)$n2)
  }

  # Main AP loop
  last_norm = Inf
  iteration = 0L

  if (verbose) {
    message("[altproj] Running alternating projections iterations...")
  }

  repeat {
    iteration = iteration + 1L

    # One full sweep through all fixed effects
    for (fe_col in fe_dense_cols) {
      do_half_step(fe_col)
    }

    current_norm = get_norm2()
    if (abs(current_norm - last_norm) <= tol * max(1, last_norm)) {
      if (verbose) {
        message("[altproj] Converged after ", iteration, " iterations")
      }
      break
    }

    if (iteration >= maxit) {
      if (verbose) {
        message("[altproj] Reached maximum iterations (", maxit, ")")
      }
      break
    }

    last_norm = current_norm
  }

  # Step 4: Run dbreg on the demeaned data
  if (verbose) {
    message("[altproj] Running final regression on demeaned data...")
  }

  # Create formula for demeaned variables
  y_res = paste0(yvar, "_res")
  x_res = paste0(xvars, "_res")
  final_fml = reformulate(x_res, y_res)

  # Call dbreg on the ap_work table (no fixed effects needed)
  result = dbreg::dbreg(
    fml = final_fml,
    conn = conn,
    table = "ap_work",
    vcov = vcov,
    verbose = FALSE # We handle our own messaging
  )

  # Step 5: Adjust result metadata
  # Get original sample size
  nobs_orig = as.numeric(
    DBI::dbGetQuery(conn, "SELECT COUNT(*) AS n FROM altproj_base")$n
  )

  # Count FE degrees of freedom
  fe_df = 0
  for (i in seq_along(fe_dense_cols)) {
    fe_col = fe_dense_cols[i]
    n_levels = as.numeric(
      DBI::dbGetQuery(
        conn,
        glue::glue("SELECT COUNT(DISTINCT {fe_col}) AS n FROM altproj_base")
      )$n
    )
    fe_df = fe_df + n_levels
  }
  fe_df = fe_df - length(fes) # Subtract for identifiability

  # Adjust degrees of freedom
  adjusted_df_res = max(nobs_orig - length(xvars) - fe_df, 1)

  # Update result object
  result$strategy = "altproj"
  result$fes = fes # Restore original FE names
  result$nobs_orig = nobs_orig
  result$df_residual = adjusted_df_res
  result$n_iterations = iteration
  result$n_fe_levels = fe_df + length(fes)

  # Recompute coefficient table with correct degrees of freedom
  if (result$df_residual != adjusted_df_res) {
    betahat = result$coeftable[, "estimate", drop = FALSE]
    # Scale variance for different df
    scale_factor = result$df_residual / adjusted_df_res
    vcov_scaled = result$vcov * scale_factor
    result$vcov = vcov_scaled
    result$coeftable = gen_coeftable(betahat, vcov_scaled, adjusted_df_res)
  }

  # Clean up
  DBI::dbExecute(conn, "DROP TABLE IF EXISTS altproj_base")
  DBI::dbExecute(conn, "DROP TABLE IF EXISTS ap_work")
  if (!is.null(data)) {
    DBI::dbExecute(conn, "DROP VIEW IF EXISTS tmp_altproj_data")
  }

  if (own_conn) {
    DBI::dbDisconnect(conn)
  }

  if (verbose) {
    message("[altproj] Alternating projections completed")
  }

  return(result)
}

And a proof-of-concept example:

pkgload::load_all("~/Documents/Projects/dbreg/")
#> ℹ Loading dbreg
library(fixest)

# Load the trade data (4 FE: Origin, Destination, Product, Year)
data("trade", package = "fixest")

# Create log variables
trade$log_euros = log(trade$Euros)
trade$log_dist = log(trade$dist_km)

# example: gravity equation with four FEs

feols(
  log_euros ~ log_dist | Origin + Destination + Product + Year,
  data = trade,
  vcov = "hc1"
)
#> OLS estimation, Dep. Var.: log_euros
#> Observations: 38,325
#> Fixed-effects: Origin: 15,  Destination: 15,  Product: 20,  Year: 10
#> Standard-errors: Heteroskedasticity-robust 
#>          Estimate Std. Error  t value  Pr(>|t|)    
#> log_dist -2.16988   0.018212 -119.147 < 2.2e-16 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> RMSE: 1.74337     Adj. R2: 0.705139
#>                 Within R2: 0.219322
dbreg_altproj(
  log_euros ~ log_dist | Origin + Destination + Product + Year,
  data = trade,
  vcov = "hc1",
  verbose = FALSE
)
#> Standard Errors: Heteroskedasticity-robust 
#>              Estimate Std. Error  t value  Pr(>|t|)    
#> log_dist_res -2.17176   0.020928 -103.772 < 2.2e-16 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

^{Created on 2025-08-31 with reprex v2.1.1}

The results are close to fixest, albeit not perfect. Again, I haven't tested thoroughly, but my assumption is that this can be fixed by tweaking the tolerance steps and stopping rules.

jamesbrandecon · 2025-09-01T00:06:52Z

Ok yeah let's talk live! We can email to schedule. Fwiw, three quick points:

your code implements HDFE, whereas mine (also vibe-coded) is specifically for 2 dims and I haven't thought about expanding further.
For a dataset with N=5e5, T=30, it seems like yours takes ~8 seconds and mine is more like 2, same as feols.
I agree weights are useful alone (and for bootstrapping) but it's important to note that the weighted mundlak with two dims can only (as far as I can tell) be done via AP, so the two are connected.

jamesbrandecon · 2025-12-20T17:20:31Z

@grantmcdermott for now, can you unblock me on adding dplyr as a dependency and let me build this stuff out (weights and AP)? Always able to go back and take other approaches if I make it modular

jamesbrandecon · 2026-02-03T03:05:45Z

@grantmcdermott I am still making sure the speed is ok, but would be interested in any comments you have here. I needed weights for a couple of the things I want to add next, so I went back and had codex undo all the dplyr stuff. Hopefully that lets us merge this finally

grantmcdermott · 2026-02-03T20:56:34Z

Thanks @jamesbrandecon.

Realistically, I won't be able to review under the weekend (at least). QQ, though: what do you mean "alternating projections for demean strategy" in commits above? Is this full blown, FWL style iterative AP? Or something else?

jamesbrandecon · 2026-02-04T03:36:33Z

@grantmcdermott If I am thinking correctly (I independently came to this conclusion in the previous version above, which I mostly discarded here), the demeaning with num_FEs>2 is best done via alternating projections. I guess I've never looked deeply into AP algos so maybe I am missing something but I just loop over each FE dimension and take out the mean, repeating until convergence.

grantmcdermott

Hi @jamesbrandecon,

I'm finally getting around to reviewing this. There are a lot of file changes (although most of these are cosmetic) and the main dbreg.R file has a lot of changes (which makes me a bit nervious).

However, reading through I'm realising that a much of the main diff is inflated. Specifically, the rebase appears to have inlined several shared helpers (parse_vcov_args, setup_db_connection, parse_regression_formula) that previously (and still?) live in utils.R. Can we revert to calling the helpers and just add the new weight validation logic on top? It would make the PR much easier to review and keeps dbreg() / dbbinsreg() sharing the same code paths.

grantmcdermott · 2026-03-31T23:05:35Z

R/dbreg.R

-  vcov = vcov_parsed$vcov_type
-  cluster = vcov_parsed$cluster_var
-
+  # Parse vcov: can be string or formula (for clustering)


Why replace the previous convenience function? This whole chunk used to be much shorter:

# Parse vcov/cluster arguments using shared helper vcov_parsed = parse_vcov_args(vcov, cluster) vcov = vcov_parsed$vcov_type cluster = vcov_parsed$cluster_var

grantmcdermott · 2026-03-31T23:09:02Z

R/dbreg.R

-  term_labels = fml_parsed$term_labels
-  has_interactions = fml_parsed$has_interactions
-  fe = fml_parsed$fe
+  own_conn = FALSE


same issue for this chunk as the previous one. why inline all of the code instead of using the previous convenience function(s)? the old code was:

# Set up database connection and data source using shared helper db_setup = setup_db_connection(conn, table, data, path) conn = db_setup$conn own_conn = db_setup$own_conn from_statement = db_setup$from_statement # Parse formula using shared helper fml_parsed = parse_regression_formula(fml) fml = fml_parsed$fml yvar = fml_parsed$yvar xvars = fml_parsed$xvars term_labels = fml_parsed$term_labels has_interactions = fml_parsed$has_interactions fe = fml_parsed$fe

jamesbrandecon changed the title ~~adding weights to regression with zero or one FE~~ adding weights to regressions with up to 2 FE dimensions Aug 31, 2025

jamesbrandecon force-pushed the weighted-regression branch from 126e86d to d1a1092 Compare February 3, 2026 02:44

jamesbrandecon added 5 commits February 2, 2026 18:50

Add weighted regression support-- rebase

6c10ce9

update tests for weights, incorporate other SEs

7c69f16

alternating projections for demean strategy

ee0cb7d

little comments/format

1c41fcd

altdocs

f18f12b

jamesbrandecon force-pushed the weighted-regression branch from d1a1092 to f18f12b Compare February 3, 2026 02:52

grantmcdermott reviewed Mar 31, 2026

View reviewed changes

Conversation

jamesbrandecon commented Aug 30, 2025

Uh oh!

jamesbrandecon commented Aug 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

grantmcdermott commented Aug 30, 2025

Uh oh!

jamesbrandecon commented Aug 30, 2025

Uh oh!

grantmcdermott commented Aug 31, 2025

Uh oh!

jamesbrandecon commented Aug 31, 2025

Uh oh!

grantmcdermott commented Aug 31, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

grantmcdermott commented Aug 31, 2025

Uh oh!

grantmcdermott commented Aug 31, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jamesbrandecon commented Aug 31, 2025

Uh oh!

grantmcdermott commented Aug 31, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jamesbrandecon commented Aug 31, 2025

Uh oh!

grantmcdermott commented Aug 31, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Pure SQL alternating projections

Uh oh!

jamesbrandecon commented Sep 1, 2025

Uh oh!

jamesbrandecon commented Dec 20, 2025

Uh oh!

jamesbrandecon commented Feb 3, 2026

Uh oh!

grantmcdermott commented Feb 3, 2026

Uh oh!

jamesbrandecon commented Feb 4, 2026

Uh oh!

grantmcdermott left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

grantmcdermott Mar 31, 2026

Choose a reason for hiding this comment

Uh oh!

grantmcdermott Mar 31, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

jamesbrandecon commented Aug 30, 2025 •

edited

Loading

grantmcdermott commented Aug 31, 2025 •

edited

Loading

grantmcdermott commented Aug 31, 2025 •

edited

Loading

grantmcdermott commented Aug 31, 2025 •

edited

Loading

grantmcdermott commented Aug 31, 2025 •

edited

Loading

grantmcdermott left a comment •

edited

Loading