A graph-based, estimator-agnostic cross-fitting engine for
semiparametric estimation (e.g. double/debiased machine learning) and
related meta-learners. crossfit makes the cross-fitting schedule
explicit and auditable, supports DAGs of nuisance learners, and is
well-suited for simulation studies and benchmarking grids.
The package lets you define:
- a target functional (e.g. ATE, risk, regression error),
- a graph of nuisance models (propensity scores, regressions, etc.),
- how many folds each node trains on (
train_fold), - how many folds the target evaluates on (
eval_fold),
and then runs a cross-fitting schedule with configurable aggregation over panels and repetitions.
You can install the released version from CRAN or the development version from GitHub:
# Install the released version from CRAN
install.packages("crossfit")
# Install the development version from GitHub
install.packages("remotes")
remotes::install_github("EtiennePeyrot/crossfit-R", build_vignettes = TRUE)Then load it as usual:
library(crossfit)crossfit is designed for settings where:
- you care about a low-dimensional target (ATE, a coefficient, a risk, …),
- the target depends on high-dimensional nuisance functions estimated by ML.
The engine:
-
enforces out-of-sample use of nuisances via K-fold cross-fitting,
-
executes an explicit schedule over folds, panels and repetitions (useful for auditing and benchmarking),
-
includes reuse-aware caching (avoid redundant refits) and failure isolation for large experiment grids,
-
supports an arbitrary DAG of nuisances (not just one or two),
-
lets each node choose its own
train_fold(how many folds it trains on), -
lets the target choose its
eval_fold(how many folds it evaluates on), -
supports several fold allocation schemes:
"independence","overlap","disjoint", -
has two modes:
mode = "estimate"→ returns a numeric estimate of the target,mode = "predict"→ returns a cross-fitted prediction function.
Internally, the graph is normalized into a set of instances with structural signatures, so that identical models can share fits and be cached efficiently.
Here is a minimal example on a simple regression problem. We define a
nuisance
library(crossfit)
set.seed(1)
n <- 200
x <- rnorm(n)
y <- x + rnorm(n)
data <- data.frame(x = x, y = y)
# 1) Nuisance: regression m(x) = E[Y | X]
nuis_y <- create_nuisance(
fit = function(data, ...) lm(y ~ x, data = data),
predict = function(model, data, ...) {
as.numeric(predict(model, newdata = data))
}
)
# 2) Target: cross-fitted MSE of m(x)
target_mse <- function(data, nuis_y, ...) {
mean((data$y - nuis_y)^2)
}
# 3) Method: use 4 folds, 3 repetitions, DML-style "independence" allocation
method <- create_method(
target = target_mse,
list_nuisance = list(nuis_y = nuis_y),
folds = 4,
repeats = 3,
eval_fold = 1,
mode = "estimate",
fold_allocation = "independence",
aggregate_panels = mean_estimate,
aggregate_repeats = mean_estimate
)
res <- crossfit(data, method)
str(res$estimates)
res$estimates[[1]]The crossfit() call:
-
builds the nuisance / target graph,
-
runs K-fold cross-fitting for
repeatsrepetitions, -
aggregates over panels and repetitions using
mean_estimate(), -
returns a list with:
-
estimates– one entry per method (here just one), -
per_method– panel-wise and repetition-wise values and errors, -
repeats_done– number of successful repetitions per method, -
K,K_required,methods,plan– diagnostics and internals.
-
You can run several methods in parallel, sharing some or all nuisances. For example, we can estimate both:
- the cross-fitted MSE of
$m(x)$ , - the cross-fitted mean of
$m(x)$ ,
in a single call:
target_mean <- function(data, nuis_y) {
mean(nuis_y)
}
m_mse <- create_method(
target = target_mse,
list_nuisance = list(nuis_y = nuis_y),
folds = 4,
repeats = 3,
eval_fold = 1,
mode = "estimate",
fold_allocation = "independence",
aggregate_panels = mean_estimate,
aggregate_repeats = mean_estimate
)
m_mean <- create_method(
target = target_mean,
list_nuisance = list(nuis_y = nuis_y),
folds = 4,
repeats = 3,
eval_fold = 1,
mode = "estimate",
fold_allocation = "overlap",
aggregate_panels = mean_estimate,
aggregate_repeats = mean_estimate
)
cf_multi <- crossfit_multi(
data = data,
methods = list(mse = m_mse, mean = m_mean),
aggregate_panels = mean_estimate,
aggregate_repeats = mean_estimate
)
cf_multi$estimatesThe two methods share the fitted nuisance models whenever their structure and training folds coincide, which can save a lot of computation when you compare multiple learners or targets.
In "predict" mode, the engine returns a prediction function
instead of a numeric estimate. This is useful if you want a cross-fitted
predictor you can re-use on new data.
Here we build a cross-fitted regression function:
library(crossfit)
set.seed(1)
# Toy nonlinear regression problem
n <- 200
x <- runif(n, -2, 2)
y <- sin(x) + rnorm(n, sd = 0.3)
data <- data.frame(x = x, y = y)
# Two simple nuisances: linear and quadratic regressions
nuis_lin <- create_nuisance(
fit = function(data, ...) lm(y ~ x, data = data),
predict = function(model, data, ...) {
as.numeric(predict(model, newdata = data))
},
train_fold = 2
)
nuis_quad <- create_nuisance(
fit = function(data, ...) lm(y ~ poly(x, 2), data = data),
predict = function(model, data, ...) {
as.numeric(predict(model, newdata = data))
},
train_fold = 2
)
# Target in "predict" mode: ensemble of the two nuisances
target_ensemble <- function(data, m_lin, m_quad, ...) {
0.5 * m_lin + 0.5 * m_quad
}
method_ens <- create_method(
target = target_ensemble,
list_nuisance = list(m_lin = nuis_lin,
m_quad = nuis_quad),
folds = 4,
repeats = 3,
eval_fold = 0, # no eval window in predict mode
mode = "predict",
fold_allocation = "independence"
)
res <- crossfit_multi(
data = data,
methods = list(ensemble = method_ens),
aggregate_panels = mean_predictor,
aggregate_repeats = mean_predictor
)
# Cross-fitted ensemble predictor on new data
f_hat <- res$estimates$ensemble
newdata <- data.frame(x = seq(-2, 2, length.out = 5))
cbind(x = newdata$x, y_hat = f_hat(newdata))Here:
- Each repetition builds cross-fitted predictors,
mean_predictor()aggregates the list of predictors into a single ensemble,f_hat(newdata)gives cross-fitted predictions on future data.
-
create_nuisance()
Define a nuisance node viafit/predict,train_fold, and optional dependency mappings (fit_deps,pred_deps). -
create_method()
Define a method:targetfunction,- nuisance list,
folds,repeats,mode("estimate"or"predict"),eval_fold,fold_allocation,- optional
aggregate_panels,aggregate_repeats.
-
crossfit()
Run cross-fitting for a single method. -
crossfit_multi()
Run cross-fitting for several methods in parallel, with shared nuisances and shared K-fold splits. -
Aggregators:
mean_estimate(),median_estimate()– combine numeric panel / repetition results.mean_predictor(),median_predictor()– combine lists of prediction functions whenmode = "predict".
See:
?crossfit
?crossfit_multi
?create_method
?create_nuisanceYou can find a more detailed introduction in the package vignette:
browseVignettes("crossfit")
# or directly:
vignette("crossfit-intro", package = "crossfit")If you encounter a bug or have a feature request, please open an issue at: https://github.com/EtiennePeyrot/crossfit-R/issues.
crossfit is free software released under the GPL-3 license.