Machine-learning debiased efficient estimation under generalized data-fusion alignments
Sourced from the work of Ellen Sandra Graham, Marco Carone, Andrea Rotnitzky
§ Problem Statement
Setup
Let be a latent/full-data random element with law on a measurable space , where belongs to a semiparametric model . The estimand is a finite-dimensional parameter , with pathwise differentiable at the true law .
This setup follows Graham et al. (2024).
Data are collected from independent sources. For source , we observe i.i.d. data in from an observed-data law , with and . Each is induced by , where denotes source-specific nuisance features (e.g., observation/coarsening mechanisms and other non-target components). Assume the sources satisfy a prespecified set of alignment restrictions consisting of: (i) conditional alignment constraints equating selected conditional distributions under to corresponding conditionals of , and (ii) marginal alignment constraints equating selected marginals under to corresponding marginals of . These alignments may come from different factorizations of (not necessarily one common factorization), and together they identify .
Let be the observed-data model induced by all satisfying the alignments. For , let denote the source-indexed efficient influence function contribution for source at the truth, with , and define
(Equivalent pooled-data notation uses with source indicator and EIF .)
Unsolved Problem
Develop a general machine-learning-based estimator (e.g., cross-fitted one-step/TMLE using flexible nuisance estimators) and general, verifiable high-level conditions under which
so that is regular asymptotically linear and asymptotically normal with covariance (and hence efficient when the expansion uses the canonical gradient). The desired theory should handle arbitrary , arbitrary pathwise differentiable , and arbitrary generalized alignment structures, while requiring only weak empirical-process assumptions (ideally Donsker-free via sample splitting/cross-fitting) and nuisance-rate conditions sufficient to make second-order remainders .
See Towards a Unified Theory for Semiparametric Data Fusion with Individual-Level Data (https://arxiv.org/abs/2409.09973) for the influence-function characterization that motivates this question.
§ Discussion
§ Significance & Implications
Direct textual support: the abstract of Graham et al. (2024) says the framework "paves the way" for machine-learning-debiased estimation and mentions challenges for efficient inference. Inference: this indicates a remaining methodological step from influence-function characterization to broadly applicable ML estimators with full efficiency guarantees.
§ Known Partial Results
Graham et al. (2024): Direct textual support: the paper develops a general characterization of regular asymptotically linear influence functions and an efficiency characterization under generalized alignments. Inference: these results provide key ingredients for estimator construction, but a fully general cross-fitted ML estimation theory with verifiable high-level conditions and efficiency guarantees across arbitrary alignment structures is not established there.
§ References
Towards a Unified Theory for Semiparametric Data Fusion with Individual-Level Data
Ellen Sandra Graham, Marco Carone, Andrea Rotnitzky (2024)
arXiv preprint
📍 Discussion portion of the arXiv v3 manuscript (section numbering/pagination may vary across versions), where the authors state the work "paves the way" for ML-debiased estimation and discuss remaining challenges for efficient inference.
Primary source motivating this problem.