Machine-learning debiased efficient estimation under generalized data-fusion alignments

§ Problem Statement

Setup

Let $X$ be a latent/full-data random element with law $P$ on a measurable space $(\mathcal X,\mathcal A)$ , where $P$ belongs to a semiparametric model $\mathcal P$ . The estimand is a finite-dimensional parameter $\psi(P)\in\mathbb R^d$ , with $\psi:\mathcal P\to\mathbb R^d$ pathwise differentiable at the true law $P_0$ .

This setup follows Graham et al. (2024).

Data are collected from $K\ge 2$ independent sources. For source $k$ , we observe i.i.d. data $O^{(k)}_1,\dots,O^{(k)}_{n_k}$ in $(\mathcal O_k,\mathcal F_k)$ from an observed-data law $Q_k$ , with $n=\sum_{k=1}^K n_k$ and $n_k/n\to\pi_k\in(0,1)$ . Each $Q_k$ is induced by $(P,\eta_k)$ , where $\eta_k$ denotes source-specific nuisance features (e.g., observation/coarsening mechanisms and other non-target components). Assume the sources satisfy a prespecified set of alignment restrictions consisting of: (i) conditional alignment constraints equating selected conditional distributions under $Q_k$ to corresponding conditionals of $P$ , and (ii) marginal alignment constraints equating selected marginals under $Q_k$ to corresponding marginals of $P$ . These alignments may come from different factorizations of $P$ (not necessarily one common factorization), and together they identify $\psi(P)$ .

Let $\mathcal Q$ be the observed-data model induced by all $(P,\eta_1,\dots,\eta_K)$ satisfying the alignments. For $\psi$ , let $\varphi_{\mathrm{eff},k}\in L_2^0(Q_k)$ denote the source-indexed efficient influence function contribution for source $k$ at the truth, with $\mathbb E[\varphi_{\mathrm{eff},k}(O^{(k)})]=0$ , and define

\Sigma_{\mathrm{eff}}:=\sum_{k=1}^K \pi_k\,\mathbb E\!\left[\varphi_{\mathrm{eff},k}(O^{(k)})\varphi_{\mathrm{eff},k}(O^{(k)})^\top\right].

(Equivalent pooled-data notation uses $W=(S,O)$ with source indicator $S\in\{1,\dots,K\}$ and EIF $\phi_{\mathrm{eff}}(W)$ .)

Unsolved Problem

Develop a general machine-learning-based estimator $\hat\psi$ (e.g., cross-fitted one-step/TMLE using flexible nuisance estimators) and general, verifiable high-level conditions under which

\sqrt n\{\hat\psi-\psi(P_0)\} = \sum_{k=1}^K \sqrt{\frac{n_k}{n}}\left\{\frac{1}{\sqrt{n_k}}\sum_{i=1}^{n_k}\varphi_{\mathrm{eff},k}\!\left(O_i^{(k)}\right)\right\}+o_p(1),

so that $\hat\psi$ is regular asymptotically linear and asymptotically normal with covariance $\Sigma_{\mathrm{eff}}$ (and hence efficient when the expansion uses the canonical gradient). The desired theory should handle arbitrary $K$ , arbitrary pathwise differentiable $\psi$ , and arbitrary generalized alignment structures, while requiring only weak empirical-process assumptions (ideally Donsker-free via sample splitting/cross-fitting) and nuisance-rate conditions sufficient to make second-order remainders $o_p(n^{-1/2})$ .

See Towards a Unified Theory for Semiparametric Data Fusion with Individual-Level Data (https://arxiv.org/abs/2409.09973) for the influence-function characterization that motivates this question.

§ Discussion

Loading discussion…

§ Significance & Implications

Direct textual support: the abstract of Graham et al. (2024) says the framework "paves the way" for machine-learning-debiased estimation and mentions challenges for efficient inference. Inference: this indicates a remaining methodological step from influence-function characterization to broadly applicable ML estimators with full efficiency guarantees.

§ Known Partial Results

Graham et al. (2024): Direct textual support: the paper develops a general characterization of regular asymptotically linear influence functions and an efficiency characterization under generalized alignments. Inference: these results provide key ingredients for estimator construction, but a fully general cross-fitted ML estimation theory with verifiable high-level conditions and efficiency guarantees across arbitrary alignment structures is not established there.

§ References

[1]

Towards a Unified Theory for Semiparametric Data Fusion with Individual-Level Data

Ellen Sandra Graham, Marco Carone, Andrea Rotnitzky (2024)

arXiv preprint

📍 Discussion portion of the arXiv v3 manuscript (section numbering/pagination may vary across versions), where the authors state the work "paves the way" for ML-debiased estimation and discuss remaining challenges for efficient inference.

Primary source motivating this problem.

Link ↗arXiv ↗

§ Tags

semiparametric-efficiency data-fusion debiased-machine-learning influence-function asymptotics