Guarantees for Nonlinear Representation Learning: Non-identical Covariates, Dependent Data, Fewer Samples

Détail du document

Identifiant

oai:arXiv.org:2410.11227

Sujet

Statistics - Machine Learning Computer Science - Machine Learnin... Electrical Engineering and Systems...

Auteur

Zhang, Thomas T. Lee, Bruce D. Ziemann, Ingvar Pappas, George J. Matni, Nikolai

Catégorie

Computer Science

Année

2024

Date de référencement

23/10/2024

Mots clés

machine sources target task denotes source } risk tasks dependent learning g$ $\hat \mathcal data $

Métrique

Résumé

A driving force behind the diverse applicability of modern machine learning is the ability to extract meaningful features across many sources.

However, many practical domains involve data that are non-identically distributed across sources, and statistically dependent within its source, violating vital assumptions in existing theoretical studies.

Toward addressing these issues, we establish statistical guarantees for learning general $\textit{nonlinear}$ representations from multiple data sources that admit different input distributions and possibly dependent data.

Specifically, we study the sample-complexity of learning $T+1$ functions $f_\star^{(t)} \circ g_\star$ from a function class $\mathcal F \times \mathcal G$, where $f_\star^{(t)}$ are task specific linear functions and $g_\star$ is a shared nonlinear representation.

A representation $\hat g$ is estimated using $N$ samples from each of $T$ source tasks, and a fine-tuning function $\hat f^{(0)}$ is fit using $N'$ samples from a target task passed through $\hat g$.

We show that when $N \gtrsim C_{\mathrm{dep}} (\mathrm{dim}(\mathcal F) + \mathrm{C}(\mathcal G)/T)$, the excess risk of $\hat f^{(0)} \circ \hat g$ on the target task decays as $\nu_{\mathrm{div}} \big(\frac{\mathrm{dim}(\mathcal F)}{N'} + \frac{\mathrm{C}(\mathcal G)}{N T} \big)$, where $C_{\mathrm{dep}}$ denotes the effect of data dependency, $\nu_{\mathrm{div}}$ denotes an (estimatable) measure of $\textit{task-diversity}$ between the source and target tasks, and $\mathrm C(\mathcal G)$ denotes the complexity of the representation class $\mathcal G$.

In particular, our analysis reveals: as the number of tasks $T$ increases, both the sample requirement and risk bound converge to that of $r$-dimensional regression as if $g_\star$ had been given, and the effect of dependency only enters the sample requirement, leaving the risk bound matching the iid setting.

;Comment: Appeared at ICML 2024

Zhang, Thomas T.,Lee, Bruce D.,Ziemann, Ingvar,Pappas, George J.,Matni, Nikolai, 2024, Guarantees for Nonlinear Representation Learning: Non-identical Covariates, Dependent Data, Fewer Samples