Jiaqi Ma / Paper

A Survey of Data Attribution: Methods, Applications, and Evaluation in the Era of Generative AI

Junwei Deng, Yuzheng Hu, Pingbang Hu, Ting-Wei Li, Shixuan Liu, Jiachen T. Wang, Dan Ley, Qirun Dai, Benhao Huang, Jin Huang, Cathy Jiao, Hoang Anh Just, Yijun Pan, Jingyan Shen, Yiwen Tu, Weiyi Wang, Xinhe Wang, Shichang Zhang, Shiyuan Zhang, Ruoxi Jia, Himabindu Lakkaraju, Hao Peng, Weijing Tang, Chenyan Xiong, Jieyu Zhao, Hanghang Tong, Han Zhao, and Jiaqi W. Ma

Preprint. Published online August 29, 2025. SSRN ID: 5451054. DOI: 10.2139/ssrn.5451054.

Abstract

Training data is the fuel of modern artificial intelligence (AI), fundamentally shaping the capabilities, limitations, and biases of AI systems. The emergence of large-scale generative models has elevated the importance of understanding how data influences their behaviors, bringing the field of data attribution to the forefront. This survey provides a comprehensive overview of data attribution, covering its methods, applications, and evaluation protocols, with a particular emphasis on the challenges and opportunities arising in the era of generative AI. We start by introducing a conceptual framework for attribution centered on three core questions: what to attribute (model behaviors), attribute to what (training entities), and how to attribute (influence measures). Within this framework, we systematically review major attribution approaches, including those based on influence functions, weighted marginal contributions, training dynamics, and simulators. We then examine key applications of data attribution, such as data selection, fact tracing, adversarial attacks and defenses, and the emerging data economy. Finally, we critically assess common evaluation criteria, including the quality of counterfactual predictions, utility in downstream tasks, and computational efficiency. We conclude with a forward-looking perspective on the future of data attribution, highlighting key open challenges and promising directions for future research.

Citation

Junwei Deng, Yuzheng Hu, Pingbang Hu, Ting-Wei Li, Shixuan Liu, Jiachen T. Wang, Dan Ley, Qirun Dai, Benhao Huang, Jin Huang, Cathy Jiao, Hoang Anh Just, Yijun Pan, Jingyan Shen, Yiwen Tu, Weiyi Wang, Xinhe Wang, Shichang Zhang, Shiyuan Zhang, Ruoxi Jia, Himabindu Lakkaraju, Hao Peng, Weijing Tang, Chenyan Xiong, Jieyu Zhao, Hanghang Tong, Han Zhao, and Jiaqi W. Ma. 2025. A Survey of Data Attribution: Methods, Applications, and Evaluation in the Era of Generative AI. Available at SSRN: https://papers.ssrn.com/sol3/papers.cfm?abstract_id=5451054. DOI: 10.2139/ssrn.5451054.

BibTeX

@misc{deng2025surveydataattribution,
  title = {A Survey of Data Attribution: Methods, Applications, and Evaluation in the Era of Generative AI},
  author = {Deng, Junwei and Hu, Yuzheng and Hu, Pingbang and Li, Ting-Wei and Liu, Shixuan and Wang, Jiachen T. and Ley, Dan and Dai, Qirun and Huang, Benhao and Huang, Jin and Jiao, Cathy and Just, Hoang Anh and Pan, Yijun and Shen, Jingyan and Tu, Yiwen and Wang, Weiyi and Wang, Xinhe and Zhang, Shichang and Zhang, Shiyuan and Jia, Ruoxi and Lakkaraju, Himabindu and Peng, Hao and Tang, Weijing and Xiong, Chenyan and Zhao, Jieyu and Tong, Hanghang and Zhao, Han and Ma, Jiaqi W.},
  year = {2025},
  doi = {10.2139/ssrn.5451054},
  url = {https://papers.ssrn.com/sol3/papers.cfm?abstract_id=5451054},
  note = {Available at SSRN}
}