MIT Connection Science - Data Provenance Initiative

The Data Provenance Initiative is a large-scale audit of AI datasets used to train large language models.

About

The Data Provenance Initiative is a large-scale audit of AI datasets used to train large language models. As a first step, we've traced 2000+ popular, text-to-text finetuning datasets from origin to creation, cataloging their data sources, licenses, creators, and other metadata, for researchers to explore using this tool. The purpose of this work is to improve transparency, documentation, and informed use of datasets in AI.

Team

Shayne Longpre

MIT PhD; Lead Researcher, Data Provenance for AI

Sandy Pentland

Director of MIT Connection Science, Professor, MIT

Tobin South

Fulbright PhD Scholar, MIT

Robert Mahari

JD-PhD at Harvard Law School & MIT

⏎ Davos 2024