M4

The M4 dataset consists of human-written texts from several data sources, including Wikipedia, Reddit, and arXiv in the English subset of the dataset. It pairs the human-written texts with texts generated by several LLMs, including text-davinci-003 (henceforth GPT-3.5), GPT-4, and ChatGPT.

Data and Resources

Cite this as

Paul Jeha, Michael Bohlke-Schneider, Pedro Mercado, Shubham Kapoor, Rajbir Singh Nirwan, Valentin Flunkert, Jan Gasthaus, Tim Januschowski (2024). Dataset: M4. https://doi.org/10.57702/6akfxnfi

DOI retrieved: December 16, 2024

Additional Info

Field Value
Created December 16, 2024
Last update December 16, 2024
Defined In https://doi.org/10.48550/arXiv.2401.03946
Citation
  • https://doi.org/10.48550/arXiv.2401.01124
  • https://doi.org/10.48550/arXiv.2108.00981
  • https://doi.org/10.48550/arXiv.2405.17964
  • https://doi.org/10.48550/arXiv.2406.11073
Author Paul Jeha
More Authors
Michael Bohlke-Schneider
Pedro Mercado
Shubham Kapoor
Rajbir Singh Nirwan
Valentin Flunkert
Jan Gasthaus
Tim Januschowski
Homepage https://archive.ics.uci.edu/ml/datasets/M4