You're currently viewing an old version of this dataset. To see the current version, click here.

M4

The M4 dataset consists of human-written texts from several data sources, including Wikipedia, Reddit, and arXiv in the English subset of the dataset. It pairs the human-written texts with texts generated by several LLMs, including text-davinci-003 (henceforth GPT-3.5), GPT-4, and ChatGPT.

Data and Resources

This dataset has no data

Cite this as

Paul Jeha, Michael Bohlke-Schneider, Pedro Mercado, Shubham Kapoor, Rajbir Singh Nirwan, Valentin Flunkert, Jan Gasthaus, Tim Januschowski (2024). Dataset: M4. https://doi.org/10.57702/6akfxnfi

Private DOI This DOI is not yet resolvable.
It is available for use in manuscripts, and will be published when the Dataset is made public.

Additional Info

Field Value
Created December 16, 2024
Last update December 16, 2024
Defined In https://doi.org/10.48550/arXiv.2401.03946
Citation
  • https://doi.org/10.48550/arXiv.2401.01124
  • https://doi.org/10.48550/arXiv.2108.00981
  • https://doi.org/10.48550/arXiv.2405.17964
  • https://doi.org/10.48550/arXiv.2406.11073
Author Paul Jeha
More Authors
Michael Bohlke-Schneider
Pedro Mercado
Shubham Kapoor
Rajbir Singh Nirwan
Valentin Flunkert
Jan Gasthaus
Tim Januschowski
Homepage https://archive.ics.uci.edu/ml/datasets/M4