You're currently viewing an old version of this dataset. To see the current version, click here.

M4

The M4 dataset consists of human-written texts from several data sources, including Wikipedia, Reddit, and arXiv in the English subset of the dataset. It pairs the human-written texts with texts generated by several LLMs, including text-davinci-003 (henceforth GPT-3.5), GPT-4, and ChatGPT.

Data and Resources

This dataset has no data

Cite this as

Paul Jeha, Michael Bohlke-Schneider, Pedro Mercado, Shubham Kapoor, Rajbir Singh Nirwan, Valentin Flunkert, Jan Gasthaus, Tim Januschowski (2024). Dataset: M4. https://doi.org/10.57702/6akfxnfi

Private DOI This DOI is not yet resolvable.
It is available for use in manuscripts, and will be published when the Dataset is made public.

Additional Info

Field	Value
Created	December 16, 2024
Last update	December 16, 2024
Defined In	https://doi.org/10.48550/arXiv.2401.03946
Citation	https://doi.org/10.48550/arXiv.2401.01124 https://doi.org/10.48550/arXiv.2108.00981 https://doi.org/10.48550/arXiv.2405.17964 https://doi.org/10.48550/arXiv.2406.11073
Author	Paul Jeha
More Authors	Michael Bohlke-Schneider Pedro Mercado Shubham Kapoor Rajbir Singh Nirwan Valentin Flunkert Jan Gasthaus Tim Januschowski
Homepage	https://archive.ics.uci.edu/ml/datasets/M4