M4

doi:doi:10.57702/6akfxnfi

M4

The M4 dataset consists of human-written texts from several data sources, including Wikipedia, Reddit, and arXiv in the English subset of the dataset. It pairs the human-written texts with texts generated by several LLMs, including text-davinci-003 (henceforth GPT-3.5), GPT-4, and ChatGPT.

Data and Resources

Original MetadataJSON
The json representation of the dataset with its distributions based on DCAT.
Explore
- Preview
- Download

Cite this as

Paul Jeha, Michael Bohlke-Schneider, Pedro Mercado, Shubham Kapoor, Rajbir Singh Nirwan, Valentin Flunkert, Jan Gasthaus, Tim Januschowski (2024). Dataset: M4. https://doi.org/10.57702/6akfxnfi

DOI retrieved: December 16, 2024

Additional Info

Field	Value
Created	December 16, 2024
Last update	December 16, 2024
Defined In	https://doi.org/10.48550/arXiv.2401.03946
Citation	https://doi.org/10.48550/arXiv.2401.01124 https://doi.org/10.48550/arXiv.2108.00981 https://doi.org/10.48550/arXiv.2405.17964 https://doi.org/10.48550/arXiv.2406.11073
Author	Paul Jeha
More Authors	Michael Bohlke-Schneider Pedro Mercado Shubham Kapoor Rajbir Singh Nirwan Valentin Flunkert Jan Gasthaus Tim Januschowski
Homepage	https://archive.ics.uci.edu/ml/datasets/M4