M4

doi:doi:10.57702/6akfxnfi

M4

The M4 dataset consists of human-written texts from several data sources, including Wikipedia, Reddit, and arXiv in the English subset of the dataset. It pairs the human-written texts with texts generated by several LLMs, including text-davinci-003 (henceforth GPT-3.5), GPT-4, and ChatGPT.

BibTex:

@dataset{Paul_Jeha_and_Michael_Bohlke-Schneider_and_Pedro_Mercado_and_Shubham_Kapoor_and_Rajbir_Singh_Nirwan_and_Valentin_Flunkert_and_Jan_Gasthaus_and_Tim_Januschowski_2024,
    abstract = {The M4 dataset consists of human-written texts from several data sources, including Wikipedia, Reddit, and arXiv in the English subset of the dataset. It pairs the human-written texts with texts generated by several LLMs, including text-davinci-003 (henceforth GPT-3.5), GPT-4, and ChatGPT.},
    author = {Paul Jeha and Michael Bohlke-Schneider and Pedro Mercado and Shubham Kapoor and Rajbir Singh Nirwan and Valentin Flunkert and Jan Gasthaus and Tim Januschowski},
    doi = {10.57702/6akfxnfi},
    institution = {No Organization},
    keyword = {'Detection', 'Forecasting', 'LLMs', 'M4', 'Machine-Generated Text', 'Machine-generated text detection', 'Multidomain', 'Multilingual', 'Multimodel', 'Synthetic Data', 'Text Generation', 'Time Series', 'forecasting', 'time series'},
    month = {dec},
    publisher = {TIB},
    title = {M4},
    url = {https://service.tib.eu/ldmservice/dataset/m4},
    year = {2024}
}