Gpt4all-J

The dataset used for training the TeenyTinyLlama pair consists of a concatenation of open-source Brazilian Portuguese datasets, including Wikipedia, CulturaX, OSCAR, Common Crawl, and ROOTS. The dataset is filtered to exclude samples classified above a pre-defined toxicity threshold.

Data and Resources

Cite this as

Nicholas Kluge CorrĂȘa, Sophia Falk, Shiza Fatimah, Aniket Sen, Nythamar de Oliveira (2025). Dataset: Gpt4all-J. https://doi.org/10.57702/7ik90fqt

DOI retrieved: January 3, 2025

Additional Info

Field Value
Created January 3, 2025
Last update January 3, 2025
Defined In https://doi.org/10.1016/j.mlwa.2024.100558
Author Nicholas Kluge CorrĂȘa
More Authors
Sophia Falk
Shiza Fatimah
Aniket Sen
Nythamar de Oliveira
Homepage https://huggingface.co/datasets/pablo-moreira/Gpt4all-J-prompt-generations-pt