Gpt4all-J

doi:doi:10.57702/7ik90fqt

Gpt4all-J

The dataset used for training the TeenyTinyLlama pair consists of a concatenation of open-source Brazilian Portuguese datasets, including Wikipedia, CulturaX, OSCAR, Common Crawl, and ROOTS. The dataset is filtered to exclude samples classified above a pre-defined toxicity threshold.

Data and Resources

Original MetadataJSON
The json representation of the dataset with its distributions based on DCAT.
Explore
- Preview
- Download

Cite this as

Nicholas Kluge Corrêa, Sophia Falk, Shiza Fatimah, Aniket Sen, Nythamar de Oliveira (2025). Dataset: Gpt4all-J. https://doi.org/10.57702/7ik90fqt

DOI retrieved: January 3, 2025

Additional Info

Field	Value
Created	January 3, 2025
Last update	January 3, 2025
Defined In	https://doi.org/10.1016/j.mlwa.2024.100558
Author	Nicholas Kluge Corrêa
More Authors	Sophia Falk Shiza Fatimah Aniket Sen Nythamar de Oliveira
Homepage	https://huggingface.co/datasets/pablo-moreira/Gpt4all-J-prompt-generations-pt