Instruct-PTBR

The dataset used for training the TeenyTinyLlama pair consists of a concatenation of open-source Brazilian Portuguese datasets, including Wikipedia, CulturaX, OSCAR, Common Crawl, and ROOTS. The dataset is filtered to exclude samples classified above a pre-defined toxicity threshold.

Data and Resources

Cite this as

Nicholas Kluge CorrĂȘa, Sophia Falk, Shiza Fatimah, Aniket Sen, Nythamar de Oliveira (2025). Dataset: Instruct-PTBR. https://doi.org/10.57702/n8ugpykn

DOI retrieved: January 3, 2025

Additional Info

Field Value
Created January 3, 2025
Last update January 3, 2025
Defined In https://doi.org/10.1016/j.mlwa.2024.100558
Author Nicholas Kluge CorrĂȘa
More Authors
Sophia Falk
Shiza Fatimah
Aniket Sen
Nythamar de Oliveira
Homepage https://huggingface.co/datasets/cnmoro/Instruct-PTBR-ENUS-11M