Instruct-PTBR

doi:doi:10.57702/n8ugpykn

Instruct-PTBR

The dataset used for training the TeenyTinyLlama pair consists of a concatenation of open-source Brazilian Portuguese datasets, including Wikipedia, CulturaX, OSCAR, Common Crawl, and ROOTS. The dataset is filtered to exclude samples classified above a pre-defined toxicity threshold.

Data and Resources

Original MetadataJSON
The json representation of the dataset with its distributions based on DCAT.
Explore
- Preview
- Download

Cite this as

Nicholas Kluge Corrêa, Sophia Falk, Shiza Fatimah, Aniket Sen, Nythamar de Oliveira (2025). Dataset: Instruct-PTBR. https://doi.org/10.57702/n8ugpykn

DOI retrieved: January 3, 2025

Additional Info

Field	Value
Created	January 3, 2025
Last update	January 3, 2025
Defined In	https://doi.org/10.1016/j.mlwa.2024.100558
Author	Nicholas Kluge Corrêa
More Authors	Sophia Falk Shiza Fatimah Aniket Sen Nythamar de Oliveira
Homepage	https://huggingface.co/datasets/cnmoro/Instruct-PTBR-ENUS-11M