Instruct-PTBR

The dataset used for training the TeenyTinyLlama pair consists of a concatenation of open-source Brazilian Portuguese datasets, including Wikipedia, CulturaX, OSCAR, Common Crawl, and ROOTS. The dataset is filtered to exclude samples classified above a pre-defined toxicity threshold.

BibTex: