The BC5CDR dataset consists of 1,500 PubMed articles, which has been separated into training set (500), development set (500), and test set (500). The dataset contains 15,935...
The OSCAR corpus is a multilingual web corpus that is used for pre-training large generative language models. It is a document-oriented corpus that is comparable in size and...