S2ORC

A collection of 81.1 million scholarly publications in English from various academic fields, used to pre-train a language model.

Data and Resources

Original MetadataJSON
The json representation of the dataset with its distributions based on DCAT.
Explore
- Preview
- Download

Lo, K., Wang, L. L., Neumann, M., Kinney, R., Weld, D. S. (2024). Dataset: S2ORC. https://doi.org/10.57702/g2wuqc2w

DOI retrieved: December 16, 2024

Field	Value
Created	December 16, 2024
Last update	December 16, 2024
Defined In	https://doi.org/10.48550/arXiv.2212.03869
Citation	https://doi.org/10.48550/arXiv.2303.14334 https://doi.org/10.48550/arXiv.2307.12996 https://doi.org/10.48550/arXiv.2401.01089 https://doi.org/10.18653/v1/2023.emnlp-main.822
Author	Lo, K.
More Authors	Wang, L. L. Neumann, M. Kinney, R. Weld, D. S.
Homepage	https://s2orc.org/