STEM-NER-60k
A Large-scale Dataset of STEM Science as PROCESS, METHOD, MATERIAL, and DATA Named Entities
This repository hosts data as a follow-up study to the following publications
D'Souza, J., Hoppe, A., Brack, A., Jaradeh, M., Auer, S., & Ewerth, R. (2020). The STEM-ECR Dataset: Grounding Scientific Entity References in STEM Scholarly Content to Authoritative Encyclopedic and Lexicographic Sources. In Proceedings of The 12th Language Resources and Evaluation Conference (pp. 2192–2203). European Language Resources Association.
Brack, A., D’Souza, J., Hoppe, A., Auer, S., Ewerth, R. (2020). Domain-Independent Extraction of Scientific Concepts from Research Articles. In: , et al. Advances in Information Retrieval. ECIR 2020. Lecture Notes in Computer Science, vol 12035. Springer, Cham. https://doi.org/10.1007/978-3-030-45439-5_17
Supporting dataset link https://data.uni-hannover.de/dataset/stem-ecr-v1-0
Description
Roughly 60,000 titles and abstracts of scholarly articles with the CC-BY redistributable license were downloaded from Elsevier. The articles spanned 10 STEM domains which were the most prolific on Elsevier viz., Agriculture, Astronomy, Biology, Chemistry, Computer Science, Earth Science, Engineering, Material Science, and Mathematics.
The STEM NER system reported in the publication above was applied on these articles. An automatically extracted dataset of 4 typed entities, viz., Process, Method, Material, and Data was created.
What this repository contains?
Aggregated lists of Process, Method, Material, and Data entities with respective occurrence counts extracted from 59,984 scholarly publications organized per the 10 STEM domains considered.
Additionally, the list of Elsevier CC-BY articles used in this study are provided in the raw-data
directory of the repository.
Useful Links
BibTex: