Web2Text: Deep Structured Boilerplate Removal

doi:doi:10.57702/phtx73hb

Web2Text: Deep Structured Boilerplate Removal

Web pages are a valuable source of information for many natural language processing and information retrieval tasks. Extracting the main content from those documents is essential for the performance of derived applications. To address this issue, we introduce a novel model that performs sequence labeling to collectively classify all text blocks in an HTML page as either boilerplate or main content.

Data and Resources

Original MetadataJSON
The json representation of the dataset with its distributions based on DCAT.
Explore
- Preview
- Download

Cite this as

Thijs Vogels, Octavian-Eugen Ganea, Carsten Eickhoﬀ (2024). Dataset: Web2Text: Deep Structured Boilerplate Removal. https://doi.org/10.57702/phtx73hb

DOI retrieved: December 16, 2024

Additional Info

Field	Value
Created	December 16, 2024
Last update	December 16, 2024
Defined In	https://doi.org/10.48550/arXiv.1801.02607
Author	Thijs Vogels
More Authors	Octavian-Eugen Ganea Carsten Eickhoﬀ
Homepage	https://arxiv.org/abs/1704.07813