Web2Text: Deep Structured Boilerplate Removal

Web pages are a valuable source of information for many natural language processing and information retrieval tasks. Extracting the main content from those documents is essential for the performance of derived applications. To address this issue, we introduce a novel model that performs sequence labeling to collectively classify all text blocks in an HTML page as either boilerplate or main content.

Data and Resources

Cite this as

Thijs Vogels, Octavian-Eugen Ganea, Carsten Eickhoff (2024). Dataset: Web2Text: Deep Structured Boilerplate Removal. https://doi.org/10.57702/phtx73hb

DOI retrieved: December 16, 2024

Additional Info

Field Value
Created December 16, 2024
Last update December 16, 2024
Defined In https://doi.org/10.48550/arXiv.1801.02607
Author Thijs Vogels
More Authors
Octavian-Eugen Ganea
Carsten Eickhoff
Homepage https://arxiv.org/abs/1704.07813