News400 Dataset

doi:doi:10.25835/0084897

News400 Dataset

Multimodal Analytics for Real-world News using Measures of Cross-modal Entity Consistency

This repository contains the News400 dataset introduced in the paper:

Eric Müller-Budack, Jonas Theiner, Sebastian Diering, Maximilian Idahl, and Ralph Ewerth. 2020. Multimodal Analytics for Real-world News using Measures of Cross-modal Entity Consistency. In Proceedings of the 2020 International Conference on Multimedia Retrieval (ICMR '20). Association for Computing Machinery, New York, NY, USA, 16–25. DOI: https://doi.org/10.1145/3372278.3390670

Content

news400.tar.gz:
- dataset.jsonl containing:
  - Web links to the news texts
  - Web links to the news image
  - Outputs of the named entity recognition and disambiguation (NERD) approach
  - Untampered and tampered entities
- .jsonl file for each entity type containing the following information for each entity:
  - Wikidata ID
  - Wikidata label
  - Meta information used for tampering
  - Web links to all reference images crawled from Google, Bing, and Wikidata
- splits for testing and validation
news400_features.tar.gz:
- Visual features of the news images for persons, locations, and scenes
- Visual features of the reference images for persons, locations, and scenes
news400_wordembeddings.tar.gz: Word embeddings of all nouns in the news texts

Source Code

The source code to reproduce our results as well as download scripts to crawl news texts and images can be found on our GitHub page: https://github.com/TIBHannover/cross-modal_entity_consistency

BibTex:

@dataset{Eric_Müller-Budack_and__Jonas_Theiner_and__Sebastian_Diering_and__Maximilian_Idahl_and__Ralph_Ewerth_2020,
    abstract = {# Multimodal Analytics for Real-world News using Measures of Cross-modal Entity Consistency

This repository contains the *News400* dataset introduced in the paper:

> Eric Müller-Budack, Jonas Theiner, Sebastian Diering, Maximilian Idahl, and Ralph Ewerth. 2020. 
Multimodal Analytics for Real-world News using Measures of Cross-modal Entity Consistency. 
In Proceedings of the 2020 International Conference on Multimedia Retrieval (ICMR '20). Association for Computing Machinery, New York, NY, USA, 16–25. DOI: https://doi.org/10.1145/3372278.3390670

## Content

- **news400.tar.gz**:
    - ```dataset.jsonl``` containing:
        - Web links to the news texts
        - Web links to the news image
        - Outputs of the named entity recognition and disambiguation (NERD) approach
        - Untampered and tampered entities
    - ```<entity>.jsonl``` file for each entity type containing the following information for each entity:
        - Wikidata ID
        - Wikidata label
        - Meta information used for tampering
        - Web links to all reference images crawled from Google, Bing, and Wikidata
    - splits for testing and validation
- **news400_features.tar.gz**:
    - Visual features of the news images for persons, locations, and scenes
    - Visual features of the reference images for persons, locations, and scenes
- **news400_wordembeddings.tar.gz**: Word embeddings of all nouns in the news texts

## Source Code

The source code to reproduce our results as well as download scripts to crawl news texts and images can be found on our GitHub page: https://github.com/TIBHannover/cross-modal_entity_consistency
},
    author = {Eric Müller-Budack and  Jonas Theiner and  Sebastian Diering and  Maximilian Idahl and  Ralph Ewerth},
    doi = {10.25835/0084897},
    institution = {TIB},
    keyword = {'Cross-modal consistency', 'cross-modal entity verification', 'deep learning', 'image repurposing detection', 'multimodal retrieval'},
    month = {jun},
    publisher = {LUIS},
    title = {News400 Dataset},
    url = {https://service.tib.eu/ldmservice/vdataset/luh-news400},
    year = {2020}
}