TamperedNews & News400 (IJMIR'21 Update)

doi:doi:10.25835/lzcs481w

TamperedNews & News400 (IJMIR'21 Update)

Multimodal Analytics for Real-world News using Measures of Cross-modal Entity Consistency

This repository contains the TamperedNews and News400 datasets introduced in the paper:

Eric Müller-Budack, Jonas Theiner, Sebastian Diering, Maximilian Idahl, Sherzod Hakimov und Ralph Ewerth. „Multimodal news analytics using measures of cross-modal entity and context consistency“. In: International Journal of Multimedia Information Retrieval 10.2 (2021), Springer, S. 111–125. DOI: https://doi.org/10.1007/s13735-021-00207-4

Content

For both datasets TamperedNews and News400, we provide the:

*dataset*.tar.gz containing the *dataset*.jsonl with
- Web links to the news texts
- Web links to the news image
- Outputs of the named entity recognition and disambiguation (NERD) approach
- Untampered and tampered entities
*dataset*_features.tar.gzwith visual features for events, locations, and persons
news400_wordembeddings.tar.gz: Word embeddings of all nouns in the news texts of the News400 dataset

Please note that the word embeddings of the TamperedNews dataset (tamperednews_wordembeddings.tar.gz) have been already provided in the first version (Link).

For all entities detected in both datasets, we provide:

entities.tar.gz containing an *entity_type*.jsonl for all entity types (events, locations, and persons) with:
- Wikidata ID
- Wikidata label
- Meta information used for tampering
- Web links to all reference images crawled from Google, Bing, and Wikidata
entities_features.tar.gz containing the visual features of the reference images for all entities

Source Code

The source code to reproduce our results as well as download scripts to crawl news texts and images can be found on our GitHub page: https://github.com/TIBHannover/cross-modal_entity_consistency

BibTex:

@dataset{Eric_Müller-Budack_and__Jonas_Theiner_and__Sebastian_Diering_and__Maximilian_Idahl_and__Sherzod_Hakimov_and__Ralph_Ewerth_2022,
    abstract = {# Multimodal Analytics for Real-world News using Measures of Cross-modal Entity Consistency

This repository contains the *TamperedNews* and *News400* datasets introduced in the paper:

> Eric Müller-Budack, Jonas Theiner, Sebastian Diering, Maximilian Idahl, Sherzod Hakimov und Ralph Ewerth. „Multimodal news analytics using measures of cross-modal entity and context consistency“. In: _International Journal of Multimedia Information Retrieval_ 10.2 (2021), Springer, S. 111–125. DOI: https://doi.org/10.1007/s13735-021-00207-4

## Content

For both datasets *TamperedNews* and *News400*, we provide the:

- ```*dataset*.tar.gz``` containing the ```*dataset*.jsonl``` with
    - Web links to the news texts
    - Web links to the news image
    - Outputs of the named entity recognition and disambiguation (NERD) approach
    - Untampered and tampered entities
- ```*dataset*_features.tar.gz```with visual features for events, locations, and persons
- ```news400_wordembeddings.tar.gz```: Word embeddings of all nouns in the news texts of the News400 dataset

Please note that the word embeddings of the *TamperedNews* dataset (```tamperednews_wordembeddings.tar.gz```) have been already provided in the first version ([Link](https://data.uni-hannover.de/dataset/tamperednews)).

For all entities detected in both datasets, we provide:

- ```entities.tar.gz``` containing an ```*entity_type*.jsonl``` for all entity types (events, locations, and persons) with:
    - Wikidata ID
    - Wikidata label
    - Meta information used for tampering
    - Web links to all reference images crawled from Google, Bing, and Wikidata
- ```entities_features.tar.gz``` containing the visual features of the reference images for all entities

## Source Code

The source code to reproduce our results as well as download scripts to crawl news texts and images can be found on our GitHub page: https://github.com/TIBHannover/cross-modal_entity_consistency

},
    author = {Eric Müller-Budack and  Jonas Theiner and  Sebastian Diering and  Maximilian Idahl and  Sherzod Hakimov and  Ralph Ewerth},
    doi = {10.25835/lzcs481w},
    institution = {TIB},
    keyword = {'Cross-modal consistency', 'cross-modal entity verification', 'deep learning', 'image repurposing detection', 'multimedia retrieval'},
    month = {may},
    publisher = {LUIS},
    title = {TamperedNews & News400 (IJMIR'21 Update)},
    url = {https://service.tib.eu/ldmservice/vdataset/luh-tamperednews-news400-ijmir21},
    year = {2022}
}