Devil in the Number: Towards Robust Multi-modality Data Filter

doi:doi:10.57702/wrxz766y

Devil in the Number: Towards Robust Multi-modality Data Filter

The dataset used in the paper is a web-scale dataset for training a vision-language model. The dataset contains text-image pairs, and the authors propose a novel filter to remove redundant information such as numbers and bracketed content.

Data and Resources

Original MetadataJSON
The json representation of the dataset with its distributions based on DCAT.
Explore
- Preview
- Download

Cite this as

Yichen Xu, Zihan Xu, Wenhao Chai, Zhonghan Zhao, Enxin Song, Gaoang Wang (2024). Dataset: Devil in the Number: Towards Robust Multi-modality Data Filter. https://doi.org/10.57702/wrxz766y

DOI retrieved: December 2, 2024

Additional Info

Field	Value
Created	December 2, 2024
Last update	December 2, 2024
Defined In	https://doi.org/10.48550/arXiv.2309.13770
Author	Yichen Xu
More Authors	Zihan Xu Wenhao Chai Zhonghan Zhao Enxin Song Gaoang Wang
Homepage	https://arxiv.org/abs/2304.14108