Show-and-Tell

Visual language grounding is widely studied in modern neural networks, which typically adopts an encoder-decoder framework consisting of a convolutional neural network (CNN) for image feature extraction and a recurrent neural network (RNN) for language caption generation.

Data and Resources

Cite this as

Hongge Chen, Huan Zhang, Pin-Yu Chen, Jinfeng Yi, Cho-Jui Hsieh (2024). Dataset: Show-and-Tell. https://doi.org/10.57702/krqufe4w

DOI retrieved: December 17, 2024

Additional Info

Field Value
Created December 17, 2024
Last update December 17, 2024
Defined In https://doi.org/10.48550/arXiv.1712.02051
Author Hongge Chen
More Authors
Huan Zhang
Pin-Yu Chen
Jinfeng Yi
Cho-Jui Hsieh
Homepage https://arxiv.org/abs/1506.05981