Skip to content
/ DejaVu Public

English-Japanese caption dataset including word sense ambiguity for evaluation of multimodal machine translation.

Notifications You must be signed in to change notification settings

tmu-nlp/DejaVu

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

9 Commits
 
 
 
 

Repository files navigation

DejaVu: Disambiguation evaluation dataset for English-JApanese machine translation on VisUal information

English-Japanese caption dataset including word sense ambiguity for evaluation of multimodal machine translation.

Dataset Overview

  • Captions: English captions and their Japanese translations, including 500 target words and 6 different caption templates. All target words have multiple word senses and are designed such that the limited contextual information in the captions is not sufficient to select the correct word sense. In order to translate correctly into Japanese, it is necessary to refer to the corresponding images.
  • Images: 500 images referenced during translation. These images are obtained from ImageNet and Flickr. The WordNet id corresponding to the target word in the caption is used as the image file name.
  • Format: Images are stored in the images/ directory, and captions are available in the captions/ directory. index.txt is used to map the id of images in the order of captions.

Dataset Structure

The dataset is structured as follows:

- dataset/
  - captions/
    - en/
      - template1.en
      - template2.en
      - ...
    - ja/
      - template1-1.ja
      - template1-2.ja
      - ...
  - images/
      - 10055410.jpg
      - 10149867.jpg
      - ...
  - index.txt
- README.md

Usage

Clone the repository:

git clone git@github.com:tmu-nlp/DejaVu.git

License

This dataset is made available under the Creative Commons Attribution-ShareAlike (CC BY-SA) license.

Citation

If you use this dataset in your research, please cite it as follows:

@misc{DejaVu,
  author = {Ayako Sato},
  title = {DejaVu},
  year = {2024},
  howpublished = {\url{https://github.com/tmu-nlp/DejaVu}}
}

About

English-Japanese caption dataset including word sense ambiguity for evaluation of multimodal machine translation.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published