The 2021 competition is part of the FGVC^8 workshop at CVPR.
There is a total of 10,000 species in the dataset. The full training dataset contains nearly 2.7M images. To make the dataset more accessible we have also created a "mini" training dataset with 50 examples per species for a total of 500K images. Each species has 10 validation images. There are a total of 500,000 test images.
Super Category | Species Count | Train Images | Train Mini Images | Val Images | Test Images |
---|---|---|---|---|---|
Plants | 4,271 | 1,148,702 | 213,550 | 42,710 | x |
Insects | 2,526 | 663,682 | 126,300 | 25,260 | x |
Birds | 1,486 | 414,847 | 74,300 | 14,860 | x |
Fungi | 341 | 90,048 | 17,050 | 3,410 | x |
Reptiles | 313 | 86,830 | 15,650 | 3,130 | x |
Mammals | 246 | 68,917 | 12,300 | 2,460 | x |
Ray-finned Fishes | 183 | 45,166 | 9,150 | 1,830 | x |
Amphibians | 170 | 46,252 | 8,500 | 1,700 | x |
Mollusks | 169 | 44,670 | 8,450 | 1,690 | x |
Arachnids | 153 | 40,687 | 7,650 | 1,530 | x |
Animalia | 142 | 37,042 | 7,100 | 1,420 | x |
Total | 10,000 | 2,686,843 | 500,000 | 100,000 | 500,000 |
We are using top-1 error as the evaluation metric for the Kaggle competition. For each image , an algorithm will produce one label and we will compare this label to the ground truth label for the image , computing the error as:
The overall error score for an algorithm is the average error over all test images:
A top-1 metric is unforgiving and will allow us to test the limits of image classification performance. We acknowledge that there are problems with a top-1 metric, which is why we traditionally have used a top-K metric. Some cases where a top-K metric is more appropriate include:
- Images where multiple species are present (e.g. a bee on a flower) and therefore it is ambiguous to decide which species an algorithm should report.
- Images where the ground truth label is noisy. A top-k metric would smooth over this noise.
- Species that are visually identical, and separated geographically.
For (1.) we assume truly ambiguous cases to be rare and assume the subject of the photo is the target species to identify (see here for a video from iNaturalist describing "How to Take Identifiable Photos"). For (2.) this is an unfortunate reality of large scale datasets. For (3.) we are providing competitors with location and date information to help disambiguate these cases, should they wish to use it.
Note that while the Kaggle competition leaderboard is computed using top-1 error, competitors are required to submit 5 labels per image so that the organizers can study top-K performance. We will present the analysis at the FGVC8 workshop. See the Submission Format section below.
We made a few modifications to the competition this year. Similar to the 2017 competition, we are releasing the species names immediately, instead of obfuscating them. Our reason for obfuscating them in 2018 and 2019 was to make it difficult for competitors to scrape the web (or iNaturalist itself) for additional images. Because we are releasing 2.7M training images and the dataset doesn't necessarily focus on the long tail problem we feel that we can release the species names without worry. This does not mean that scraping is allowed. Please do not scrape for additional data, especially from iNaturalist or GBIF. Having the species names also makes interpreting validation results easier when examining confusion matrices and accuracy statistics.
We are also releasing location and date information for each image in the form of latitude
, longitude
, location_uncertainty
, and date
values. We have retroactively added this information to the 2017 and 2018 datasets, but this year competitors are able to utilize this information when building models. We hope this motivates competitors to devise interesting solutions to this large scale problem.
Participants are welcome to use previous iNaturalist competition datasets (2017, 2018, 2019) as an additional data source. However we do not provide a category or image mappings between the datasets and there is certainly overlap between the species and images included in the various datasets. Besides using previous iNaturalist competition datasets, participants are forbidden from collecting additional natural world data for the 2021 competition. Weights from models trained on ImageNet, COCO, and previous iNaturalist competition datasets may be used to initialize models. Models pretrained on other datasets are not allowed. Please specify any and all external data used for training when uploading results.
The general rule is that participants should only use the provided training and validation images (with the exception of the allowed pretrained models) to train a model to classify the test images. We do not want participants crawling the web in search of additional data for the target categories. Participants should be in the mindset that this is the only data available for these categories.
Participants are allowed to collect additional annotations (e.g. bounding boxes, keypoints) on the provided training and validation sets, however they should not annotate the test images. Teams should specify that they collected additional annotations when submitting results.
We follow the annotation format of the COCO dataset and add additional fields. The annotations are stored in the JSON format and are organized as follows:
{
"info" : info,
"images" : [image],
"categories" : [category],
"annotations" : [annotation],
"licenses" : [license]
}
info{
"year" : int,
"version" : str,
"description" : str,
"contributor" : str,
"url" : str,
"date_created" : str,
}
image{
"id" : int,
"width" : int,
"height" : int,
"file_name" : str,
"license" : int,
"rights_holder" : str,
"date": str,
"latitude": float,
"longitude": float,
"location_uncertainty": int,
}
category{
"id" : int,
"name" : str,
"common_name" : str,
"supercategory" : str,
"kingdom" : str,
"phylum" : str,
"class" : str,
"order" : str,
"family" : str,
"genus" : str,
"specific_epithet" : str,
"image_dir_name" : str,
}
annotation{
"id" : int,
"image_id" : int,
"category_id" : int
}
license{
"id" : int,
"name" : str,
"url" : str
}
- It is possible for the
latitude
,longitude
, andlocation_uncertainty
fields to benull
when the values were missing from the respective iNaturalist observation (due to the location being private or the species being sensitive). location_uncertainty
is measured in meters, and may contain bogus values such as-80
and106807033
. We decided to provide the raw values that we obtained from iNaturalist instead of sanitized values.- The
test.json
file does not have theannotations
field. You can only evaluate on the test set by uploading your submission to the Kaggle competition.
The submission format for the Kaggle competition is a csv file with the following format:
Id,Predicted
1,9279 608 1220 6626 8696
2,3821 4951 4833 7040 6913
3,4112 4061 7453 7540 2348
The Id
column corresponds to the test image id. The Predicted
column corresponds to 5 category ids, separated by spaces. You should have one row for each test image, and the 5 category labels should be sorted in order of decreasing confidence (the most confident category is the first category label). The Kaggle leaderboard will be computed using top-1 error (see the Evaluation section above), and the competition organizers will study top-K error using the remaining predictions and present the results at the FGVC8 workshop.
Your submission file must have 5 category predictions per image or the Kaggle evaluation will fail.
By downloading this dataset you agree to the following terms:
- You will abide by the iNaturalist Terms of Service.
- You will use the data only for non-commercial research and educational purposes.
- You will NOT distribute the dataset images.
- Cornell University makes no representations or warranties regarding the data, including but not limited to warranties of non-infringement or fitness for a particular purpose.
- You accept full responsibility for your use of the data and shall defend and indemnify Cornell University, including its employees, officers and agents, against any and all claims arising from your use of the data, including but not limited to your use of any copies of copyrighted images that you may create from the data.
The dataset files are available through the AWS Open Data Program:
- Train Images [224GB]
- s3://ml-inat-competition-datasets/2021/train.tar.gz
- Running
md5sum train.tar.gz
should producee0526d53c7f7b2e3167b2b43bb2690ed
- Images have a max dimension of 500px and have been converted to JPEG format
- Untaring the images creates a directory structure like
train/category/image.jpg
. This may take a while.
- Train Annotations [221MB]
- s3://ml-inat-competition-datasets/2021/train.json.tar.gz
- Running
md5sum train.json.tar.gz
should produce38a7bb733f7a09214d44293460ec0021
- Train Mini Images [42GB]
- s3://ml-inat-competition-datasets/2021/train_mini.tar.gz
- Running
md5sum train_mini.tar.gz
should producedb6ed8330e634445efc8fec83ae81442
- Images have a max dimension of 500px and have been converted to JPEG format
- Untaring the images creates a directory structure like
train_mini/category/image.jpg
. This may take a while.
- Train Mini Annotations [45MB]
- s3://ml-inat-competition-datasets/2021/train_mini.json.tar.gz
- Running
md5sum train_mini.json.tar.gz
should produce395a35be3651d86dc3b0d365b8ea5f92
- Validation Images [8.4GB]
- s3://ml-inat-competition-datasets/2021/val.tar.gz
- Running
md5sum val.tar.gz
should producef6f6e0e242e3d4c9569ba56400938afc
- Images have a max dimension of 500px and have been converted to JPEG format
- Untaring the images creates a directory structure like
val/category/image.jpg
. This may take a while.
- Validation Annotations [9.4MB]
- s3://ml-inat-competition-datasets/2021/val.json.tar.gz
- Running
md5sum val.json.tar.gz
should produce4d761e0f6a86cc63e8f7afc91f6a8f0b
- Test Images [43GB]
- s3://ml-inat-competition-datasets/2021/public_test.tar.gz
- Running
md5sum public_test.tar.gz
should produce7124b949fe79bfa7f7019a15ef3dbd06
- Images have a max dimension of 500px and have been converted to JPEG format
- Untaring the images creates a directory structure like
public_test/image.jpg
.
- Test Info [21MB]
- s3://ml-inat-competition-datasets/2021/public_test.json.tar.gz
- Running
md5sum public_test.json.tar.gz
should produce7a9413db55c6fa452824469cc7dd9d3d