Caltech-UCSD Birds-200-2011 (CUB-200-2011)

Bird classification dataset with 11,788 images and 200 bird classes. It also contains part annotations, visual attributes, and bounding boxes.
[URL] [Paper]


Updated bird classification dataset, containing 48,562 images of North American birds from 555 bird classes. Also includes part annotations, and bounding boxes, and expert validated labels.
[URL] [Paper]

Pasadena Urban Trees

This includes dense aerial and street view imagery for 30,000 trees labeled with geo-location and tree species from Pasadena, California.
[URL] [Paper]

iNaturalist Species Classification

The iNat datasets contain images of visually similar species, captured in a wide variety of situations, from all over the world. Images were collected with different camera types, have varying image quality, feature a large class imbalance, and have been verified by multiple citizen scientists from the online platform iNaturalist. There are currently four different variants of the iNat dataset: iNat2017, iNat2018, iNat2019, and iNat2021.
[URL] [Paper]


The iWildCam datasets contain diverse sets of camera trap imagery with a focus on tackling challenges in applying computer vision to ecological monitoring with static sensors, such as generalizing models to new sensor deployments or novel species or incorporating data from alternate image capture domains such as community science or satellite imagery. There are currently four different variants of the iWildCam dataset: iWildCam 2018, iWildCam 2019, iWildCam 2020, and iWildCam 2021.
[URL] [Paper]

Caltech Camera Traps

This dataset of camera trap images contains 243,100 images from 140 camera locations from the Southwestern United States. It includes labels for 21 animal categories (plus empty), along with 66,000 bounding box annotations.
[URL] [Paper]

GeoLifeCLEF 2020

A collection of 1.9 million species observations paired with high-resolution remote sensing imagery, land cover data, and altitude, in addition to traditional low-resolution climate and soil variables.
[URL] [Paper]


Dataset for instance segmentation and attribute localization. Fashionpedia consists of: (1) an anthology built by fashion experts containing 27 main apparel categories, 19 apparel parts, 294 fine-grained attributes, and their relationships. (2) a dataset with everyday and celebrity event fashion images annotated with segmentation masks and their associated per-mask fine-grained attributes, built upon the Fashionpedia ontology.
[URL] [Paper]


A new suite of challenging natural world visual benchmark tasks that are motivated by realworld image understanding use cases. The tasks are validated by experts and span a diverse range of visual concepts including behavior, age, health, and more.
[URL] [Paper]


Sapsucker Woods 60 (SSW60) is a dataset for advancing research on audiovisual fine-grained categorization. It covers 60 species of birds that all occur in a specific geographic location: Sapsucker Woods, Ithaca, NY. It is comprised of images from existing datasets, along with new expert curated audio and video data.
[URL] [Paper]

Caltech Fish Counting (CFC)

The Caltech Fish Counting Dataset (CFC) is a large-scale dataset for detecting, tracking, and counting fish in sonar videos. It provides a rich source of data for advancing low signal-to-noise computer vision applications and tackling domain generalization for multiple-object tracking and counting. The dataset contains over half a million annotations in over 1,500 videos sourced from seven different sonar cameras. [URL] [Paper]