Overview

The Consented Activities of People (CAP) dataset is a fine grained activity dataset for visual AI research curated using the Visym Collector platform. The CAP dataset contains annotated videos of fine-grained activity classes of consented people. Videos are recorded from mobile devices around the world from a third person viewpoint looking down on the scene from above, containing subjects performing every day activities. Videos are annotated with bounding box tracks around the primary actor along with temporal start/end frames for each activity instance, and distributed in vipy json format. An interactive visualization and video summary is available for review below.

The CAP dataset was collected with the following goals:

Atomic. Activities have length ≤ 3 seconds and visually grounded (e.g. activities should be unambiguously determined from the pixels).
Fine-grained. All activities are selected so that there are subtle differences between classes where the activity representation and discrimination is critical for performance, rather than the scene context or object detection. The label space of fine-grained activities is tree structured by design.
Person centered. All activities are collected from handheld or stabilized mobile devices at a fixed security perspective (e.g. looking down on a scene from above) and include a single consented person as the primary subject. Subjects are tasked with performing specific atomic activities, person/object or person/person interactions.
Around the house. The collection involves objects, locations and activities that most collectors have easy access to and can easily perform without practice.
Non-overlapping. All activities are performed independently, and no activities are performed jointly or simultaneously overlapping with other activities (e.g. a subject will not simultaneously perform the “person uses cell phone” activity while performing the “person takes off hat” activity).
Ethical. All videos are collected with informed consent for how the videos will be shared and used. Non-consented subjects have their faces blurred.
Worldwide. Videos are collected from 780 collectors in 33 countries.
Large-scale. We provide an open and easily downloaded training/validation set suitable for pre-training.

This dataset is associated with the:

First Workshop on Fine-grained Activity Detection at WACV’23 in January 2023
Second Workshop on Fine-grained Activity Detection at ICCV’23 in October 2023

Explorer

The dataset explorer shows a 4% sample of the CAP dataset, tightly cropped spatially around the actor and cropped temporally around the fine-grained activity being performed. The full dataset includes the larger spatiotemporal context in each video around the activity, and the complete set of activity labels. This open source visualization tool can be sorted by category or color, and shown in full screen.

Visualization

This video visualization shows a sample of 40 activities each from 28 collectors showing the tight crop around the actor. We also provide visualization of a random sample of full context videos available in the training/validation set and 5Hz background stabilized videos.

Summary

This summary shows the statistics of the entire CAP dataset which includes activity classification and activity detection subsets as well as sequestered test sets. The public training/validation sets for specific tasks will be smaller than these totals.

Download

cap_detection_handheld_val.tar.gz (0.9 GB) MD5:72f58e69582c17dd366d3c7e85cf0da8 (05May23)
- Validation set for handheld activity detection in untrimmed clips for the second fine-grained activity detection challenge
- Getting started using the activity detection validation set
cap_classification_clip.tar.gz (288 GB) MD5:54315e2ce204f0dbbe298490a63b5b3b (02Mar22)
- Tight temporal clip training/validation set for handheld activity classification
cap_classification_pad.tar.gz (386 GB) MD5:fbdc75e6ef10b874ddda20ee9765a710 (02Mar22)
- Temporally padded (>4s) training/validation set for handheld activity classification

License

Creative Commons Attribution 4.0 International (CC BY 4.0). Every subject in this dataset has consented to their personally identifable information to be shared publicly for the purpose of advancing computer vision research. Non-consented subjects have their faces blurred out.

Reference

Jeffrey Byrne (Visym Labs), Greg Castanon (STR), Zhongheng Li (STR) and Gil Ettinger (STR)
“Fine-grained Activities of People Worldwide”, arXiv:2207.05182, 2022

@article{Byrne2023Fine,
   title = “Fine-grained Activities of People Worldwide”,
   author = “J. Byrne and G. Castanon and Z. Li and G. Ettinger”,
   journal = “Winter Applications of Computer Vision (WACV’23)”,
   year = 2023
}

Acknowledgement

Supported by the Intelligence Advanced Research Projects Activity (IARPA) via Department of Interior/ Interior Business Center (DOI/IBC) contract number D17PC00344. The U.S. Government is authorized to reproduce and distribute reprints for Governmental purposes notwithstanding any copyright annotation thereon. Disclaimer: The views and conclusions contained herein are those of the authors and should not be interpreted as necessarily representing the official policies or endorsements, either expressed or implied, of IARPA, DOI/IBC, or the U.S. Government.

We thank the AWS Open Data Sponsorship Program for supporting the storage and distribution of this dataset.

Contact

Visym Labs <info@visym.com>