Datasets
In order to contribute to the broader research community, Google periodically releases data of interest to researchers in a wide range of computer science disciplines.
Sort By
1 - 15 of 167 datasets
Competitive Robot Table Tennis: Initial Ball States
The dataset is comprised of an array of Initial table tennis ball states just after the ball was hit, of the form: [pos_x, pos_y, pos_z, vel_x, vel_y, vel_z, w_vel_x, w_vel_y, w_vel_z].
Topics API DP Synthetic Data Release
The dataset contains realistic traces of Chrome's Topics API outputs obtained using differentially privacy. Full details are available in our publication ""Differentially Private Synthetic Data Release for Topics API Outputs"", Travis Dick et al. Proceedings of KDD 2025.
ScreenQA
ScreenQA dataset was introduced in the "ScreenQA: Large-Scale Question-Answer Pairs over Mobile App Screenshots" paper. It contains ~86K question-answer pairs collected by human annotators for ~35K screenshots from Rico. It should be used to train and evaluate models capable of screen content understanding via question answering.
HowToDIV
The HowToDIV dataset consists of dialogues, instructions and video-steps for procedural task assistance across diverse domains in cooking, mechanics, and planting. Each session includes multi-turn conversation where an expert teaches a user how to perform a task while perceiving user's surroundings.
Cultural Familiarity Annotations
This dataset consists of prompts and AI-generated text for 10 countries (China, Ethiopia, Greece, Indonesia, Iran, Mexico, South Korea, Spain, the United Kingdom, and the United States), accompanied by human annotations from each country assessing the cultural familiarity of the generations.
SCIN Crowdsourced Dermatology Dataset
The SCIN dataset contains 10,000 images of dermatology conditions, crowdsourced with informed consent from US internet users. Contributions include self-reported demographic and symptom information and dermatologist labels, as well as estimated Fitzpatrick skin type and Monk Skin Tone.
Net-NTLMv1 Rainbow Tables
Rainbow tables generated for the Net-NTLMv1 authentication protocol with the challenge of 1122334455667788 to aid in key recovery
BamTwoogle
The BamTwoogle dataset accompanies "ReST meets ReAct: Self-Improvement for Multi-Step Reasoning LLM Agent" paper (https://arxiv.org/abs/2312.10003). It was written to be a complementary, slightly more challenging sequel to Bamboogle dataset.
ScreenQA Short
The dataset is a modification of the original ScreenQA dataset. It contains the same ~86K questions for ~35K screenshots from Rico, but the ground truth is a list of short answers. It should be used to train and evaluate models capable of screen content understanding via question answering.
Adversarial Nibbler Round 1 Dataset
This dataset contains results from round 1 of Adversarial Nibbler challenge. This data includes adversarial prompts fed into public generative text2image models and validations for unsafe images. It also includes: all prompts submitted and all prompts attempted.
Google Data Center Power Trace 2019
Power utilization of power domains in Google data centers during 2019 May.
Screen Annotation
The Screen Annotation dataset consists of pairs of mobile screenshots and their annotations. The annotations describe the UI elements present on the screen: their type, location, OCR text and a short description.
CF-TriviaQA
The CF-TriviaQA dataset accompanies "Hallucination Augmented Recitations for Language Models" paper (https://arxiv.org/abs/2311.07424). It is a counterfactual open book QA dataset generated from the TriviaQA dataset using HAR approach, with the purpose of improving attribution in LLMs.
Upwelling irradiance from GOES-16
Machine learned models that estimate wideband irradiance from 2km narrow-band radiances (using co-aligned satellite imagery as training data) and so can be used to make satellite-driven estimates of contrail warming.
LibriTTS-R
LibriTTS-R is a sound quality improved version of the LibriTTS corpus (http://www.openslr.org/60/) which is a multi-speaker English corpus of approximately 585 hours of read English speech at 24kHz sampling rate.. To improve sound quality, a speech restoration model, 'Miipher' was used.