Google Wake Words and Voice Commands in US English

Project Overview:


As a leading data collection and annotation company, we successfully completed a project to construct a comprehensive dataset of audio clips featuring the wake words “Hey Google” and “OK Google,” followed by a variety of voice commands in US English. This dataset is a testament to our expertise in enhancing voice recognition systems and voice assistant technologies using Google’s voice infrastructure.


Our project involved gathering a diverse array of audio recordings from US English speakers, showcasing different accents and scenarios. Each recording was meticulously annotated to highlight the wake words and subsequent voice commands.

Google Wake Words and Voice Commands in US English
Google Wake Words and Voice Commands in US English
Google Wake Words and Voice Commands in US English
Google Wake Words and Voice Commands in US English


  • Voice Assistant Users: We partnered with Google Assistant users who willingly provided audio clips of them uttering “Hey Google” or “OK Google,” followed by voice commands in assorted contexts.
  • Voice Actors: We enlisted professional voice actors to generate synthetic recordings of wake words and voice commands, enriching the diversity and control of our dataset.
  • Public Domain Recordings: We incorporated publicly available audio clips featuring the targeted wake words and voice commands in US English.
case study-post
Google Wake Words and Voice Commands in US English
Google Wake Words and Voice Commands in US English

Data Collection Metrics

  • Total Audio Clips Collected: 100,000
  • User Contributions: 60,000
  • Voice Actor Recordings: 20,000
  • Public Domain Extracts: 20,000
  • Total Audio Clips Annotated: 100,000

Annotation Process


  1. Wake Word and Command Annotation: We precisely identified the start and end points of the “Hey Google” or “OK Google” wake words and the subsequent voice commands in each audio clip.
  2. Speaker Demographics: We gathered metadata on each speaker’s demographics, such as age, accent, and gender.
  3. Recording Conditions: We documented various recording settings, including background noise and acoustic environments.

Annotation Metrics

  • Audio Clips with Annotations: 100,000
  • Speaker Demographic Metadata: 100,000
  • Recording Condition Metadata: 100,000
Google Wake Words and Voice Commands in US English
Google Wake Words and Voice Commands in US English
Google Wake Words and Voice Commands in US English
Google Wake Words and Voice Commands in US English

Quality Assurance


Annotation Verification: We implemented a rigorous validation protocol, utilizing both automated tools and human reviewers, to ensure the precision of our wake word and command annotations.
User Consent: We guaranteed that all user-contributed audio clips had clear consent for use in our dataset, with all personally identifiable information anonymized.
Privacy Compliance: We adhered to stringent privacy standards, encompassing data retention policies and providing opt-out options for our contributors.

QA Metrics

  • Annotation Validation Cases: 10,000 (10% of total dataset)
  • Privacy Audits: Conducted on all 60,000 user-contributed clips


Our project, the Google Wake Words and Voice Commands Dataset in US English, demonstrates our capability in advancing voice recognition technology and voice assistant systems. By providing a dataset rich in diversity, precise in annotations, and stringent in privacy compliance, we contribute significantly to the field of voice recognition and natural language processing, showcasing our expertise in data collection and annotation for machine learning applications.


Quality Data Creation


Guaranteed TAT


ISO 9001:2015, ISO/IEC 27001:2013 Certified


HIPAA Compliance


GDPR Compliance


Compliance and Security

Let's Discuss your Data collection Requirement With Us

To get a detailed estimation of requirements please reach us.

Scroll to Top