Our company successfully built a comprehensive dataset of audio clips featuring the “Hey Google” or “OK Google” wake words in US English. This dataset, crucial for improving wake word detection and voice assistant technologies, showcases our expertise in gathering and annotating high-quality data for machine learning models.
Scope
We gathered a varied collection of audio recordings from diverse US English speakers, featuring various accents and contexts. Our team meticulously annotated these recordings with precise wake word markers, demonstrating our capability in handling complex data annotation projects.
Sources
Voice Assistant Users: Collaborate with Google Assistant users who consent to contribute audio clips of them saying “Hey Google” or “OK Google” in different contexts.
Voice Actors: Hire professional voice actors to create synthetic wake word recordings for added diversity and control.
Public Domain Recordings: Extract publicly available audio recordings with instances of the “Hey Google” or “OK Google” wake words in US English.
Data Collection Metrics
Total Audio Clips Collected and Annotated: 60,000 clips
User Contributions: 36,000
Voice Actor Recordings: 12,000
Public Domain Extracts: 12,000
Annotation Process
Stages
Wake Word Annotation: We accurately identified and marked the “Hey Google” or “OK Google” wake words in each audio clip.
Speaker Demographics: Our team collected and annotated demographic metadata, including age, accent, and gender, for each speaker.
Recording Conditions: We documented and annotated various recording conditions like background noise and acoustic environments.
Annotation Metrics
Audio Clips with Wake Word Annotations: 60,000
Speaker Demographic Metadata: 60,000
Recording Condition Metadata: 60,000
Quality Assurance
Stages
Annotation Verification: We employed automated tools and human reviewers to ensure the accuracy of wake word annotations. User Consent: We maintained strict privacy standards, ensuring all user-contributed audio clips had explicit consent for use. Privacy Compliance: We adhered to privacy regulations, including data retention policies and opt-out options for contributors.
The Google Wake Words Dataset in US English is a testament to our expertise in data collection and annotation. It serves as an invaluable resource for advancements in voice recognition and natural language processing technologies.
Quality Data Creation
Guaranteed TAT
ISO 9001:2015, ISO/IEC 27001:2013 Certified
HIPAA Compliance
GDPR Compliance
Compliance and Security
Let's Discuss your Data collection
Requirement With Us
To get a detailed estimation of requirements please reach us.