Project Overview:


The “Hinglish Media Audio Dataset” project is designed to create a comprehensive audio dataset that combines Hindi and English languages (Hinglish) for advanced speech recognition applications. This dataset is pivotal for training AI models in understanding and processing mixed-language speech, which is commonly used in various regions, particularly in India.


  • Diverse Environmental Conditions: The dataset includes recordings from a range of environments, from quiet indoor settings to noisy outdoor locations, which helps in training models to accurately process speech under various acoustic conditions.
  • Variety of Speech Contexts: Capturing Hinglish speech in different contexts such as informal conversations, media broadcasts, public speeches, and digital communications, ensuring the models can generalize across different speech situations.
  • Audio samples were sourced from a mix of national and regional media outlets, ensuring a representation of both mainstream and niche content.
  • Collaborations with broadcasters and digital platforms were key in acquiring a comprehensive range of Hinglish audio material.
  • The collected data successfully generated a diverse and authentic set of Hinglish audio interactions, capturing the nuances of language and cultural expression.
Data Collection Metrics

  • Total Hinglish Recordings: 22,500
  • Volunteers’ Contributions: 15,000
  • Public Domain Datasets: 4,500
  • Professional Recordings: 3,000

Annotation Process


  1. Language Identification: Labeling each recording with predominant language indicators (Hindi, English, or Mixed).
  2. Metadata Documentation: Recording details like the date, time, and context of each speech sample.

Annotation Metrics

  • Recordings with Language Labels: 22,500
  • Metadata Entries: 22,500
Quality Assurance


Annotation Review: Expert linguists review the labels for accuracy and consistency.
Data Integrity Checks: Filtering out recordings that don’t meet the quality standards.
Data Security: Upholding privacy norms and ensuring data confidentiality.

QA Metrics

  • Reviewed Annotations: 2,250 (10% of total)
  • Data Cleansing: Exclusion of inadequate recordings


The “Hinglish Media Audio Dataset” represents a significant stride in the realm of speech recognition technology. By offering a rich and diverse range of Hinglish audio samples, accurately annotated and quality-assured, this dataset lays the groundwork for sophisticated AI systems capable of understanding and processing mixed-language speech. It’s a step forward in bridging linguistic diversity, particularly beneficial for regions where Hinglish is a prevalent mode of communication.

