Malay Media Audio Dataset

Project Overview:


The objective of our project, “Malay Media Audio Dataset,” is to develop a comprehensive audio dataset that can be used for training advanced machine learning models in voice recognition, natural language processing, and media analysis. This dataset specifically focuses on the Malay language, providing a rich source of linguistic data.


Our scope involves the collection and annotation of Malay language audio files from diverse sources. This includes media clips, interviews, and other spoken-word recordings. The audio files are annotated with detailed metadata, including speaker identity, speech context, and technical attributes.

Malay Media Audio Dataset
Malay Media Audio Dataset
Malay Media Audio Dataset
Malay Media Audio Dataset


  • Movies and TV Shows: Scenes featuring characters speaking in AAVE.
  • Interviews: Conversations and interviews with African American individuals where AAVE is spoken.
  • Online Videos: Social media clips, YouTube videos, and online content showcasing AAVE usage.
Malay Media Audio Dataset
Malay Media Audio Dataset

Data Collection Metrics

  • Total Audio Recordings: 18,000 recordings
  • Media Clips: 7,000
  • Interviews: 6,000
  • Other Spoken-Word Recordings: 5,000

Annotation Process


  1. Speaker Identification: Annotate each audio recording with the identity of the speaker(s) and their role in the media.
  2. Contextual Tagging: Tag each recording with context information like topic, setting, and emotional tone.
  3. Technical Annotation: Include technical data such as audio quality, duration, and background noise levels.

Annotation Metrics

  • Audio Recordings with Speaker and Contextual Labels: 18,000
  • Technical Annotations: 18,000
Malay Media Audio Dataset
Malay Media Audio Dataset
Malay Media Audio Dataset
Malay Media Audio Dataset

Quality Assurance


  • Rigorous validation process to ensure the accuracy of annotations.
  • Regular checks for audio quality and clarity.
  • Adherence to data privacy regulations and ethical guidelines.

QA Metrics

  • Audio Quality Checks: 3,000 recordings
  • Annotation Accuracy Review: 2,000 recordings


The Malay Media Audio Dataset is an invaluable resource for the development of machine learning models that require Malay language audio inputs. With a diverse range of recordings and meticulous annotations, this dataset stands out as a high-quality tool for researchers and developers working in the fields of voice recognition, linguistic analysis, and media studies. Our commitment to data quality and integrity ensures that the dataset is not only comprehensive but also reliable and effective for various applications.

quality dataset

Quality Data Creation

Guaranteed TAT​

Guaranteed TAT

ISO 9001:2015, ISO/IEC 27001:2013 Certified​

ISO 9001:2015, ISO/IEC 27001:2013 Certified

HIPAA Compliance​

HIPAA Compliance

GDPR Compliance​

GDPR Compliance

Compliance and Security​

Compliance and Security

Let's Discuss your Data collection Requirement With Us

To get a detailed estimation of requirements please reach us.

Scroll to Top