Canadian French conversations

Project Overview:


The project ‚ÄúCanadian French Conversations‚ÄĚ is dedicated to developing a comprehensive dataset to train automatic speech recognition (ASR) models. This dataset aims to enhance the ability of ASR systems to accurately transcribe Canadian French spoken content, catering to a diverse range of applications, including voice-activated systems, transcription services, and language learning tools.


This initiative focuses on the collection and annotation of Canadian French spoken dialogues from a variety of sources, ensuring a broad spectrum of dialects, accents, and colloquialisms are represented. The project involves both raw audio collection and detailed transcription, including dialogue annotations and contextual metadata.

  • Dialogue Recordings: Collection of Canadian French conversations from public forums, educational materials, and volunteered contributions.
  • Annotation Experts:¬†Engagement of language experts and native speakers for precise and culturally accurate transcriptions.
Data Collection Metrics

  • Total Conversations Collected: 7,500 dialogues
  • Conversations from Public Forums: 5,500
  • Educational Material Contributions: 2,000

Annotation Process


  1. Verbatim Transcription: Each conversation is transcribed verbatim, capturing the nuances of Canadian French, including regional slang and idiomatic expressions.
  2. Metadata Annotation: Metadata such as speaker information, context, regional dialect indicators, and conversation themes are logged.

Annotation Metrics

  • Conversations with Transcriptions: 7,500
  • Metadata Annotated Conversations:¬†7,500
Quality Assurance


  • Transcription Review:¬†Engaging a team of native Canadian French speakers and linguists to review and validate transcriptions for accuracy and cultural relevance.
  • Data Quality Control:¬†Stringent measures to remove or correct transcriptions with significant errors or inconsistencies.
  • Data Security:¬†Ensuring compliance with privacy laws and intellectual property rights.

QA Metrics

Transcription Validation Cases: 750 (10% of total)
Data Cleansing and Error Correction: Rigorous review and editing process.


The ‚ÄúCanadian French Conversations‚ÄĚ dataset stands as a pivotal resource for developers and researchers focusing on Canadian French speech recognition. This rich dataset, with its accurate annotations and comprehensive metadata, is instrumental in advancing ASR technology. It plays a crucial role in enhancing the accessibility and usability of technology for French-speaking communities in Canada, opening doors to innovative applications in various fields.

