New York English General Conversation Dataset

Project Overview:

Objective

Our mission is to curate a comprehensive dataset that enables AI models to engage in meaningful conversations, effectively capturing the essence of everyday dialogues in the New York English dialect. The project, “New York English General Conversation Dataset,” aims to provide a valuable resource for natural language processing and conversational AI research. It’s like having a treasure trove of New York conversations at your fingertips.

Scope

To achieve this goal, we set out to collect and annotate conversational data in the New York English dialect. Our focus is on capturing real-life conversations, including the nuances, slang, and regional variations that make New York English unique. This dataset will empower AI models to understand and generate authentic New York-style conversations.

New York English General Conversation Dataset
New York English General Conversation Dataset
New York English General Conversation Dataset
New York English General Conversation Dataset

Sources

  • Street Conversations: Gathering conversations from the lively streets of New York City, where people engage in spontaneous dialogues on a wide range of topics.
  • Neighborhood Cafes: Recording conversations in local cafes, capturing the friendly banter and discussions that happen over a cup of coffee.
  • Public Transportation: Eavesdropping on conversations in subways and buses, where commuters exchange thoughts and opinions.
New York English General Conversation Dataset
New York English General Conversation Dataset

Data Collection Metrics

  • Total Conversations Collected: 50,000 conversations
  • Street Conversations: 25,000
  • Neighborhood Cafes: 15,000
  • Public Transportation: 10,000

Annotation Process

Stages

  1. Conversation Transcription: Annotating each conversation by transcribing it into text format, capturing the unique linguistic features of New York English.
  2. Emotion Labeling: Adding emotional context to conversations, identifying sentiments, tones, and attitudes expressed.

Annotation Metrics

  • Conversations Transcribed: 50,000
  • Conversations with Emotion Labels: 50,000
New York English General Conversation Dataset
New York English General Conversation Dataset
New York English General Conversation Dataset
New York English General Conversation Dataset

Quality Assurance

Stages

Annotation Verification: Engaged linguistic experts to review and verify the accuracy of transcriptions and emotion labels.
Data Quality Control: Removed conversations with poor audio quality, incomplete transcripts, or irrelevant content.
Data Security: Safeguarded the privacy of individuals involved in the conversations and adhered to all legal and ethical standards.

QA Metrics

  • Annotation Validation Cases: 5,000 (10% of total)
  • Data Cleansing: Eliminated low-quality or irrelevant conversations

Conclusion

The “New York English General Conversation Dataset” is a valuable resource for researchers, linguists, and AI enthusiasts. With its extensive collection of authentic New York conversations and detailed annotations, this dataset empowers AI models to engage in conversations that resonate with the unique charm of New York English. It’s a game-changer for developing conversational AI systems, enabling them to speak and understand like true New Yorkers.

quality dataset

Quality Data Creation

Guaranteed TAT​

Guaranteed TAT

ISO 9001:2015, ISO/IEC 27001:2013 Certified​

ISO 9001:2015, ISO/IEC 27001:2013 Certified

HIPAA Compliance​

HIPAA Compliance

GDPR Compliance​

GDPR Compliance

Compliance and Security​

Compliance and Security

Let's Discuss your Data collection Requirement With Us

To get a detailed estimation of requirements please reach us.

Scroll to Top