New York English General Conversation Dataset

Home » Case Study » New York English General Conversation Dataset

Project Overview:

Objective

Our mission is to curate a comprehensive dataset that enables AI models to engage in meaningful conversations, effectively capturing the essence of everyday dialogues in the New York English dialect. The project, “New York English General Conversation Dataset,” aims to provide a valuable resource for natural language processing and conversational AI research. It’s like having a treasure trove of New York conversations at your fingertips.

Scope

To achieve this goal, we set out to collect and annotate conversational data in the New York English dialect. Our focus is on capturing real-life conversations, including the nuances, slang, and regional variations that make New York English unique. This dataset will empower AI models to understand and generate authentic New York-style conversations.

Sources

Street Conversations: Gathering conversations from the lively streets of New York City, where people engage in spontaneous dialogues on a wide range of topics.
Neighborhood Cafes: Recording conversations in local cafes, capturing the friendly banter and discussions that happen over a cup of coffee.
Public Transportation: Eavesdropping on conversations in subways and buses, where commuters exchange thoughts and opinions.

Data Collection Metrics

Total Conversations Collected: 50,000 conversations
Street Conversations: 25,000
Neighborhood Cafes: 15,000
Public Transportation: 10,000

Annotation Process

Stages

Conversation Transcription: Annotating each conversation by transcribing it into text format, capturing the unique linguistic features of New York English.
Emotion Labeling: Adding emotional context to conversations, identifying sentiments, tones, and attitudes expressed.

Annotation Metrics

Conversations Transcribed: 50,000
Conversations with Emotion Labels: 50,000

Quality Assurance

Stages

Annotation Verification: Engaged linguistic experts to review and verify the accuracy of transcriptions and emotion labels.
Data Quality Control: Removed conversations with poor audio quality, incomplete transcripts, or irrelevant content.
Data Security: Safeguarded the privacy of individuals involved in the conversations and adhered to all legal and ethical standards.

QA Metrics

Annotation Validation Cases: 5,000 (10% of total)
Data Cleansing: Eliminated low-quality or irrelevant conversations

Conclusion

The “New York English General Conversation Dataset” is a valuable resource for researchers, linguists, and AI enthusiasts. With its extensive collection of authentic New York conversations and detailed annotations, this dataset empowers AI models to engage in conversations that resonate with the unique charm of New York English. It’s a game-changer for developing conversational AI systems, enabling them to speak and understand like true New Yorkers.

Let's Discuss your Data collection Requirement With Us

To get a detailed estimation of requirements please reach us.

New York English General Conversation Dataset

Project Overview:

Objective

Scope

Sources

Data Collection Metrics

Annotation Process

Stages

Annotation Metrics

Quality Assurance

Stages

QA Metrics

Conclusion

Quality Data Creation

Guaranteed TAT

ISO 9001:2015, ISO/IEC 27001:2013 Certified

HIPAA Compliance

GDPR Compliance

Compliance and Security

Let's Discuss your Data collection Requirement With Us