New York English General Conversation Dataset
Home » Case Study » New York English General Conversation Dataset
Project Overview:
Objective
Our mission is to curate a comprehensive dataset that enables AI models to engage in meaningful conversations, effectively capturing the essence of everyday dialogues in the New York English dialect. The project, “New York English General Conversation Dataset,” aims to provide a valuable resource for natural language processing and conversational AI research. It’s like having a treasure trove of New York conversations at your fingertips.
Scope
To achieve this goal, we set out to collect and annotate conversational data in the New York English dialect. Our focus is on capturing real-life conversations, including the nuances, slang, and regional variations that make New York English unique. This dataset will empower AI models to understand and generate authentic New York-style conversations.
Sources
- Street Conversations: Gathering conversations from the lively streets of New York City, where people engage in spontaneous dialogues on a wide range of topics.
- Neighborhood Cafes: Recording conversations in local cafes, capturing the friendly banter and discussions that happen over a cup of coffee.
- Public Transportation: Eavesdropping on conversations in subways and buses, where commuters exchange thoughts and opinions.
Data Collection Metrics
- Total Conversations Collected: 50,000 conversations
- Street Conversations: 25,000
- Neighborhood Cafes: 15,000
- Public Transportation: 10,000
Annotation Process
Stages
- Conversation Transcription: Annotating each conversation by transcribing it into text format, capturing the unique linguistic features of New York English.
- Emotion Labeling: Adding emotional context to conversations, identifying sentiments, tones, and attitudes expressed.
Annotation Metrics
- Conversations Transcribed: 50,000
- Conversations with Emotion Labels: 50,000
Quality Assurance
Stages
Annotation Verification: Engaged linguistic experts to review and verify the accuracy of transcriptions and emotion labels.
Data Quality Control: Removed conversations with poor audio quality, incomplete transcripts, or irrelevant content.
Data Security: Safeguarded the privacy of individuals involved in the conversations and adhered to all legal and ethical standards.
QA Metrics
- Annotation Validation Cases: 5,000 (10% of total)
- Data Cleansing: Eliminated low-quality or irrelevant conversations
Conclusion
The “New York English General Conversation Dataset” is a valuable resource for researchers, linguists, and AI enthusiasts. With its extensive collection of authentic New York conversations and detailed annotations, this dataset empowers AI models to engage in conversations that resonate with the unique charm of New York English. It’s a game-changer for developing conversational AI systems, enabling them to speak and understand like true New Yorkers.
Quality Data Creation
Guaranteed TAT
ISO 9001:2015, ISO/IEC 27001:2013 Certified
HIPAA Compliance
GDPR Compliance
Compliance and Security
Let's Discuss your Data collection Requirement With Us
To get a detailed estimation of requirements please reach us.