English Scenes Text Dataset
Home » Case Study » English Scenes Text Dataset
Project Overview:
Objective
To develop a dataset encompassing images with English text captured from various scenes. The primary intent is to train and improve Optical Character Recognition (OCR) models, AI-based translation tools, and accessibility software.
Scope
Compile images featuring English text from diverse settings, such as street signs, storefronts, product labels, book covers, and digital screens. Annotations will pinpoint the text and provide accompanying transcriptions.
Sources
- Tourist attractions and public spaces: Engaged in collaborations to create a carefully collected and successfully curated portfolio of popular destinations.
- Collaborations with retailers for product packaging: Established partnerships resulting in a thoughtfully collected and successfully curated assortment of product packaging materials.
- Library archives and bookstores: Utilized materials from library archives and bookstores, contributing to a meticulously collected and diverse range of literary resources.
- User-submitted photographs: Accepted and thoughtfully curated a collection of user-submitted photographs for a comprehensive visual representation.
- Screenshots from digital devices: Ethically collected and successfully curated screenshots from digital devices, ensuring a well-organized and comprehensive dataset.
Data Collection Metrics
- Total Images with Text: 450,000
- Street Signs: 100,000
- Storefronts and Billboards: 90,000
- Product Labels: 80,000
- Book Covers: 70,000
- Digital Screens: 60,000
- Miscellaneous Sources: 50,000
Annotation Process
Stages
- Image Pre-processing: Adjusting for optimal clarity and brightness.
- Text Region Segmentation: Outlining the exact region where the text appears.
- Text Transcription: Transcribing the English text as it appears in the scene.
- Validation: Cross-checking annotations with a combination of human reviewers and preliminary OCR models.
Annotation Metrics
- Total Text Region Annotations: 450,000
- Transcriptions Provided: 450,000
Quality Assurance
Stages
Automated OCR Verification:Â Early-stage OCR models validate the segmented text and transcriptions for consistency.
Peer Review:Â Annotations undergo a second pass by an alternate group of annotators.
Inter-annotator Agreement:Â A selection of images is examined by multiple annotators to ensure high consistency.
QA Metrics
- Annotations Validated using OCR: 225,000 (50% of total images)
- Peer Reviewed Annotations: 135,000 (30% of total images)
- Inconsistencies Identified and Rectified: 9,000 (2% of total images)
Conclusion
The English Scenes Text Dataset serves as a robust foundation for the development and refinement of OCR models and other text-recognition software. By representing a wide range of real-world scenes, the dataset ensures that these models can accurately detect and interpret English text in various contexts. This dataset is pivotal for innovations in language translation, augmented reality, and assistive technologies.
Quality Data Creation
Guaranteed TAT
ISO 9001:2015, ISO/IEC 27001:2013 Certified
HIPAA Compliance
GDPR Compliance
Compliance and Security
Let's Discuss your Data collection Requirement With Us
To get a detailed estimation of requirements please reach us.