OCR Data Collection

Home » Case Study » OCR Data Collection

Project Overview:

Objective

To acquire and annotate a diverse dataset suitable for OCR applications, it is essential to enhance text recognition in various contexts. Firstly, a wide range of text samples should be collected from different sources, such as printed documents, handwritten notes, and digital text, as part of the OCR Data Collection process. Furthermore, these samples should include various languages, fonts, and styles to ensure comprehensive coverage.

Scope

To begin with, we start by collecting a large number of textual images from various domains. Subsequently, we proceed to perform detailed annotation in order to achieve accurate character recognition.

Sources

Scanned documents, such as reports, books, and forms, were obtained, along with street signs and billboards. Additionally, these documents included handwritten notes.
The other collected receipts were primarily printed receipts. Additionally, there were invoices. Furthermore, digital displays were used. Moreover, screens were also included.

Data Collection Metrics

Total Image Samples: 400,000
Scanned Documents: 150,000
Street Signs & Billboards: 50,000
Handwritten Notes: 90,000
Receipts & Invoices: 80,000
Digital Displays: 30,000

Annotation Process

Stages

Bounding Boxes: Drawing rectangles around individual words or characters helps in isolating each element for better analysis.
Transcription: Providing textual equivalents for the identified textual images ensures the text is understandable in a digital format.
Font & Style Tagging: Marking text based on its font type, size, and style (italic, bold) helps distinguish different parts of the text visually.
Handwriting Classification: Distinguishing between cursive, print, and mixed styles makes it easier to analyze handwritten text accurately.
Quality Labeling: Noting the clarity and quality of text (blurry, clear, distorted) helps in assessing the readability and usability of the text.

Annotation Metrics

Total Annotations: 1,600,000
Bounding Boxes: 1,000,000
Transcriptions: 400,000
Font & Style Tags: 100,000
Handwriting Classifications: 50,000
Quality Labels: 50,000

Quality Assurance

Stages

Expert Review: Moreover, we hire OCR specialists to review annotations, thus ensuring high-quality output.
Consistency Checks: Furthermore, automated checks are in place to check text and box placements.
Inter-annotator Agreement: Additionally, to maintain uniformity, multiple annotators review overlapping portions of the dataset.

QA Metrics

Annotations Reviewed by Specialists: 160,000 (10% of total annotations)
Inconsistencies Detected and Corrected: 32,000 (2% of total annotations)

Conclusion

The development of this comprehensive OCR dataset, created through meticulous collection and annotation methods, is poised to significantly boost text recognition capabilities. Furthermore, the dataset’s diversity ensures its applicability across various OCR scenarios. Consequently, this makes it a valuable resource for the AI and machine learning community.

Let's Discuss your Data collection Requirement With Us

To get a detailed estimation of requirements please reach us.