OCR Data Collection

Project Overview:


To acquire and annotate a diverse dataset suitable for OCR applications, it is essential to enhance text recognition in various contexts. Firstly, a wide range of text samples should be collected from different sources, such as printed documents, handwritten notes, and digital text. Furthermore, these samples should include various languages, fonts, and styles to ensure comprehensive coverage.


We begin by collecting a large number of textual images from various domains. Next, we perform detailed annotation to achieve accurate character recognition.

OCR Data Collection
OCR Data Collection
OCR Data Collection
OCR Data Collection


  • Scanned documents (e.g., reports, books, forms) with handwritten notes were obtained alongside the street signs and billboards.
  • The other collected receipts were printed receipts, invoices, digital displays and screens.
OCR Data Collection
OCR Data Collection

Data Collection Metrics

  • Total Image Samples: 400,000
  • Scanned Documents: 150,000
  • Street Signs & Billboards: 50,000
  • Handwritten Notes: 90,000
  • Receipts & Invoices: 80,000
  • Digital Displays: 30,000

Annotation Process


  1. Bounding Boxes: Drawing rectangles around individual words or characters helps in isolating each element for better analysis.
  2. Transcription: Providing textual equivalents for the identified textual images ensures the text is understandable in a digital format.
  3. Font & Style Tagging: Marking text based on its font type, size, and style (italic, bold) helps distinguish different parts of the text visually.
  4. Handwriting Classification: Distinguishing between cursive, print, and mixed styles makes it easier to analyze handwritten text accurately.
  5. Quality Labeling: Noting the clarity and quality of text (blurry, clear, distorted) helps in assessing the readability and usability of the text.

Annotation Metrics

  • Total Annotations: 1,600,000
  • Bounding Boxes: 1,000,000
  • Transcriptions: 400,000
  • Font & Style Tags: 100,000
  • Handwriting Classifications: 50,000
  • Quality Labels: 50,000
OCR Data Collection
OCR Data Collection
OCR Data Collection
OCR Data Collection

Quality Assurance


Expert Review: Moreover, we hire OCR specialists to review annotations, thus ensuring high-quality output.
Consistency Checks: Furthermore, automated checks are in place to check text and box placements.
Inter-annotator Agreement: Additionally, to maintain uniformity, multiple annotators review overlapping portions of the dataset.

QA Metrics

  • Annotations Reviewed by Specialists: 160,000 (10% of total annotations)
  • Inconsistencies Detected and Corrected: 32,000 (2% of total annotations)


The development of this comprehensive OCR dataset, created through meticulous collection and annotation methods, is poised to significantly boost text recognition capabilities. Moreover, the dataset’s diversity ensures its applicability across various OCR scenarios, making it a valuable resource for the AI and machine learning community.

quality dataset

Quality Data Creation

Guaranteed TAT​

Guaranteed TAT

ISO 9001:2015, ISO/IEC 27001:2013 Certified​

ISO 9001:2015, ISO/IEC 27001:2013 Certified

HIPAA Compliance​

HIPAA Compliance

GDPR Compliance​

GDPR Compliance

Compliance and Security​

Compliance and Security

Let's Discuss your Data collection Requirement With Us

To get a detailed estimation of requirements please reach us.

Scroll to Top
{ "@context": "https://schema.org/", "@graph": [ { "@type": "Dataset", "name": "OCR Data Collection", "description": "Dive into the OCR Data Collection Ideal for text recognition AI, document analysis, and digital data extraction research.", "url": "https://gts.ai/case-study/ocr-data-collection/", "keywords": [ "OCR Data", "text collection", "Invoice Dataset Collection", "dataset for ml", "artificial intelligence data sets" ], "license": "https://creativecommons.org/publicdomain/zero/1.0/", "publisher": { "@type": "Organization", "name": "GLOBOSE TECHNOLOGY SOLUTIONS PRIVATE LIMITED" }, "distribution": { "@type": "DataDownload", "encodingFormat": "JSON", "contentUrl": "https://gts.ai/case-study/ocr-data-collection/" }, "creator": { "@type": "Organization", "url": "https://gts.ai/", "logo": "https://gts.ai/wp-content/themes/mx/images/logo.png", "name": "GTS", "contactPoint": { "@type": "ContactPoint", "contactType": "customer service", "telephone": "+91-9549451061", "email": "hi@gts.ai" } } }, { "@type": "BreadcrumbList", "itemListElement": [ { "@type": "ListItem", "position": 1, "name": "Home", "item": "https://gts.ai/" }, { "@type": "ListItem", "position": 2, "name": "OCR", "item": "https://gts.ai/case-study-category/ocr/" }, { "@type": "ListItem", "position": 3, "name": "OCR Data Collection", "item": "https://gts.ai/case-study/ocr-data-collection/" } ] } ] }