Mandarin OCR Images Data – Images with Transcription

Project Overview:

Objective

Our objective was to develop a comprehensive dataset consisting of OCR images paired with accurate transcriptions in Mandarin. This dataset is essential for training and evaluating OCR and text recognition systems. Consequently, it enables these systems to convert Mandarin text, whether scanned or handwritten, into digital format with high accuracy.

Scope

Collect a diverse set of images containing Mandarin text, transcribe the text into digital format, and ensure high-quality image-text pairs for OCR model training.

Mandarin OCR Images Data – Images with Transcription
Mandarin OCR Images Data – Images with Transcription
Mandarin OCR Images Data – Images with Transcription
Mandarin OCR Images Data – Images with Transcription

Sources

  • Image Collections: Obtain a variety of image sources containing Mandarin text, including scanned documents, handwritten notes, books, historical documents, and public domain text.
  • Crowdsourcing: Employ crowdsourcing platforms to collect handwritten text samples and transcriptions.
Mandarin OCR Images Data – Images with Transcription
Mandarin OCR Images Data – Images with Transcription

Data Collection Metrics

  • Total OCR Images Collected: 60,000 images
  • Handwritten Samples Gathered: 12,000 samples
  • Transcriptions Provided: Digital transcriptions for all collected images
  • Data Annotated: 72,000 data points annotated

Annotation Process

Stages

  1. Image Selection: Curate a diverse set of images containing Mandarin text. Ensure these images include various fonts, styles, and types of writing, such as simplified, traditional, and calligraphy.
  2. OCR and Transcription: Use OCR software to automatically extract text from the images. Then, review and correct the OCR output to ensure the accuracy and completeness of the transcriptions.
  3. Handwritten Samples: Collect handwritten samples through crowdsourcing platforms. Make sure to include a wide range of handwriting styles to cover different variations.
  4. Transcription Validation: Finally, validate the quality of the transcriptions through manual review and verification by Mandarin-speaking experts.

Annotation Metrics

  • OCR Images with Transcriptions: 50,000 pairs
  • Handwritten Samples: 10,000 samples
  • Transcription Validation Cases: 5,000 (randomly selected for validation)
Mandarin OCR Images Data – Images with Transcription
Mandarin OCR Images Data – Images with Transcription
Mandarin OCR Images Data – Images with Transcription
Mandarin OCR Images Data – Images with Transcription

Quality Assurance

Stages

Transcription Verification: Implement a validation process involving Mandarin-speaking human reviewers to verify the correctness of transcriptions and OCR output.
Privacy Compliance: Ensure compliance with privacy regulations, especially when handling potentially sensitive handwritten documents.
Data Security: Implement data security measures to protect any personal or sensitive information.

QA Metrics

  • Transcription Validation Accuracy: Ensure a high level of accuracy (e.g., 99%+) in transcription validation.
  • Privacy Audits: Ongoing to ensure compliance

Conclusion

The Mandarin OCR Images Data with Transcriptions dataset is a valuable resource for OCR and text recognition research and development in the Mandarin language. This dataset includes diverse images and accurate transcriptions, and it adheres to privacy and security standards. Therefore, it enables the training and evaluation of OCR models for Mandarin text effectively.

quality dataset

Quality Data Creation

Guaranteed TAT​

Guaranteed TAT

ISO 9001:2015, ISO/IEC 27001:2013 Certified​

ISO 9001:2015, ISO/IEC 27001:2013 Certified

HIPAA Compliance​

HIPAA Compliance

GDPR Compliance​

GDPR Compliance

Compliance and Security​

Compliance and Security

Let's Discuss your Data collection Requirement With Us

To get a detailed estimation of requirements please reach us.

Scroll to Top