Chinese Handwritten Composition Datasets

Project Overview:

Objective

Our goal was to compile a comprehensive dataset of Chinese handwritten compositions, targeting a significant leap forward in Optical Character Recognition (OCR) technologies for Chinese scripts. This dataset is also a boon for educators, offering automated tools for grading and analyzing student compositions.

Scope

We embarked on gathering a wide array of handwritten essays and compositions, covering various themes and writing styles. Accompanying each piece were key metadata elements such as grade level, writing style, and a digital text version.

Chinese Handwritten Composition Datasets
Chinese Handwritten Composition Datasets
Chinese Handwritten Composition Datasets
Chinese Handwritten Composition Datasets

Sources

  • Collaborations with schools across different provinces in China.
  • Public essay competitions emphasizing handwritten submissions.
  • Archival compositions from educational institutions.
  • Crowd-sourced contributions through online platforms.
case study-post
Chinese Handwritten Composition Datasets
Chinese Handwritten Composition Datasets

Data Collection Metrics

  • Total Handwritten Compositions Collected: 275,000
  • Primary School Submissions: 80,000
  • Middle School Essays: 90,000
  • High School Compositions: 60,000
  • University and Adult Contributions: 25,000

Annotation Process

Stages

  1. Image Pre-processing for Enhanced Legibility
  2. Accurate Digital Transcription of Handwritten Content
  3. Detailed Metadata Annotation

Annotation Metrics

  • Total Digital Transcriptions Completed: 275,000
  • Metadata Annotations: 825,000 (Three per composition)
Chinese Handwritten Composition Datasets
Chinese Handwritten Composition Datasets
Chinese Handwritten Composition Datasets
Chinese Handwritten Composition Datasets

Quality Assurance

Stages

Automated OCR Checks

Rigorous Peer Review Process

High Standards of Inter-annotator Agreement

QA Metrics

  • OCR Validated Annotations: 137,500
  • Peer Reviewed Annotations: 82,500
  • Identified and Rectified Inconsistencies: 5,500

Conclusion

The Chinese Handwritten Composition Dataset offers an invaluable reservoir of native script that mirrors the intricacies and variations of handwriting across different age groups and education levels. By integrating this dataset, OCR technologies can achieve higher accuracy rates when deciphering Chinese handwriting. Furthermore, educational tools can benefit immensely, allowing for innovative solutions in automated grading, handwriting analysis, and educational feedback.

Technology

Quality Data Creation

Technology

Guaranteed TAT

Technology

ISO 9001:2015, ISO/IEC 27001:2013 Certified

Technology

HIPAA Compliance

Technology

GDPR Compliance

Technology

Compliance and Security

Let's Discuss your Data collection Requirement With Us

To get a detailed estimation of requirements please reach us.

Scroll to Top