Chinese Bill Datasets: Financial Data Resources

Chinese Bill Datasets

Project Overview:

Objective

Develop a comprehensive dataset containing images of various Chinese bills (receipts and invoices). This dataset aims to accelerate advancements in Optical Character Recognition (OCR) models tailored for financial documents in China, assist in expense tracking apps, and facilitate tax compliance automation.

Scope

Compile images of different types of bills, including restaurant receipts, shopping invoices, utility bills, and more. Each bill will be annotated with its categories, total amount, date, and itemized details when applicable.

  • img4
  • img4
  • img4
  • img4

Sources

  • Collaborations with businesses willing to share anonymized bill copies.
  • Public solicitations for bill contributions, ensuring sensitive data is redacted.
  • Sample bills from various online shopping and utility websites.
img4
  • img4
  • img4

Data Collection Metrics

  • Total Bill Images: 300,000
  • Restaurant Receipts: 100,000
  • Shopping Invoices: 80,000
  • Utility Bills: 60,000
  • Transportation Bills: 40,000
  • Miscellaneous Bills: 20,000

Annotation Process

Stages

  1. Image Pre-processing: Enhancing clarity, brightness, and contrast to improve legibility.
  2. Category Annotation: Labeling the type/category of each bill.
  3. Data Extraction: Annotating key details such as total amount, date, and itemized list.
  4. Validation: Using preliminary OCR models and financial experts to verify the annotations.

Annotation Metrics

  • Total Category Annotations: 300,000
  • Total Amount Annotations: 300,000
  • Date Annotations: 300,000
  • Itemized Details Annotations (for applicable bills): 220,000
  • img4
  • img4
  • img4
  • img4

Quality Assurance

Automated OCR Verification: Early-stage OCR models help in validating the extracted data against the annotations.
Peer Review: A secondary set of annotators inspects a subset of the bills for consistency and accuracy.
Inter-annotator Agreement: Certain bills are annotated by multiple reviewers to ensure agreement and consistency in data extraction.

QA Metrics:

  • Annotations Validated using OCR: 150,000 (50% of total bills)
  • Peer Reviewed Annotations: 90,000 (30% of total bills)
  • Inconsistencies Identified and Rectified: 6,000 (2% of total bills)

Conclusion

The Chinese Bill Dataset provides a robust foundation for models and apps targeting financial document recognition and data extraction in China. With its extensive coverage of various bill types and meticulous annotations, this dataset serves as a catalyst for technological innovations in personal finance, business expense management, and regulatory compliance.

  • icon
    Quality Data Creation
  • icon
    Guaranteed
    TAT
  • icon
    ISO 9001:2015, ISO/IEC 27001:2013 Certified
  • icon
    HIPAA
    Compliance
  • icon
    GDPR
    Compliance
  • icon
    Compliance and Security

Let's Discuss your Data collection
Requirement With Us

To get a detailed estimation of requirements please reach us.

Get a Quote icon