Chinese Bill Datasets
Home » Case Study » Chinese Bill Datasets
Project Overview:
Objective
Developing a comprehensive dataset containing images of various Chinese bills, such as receipts and invoices, is crucial. This dataset aims to accelerate advancements in Optical Character Recognition (OCR) models specifically tailored for financial documents in China. Furthermore, it will assist in the development of expense tracking apps and facilitate tax compliance automation. By including a diverse range of bill types and formats, the dataset will ensure that OCR models can accurately recognize and process text across different document styles. Consequently, this will enhance the efficiency and accuracy of financial data management, making it easier for individuals and businesses to track expenses and comply with tax regulations.
Scope
Compile images of various types of bills, such as restaurant receipts, shopping invoices, utility bills, and more. For each bill, annotate it with its categories, total amount, date, and itemized details when applicable. Furthermore, ensure that each annotation is clear and concise. Additionally, this will facilitate a comprehensive understanding of the billing details.
Sources
- Collaborations with businesses willing to share anonymized bill copies offer numerous advantages. Firstly, they enable comprehensive data analysis without compromising customer privacy. Moreover, these partnerships facilitate the development of more accurate pricing models, which in turn benefit both the businesses and their customers. Furthermore, sharing anonymized bill copies allows for the identification of consumption patterns, thereby enabling companies to optimize their services.
- Public solicitations for bill contributions require careful attention to ensure sensitive data is properly redacted. Consequently, this safeguards the privacy of individuals and protects against potential misuse of information.
- Firstly, the itemized list of purchases includes a variety of electronics and household goods. Subsequently, the total amount due is calculated, incorporating taxes and shipping fees. Moreover, any applicable discounts or promotional offers are deducted, providing a clear breakdown of savings.
Data Collection Metrics
- Total Bill Images: 300,000
- Restaurant Receipts: 100,000
- Shopping Invoices: 80,000
- Utility Bills: 60,000
- Transportation Bills: 40,000
- Miscellaneous Bills: 20,000
Annotation Process
Stages
- Image Pre-processing: Enhancing clarity, brightness, and contrast to improve legibility. Furthermore, adjusting these parameters ensures the text is more readable.
- Category Annotation: Labeling the type/category of each bill. Additionally, this step organizes the bills into relevant groups, facilitating easier data handling.
- Data Extraction: Annotating key details such as the total amount, date, and itemized list. Moreover, this process captures essential information for further analysis.
- Validation: Using preliminary OCR models and financial experts to verify the annotations. Consequently, this ensures the accuracy and reliability of the extracted data.
Annotation Metrics
- Total Category Annotations: 300,000
- Total Amount Annotations: 300,000
- Date Annotations: 300,000
- Itemized Details Annotations (for applicable bills): 220,000
Quality Assurance
Stages
Automated OCR Verification:Â Early-stage OCR models help in validating the extracted data against the annotations.
Peer Review: Subsequently, a secondary set of annotators inspects a subset of the bills for consistency and accuracy.
Inter-annotator Agreement: Finally, Inter-annotator Agreement: Moreover, certain bills are annotated by multiple reviewers to ensure agreement and consistency in data extraction.
QA Metrics
- Annotations Validated using OCR: 150,000 (50% of total bills)
- Peer Reviewed Annotations: 90,000 (30% of total bills)
- Inconsistencies Identified and Rectified: 6,000 (2% of total bills)
Conclusion
The Chinese Bill Dataset provides a robust foundation for models and apps targeting financial document recognition and data extraction in China. With its extensive coverage of various bill types and meticulous annotations, this dataset serves as a catalyst for technological innovations in personal finance, business expense management, and regulatory compliance.
Quality Data Creation
Guaranteed TAT
ISO 9001:2015, ISO/IEC 27001:2013 Certified
HIPAA Compliance
GDPR Compliance
Compliance and Security
Let's Discuss your Data collection Requirement With Us
To get a detailed estimation of requirements please reach us.