Current project
FeaturedResearch Prototype
DocPipeline
View project →AI-powered document extraction system that turns PDFs and images into structured outputs through OCR, schema-driven extraction, and validation.
Problem
Raw OCR output is not enough for business workflows. Teams still need to interpret fields, normalize formats, and move extracted data into systems they already use.
Architecture
DocPipeline is built as a modular, multi-stage system: document input, pluggable OCR, extraction logic, schema-based shaping, and async export into downstream integrations.
Impact
The project demonstrates a cleaner path from documents to usable structured data across receipts, purchase orders, packing slips, utility bills, expense reports, and quotes.
StackPaddleOCRAzure Document IntelligenceLLM ExtractionSchema ValidationAsync WorkersWebhooks
Related readingLearning Log