CSV Converter

Python Pandas PyMuPDF OCR

Developed a high-accuracy PDF-to-CSV conversion algorithm using Python that extracts every table into separate CSV files, generates a full-document text file, and outputs structural metadata files detailing table outlines, dimensions, and formatting, robust across any document layout. Built a modular extraction pipeline using Docling, PyMuPDF, Pandas, and a 3-layer OCR stack.

Overview

The CSV Converter is a sophisticated document processing tool designed to extract tabular data from PDF documents with high accuracy. The system handles complex document layouts and converts tables into structured CSV format while preserving formatting and relationships between data elements.

How this works:

Key Features

Multi-table extraction from single PDF documents
High-accuracy OCR with 3-layer fallback system
Automatic table detection and boundary recognition
Structural metadata generation (dimensions, formatting)
Full-document text extraction
Robust handling of various document layouts

Technology Stack

Core Libraries

Python - Primary programming language
Pandas - Data manipulation and CSV generation
PyMuPDF - PDF parsing and text extraction
Docling - Document structure analysis

OCR Stack

Tesseract OCR - Primary OCR engine
RapidOCR - Secondary OCR fallback
Custom OCR Layer - Tertiary fallback for edge cases

Implementation Highlights

The conversion pipeline employs a modular architecture that processes documents through multiple stages: initial structure analysis, table detection, OCR processing with fallback layers, and finally CSV generation with metadata preservation.

The 3-layer OCR stack ensures maximum accuracy by attempting extraction with progressively more specialized OCR engines, allowing the system to handle documents with varying quality, fonts, and layouts. The metadata output provides detailed information about table structure, enabling downstream applications to understand the relationships and formatting of extracted data.

Use Cases

Financial document processing
Research paper data extraction
Report automation and data migration
Document digitization workflows

GitHub