Back to Projects

CSV Converter

Python Pandas PyMuPDF OCR

Developed a high-accuracy PDF-to-CSV conversion algorithm using Python that extracts every table into separate CSV files, generates a full-document text file, and outputs structural metadata files detailing table outlines, dimensions, and formatting—robust across any document layout. Built a modular extraction pipeline using Docling, PyMuPDF, Pandas, and a 3-layer OCR stack.

Overview

The CSV Converter is a sophisticated document processing tool designed to extract tabular data from PDF documents with high accuracy. The system handles complex document layouts and converts tables into structured CSV format while preserving formatting and relationships between data elements.

How this works:

Key Features

  • Multi-table extraction from single PDF documents
  • High-accuracy OCR with 3-layer fallback system
  • Automatic table detection and boundary recognition
  • Structural metadata generation (dimensions, formatting)
  • Full-document text extraction
  • Robust handling of various document layouts

Technology Stack

Core Libraries

  • Python - Primary programming language
  • Pandas - Data manipulation and CSV generation
  • PyMuPDF - PDF parsing and text extraction
  • Docling - Document structure analysis

OCR Stack

  • Tesseract OCR - Primary OCR engine
  • RapidOCR - Secondary OCR fallback
  • Custom OCR Layer - Tertiary fallback for edge cases

Implementation Highlights

The conversion pipeline employs a modular architecture that processes documents through multiple stages: initial structure analysis, table detection, OCR processing with fallback layers, and finally CSV generation with metadata preservation.

The 3-layer OCR stack ensures maximum accuracy by attempting extraction with progressively more specialized OCR engines, allowing the system to handle documents with varying quality, fonts, and layouts. The metadata output provides detailed information about table structure, enabling downstream applications to understand the relationships and formatting of extracted data.

Use Cases

  • Financial document processing
  • Research paper data extraction
  • Report automation and data migration
  • Document digitization workflows