Endless quantities of old but critically important data is trapped in paper journals and reports across the world. Is it possible to extract this data and make it available to the masses, even when the data is highly specialised and of poor quality? That is what I did for reports of caving incidents and accidents dating back 50 years.
The National Speleological Society has published American Caving Accidents since 1967, documenting thousands of incidents, and despite being freely available as PDFs, these are unindexed, poorly scanned, and basically impossible to use as a learning tool.
In this talk, I'll show you how I built a system that programmatically processed these documents into a structured, publicly searchable database. You'll see the full pipeline: how to OCR degraded/low quality documents, custom code to untangle multi-column layouts, and LLM processing stages to extract and format the data whilst maintaining 100% accuracy.
I'll cover some of the more unusual challenges: handling dates like "Autumn 1996" with a custom model field, building a pluggable processing step system in Django, and how to intake and normalise large quantities of low quality data in a relational database.
Have you ever looked at a stack of old documents and thought: "there's valuable data in here, if only someone could extract it"? This talk is about what happens when you actually try.