Skip to main content

A Plan for Turning Soviet Cookbook Pages into Data

· 8 min read
Bogdan Varlamov
Bogdan Varlamov
Technologist

With Phase 1 of the soviet.recipes project complete (224 DSLR images covering every page), I'm now planning how to extract machine-readable text from those images. This post outlines the plan for comparing five approaches across effort, cost, technical risk, and quality so I can determine the lowest-effort path to a reliable text extraction workflow.

1. The Challenge: Converting 224 Soviet Cookbook Images into Text

I have a full photographic capture of The Book of Tasty and Healthy Food — 224 high-resolution images totaling around 191 MB. Now I need to convert them into machine-readable text to enable search, translation, data extraction, and downstream analysis.

Technical Challenges

The source material has several complexities:

  • Cyrillic text throughout
  • Aged paper and inconsistent contrast
  • Mixed content types: recipes, captions, decorated headings, illustrations
  • Non-uniform typography and complex layouts
  • Photography artifacts such as glare, shadow, and noise
  • Curved text caused by page folds and book curvature

These issues mean simple OCR may not be enough. I need to evaluate a range of modern and traditional methods to find what delivers the best balance of quality and effort.


2. Candidate Approaches for Text Extraction

Below are the five approaches I'll be testing, from manual LLM extraction to fully automated API pipelines.

2.1 Manual Extraction with Multimodal LLMs

Approach: Upload images one-by-one to a commercial multimodal model (ChatGPT, Claude, Gemini, Grok) and manually copy text outputs.

Effort: ~19 hours
Cost: $0–50
Quality: High
Risks: Repetitive, error-prone, not reproducible, slow for iteration.


2.2 Docling Document Processing Pipeline

Approach: Use the Docling document processing framework to automatically convert images to structured text.

Effort: 6–12 hours
Cost: $0
Quality: Medium–High
Risks: multiple OCR engines available, but OCR may not be enough for this use case.

One aspect that is very intriguing about using Docling over direct OCR is that it handles complex reading order processing. Consider the below example, where the red arrows indicate the reading direction. As we can see, it follows the natural human reading order rather than the typical line-by-line behavior of simpler OCR engines.

Docling reading order processing example


2.3 Local Vision-Enabled LLM (Qwen3-VL or similar)

Approach: Run a vision-capable LLM locally using Ollama, llama.cpp, or similar, then batch-process all images through a custom script.

Docling also integrates with Vision-Language Models, which means I may be able to reuse the same Docling pipeline and swap the underlying model with minimal code changes.

Effort: 8–14 hours
Cost: $0
Quality: Medium (model-dependent)
Risks: Hardware limitations, slower inference, variable Cyrillic support.


2.4 Traditional OCR Engines (Tesseract, EasyOCR)

Approach: Build a preprocessing chain (deskew, denoise, dewarp) feeding into a traditional OCR engine.

Effort: 12–25 hours
Cost: $0
Quality: Medium–Low
Risks: Layout handling limited, heavy tuning required, sensitive to image quality.


2.5 Batch Processing Against SOTA API Models

Approach: Build a batch pipeline locally but submit each image to a commercial OCR-capable LLM API such as OpenAI, Anthropic, or xAI.

The Docling pipeline could be reused here as well, since the text extraction step would simply point to an API endpoint instead of a local model.

Effort: 8–14 hours
Cost: $20–100
Quality: High
Risks: Ongoing costs, rate limits, API changes, vendor lock-in.


3. Evaluation Framework and Scoring Methodology

Rather than relying only on high-level criteria, I will use a consistent evaluation process that combines manual review, structured scoring, and any quality metadata provided by the tools themselves. Each approach will be tested on the same representative subset of pages to ensure fair comparison.

3.1 Sampling Strategy

To evaluate output quality without reviewing all 224 pages up front:

  • I will randomly sample 5–10 pages for an initial quick check.
  • For approaches that pass the initial test, I will expand to 20–30 randomly selected pages.
  • The sample will include edge cases:
    • curved text from book folds
    • low contrast or aged pages
    • pages with illustrations, captions, or complex layout
    • dense recipe pages with mixed typography

This guarantees coverage of both typical and problematic pages.

3.2 Manual Review Criteria

For each sampled page, I will manually review and score:

  • Character-level accuracy — Are Cyrillic characters recognized correctly?
  • Completeness — Is all visible text captured? Are any blocks missing?
  • Layout and structure — Are headings, sections, recipes, or captions distinguished?
  • Consistency — Does the approach produce similar quality across different page types?
  • Formatting — Does the output preserve natural reading order or scramble segments?

This manual scoring gives an empirical baseline for comparison.

3.3 Automated or Built-In Quality Signals

Some tools provide additional metadata that can inform evaluation:

  • Docling offers internal confidence measures and indicators during parsing and OCR passes.
  • Traditional OCR engines (Tesseract, EasyOCR) may provide per-block confidence values.
  • LLMs (local or API) sometimes return normalized text probabilities or uncertainty signals depending on the model.

When available, I will record these confidence values alongside manual scores to identify systematic weaknesses.

3.4 Reproducibility and Iteration Speed

I will also measure practical factors:

  • Reproducibility — Can the workflow be rerun without manual steps?
  • Iteration speed — How quickly can I adjust prompts, preprocessing, or settings and reprocess a sample set?
  • Operational overhead — Rate limits, slow inference, batching constraints, or environment setup complexity.

These factors matter because the soviet.recipes project will likely require multiple passes as I refine extraction quality.

3.5 Aggregation and Scoring

All results (manual scores + confidence metadata) will be logged into a consistent spreadsheet that includes:

  • approach
  • sample images tested
  • per-image accuracy/completeness/structure scores
  • confidence values (if available)
  • notes on failure modes or recurring issues

This evaluation process ensures each approach is tested on the same terms and that the ranking reflects real-world performance rather than theoretical expectations.


4. Prioritization and Decision Rationale

I’m using a lowest-effort-first strategy, adjusted for cost and expected quality. This reduces wasted effort and helps quickly validate what works.

Key Observations

Manual LLM Extraction

High quality but extremely inefficient. Useful as a baseline, not as a scalable solution.

Docling Opportunity

Zero cost, easy installation, and purpose-built for documents. This is the most attractive low-effort starting point.

Traditional OCR Complexity

Highly reproducible but requires heavy preprocessing. Likely a fallback option.

Local vs API LLMs

Both require similar setup. Local runs are free but weaker; APIs cost money but deliver best-in-class accuracy.

My Planned Order of Experimentation

  1. Docling Framework
  2. Local Vision LLM
  3. SOTA API Backend
  4. Traditional OCR Engines
  5. Manual LLM Upload (last resort)

This progression climbs from lowest effort to highest quality with minimal wasted time.


5. Implementation Strategy and Workflow Reusability

A major benefit of this approach order is that the batch-processing harness I build early on will support every method that follows.

5.1 Shared Batch Workflow Architecture

The core workflow (iterate → process → save → retry on failure) stays identical regardless of engine.

5.2 Optional Workflow Frameworks

I may use a lightweight orchestrator if the pipeline becomes more complex:

  • CrewAI Flows
  • LangGraph
  • Node-RED
  • Prefect
  • Apache Airflow (probably overkill)

For now, a simple Python script should be sufficient.

5.3 Validation and Quality-Checking Methodology

To compare outputs consistently:

  • Randomly sample 5–30 pages per engine
  • Manually review accuracy, completeness, and structure
  • Log results in a spreadsheet for side-by-side comparison
  • Repeat when tuning or changing engines

This gives a clear, empirical comparison between approaches.


6. What’s Next: The Execution Plan

Before building the full workflow, I’ll run quick validation tests:

  • Docling Sample Run on 5–10 images
  • SOTA LLM Spot Check using ChatGPT or Claude
  • Edge Case Testing on the hardest pages (dark pages, curved text, unusual layouts)

If Docling shows promise, I’ll build the batch-processing harness on top of it. If not, I’ll pivot to local LLMs or API-based extraction. Each method will get a dedicated blog post documenting results.

Follow Along

Future posts will share benchmarks, sample outputs, lessons learned, and recommendations for anyone attempting similar digitization work.
Check the tag soviet.recipes for updates.