How Ukraine's Largest Court Database Got Vectorized: 33.7M Decisions, One Vector DB

How Ukraine's Largest Court Database Got Vectorized: 33.7M Decisions, One Vector DB

Making public judicial records searchable through semantic search

Imagine if a lawyer could type a plain-English question and instantly get back the five most relevant court rulings, with the exact paragraphs that matter — instead of wading through thousands of keyword hits. That's what semantic search promises. But making it real for 33.7 million court decisions is a different beast.

The EDRSR — the Unified State Register of Court Decisions — has opened Ukraine's entire judicial record to the public. Now a team is working to make all that text actually searchable.

Why This Matters Right Now

In 2026, court decisions are public data in Ukraine, but they remain difficult to search effectively. A lawyer searching for precedent on a specific issue had to use clunky keyword search, which returned thousands of irrelevant hits. The system didn't understand meaning — it just found documents containing the words you typed. With semantic search powered by vector embeddings, lawyers can ask what they actually want to know: "Is there case law on recovering bank prepayment fees?" The system finds the five most relevant rulings, pulls out the key paragraphs, and shows how courts reasoned through it.

But before you can do semantic search, you have to vectorize the text. And Ukraine's court system is not a small problem.

The Scale

The register holds decisions going back to 2006. The breakdown:

  • Civil cases (CPC): 33.7M documents — the largest category
  • Criminal cases (CrPC): 12M+
  • Administrative cases (CAS): 14M+
  • Commercial cases (CC): 6M+
  • Misdemeanors (CUaP): 6M+

As of now, the Qdrant vector database holds 44M+ vectors. Civil cases are 42% done (14.3M out of 33.7M processed). Once the civil cohort finishes, the collection will hold roughly 63M+ vectors — two orders of magnitude larger than a typical RAG project, which might have 100K to 1M vectors.

Processing this many documents meant building a pipeline that wouldn't fall over halfway through.

The Technical Stack

The team went with proven, practical choices:

Embedding model: Voyage AI's voyage-3.5, which outputs 1024-dimensional vectors. They tested Voyage 3 Large and OpenAI's text-embedding-3-large but found the quality gain on legal text didn't justify the cost difference. Voyage 3 Large is three times more expensive.

Vector database: Qdrant v1.17, self-hosted in Docker on a dedicated Amazon EC2 instance (r6a.xlarge: 4 CPU, 32 GB RAM, 2 TB gp3 storage). They gave it its own instance because 44M+ points with HNSW indexing were running the production database out of memory and blocking the chat service entirely.

Source of truth: PostgreSQL 15, with tables partitioned by adjudication date. The full court texts live in one table, metadata in another. A JOIN across all partitions touches 30M+ rows, so the pipeline processes one year at a time.

Pipeline runtime: Python 3.11, asyncio, aiohttp. No heavy frameworks — just straight HTTP calls to Voyage and Qdrant. The whole thing is 440 lines in one file.

How They Split the Work

Court decisions are long. An average civil ruling is 8,000 to 12,000 characters. Some reach 200,000. Voyage accepts up to 32,000 tokens per input, but quality degrades on long contexts, and one long vector is useless for retrieval — the language model can't pinpoint which paragraph is relevant.

So the team chunks:

  • Maximum 2,048 characters per chunk
  • 50-word overlap between neighboring chunks (to preserve context at boundaries)
  • Split on paragraph boundaries to keep semantic coherence

On average, one decision yields 2.7 chunks. Each chunk gets a composite ID in Qdrant (doc_id × 1000 + chunk_index), which lets a single payload filter pull all chunks of one decision.

Speed and Concurrency

Voyage has a rate limit: 2,000 requests per minute per API key. The team uses two keys and round-robins between them, hitting a theoretical 4,000 RPM ceiling.

They hold concurrency at 50 concurrent requests and get a steady 63 documents processed per second. That's about 170 requests per minute per key — well under the limit. They tried concurrency 70 and hit a Python GIL (global interpreter lock) wall: the process stalled at 13% CPU, made no progress, and threw no errors. Just hung. Dropped back to 50 and it ran smooth.

Every 100 documents, they batch 500 chunks and send to Voyage, collect embeddings, build Qdrant points, and upsert. On error (429 rate-limit, network timeout), they use exponential backoff with jitter, max 5 retries.

The Checkpoint That Saved Weeks

At 33.7M documents, any failure means hours or days of lost work. The team built a checkpoint system: every 1,000 processed documents, the pipeline writes a JSON snapshot with the last document ID, count, tokens used, and timestamp.

On restart, it reads that checkpoint and resumes from WHERE doc_id > last_doc_id. No duplicate work, no restart from zero.

This has saved them twice. Once when PostgreSQL ran out of memory (more below). Once when Qdrant restarted and lost its API key from the environment.

A Production Incident: Postgres Out of Memory

At 2.86M documents, PostgreSQL fell into recovery mode. The root cause: a config mismatch.

The database was set to shared_buffers=16GB, but the container memory limit was 12GB. PostgreSQL tried to allocate more than it had; the OS killed the process.

The fix (PR #1453) bumped the container limit to 24GB and shm_size to 16GB. After restarting, PostgreSQL came up in 4 seconds and stayed stable.

The lesson: PostgreSQL configuration parameters must align with container memory limits. The system runs fine until the first load spike, then fails ungracefully.

They also bumped swap on their dev machine from 8GB to 24GB because heavy Voyage API traffic generates lots of temporary objects in the Python process.

The Bill So Far

One civil document averages 2.7 chunks × 850 tokens = 2,300 tokens. At Voyage's pricing of 6 cents per million tokens, that's 0.014 cents per document — roughly 138 microdollars.

So far (42% complete):

  • 14.3M documents processed
  • ~$1,980 spent on Voyage API
  • ~63 hours of pipeline runtime

Remaining (58% to go):

  • 19.4M documents
  • Estimated ~$2,680 in Voyage costs
  • Estimated 85 hours (~3.5 days of continuous running)

Total cost for the full civil cohort: approximately $4,660 in API fees.

The dedicated EC2 instance runs about $0.20 per hour on-demand — roughly $145 per month. Cheaper than recovering from an OOM incident on production.

For comparison: the same budget on OpenAI's text-embedding-3-large would vectorize only a quarter of the volume. At this scale, Voyage makes financial sense.

What It Enables

Once the pipeline finishes, the collection will hold 63M+ vectors across all civil cases. A lawyer types a natural-language query — "case law on voiding a sale contract due to seller incapacity" — and the system surfaces the most relevant decisions from the right jurisdiction, with key paragraph extracts and links back to EDRSR.

That's semantic search for the entire Ukrainian civil court system.

Conclusion

Vectorizing 33.7M court decisions is not a small engineering problem. It requires picking the right embedding model for the cost-quality tradeoff, isolating your vector database to avoid starving production, carefully managing concurrency to avoid GIL deadlocks, and building fault tolerance into a 100+ hour pipeline. Ukraine's judicial record is now being transformed into a searchable knowledge base.

Merits

  • Semantic search over full-text. Lawyers get answers, not keyword hits.
  • Cost-efficient at scale. Voyage AI was 4x cheaper than the alternative for this volume.
  • Fault-tolerant design. Checkpoints let the pipeline survive failures without replay.
  • Practical infrastructure. Qdrant in a separate container, PostgreSQL with correct memory bounds — design decisions that protect production.
  • Documented incident. The Postgres OOM failure became a clear lesson about configuration alignment.

Demerits

  • Long pipeline duration. 85+ hours remaining means weeks of continuous operation; infrastructure can still fail mid-run despite checkpoints.
  • Two orders of magnitude larger than typical. 63M+ vectors is unproven territory; scaling risks remain.
  • Dedicated hardware cost. The r6a.xlarge instance is necessary but adds ongoing operational expense.
  • Language model dependency. Quality depends on the embedding model; if Voyage changes pricing or service, the calculus shifts.
  • No mention of query latency. The article doesn't discuss how fast a semantic search actually executes on 63M vectors.

Caution

This article is educational and describes a real project as reported in the source material. Any implementation should verify cost figures with current Voyage pricing, test PostgreSQL configuration parameters in your own environment with your own data, and validate concurrency settings (the 50 concurrent requests that worked here may not suit all systems). The Postgres shared_buffers value must match your actual container memory. Before relying on any specific numbers or approaches, consult the original source and your own infrastructure.

Frequently asked questions

  • What is the EDRSR and why does Ukraine open all court decisions to the public?
  • How do vector embeddings work and why are they better than keyword search for legal documents?
  • Why did the team use Qdrant instead of other vector databases like Pinecone or Milvus?
  • What is a GIL deadlock and why did concurrency 70 cause the pipeline to hang?
  • How does checkpoint-based resume work and how much restart overhead does it add?
  • Why is PostgreSQL configuration alignment with container memory limits so critical?
  • What is the cost difference between Voyage AI and OpenAI embeddings at 33.7M documents?
  • How long does a semantic search query take to execute on a 63M+ vector collection?

Tags

#vectorsearch #qdrant #voyageai #legaltech #ukraine #semanticsearch #ragapplications #scalinginfrastructure

Responses

Sign in to leave a response.

Loading…