Training a Massive AI Model on Ukraine's Entire Legal System — Here's What Would Happen

Here's a thought experiment that landed on DEV Community on July 3, 2026: what if you took every Ukrainian court decision, every law, every legal registry, and fed them all to a giant AI model training run on Google Cloud? What would you actually get?

This isn't purely theoretical. One legal tech team already has the data—roughly 2 terabytes of Ukrainian law sitting in their production systems right now. They published a detailed breakdown of what the training would look like, the cost, and what kind of legal reasoning the resulting model would develop.

The Dataset: 2 Terabytes of Real Legal Text

The starting point is genuine. The team, which runs a legal database called SecondLayer, has collected:

96.2 million full-text Ukrainian court decisions. Every civil, criminal, commercial, and administrative ruling they can access. That's 1.5 terabytes just for the decisions themselves.
Another 550 gigabytes of legal reference material. Constitutional law, civil codes, criminal codes, procedural codes, tax law. All the legislation that courts cite and apply.
Public registries and structured data. Business registries, debtors lists, ownership records. Plus a bonus: Spanish legal data (court rulings, tax guidance, constitutional court decisions) so the model learns European legal thinking in another language.

Add it up: about 2 terabytes of raw text. After deduplication and cleanup, that shrinks to roughly 800 to 1,000 gigabytes. When converted to "tokens"—the individual chunks an AI model learns from—it becomes 280 to 330 billion tokens.

For context: the original DeepSeek V3 model, one of today's most capable AI systems, trained on 14.8 trillion tokens, most of it English internet text. This Ukrainian legal corpus is 50 times smaller. But here's the key: it's specialized. Nearly every word matters. No memes, no random blog posts, no noise—just decades of court rulings and legal doctrine.

Why This Model Would Be Different From Today's AI

If you ask GPT-4o or Claude Opus 4.7 about Ukrainian law right now, they often hallucinate. They confuse legal articles, mix up versions of law from before and after 2022, and can't reliably tell the difference between administrative and civil proceedings.

A model trained on 96 million Ukrainian court decisions would be different. It wouldn't just memorize facts—it would learn how Ukrainian courts apply law. It would understand that Supreme Court decisions shape how lower courts reason. It would see patterns in how legal interpretation evolves over years and decades.

Because each court decision in the dataset is tagged with metadata (which court, which judge, what date, which articles of law were cited), the model learns structure. It's not just "there's an article about liability somewhere"—it's "pursuant to Article 611 of the Civil Code (revision of June 17, 2020), in cases of penalty recovery, the following applies..."

The Architecture: DeepSeek V3 Scaled Up

The thought experiment uses DeepSeek V3 as the blueprint. This model uses something called a Mixture of Experts (MoE)—imagine a team of specialist lawyers. When a civil case comes in, only the civil law expert activates. When it's tax law, only the tax expert speaks. This saves enormous computational power.

The real DeepSeek V3 has 671 billion parameters (adjustable weights that hold learned knowledge). Only 37 billion of those activate per query—just the relevant experts.

The hypothetical version scales this to 860 billion parameters, with about 47 billion active per query. It's roughly 30% larger, with more specialized experts. Why more experts? Because Ukrainian legal text benefits from extreme specialization. One expert becomes expert in cassation reasoning, another in liability disputes, another in tax law. The model learns finer distinctions without losing general capability.

The Hardware: Google's Most Powerful AI Chips

You can't train a model this large on normal computers. Google Cloud Platform offers TPU v5p—specialized chips designed purely for AI training. They pack 95 gigabytes of high-bandwidth memory each and communicate with each other at extraordinary speed.

The minimum viable setup: 2,048 of these chips, spread across 512 machines. One pass through all 280 billion tokens would take 3 to 4 days. For a production model, you'd want at least 3 passes to really let the model learn. So between 9 and 12 days of continuous training.

The Cost: $2.5 to $4.2 Million

Google charges about $4.20 per chip per hour on-demand, or $2.50 per chip per hour if you commit to a 3-year contract. Over 12 days on 2,048 chips, the training run alone costs $2.5 to $4.2 million.

Add another $200,000 to $500,000 for experimenting with smaller test models and fine-tuning the final version. Plus several hundred thousand more for storing model checkpoints during training.

For a specialized legal model serving millions of queries per year in production, this is economically defensible. It's a significant bet, but not absurd for an enterprise legal platform.

Why Now? Why This Matters in 2026

AI has become useful for many tasks, but specialized reasoning in non-English languages remains a gap. Ukrainian businesses, courts, and lawyers still depend heavily on human legal expertise for complex cases. A model trained on Ukrainian law could change that—faster legal research, better case prediction, more accessible guidance.

The authors aren't saying they've built this. They're asking: could we? What would it cost? What would we gain? The data exists. The hardware exists. The methods exist. It's a question of economics and whether the legal system wants this kind of tool.

Conclusion

The thought experiment is compelling because it's grounded in real constraints. The team has the data. Google has the hardware. The architecture is proven. The only questions are cost, practical execution, and whether specialized legal AI is worth the investment. For a non-English legal system that current frontier models struggle with, the answer might be yes.

Merits

Would understand Ukrainian law far better than any general-purpose AI available today
Mixture-of-Experts architecture runs cheaply at inference—only relevant experts activate, saving compute
Could serve millions of legal queries per month in production economically
Learns not just what the law says but how courts apply it in practice
Metadata in the dataset (court type, judge, date) enables more precise reasoning
Would advance legal tech in a language where specialized AI is scarce

Demerits

Extremely expensive upfront cost ($2.5 to $4.2 million for training alone)
Requires rare, hard-to-access hardware (TPU v5p pods) that most organizations can't rent
Dataset is niche—most organizations don't have 2 terabytes of legal corpus
Model is deeply specialized in Ukrainian law; less useful for general tasks
Ongoing infrastructure costs for storage and model maintenance
High specialization means poor generalization to other jurisdictions or legal systems
Risk of overfitting to one country's legal system

Caution

This article is educational and explores a hypothetical scenario based on publicly available technical information. Any specific values mentioned—costs, training timelines, model sizes, pricing per chip-hour—should be verified against current GCP pricing and the original source before relying on them for decision-making. All placeholder values in illustrative examples must be replaced with actual values appropriate to your use case and environment. Readers should consult the original technical publication, domain experts, and Google Cloud documentation before attempting similar projects at production scale.

Frequently asked questions

What is a Mixture of Experts (MoE) model and how does it save compute?
How much does it cost to train a large AI model on Google Cloud Platform?
Why would Ukrainian law need a specialized AI model?
Can you train an AI model to specialize in a specific domain like law?
What's the difference between total parameters and active parameters in an MoE model?
How long does it take to train a model with hundreds of billions of parameters?
Would a specialized legal AI model eventually replace lawyers?
What other countries or industries could benefit from domain-specific AI models?

Training a Massive AI Model on Ukraine's Entire Legal System — Here's What Would Happen

The Dataset: 2 Terabytes of Real Legal Text

Why This Model Would Be Different From Today's AI

The Architecture: DeepSeek V3 Scaled Up

The Hardware: Google's Most Powerful AI Chips

The Cost: $2.5 to $4.2 Million

Why Now? Why This Matters in 2026

Conclusion

Merits

Demerits

Caution

Frequently asked questions

Tags

Responses

Responses

The Dataset: 2 Terabytes of Real Legal Text

Why This Model Would Be Different From Today's AI

The Architecture: DeepSeek V3 Scaled Up

The Hardware: Google's Most Powerful AI Chips

The Cost: $2.5 to $4.2 Million

Why Now? Why This Matters in 2026

Conclusion

Merits

Demerits

Caution

Frequently asked questions

Tags

Prompt-Injection Defense Checklist

Responses

DGX Spark vs RTX 5090 for AI and ML Coding: A Practical 2026 Comparison

Rebuilding a Personal Site with Astro, Bun, and Cloudflare

Why AI-Generated Nudes Can't Be Stopped — A Builder's View

gRPC Support in Spring Boot 4.1: What You Need to Know

Responses