In 2022, running an LLM locally was a researcher's side project. In 2026, it has become a credible option, sometimes preferable to OpenAI or Anthropic for an enterprise. Reasons pile up: data sovereignty, predictable costs, AI Act compliance, independence from a single vendor.
The problem: the landscape has become so rich it's confusingly abundant. Mistral, Llama, Qwen, DeepSeek, Gemma, Phi. Each family ships multiple sizes, multiple variants (generalist, code, reasoning, multimodal), multiple licenses. A first-time CIO typically loses two months sorting through this.
This guide gives you the markers to decide fast. Not an absolute benchmark list (scores move every quarter, we'll get to that) but a stable analytical framework. At the end, we explain how we operate at GettIA to install the right local LLM at our clients'.
The landscape at a glance
| Family | Origin | Typical license | Key strengths | Watch out for |
|---|---|---|---|---|
| Mistral | 🇫🇷 France | Apache 2.0 (most) | Native French, AI Act friendly, European ecosystem | Flagship models (Large, Medium) are closed |
| Llama | 🇺🇸 Meta | Llama Community License | Mature ecosystem, massive tooling, multilingual | Restrictive license above 700M MAU |
| Qwen | 🇨🇳 Alibaba | Apache 2.0 (most) | Versatility, strong benchmarks, multilingual | Chinese origin, geopolitical question by sector |
| DeepSeek | 🇨🇳 DeepSeek | MIT (weights) | Top-tier reasoning, excellent cost/performance | Same geopolitical question as Qwen |
| Gemma | Gemma Terms | Multilingual, edge variants (2B, 4B), multimodal | License with usage clauses | |
| Phi | 🇺🇸 Microsoft | MIT | Very efficient small models | Mostly English |
That's the backdrop. Now let's detail what matters for an enterprise decision, family by family.
Mistral: the natural choice for a French context
Mistral AI is the European reference, and that matters beyond marketing. Most of their models (Mistral 7B, Mistral Small, the historical Mixtral, Codestral for code, Ministral for edge) ship under Apache 2.0, an ultra-permissive license that covers any commercial use without hidden clauses.
What we use in practice:
- Mistral 7B Instruct and Ministral for edge (laptops, user workstations) in Q4_K_M. VRAM ~5 GB, runs a decent French chatbot on a modern laptop.
- Mistral Small (≈24B) in Q4_K_M for an enterprise server with a mid-range GPU. ~15-17 GB VRAM. Good FR precision and reasoning for a structured summarization or analysis prompt.
- Codestral for code use cases (completion, documentation, review).
Watch out: Mistral has gone upmarket with closed models (Mistral Large, Mistral Medium) sold via their API. Those aren't downloadable and require a commercial contract with Mistral. So if your brief says "100% on-premise, nothing at a third party", stick to the open-weight versions.
Why we lean toward it for French clients: Mistral is trained with a significant French proportion, which shows in summary quality, spelling, administrative phrasing. The AI Act also looks favorably on models published by EU actors.
Llama: the mature ecosystem, with a license asterisk
Meta publishes Llama models under a community license that allows commercial use unless your company has more than 700 million monthly active users (read: unless you are Meta or one of four direct competitors). For 99.9% of companies, it's frictionless.
What we use in practice:
- Llama 3.3 70B Instruct in Q4_K_M (~45 GB VRAM): our go-to for a server with a data center GPU (A100 80GB, H100, L40S). Reasoning quality comparable to GPT-4o 2024 on most use cases.
- Llama 3.1 8B Instruct in Q4_K_M: the lightweight version for edge or multi-user parallel deployments.
- Llama 3.2 11B Vision for multimodal use cases (info extraction from scanned docs, image auditing).
Strengths: massive tooling ecosystem (llama.cpp, vLLM, MLX, Ollama, LM Studio…), quantizations available in all formats, active fine-tuning community.
What to watch: Llama models are multilingual but biased toward English in training. On technical French content, you'll sometimes see less natural turns of phrase than with Mistral. Test on your own data before deciding.
Qwen: the dark horse that competes on benchmarks
Alibaba has pushed Qwen (especially Qwen 2.5 and the recent Qwen 3 releases) to a level where they often contest Llama and Mistral on public leaderboards. And with an Apache 2.0 license on most variants, it's a clean commercial choice.
What we use in practice:
- Qwen 2.5 14B and Qwen 2.5 32B Instruct: very good size/perf balance for a mid-market server.
- Qwen 2.5 72B Instruct: Llama 3.3 70B equivalent, sometimes better on reasoning and math benchmarks.
- Qwen 2.5-Coder: code-specialized, very competitive with Codestral.
The geopolitical question: Qwen is developed by Alibaba, a Chinese company. For the vast majority of private enterprises, this changes nothing (weights are open-weight and run 100% on your infrastructure, no telemetry, no server dependency). But for defense, nuclear, national security, or sovereign public administration clients, Chinese provenance may be a blocker in security review. Settle this upstream with the security officer.
Multilingualism is a real strength: Qwen is explicitly trained multilingual with effort on European languages. On French test sets, it holds up very honorably.
DeepSeek: the reasoning specialist at a small cost
DeepSeek (a Chinese company) shook the market in early 2025 with DeepSeek V3 (MoE architecture, 671B total parameters, 37B active) and especially DeepSeek R1, a reasoning model trained at a very low cost that rivals OpenAI's reasoning models (o1, o3) on several benchmarks.
Weights ship under an MIT license, ultra-permissive.
What we use in practice:
- DeepSeek R1 (or its smaller distillations) when the client has a pointed reasoning use case: contract analysis, complex multi-step problem solving, planning.
- DeepSeek V3 for generalist chatbots with GPT-4o quality on a hefty enough infrastructure (MoE needs a lot of VRAM even if active params are limited).
Strengths: unbeatable quality/inference-cost ratio on reasoning use cases. DeepSeek's research output is prolific.
Watch out: same geopolitical questions as Qwen around provenance. And MoE architecture demands far more total VRAM than its dense equivalent (you must load all experts, even though inference activates only part).
The outsiders that count: Gemma and Phi
Gemma 2 and Gemma 3 (Google) ship in small sizes (2B, 9B, 27B for Gemma 2; 1B, 4B, 12B, 27B for Gemma 3 with multimodal variants). The Gemma license imposes a few usage conditions (listed prohibited uses) that need a legal pass, but doesn't prevent standard commercial use.
Phi-4 (Microsoft, ~14B parameters) is a small model remarkably efficient for its size, under MIT license. Excellent option when hardware is constrained and usage is mostly English.
For both, we pick them especially when the use case requires very small models (edge, IoT, mid-range laptops) or when RAM constraints are tight.
The real decision criteria (not the benchmarks)
OpenLLM / Artificial Analysis / LMSYS-style leaderboards move every month. Chasing benchmarks is a time sink in enterprise. What actually decides:
1. License compatibility with your use
- Apache 2.0 / MIT: top, no commercial restriction. Mistral (open variants), Qwen, DeepSeek, Phi.
- Llama Community License: OK unless you're a tech giant (>700M MAU).
- Gemma Terms: OK with light legal review.
- "Weights closed" or API-only models: ruled out for a truly on-premise project.
2. French capability (if that's the primary language)
On the tests we systematically run at GettIA (meeting summaries, administrative writing, French contract analysis), our rough preference order:
- Mistral Small / Mistral 7B: most natural vocabulary and phrasing
- Qwen 2.5: very good, sometimes slightly stiffer phrasing
- Llama 3.3: correct but with occasional unwanted anglicisms
- DeepSeek V3: correct, especially strong on reasoning regardless of language
- Gemma 3: acceptable but not its main terrain
- Phi-4: avoid for French business content
3. Available hardware (and its cost)
A quick decision tree:
- User laptop (modern CPU, no dedicated GPU) → 3B to 8B in Q4_K_M (Mistral 7B, Llama 3.1 8B, Ministral, Phi-4-mini)
- Pro workstation with consumer GPU (RTX 4090, ~24 GB VRAM) → 14B to 24B in Q4_K_M (Mistral Small, Qwen 14B, Gemma 27B on edge)
- Server with 1× H100 or equivalent (80 GB) → 70B-72B in Q4_K_M (Llama 3.3 70B, Qwen 72B)
- Multi-GPU cluster (2× H100 / B200) → MoE models (DeepSeek V3, Qwen 3 235B)
In 2026, renting H100 at a French sovereign host (Scaleway, OVHcloud, Outscale) costs between €2 and €4/hour depending on commitment. Purchase-wise, an H100 80 GB runs around €25-30k. A B200 (Blackwell) is significantly more expensive but multiplies throughput.
4. Specific use case
- Generalist internal chatbot / RAG → Mistral Small, Llama 3.3, Qwen 32B (based on hardware)
- Code and development → Codestral or Qwen 2.5-Coder
- Pointed reasoning (analysis, planning) → DeepSeek R1 or distillations
- Multimodal (text + image) → Llama 3.2 Vision, Gemma 3 Vision, Qwen 2-VL
- Edge / extreme constraints → Phi-4-mini, Ministral, Gemma 3 4B
5. Regulatory and geopolitical constraints
- Defense, sovereign government, critical infrastructure → Mistral (European) first, avoid Qwen and DeepSeek due to Chinese origin
- Subject to AI Act high-risk category → favor models with robust public documentation and identifiable origin team
- International client with US presence → Llama may naturally fit for stack alignment reasons
The simple decision tree, in practice
If you had to decide in 30 seconds, here's what we'd advise:
"French client, large enterprise, generalist FR use case" → Mistral Small 3 or Llama 3.3 70B depending on hardware
"Sovereign or defense sector" → Mistral exclusively
"Complex reasoning project" → DeepSeek R1 (or distillation if hardware-constrained)
"Code development project" → Codestral or Qwen 2.5-Coder
"Multilingual project, 10+ languages" → Qwen 2.5 72B (best coverage/quality ratio)
"Edge deployment on user laptops" → Mistral 7B or Ministral
Not a universal matrix, just a starting point we refine against the project's specific constraints.
How we do it at GettIA
Installing a local LLM at a client isn't checking a box in a table. It's a sequence we've honed on recent projects (RAG chatbot for Peps Digital, sovereign note-taker for a space industry player, others):
- Needs audit: primary use case, expected usage volume, language, GDPR constraints, AI Act, sector-specific. We come out with a shortlist of 2 to 3 candidate models.
- Benchmarks on your data: we run candidates on a sample of your actual inputs (conversations, documents, business requests) and compare output quality business-side, not on MMLU.
- Setup with appropriate quantization: based on your target hardware, we choose Q4_K_M, Q5, Q8 or FP16 to maximize the perf/VRAM ratio.
- Continuous evaluation pipeline: we set up reproducible test sets to catch regressions when you update the model in 6 months.
- Training your team: the model lives in your environment. Your devs must know how to redeploy it, update it, monitor it without us.
- Migration-friendly abstraction: we use abstraction layers (llama.cpp, vLLM with OpenAI-compatible adapters) so that if a better model ships later, you change one parameter.
Got a local LLM project in the works? Book a slot, we block 30 minutes to understand your context (use case, hardware, sector, language) and point you to the right choice. Free consultation, we don't sell what doesn't help.