Fine-Tuning Local LLMs with QLoRA: From Experiment to Production GGUF

There is no shortage of fine-tuning tutorials. Most show you how to get a model running on a toy dataset. Few explain what actually changes when you need the result to work reliably on production data with specific domain requirements.

This is the pipeline we use — covering dataset curation, QLoRA training on a single A100, evaluation, quantisation, and serving with vLLM or Ollama.

When Fine-Tuning Is Worth It

Fine-tuning is not always the right answer. Before committing to it, validate that the use case cannot be solved with:

Better prompting: structured prompts with few-shot examples cover most formatting and style requirements
RAG: if the need is domain knowledge (retrieving facts from documents), retrieval beats baking knowledge into weights
System prompt + tool use: for specialised behaviour in a multi-turn context

Fine-tuning earns its cost when you need: (1) consistent output format that prompting cannot reliably produce, (2) domain-specific reasoning patterns that are not in the base model, or (3) response style that cannot be achieved through prompting at acceptable latency.

Dataset Curation Is 80% of the Work

The quality of a fine-tuned model is almost entirely determined by dataset quality. This is not a figure of speech.

What makes a good fine-tuning dataset:

500-5,000 high-quality examples for instruction fine-tuning (more is not always better if quality drops)
Consistent format: every example in the same instruction/response schema
Representative of the actual production distribution — not just the easy cases
Negative examples: include cases where the correct response is “I cannot answer this” or a refusal
Deduplication: even small datasets often contain near-duplicates that bias training

Common mistakes:

Generating training data with a larger model and using it directly. The student learns the teacher’s errors as well as its patterns. Review and filter synthetically generated data.
Training only on positive examples. The model learns what to say but not when to decline.
Not splitting by time or source. If your val set has the same documents as your training set, your evaluation metrics are optimistic.

The QLoRA Training Setup

For most enterprise fine-tuning tasks, QLoRA (Quantised Low-Rank Adaptation) on a 7B-14B base model hits the sweet spot of capability and training cost. The full model stays frozen in 4-bit quantisation; only the small LoRA adapter matrices are trained.

Hardware: A single 80GB A100 on RunPod handles a 14B model with QLoRA at comfortable batch sizes. For 7B models, a 40GB A100 is sufficient.

Recommended stack: Unsloth + HuggingFace Transformers + TRL. Unsloth’s memory optimisations give 2-4x training speed over vanilla PEFT with no accuracy cost.

Key hyperparameters that actually matter:

r (LoRA rank): 16 for most tasks, 64 for complex reasoning tasks. Higher rank = more parameters = more expressive but more prone to overfitting on small datasets.
learning_rate: 2e-4 is a safe starting point. Lower (1e-4) if the base model is already close; higher (3e-4) only if the task is very different from pretraining.
num_train_epochs: Watch the validation loss. Stop when it stops decreasing — typically 1-3 epochs on a well-curated dataset. Beyond that you are memorising, not generalising.

Evaluation Before Deployment

A fine-tuned model that scores well on your eval set but fails on real inputs is the most common post-deployment surprise.

Minimum evaluation requirements:

Held-out test set: 10-15% of original data, never seen during training
Out-of-distribution examples: inputs that are valid but not represented in your training data
Adversarial inputs: attempts to elicit the wrong format, boundary cases, and inputs the model should decline
Regression on general capability: run the fine-tuned model on a standard benchmark subset (MMLU, HellaSwag) to verify you have not degraded general reasoning. Any drop of more than 2-3 points warrants investigation.

Quantisation and Export to GGUF

Once the adapter is trained and merged with the base model, you need to export to a format your inference server understands.

For llama.cpp / Ollama: GGUF format. Export with llama.cpp/convert.py, then quantise. Q4_K_XL (Unsloth Dynamic Quantisation variant) gives the best quality-to-size tradeoff at 4-bit. Q5_K_M is worth the extra VRAM if you are on a 24GB consumer GPU.

For vLLM: GGUF is not supported — use AWQ or GPTQ quantisation instead, or run the merged model in FP16/BF16 if VRAM permits.

Serving: For production, vLLM with an OpenAI-compatible API endpoint is the standard. Set --max-model-len based on your actual use case — do not default to the model’s maximum context length. For concurrent users, set --gpu-memory-utilisation to 0.85-0.90 and benchmark with your actual concurrency profile.

The fine-tuned, quantised model running on your own hardware is the end state: zero per-inference cost, full audit trail, no dependency on external APIs.

When Fine-Tuning Is Worth It

Dataset Curation Is 80% of the Work

The QLoRA Training Setup

Evaluation Before Deployment

Quantisation and Export to GGUF

Related articles

n8n + LangGraph in Production: Building Agentic Workflows That Do Not Hallucinate

EU AI Act High-Risk Compliance: What You Actually Need to Build