How to Build Hands On Large Language Models: The Production-Grade Blueprint
A deep dive into building Hands On Large Language Models. Full tech stack breakdown, implementation code, and scaling strategies for enterprise use cases.
How to Build Hands On Large Language Models: The Production-Grade Blueprint
Building Hands On Large Language Models in a production context doesn’t mean writing a Transformer from scratch in NumPy (unless you are researching). In 2026, “Hands On” engineering means owning the Fine-Tuning and Serving pipeline.
While RAG handles knowledge, Fine-Tuning handles behavior. If you rely solely on Prompt Engineering for complex formatting, JSON adherence, or domain-specific reasoning, your system is brittle.
This blueprint covers the “Zero-to-Production” architecture for training a custom LoRA (Low-Rank Adaptation) adapter and serving it with high throughput.
🏗️ The Architecture
We are moving beyond API wrappers. We are modifying the model weights themselves.
- Data Ingestion: Raw domain data is converted into
Instructionformat (System/User/Assistant). - Efficient Training (QLoRA): We use 4-bit quantization to fine-tune a 70B+ model on a single GPU.
- Adapter Management: We do not merge weights permanently. We save lightweight Adapter layers (~100MB).
- Multi-LoRA Serving: A single vLLM instance loads the Base Model once and dynamically swaps Adapters per request.
🛠️ The Stack
- Training Framework: Unsloth (2x faster, 60% less memory than raw PyTorch) + Hugging Face TRL.
- Serving Engine: vLLM (Production standard for high-throughput inference).
- Tracking: WandB (Weights & Biases) for experiment tracking.
- Infrastructure: NVIDIA A100 or H100 (Cloud or On-Prem).
💻 Implementation
This code is split into two critical components: the Training Pipeline and the Production Server.
Part 1: The Training Pipeline (train.py)
This script handles the fine-tuning. It uses Unsloth to patch the Llama-3 architecture for maximum speed.
python
train.py
import torch from unsloth import FastLanguageModel from trl import SFTTrainer from transformers import TrainingArguments from datasets import load_dataset
1. Configuration - Centralized control for production reproducibility
CONFIG = { “model_id”: “unsloth/llama-3-8b-bnb-4bit”, # Pre-quantized for memory efficiency “max_seq_length”: 2048, “lora_r”: 16, # Rank: Higher = more parameters to train, but slower “lora_alpha”: 16, “output_dir”: ”./production_adapters/finance_v1”, “dataset_path”: “your_org/financial_reports_instruct” }
def train_adapter(): # 2. Load Model with Unsloth Optimizations # This handles RoPE scaling and memory mapping automatically model, tokenizer = FastLanguageModel.from_pretrained( model_name=CONFIG[“model_id”], max_seq_length=CONFIG[“max_seq_length”], dtype=None, # Auto-detect (Float16/Bfloat16) load_in_4bit=True, )
# 3. Add LoRA Adapters
# We only train specific modules to keep the adapter small (~50MB)
model = FastLanguageModel.get_peft_model(
model,
r=CONFIG["lora_r"],
target_modules=["q_proj", "k_proj", "v_proj", "o_proj",
"gate_proj", "up_proj", "down_proj"],
lora_alpha=CONFIG["lora_alpha"],
lora_dropout=0, # 0 is optimized for Unsloth
bias="none",
use_gradient_checkpointing="unsloth", # Critical for VRAM savings
)
# 4. Data Preparation (Mock Example)
# In production, load from S3 or a Feature Store
# Format: {"text": "<|system|>...<|user|>...<|assistant|>..."}
dataset = load_dataset("json", data_files="train_data.jsonl", split="train")
# 5. The Trainer
trainer = SFTTrainer(
model=model,
tokenizer=tokenizer,
train_dataset=dataset,
dataset_text_field="text",
max_seq_length=CONFIG["max_seq_length"],
dataset_num_proc=2,
args=TrainingArguments(
per_device_train_batch_size=2,
gradient_accumulation_steps=4,
warmup_steps=5,
max_steps=60, # Replace with num_train_epochs in real run
learning_rate=2e-4,
fp16=not torch.cuda.is_bf16_supported(),
bf16=torch.cuda.is_bf16_supported(),
logging_steps=1,
output_dir=CONFIG["output_dir"],
optim="adamw_8bit", # 8-bit optimizer saves massive memory
report_to="wandb", # MANDATORY for production observability
),
)
print("🚀 Starting Training...")
trainer.train()
# 6. Save ONLY the Adapter
# Do not save the full merged model if you plan to use Multi-LoRA serving
model.save_pretrained(CONFIG["output_dir"])
tokenizer.save_pretrained(CONFIG["output_dir"])
print(f"✅ Adapter saved to {CONFIG['output_dir']}")
if name == “main”: train_adapter()
Part 2: The Production Server (serve.py)
We don’t use Flask. We use vLLM’s Async Engine directly or its OpenAI-compatible server. Below is a robust implementation using vLLM to serve the base model and dynamically load the adapter we just trained.
python
serve.py
import os from vllm import AsyncLLMEngine, AsyncEngineArgs, SamplingParams from vllm.lora.request import LoRARequest from fastapi import FastAPI, HTTPException, Request from pydantic import BaseModel import uvicorn
1. Server Configuration
app = FastAPI(title=“Enterprise LLM Inference Node”) BASE_MODEL = “unsloth/llama-3-8b-bnb-4bit” ADAPTER_PATH = ”./production_adapters/finance_v1”
2. Initialize vLLM Engine
enable_lora=True is the critical flag here
engine_args = AsyncEngineArgs( model=BASE_MODEL, enable_lora=True, max_loras=4, # Max concurrent adapters max_lora_rank=16, gpu_memory_utilization=0.85, # Reserve buffer for overhead max_model_len=4096 )
engine = AsyncLLMEngine.from_engine_args(engine_args)
class InferenceRequest(BaseModel): prompt: str temperature: float = 0.7 max_tokens: int = 512 use_adapter: bool = True
@app.post(“/generate”) async def generate(request: InferenceRequest): request_id = str(os.urandom(8).hex())
# 3. Define Sampling Parameters
sampling_params = SamplingParams(
temperature=request.temperature,
max_tokens=request.max_tokens
)
# 4. Dynamic LoRA Loading
# If use_adapter is True, we attach the LoRA request.
# If False, it uses the base model (Vanilla Llama-3).
lora_request = None
if request.use_adapter:
# In a real system, 'lora_int_id' maps to a customer ID or task ID
lora_request = LoRARequest("finance_adapter", 1, ADAPTER_PATH)
try:
# 5. Streaming Inference
# We iterate over the async generator
results_generator = engine.generate(
request.prompt,
sampling_params,
request_id=request_id,
lora_request=lora_request
)
# Non-streaming response for simplicity (use SSE for production streaming)
final_output = None
async for request_output in results_generator:
final_output = request_output
return {
"id": request_id,
"text": final_output.outputs[0].text,
"metrics": {
"input_tokens": len(final_output.prompt_token_ids),
"output_tokens": len(final_output.outputs[0].token_ids)
}
}
except Exception as e:
# Log to Sentry/Datadog here
print(f"Error: {e}")
raise HTTPException(status_code=500, detail="Inference failed")
if name == “main”: # Workers=1 because vLLM handles internal concurrency uvicorn.run(app, host=“0.0.0.0”, port=8000)
⚠️ Production Pitfalls (The “Senior” Perspective)
When you take this from a Colab notebook to a Kubernetes cluster, here is what breaks:
1. The “Catastrophic Forgetting” Trap
Problem: Your fine-tuned model becomes great at your specific task (e.g., SQL generation) but forgets how to speak English properly. Fix: Always include a small percentage (10-20%) of “general purpose” instruction data in your fine-tuning dataset to maintain baseline capabilities.
2. LoRA Rank Misconfiguration
Problem: Setting lora_r=64 thinking “more is better”.
Fix: For most text tasks, r=8 or r=16 is sufficient. Higher ranks increase VRAM usage during training and serving without proportional accuracy gains. Only increase Rank if the domain is radically different from the base model (e.g., teaching Llama to speak a new language).
3. Inference Cold Starts
Problem: Loading a new LoRA adapter from disk for every request adds 500ms+ latency.
Fix: Use vLLM’s Multi-LoRA capability (shown above). It keeps the base model in VRAM and swaps only the tiny adapter weights in the attention layers. This allows you to serve 10 different fine-tuned models from ONE GPU.
4. Evaluation Blindness
Problem: Deploying based on “it feels right.” Fix: You need an evaluation pipeline. Before promoting an adapter to production, run it against a “Golden Dataset” of 100 Q&A pairs and measure semantic similarity (using a judge model like GPT-4) against the expected answers.
🚀 Final Verdict
This architecture gives you the control of custom models with the economics of shared infrastructure.
- MVP: Use OpenAI Fine-tuning.
- Scale: Use the Unsloth + vLLM pipeline described above.
You are now managing weights, not just prompts. That is true “Hands On” LLM engineering.
Recommended Reads
How to Build Dyad: The Production-Grade Blueprint
A deep dive into building Dyad-the Dual-Agent Generator/Verifier architecture. Full tech stack breakdown, implementation code, and scaling strategies for enterprise use cases.
How to Build Ai Tutorial Codes Included: The Production-Grade Blueprint
A deep dive into building the execution engine for AI Tutorial Codes Included. Full tech stack breakdown, secure sandboxing implementation, and scaling strategies for enterprise use cases.
How to Build Claude Mem: The Production-Grade Blueprint
A deep dive into building Claude Mem. Full tech stack breakdown, implementation code, and scaling strategies for enterprise use cases.