How to Build Hands On Large Language Models: The Production-Grade Blueprint

Building Hands On Large Language Models in a production context doesn’t mean writing a Transformer from scratch in NumPy (unless you are researching). In 2026, “Hands On” engineering means owning the Fine-Tuning and Serving pipeline.

While RAG handles knowledge, Fine-Tuning handles behavior. If you rely solely on Prompt Engineering for complex formatting, JSON adherence, or domain-specific reasoning, your system is brittle.

This blueprint covers the “Zero-to-Production” architecture for training a custom LoRA (Low-Rank Adaptation) adapter and serving it with high throughput.

🏗️ The Architecture

We are moving beyond API wrappers. We are modifying the model weights themselves.

Data Ingestion: Raw domain data is converted into Instruction format (System/User/Assistant).
Efficient Training (QLoRA): We use 4-bit quantization to fine-tune a 70B+ model on a single GPU.
Adapter Management: We do not merge weights permanently. We save lightweight Adapter layers (~100MB).
Multi-LoRA Serving: A single vLLM instance loads the Base Model once and dynamically swaps Adapters per request.

🛠️ The Stack

Training Framework: Unsloth (2x faster, 60% less memory than raw PyTorch) + Hugging Face TRL.
Serving Engine: vLLM (Production standard for high-throughput inference).
Tracking: WandB (Weights & Biases) for experiment tracking.
Infrastructure: NVIDIA A100 or H100 (Cloud or On-Prem).

💻 Implementation

This code is split into two critical components: the Training Pipeline and the Production Server.

Part 1: The Training Pipeline (`train.py`)

This script handles the fine-tuning. It uses Unsloth to patch the Llama-3 architecture for maximum speed.

python

train.py

import torch from unsloth import FastLanguageModel from trl import SFTTrainer from transformers import TrainingArguments from datasets import load_dataset

1. Configuration - Centralized control for production reproducibility

CONFIG = { “model_id”: “unsloth/llama-3-8b-bnb-4bit”, # Pre-quantized for memory efficiency “max_seq_length”: 2048, “lora_r”: 16, # Rank: Higher = more parameters to train, but slower “lora_alpha”: 16, “output_dir”: ”./production_adapters/finance_v1”, “dataset_path”: “your_org/financial_reports_instruct” }

def train_adapter(): # 2. Load Model with Unsloth Optimizations # This handles RoPE scaling and memory mapping automatically model, tokenizer = FastLanguageModel.from_pretrained( model_name=CONFIG[“model_id”], max_seq_length=CONFIG[“max_seq_length”], dtype=None, # Auto-detect (Float16/Bfloat16) load_in_4bit=True, )

# 3. Add LoRA Adapters
# We only train specific modules to keep the adapter small (~50MB)
model = FastLanguageModel.get_peft_model(
    model,
    r=CONFIG["lora_r"],
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj",
                    "gate_proj", "up_proj", "down_proj"],
    lora_alpha=CONFIG["lora_alpha"],
    lora_dropout=0, # 0 is optimized for Unsloth
    bias="none",
    use_gradient_checkpointing="unsloth", # Critical for VRAM savings
)

# 4. Data Preparation (Mock Example)
# In production, load from S3 or a Feature Store
# Format: {"text": "<|system|>...<|user|>...<|assistant|>..."}
dataset = load_dataset("json", data_files="train_data.jsonl", split="train")

# 5. The Trainer
trainer = SFTTrainer(
    model=model,
    tokenizer=tokenizer,
    train_dataset=dataset,
    dataset_text_field="text",
    max_seq_length=CONFIG["max_seq_length"],
    dataset_num_proc=2,
    args=TrainingArguments(
        per_device_train_batch_size=2,
        gradient_accumulation_steps=4,
        warmup_steps=5,
        max_steps=60, # Replace with num_train_epochs in real run
        learning_rate=2e-4,
        fp16=not torch.cuda.is_bf16_supported(),
        bf16=torch.cuda.is_bf16_supported(),
        logging_steps=1,
        output_dir=CONFIG["output_dir"],
        optim="adamw_8bit", # 8-bit optimizer saves massive memory
        report_to="wandb",  # MANDATORY for production observability
    ),
)

print("🚀 Starting Training...")
trainer.train()

# 6. Save ONLY the Adapter
# Do not save the full merged model if you plan to use Multi-LoRA serving
model.save_pretrained(CONFIG["output_dir"])
tokenizer.save_pretrained(CONFIG["output_dir"])
print(f"✅ Adapter saved to {CONFIG['output_dir']}")

if name == “main”: train_adapter()

Part 2: The Production Server (`serve.py`)

We don’t use Flask. We use vLLM’s Async Engine directly or its OpenAI-compatible server. Below is a robust implementation using vLLM to serve the base model and dynamically load the adapter we just trained.

python

serve.py

import os from vllm import AsyncLLMEngine, AsyncEngineArgs, SamplingParams from vllm.lora.request import LoRARequest from fastapi import FastAPI, HTTPException, Request from pydantic import BaseModel import uvicorn

1. Server Configuration

app = FastAPI(title=“Enterprise LLM Inference Node”) BASE_MODEL = “unsloth/llama-3-8b-bnb-4bit” ADAPTER_PATH = ”./production_adapters/finance_v1”

2. Initialize vLLM Engine

enable_lora=True is the critical flag here

engine_args = AsyncEngineArgs( model=BASE_MODEL, enable_lora=True, max_loras=4, # Max concurrent adapters max_lora_rank=16, gpu_memory_utilization=0.85, # Reserve buffer for overhead max_model_len=4096 )

engine = AsyncLLMEngine.from_engine_args(engine_args)

class InferenceRequest(BaseModel): prompt: str temperature: float = 0.7 max_tokens: int = 512 use_adapter: bool = True

@app.post(“/generate”) async def generate(request: InferenceRequest): request_id = str(os.urandom(8).hex())

# 3. Define Sampling Parameters
sampling_params = SamplingParams(
    temperature=request.temperature,
    max_tokens=request.max_tokens
)

# 4. Dynamic LoRA Loading
# If use_adapter is True, we attach the LoRA request.
# If False, it uses the base model (Vanilla Llama-3).
lora_request = None
if request.use_adapter:
    # In a real system, 'lora_int_id' maps to a customer ID or task ID
    lora_request = LoRARequest("finance_adapter", 1, ADAPTER_PATH)

try:
    # 5. Streaming Inference
    # We iterate over the async generator
    results_generator = engine.generate(
        request.prompt,
        sampling_params,
        request_id=request_id,
        lora_request=lora_request
    )

    # Non-streaming response for simplicity (use SSE for production streaming)
    final_output = None
    async for request_output in results_generator:
        final_output = request_output

    return {
        "id": request_id,
        "text": final_output.outputs[0].text,
        "metrics": {
            "input_tokens": len(final_output.prompt_token_ids),
            "output_tokens": len(final_output.outputs[0].token_ids)
        }
    }

except Exception as e:
    # Log to Sentry/Datadog here
    print(f"Error: {e}")
    raise HTTPException(status_code=500, detail="Inference failed")

if name == “main”: # Workers=1 because vLLM handles internal concurrency uvicorn.run(app, host=“0.0.0.0”, port=8000)

⚠️ Production Pitfalls (The “Senior” Perspective)

When you take this from a Colab notebook to a Kubernetes cluster, here is what breaks:

1. The “Catastrophic Forgetting” Trap

Problem: Your fine-tuned model becomes great at your specific task (e.g., SQL generation) but forgets how to speak English properly. Fix: Always include a small percentage (10-20%) of “general purpose” instruction data in your fine-tuning dataset to maintain baseline capabilities.

2. LoRA Rank Misconfiguration

Problem: Setting lora_r=64 thinking “more is better”. Fix: For most text tasks, r=8 or r=16 is sufficient. Higher ranks increase VRAM usage during training and serving without proportional accuracy gains. Only increase Rank if the domain is radically different from the base model (e.g., teaching Llama to speak a new language).

3. Inference Cold Starts

Problem: Loading a new LoRA adapter from disk for every request adds 500ms+ latency. Fix: Use vLLM’s Multi-LoRA capability (shown above). It keeps the base model in VRAM and swaps only the tiny adapter weights in the attention layers. This allows you to serve 10 different fine-tuned models from ONE GPU.

4. Evaluation Blindness

Problem: Deploying based on “it feels right.” Fix: You need an evaluation pipeline. Before promoting an adapter to production, run it against a “Golden Dataset” of 100 Q&A pairs and measure semantic similarity (using a judge model like GPT-4) against the expected answers.

🚀 Final Verdict

This architecture gives you the control of custom models with the economics of shared infrastructure.

MVP: Use OpenAI Fine-tuning.
Scale: Use the Unsloth + vLLM pipeline described above.

You are now managing weights, not just prompts. That is true “Hands On” LLM engineering.

How to Build Hands On Large Language Models: The Production-Grade Blueprint

How to Build Hands On Large Language Models: The Production-Grade Blueprint

🏗️ The Architecture

🛠️ The Stack

💻 Implementation

Part 1: The Training Pipeline (`train.py`)

train.py

1. Configuration - Centralized control for production reproducibility

Part 2: The Production Server (`serve.py`)

serve.py

1. Server Configuration

2. Initialize vLLM Engine

enable_lora=True is the critical flag here