A running record of AI systems built, experiments run, and architectures designed — from LLM pipelines to production inference engines.
LLMsMar 2026
Building a Sub-40ms Inference Engine with vLLM and Custom KV-Cache Optimisation
How I reduced p99 inference latency by 63% on a 70B parameter model through continuous batching, paged attention tuning, and a custom speculative decoding pipeline. Full architecture walkthrough, benchmarks, and the lessons that didn't make it into the paper.
Multi-Agent Orchestration: Lessons from 6 Months in Production
What actually breaks in multi-agent systems at scale — loop detection, tool call reliability, memory contention, and the surprising cost of over-planning.
Designing an autoscaling strategy on AWS that handles 10× traffic spikes without pre-warming costs eating your margin. Spot instances, queue depth triggers, and warm pool tuning.
QLoRA Fine-tuning on Consumer Hardware: A Practical Guide
Fine-tuning a 13B model to 99% of full fine-tune quality on a single 24GB GPU — quantisation choices, rank selection, and gradient accumulation tricks.
Why Chain-of-Thought Prompting Degrades Under Latency Constraints
An empirical analysis of CoT reasoning quality vs. token budget constraints — and a structured prompt compression technique that preserves 88% of reasoning depth.
Building deterministic, type-safe tool interfaces for LLM agents — schema design, retry logic, and observability patterns that make debugging tractable.