Projects — Marconi AI Engineering

LLMs Mar 2026

Building a Sub-40ms Inference Engine with vLLM and Custom KV-Cache Optimisation

How I reduced p99 inference latency by 63% on a 70B parameter model through continuous batching, paged attention tuning, and a custom speculative decoding pipeline. Full architecture walkthrough, benchmarks, and the lessons that didn't make it into the paper.

Read Post

RAG Feb 2026

Hybrid Search Architecture for Production RAG

Combining dense vector retrieval with BM25 sparse search and a cross-encoder re-ranker to push retrieval recall above 94% on domain-specific corpora.

Read Post

Agents Jan 2026

Multi-Agent Orchestration: Lessons from 6 Months in Production

What actually breaks in multi-agent systems at scale — loop detection, tool call reliability, memory contention, and the surprising cost of over-planning.

Read Post

Infrastructure Dec 2025

GPU Cluster Autoscaling for Bursty LLM Workloads

Designing an autoscaling strategy on AWS that handles 10× traffic spikes without pre-warming costs eating your margin. Spot instances, queue depth triggers, and warm pool tuning.

Read Post

LLMs Nov 2025

QLoRA Fine-tuning on Consumer Hardware: A Practical Guide

Fine-tuning a 13B model to 99% of full fine-tune quality on a single 24GB GPU — quantisation choices, rank selection, and gradient accumulation tricks.

Read Post

Research Oct 2025

Why Chain-of-Thought Prompting Degrades Under Latency Constraints

An empirical analysis of CoT reasoning quality vs. token budget constraints — and a structured prompt compression technique that preserves 88% of reasoning depth.

Read Post

RAG Sep 2025

GraphRAG vs. Naive RAG: A Head-to-Head on Legal Documents

Benchmarking knowledge-graph-augmented retrieval against standard vector search on 10,000-document legal corpora — where graph wins, where it doesn't.

Read Post

Agents Aug 2025

Designing Reliable Tool Use for LLM Agents

Building deterministic, type-safe tool interfaces for LLM agents — schema design, retry logic, and observability patterns that make debugging tractable.

Read Post

Infrastructure Jul 2025

Streaming LLM Responses at Scale with Server-Sent Events

End-to-end streaming architecture from GPU inference to browser — backpressure handling, connection pooling, and graceful degradation under load.

Read Post

PROJECT LOG.