
[{"content":"","date":"26 April 2026","externalUrl":null,"permalink":"/tags/claude-code/","section":"Tags","summary":"","title":"Claude Code","type":"tags"},{"content":"","date":"26 April 2026","externalUrl":null,"permalink":"/tags/code-generation/","section":"Tags","summary":"","title":"Code Generation","type":"tags"},{"content":"","date":"26 April 2026","externalUrl":null,"permalink":"/series/dissecting-the-claude-code-harness/","section":"Series","summary":"","title":"Dissecting the Claude Code Harness","type":"series"},{"content":" 1. Sub-Agents — Scaling Beyond a Single Loop # The single-threaded agentic loop is simple and predictable, but it cannot parallelize work. Claude Code addresses this with sub-agents — child agent instances that run their own isolated loops.\nHow sub-agents work # When the main agent encounters a task that benefits from parallelism (e.g., \u0026ldquo;run tests, check linting, and update docs\u0026rdquo;), it can spawn sub-agents via the SpawnAgent tool. Each sub-agent:\nHas its own isolated context window — preventing \u0026ldquo;context collapse\u0026rdquo; in the parent session. Receives a scoped task description — a focused instruction, not the full conversation history. Has restricted tool permissions — sub-agents can be granted a subset of the parent\u0026rsquo;s tools. Returns a structured result to the parent when complete. Implementation # // The SpawnAgent tool — creates a child AgentLoop with isolated context const SpawnAgentTool: Tool = { name: \u0026#34;SpawnAgent\u0026#34;, description: \u0026#34;Spawn a sub-agent with its own isolated context to perform a focused task.\u0026#34;, permissionCategory: \u0026#34;spawn\u0026#34;, inputSchema: z.object({ task: z.string().describe(\u0026#34;The focused task description for the sub-agent\u0026#34;), allowedTools: z .array(z.string()) .optional() .describe(\u0026#34;Subset of tools the sub-agent can use\u0026#34;), }), async execute(input) { const { task, allowedTools } = input as { task: string; allowedTools?: string[]; }; // Create a scoped tool registry for the sub-agent const scopedRegistry = new ToolRegistry(); const parentTools = registry; // reference to parent\u0026#39;s registry // Only register allowed tools (or all if not specified) const toolNames = allowedTools ?? Array.from(parentTools.listNames()); for (const name of toolNames) { if (name === \u0026#34;SpawnAgent\u0026#34;) continue; // prevent recursive spawning try { scopedRegistry.register(parentTools.get(name)); } catch { // Tool not found — skip } } // Sub-agent gets its own context manager (isolated context window) const childContextManager = new ContextManager( process.cwd(), 100_000, // sub-agents get a smaller context budget ); // Sub-agent gets full permission (parent already approved the spawn) const childPermissions = new PermissionSystem(\u0026#34;auto\u0026#34;, { denyPatterns: [], allowedPaths: [process.cwd()], }); const childLoop = new AgentLoop( scopedRegistry, childPermissions, new HookRunner(), // sub-agents inherit hook config in production childContextManager, ); // Run the sub-agent and return its result to the parent const result = await childLoop.run(task); return `[Sub-agent completed]\\n${result}`; }, }; // --- Parallel sub-agent orchestration --- // The parent agent does not call SpawnAgent in parallel itself — // it issues multiple SpawnAgent tool_use blocks in a single response, // and the harness executes them concurrently: async function executeToolsConcurrently( toolCalls: ToolUseBlock[], executeTool: (tc: ToolUseBlock) =\u0026gt; Promise\u0026lt;string\u0026gt;, ): Promise\u0026lt;Map\u0026lt;string, string\u0026gt;\u0026gt; { const results = new Map\u0026lt;string, string\u0026gt;(); // Separate SpawnAgent calls (can run in parallel) from others (sequential) const spawnCalls = toolCalls.filter((tc) =\u0026gt; tc.name === \u0026#34;SpawnAgent\u0026#34;); const otherCalls = toolCalls.filter((tc) =\u0026gt; tc.name !== \u0026#34;SpawnAgent\u0026#34;); // Run spawn calls concurrently const spawnResults = await Promise.all( spawnCalls.map(async (tc) =\u0026gt; ({ id: tc.id, result: await executeTool(tc), })), ); for (const { id, result } of spawnResults) { results.set(id, result); } // Run other calls sequentially (preserve ordering guarantees) for (const tc of otherCalls) { results.set(tc.id, await executeTool(tc)); } return results; } This is architecturally similar to a worker pool in distributed systems: the parent acts as an orchestrator, the sub-agents are workers, and the tool interface is the communication protocol.\nWhy isolation matters # Without isolation, parallel tool execution would mutate the parent\u0026rsquo;s conversation history concurrently — creating race conditions and incoherent context. By giving each sub-agent its own context, the harness maintains the single-writer invariant that keeps the system predictable.\n2. MCP — Model Context Protocol # Claude Code supports the Model Context Protocol (MCP), an open standard for connecting AI assistants to external tools and data sources. MCP acts as a universal adapter layer:\nTool servers — External services that expose tools (databases, APIs, monitoring systems) via a standardized protocol. Resource providers — Services that provide context (documentation, codebase indices, knowledge bases). Implementation # // MCP tools are registered into the same ToolRegistry as built-in tools. // The harness treats them identically — same schema validation, // same permission gates, same hook system. interface MCPServerConfig { name: string; url: string; // e.g. \u0026#34;http://localhost:3001/mcp\u0026#34; } async function registerMCPTools( server: MCPServerConfig, registry: ToolRegistry, ): Promise\u0026lt;void\u0026gt; { // 1. Discover available tools from the MCP server const response = await fetch(`${server.url}/tools/list`, { method: \u0026#34;POST\u0026#34; }); const { tools } = (await response.json()) as { tools: { name: string; description: string; inputSchema: object }[]; }; // 2. Register each MCP tool as a local tool with a remote executor for (const mcpTool of tools) { registry.register({ name: `mcp_${server.name}_${mcpTool.name}`, description: `[MCP: ${server.name}] ${mcpTool.description}`, permissionCategory: \u0026#34;network\u0026#34;, // all MCP tools go through network gates inputSchema: z.any(), // schema comes from the MCP server async execute(input) { const result = await fetch(`${server.url}/tools/call`, { method: \u0026#34;POST\u0026#34;, headers: { \u0026#34;Content-Type\u0026#34;: \u0026#34;application/json\u0026#34; }, body: JSON.stringify({ name: mcpTool.name, arguments: input }), }); const { content } = (await result.json()) as { content: { type: string; text: string }[]; }; return content.map((c) =\u0026gt; c.text).join(\u0026#34;\\n\u0026#34;); }, }); } } From the harness\u0026rsquo;s perspective, MCP tools are indistinguishable from built-in tools: they go through the same schema validation, permission gates, and hook system. This means organizations can extend Claude Code\u0026rsquo;s capabilities without modifying the harness itself — a critical property for enterprise adoption.\n3. Skills — On-Demand Procedural Knowledge # MCP gives the agent new tools (the ability to do things). Skills give the agent new expertise (knowledge of how to do things). A skill is a self-contained directory — containing instructions, scripts, templates, and configuration — that the harness injects into the conversation on demand, teaching the model a specific workflow and giving it executable utilities to carry it out, without permanently consuming context tokens.\nSkill directory structure # A skill is not just a single file; it\u0026rsquo;s a directory:\n.claude/skills/ └── deploy/ ├── SKILL.md # Required — entry point (instructions + config) ├── scripts/ │ ├── deploy.sh # Helper script the skill references │ └── health-check.py # Another utility ├── assets/ │ └── deploy-config.yaml # Reference implementation └── references/ └── topic1.md # Additional documentation The scripts/ directory is particularly important: skills can bundle executable helpers that the model runs via the Bash tool during skill execution. This makes skills more than just instructions — they\u0026rsquo;re portable workflow packages.\nThe progressive-disclosure pattern # Skills use a three-level loading strategy designed to conserve the context window:\nLevel What\u0026rsquo;s loaded When Context cost Level 1: Metadata Skill name and description from YAML frontmatter Always — injected at session start Very low (~50 tokens per skill) Level 2: Instructions Full SKILL.md body (the \u0026ldquo;playbook\u0026rdquo;) On demand — when the skill is triggered Moderate (hundreds to low thousands of tokens) Level 3: Supporting files Scripts, examples, templates in the skill directory Lazy — only when the running skill reads them Variable This is analogous to how an operating system loads shared libraries: metadata (the symbol table) is always available, but the actual code is only paged in when a symbol is referenced.\nThe SKILL.md file # The SKILL.md file has two parts: YAML frontmatter (configuration) and a markdown body (instructions).\n--- name: deploy description: Deploy the application to staging or production using our CI/CD pipeline allowed-tools: [Bash, ReadFile, Grep] # restrict which tools this skill can use disable-model-invocation: true # prevent autonomous triggering (require /deploy) context: fork # run in an isolated sub-agent context --- ## Steps 1. Run `npm run build` and verify it exits cleanly. 2. Run the test suite with `npm test`. If any tests fail, stop and report. 3. Check the current branch — only `main` can deploy to production. 4. For staging: run the bundled deploy script: ```bash bash scripts/deploy.sh staging ``` For production: run bash scripts/deploy.sh production, then verify the health check using the bundled script: python3 scripts/health-check.py https://api.example.com/health Rules # Never deploy if there are uncommitted changes. Always run tests before deploying, even if the user says to skip them. After a production deploy, post a summary to #deployments on Slack. Frontmatter configuration # Field Purpose name Becomes the /slash-command and the identifier used by the UseSkill tool. Level 1 — always in context. description The signal Claude uses to match user intent to this skill. Level 1 — always in context. allowed-tools Restricts which tools the model can call while this skill is active. Omit to allow all tools. disable-model-invocation When true, prevents Claude from triggering this skill autonomously — it can only be invoked manually via /deploy. Essential for workflows with side effects. context Set to fork to run the skill in an isolated sub-agent context, preventing it from polluting the parent session\u0026rsquo;s history. The markdown body is Level 2 — loaded only when the skill is triggered. Notice that the instructions freely reference bundled scripts (scripts/deploy.sh, scripts/health-check.py) and harness tools (Bash, ReadFile). The model uses these references to orchestrate tool calls during execution.\nHow skills are triggered # Skills can be activated in two ways:\nAutonomous discovery — The model reads the skill descriptions (Level 1) and decides, based on the user\u0026rsquo;s task, that a skill is relevant. It then invokes the skill to load Level 2 instructions. This requires no user action. Manual invocation — The user types a slash command (e.g., /deploy). This is preferred for workflows with side effects, where timing matters. Personal vs project skills # Scope Location Use case Personal ~/.claude/skills/ Preferences that follow you across projects — commit message style, preferred test frameworks, code review checklists Project /skills/ (in the repo) Team workflows that travel with the codebase — deployment procedures, coding standards, architecture patterns Project skills are version-controlled and shared automatically with anyone who clones the repository.\nImplementation # import * as fs from \u0026#34;fs/promises\u0026#34;; import * as path from \u0026#34;path\u0026#34;; import * as yaml from \u0026#34;yaml\u0026#34;; interface SkillConfig { allowedTools?: string[]; // e.g. [\u0026#34;Bash\u0026#34;, \u0026#34;ReadFile\u0026#34;, \u0026#34;Grep\u0026#34;] disableModelInvocation?: boolean; // true = manual /slash-command only context?: \u0026#34;inline\u0026#34; | \u0026#34;fork\u0026#34;; // fork = run in isolated sub-agent } interface SkillMetadata { name: string; description: string; basePath: string; // directory containing SKILL.md config: SkillConfig; } interface LoadedSkill extends SkillMetadata { instructions: string; // the markdown body (Level 2) scripts: string[]; // relative paths to files in scripts/ } class SkillRegistry { private skills = new Map\u0026lt;string, SkillMetadata\u0026gt;(); // Called at startup — discovers all skills and loads Level 1 (metadata only) async discoverSkills(searchPaths: string[]): Promise\u0026lt;void\u0026gt; { for (const searchPath of searchPaths) { const entries = await fs.readdir(searchPath, { withFileTypes: true }); for (const entry of entries) { if (!entry.isDirectory()) continue; const skillDir = path.join(searchPath, entry.name); const skillFile = path.join(skillDir, \u0026#34;SKILL.md\u0026#34;); try { const raw = await fs.readFile(skillFile, \u0026#34;utf-8\u0026#34;); const metadata = this.parseFrontmatter(raw); // Discover bundled scripts (Level 3) const scripts = await this.discoverScripts(skillDir); this.skills.set(metadata.name, { ...metadata, basePath: skillDir, scripts, }); } catch { // No SKILL.md in this directory — skip } } } } // Scan the scripts/ directory for executable helpers private async discoverScripts(skillDir: string): Promise\u0026lt;string[]\u0026gt; { const scriptsDir = path.join(skillDir, \u0026#34;scripts\u0026#34;); try { const entries = await fs.readdir(scriptsDir); return entries.map((e) =\u0026gt; path.join(\u0026#34;scripts\u0026#34;, e)); } catch { return []; // no scripts/ directory } } // Level 1: returns metadata for all skills (always in context) getMetadataSummary(): string { const lines = [\u0026#34;Available skills:\u0026#34;]; for (const [name, skill] of this.skills) { lines.push(` /${name} — ${skill.description}`); } return lines.join(\u0026#34;\\n\u0026#34;); } // Level 2: loads the full instructions for a specific skill async loadSkill(name: string): Promise\u0026lt;LoadedSkill\u0026gt; { const metadata = this.skills.get(name); if (!metadata) throw new Error(`Unknown skill: ${name}`); const raw = await fs.readFile( path.join(metadata.basePath, \u0026#34;SKILL.md\u0026#34;), \u0026#34;utf-8\u0026#34;, ); const instructions = this.extractBody(raw); const scripts = metadata.scripts ?? (await this.discoverScripts(metadata.basePath)); return { ...metadata, instructions, scripts }; } // Level 3: read a supporting file from the skill\u0026#39;s directory async loadSupportingFile( skillName: string, relativePath: string, ): Promise\u0026lt;string\u0026gt; { const metadata = this.skills.get(skillName); if (!metadata) throw new Error(`Unknown skill: ${skillName}`); const filePath = path.join(metadata.basePath, relativePath); return await fs.readFile(filePath, \u0026#34;utf-8\u0026#34;); } private parseFrontmatter(raw: string): Omit\u0026lt;SkillMetadata, \u0026#34;scripts\u0026#34;\u0026gt; { const match = raw.match(/^---\\n([\\s\\S]*?)\\n---/); if (!match) throw new Error(\u0026#34;No frontmatter found\u0026#34;); const parsed = yaml.parse(match[1]) as { name: string; description: string; \u0026#34;allowed-tools\u0026#34;?: string[]; \u0026#34;disable-model-invocation\u0026#34;?: boolean; context?: \u0026#34;inline\u0026#34; | \u0026#34;fork\u0026#34;; }; return { name: parsed.name, description: parsed.description, basePath: \u0026#34;\u0026#34;, // filled in by caller config: { allowedTools: parsed[\u0026#34;allowed-tools\u0026#34;], disableModelInvocation: parsed[\u0026#34;disable-model-invocation\u0026#34;], context: parsed.context, }, }; } private extractBody(raw: string): string { return raw.replace(/^---[\\s\\S]*?---\\n*/, \u0026#34;\u0026#34;).trim(); } } // --- The Skill tool: a meta-tool that loads instructions into context --- const UseSkillTool: Tool = { name: \u0026#34;UseSkill\u0026#34;, description: \u0026#34;Load a skill\u0026#39;s instructions into the conversation to guide task execution.\u0026#34;, permissionCategory: \u0026#34;read\u0026#34;, inputSchema: z.object({ skillName: z.string().describe(\u0026#34;Name of the skill to load\u0026#34;), }), async execute(input) { const { skillName } = input as { skillName: string }; try { const skill = await skillRegistry.loadSkill(skillName); // The skill\u0026#39;s instructions + metadata are returned as a tool result, // which means they enter the conversation history and guide // the model\u0026#39;s next steps. const sections = [ `[Skill loaded: ${skill.name}]`, `Base path: ${skill.basePath}`, ]; // Surface bundled scripts so the model knows what\u0026#39;s available if (skill.scripts.length \u0026gt; 0) { sections.push( `\\nBundled scripts (can be executed via Bash):`, ...skill.scripts.map((s) =\u0026gt; ` - ${s}`), ); } // Surface tool restrictions if configured if (skill.config.allowedTools) { sections.push( `\\nAllowed tools: ${skill.config.allowedTools.join(\u0026#34;, \u0026#34;)}`, ); } sections.push(\u0026#34;\u0026#34;, skill.instructions); return sections.join(\u0026#34;\\n\u0026#34;); } catch (err: any) { return `Error loading skill: ${err.message}`; } }, }; // --- Skill-aware context building --- // During context initialization, skill metadata (Level 1) is injected // alongside CLAUDE.md so the model knows what skills exist. class SkillAwareContextManager extends ContextManager { private skillRegistry: SkillRegistry; constructor( projectRoot: string, maxTokens: number, skillRegistry: SkillRegistry, ) { super(projectRoot, maxTokens); this.skillRegistry = skillRegistry; } override buildInitialContext(): Message[] { const messages = super.buildInitialContext(); // Inject skill metadata as a system message // This is Level 1 — just names and descriptions, very cheap const skillSummary = this.skillRegistry.getMetadataSummary(); if (skillSummary) { messages.push({ role: \u0026#34;system\u0026#34;, content: `[Skills]\\n${skillSummary}\\n\\nYou can use the UseSkill tool to load any skill when relevant.`, }); } return messages; } } The key insight: skills are instructions + utilities, not services # Unlike MCP, skills do not run as separate processes. They are loaded into the conversation as instructions, and the model uses the harness\u0026rsquo;s existing tools to act on them. But skills are not \u0026ldquo;just markdown\u0026rdquo; either — they can bundle:\nExecutable scripts (scripts/) that the model calls via the Bash tool during execution. Templates and examples (examples/, resources/) that the model reads for reference. Tool restrictions (allowed-tools) that scope what the model can do while the skill is active. Isolation config (context: fork) that runs the skill in a sub-agent to protect the parent session. The result is a portable workflow package — instructions plus the utilities needed to carry them out — that requires no server, no daemon, and no deployment. A skill is just a directory you can git push.\n4. Skills vs MCP — When to Use Which # Skills and MCP are complementary but serve fundamentally different purposes. The simplest mental model: MCP gives Claude new hands; Skills give Claude new expertise.\nBut can\u0026rsquo;t skills just call APIs? # Yes — and this is worth being precise about because the overlap is real.\nA skill can bundle a scripts/jira-client.py that handles OAuth, manages tokens, retries on failure, and returns structured JSON. The model reads the skill\u0026rsquo;s instructions, which describe exactly how to call the script:\n## Available scripts - `python3 scripts/jira-client.py get-issue --key \u0026lt;ISSUE_KEY\u0026gt;` — returns issue JSON - `python3 scripts/jira-client.py create-comment --key \u0026lt;ISSUE_KEY\u0026gt; --body \u0026lt;TEXT\u0026gt;` — posts a comment The model is perfectly capable of reasoning about this interface from the instructions. It knows the flag names, the expected values, and the script\u0026rsquo;s capabilities — because the skill told it. For simple and moderate API usage, this works well and is often the better choice because it\u0026rsquo;s simpler to set up than an MCP server.\nSo when does MCP actually earn its complexity? Three situations:\n1. Harness-level validation (catching errors before execution)\nWhen the model calls a script via Bash, the harness sees one parameter: a command string. If the model hallucinates a flag name (--issue-key instead of --key), the error surfaces after the script runs and returns stderr. The model then has to parse the error, understand what went wrong, and retry — burning a full agentic loop iteration.\nWith MCP, the tool\u0026rsquo;s JSON Schema is registered with the harness. The Zod layer validates the input before the call reaches the server:\n// MCP: harness catches this BEFORE execution tool_use: { name: \u0026#34;mcp_jira_get_issue\u0026#34;, input: { issueKey: 123 } } // → Zod error: \u0026#34;issueKey must be a string\u0026#34; — returned instantly, no execution // Skill script: error surfaces AFTER execution tool_use: { name: \u0026#34;Bash\u0026#34;, input: { command: \u0026#34;python3 scripts/jira-client.py get-issue --key\u0026#34; } } // → Script runs, fails with \u0026#34;error: --key requires an argument\u0026#34;, model parses stderr This matters at scale. If the model makes ten tool calls per task, even a 5% error rate means a wasted iteration every other task. Pre-execution validation eliminates an entire class of errors.\n2. Tool discovery at scale\nWhen you have 5 scripts, the model can learn their interfaces from skill instructions. When you have 50 MCP tools across 8 servers, something changes: all MCP tool schemas are always visible in the API\u0026rsquo;s tools array. The model can browse them, compare parameters, and pick the right tool without loading any skill instructions first.\nWith skills, tool discovery requires loading skill instructions (Level 2) before the model even knows what\u0026rsquo;s available. For large tool ecosystems — an organization with MCP servers for GitHub, Jira, Postgres, Slack, Datadog, and more — the \u0026ldquo;always visible\u0026rdquo; property of MCP schemas is a significant advantage.\n3. Cross-platform portability\nAn MCP server works with Claude Code, Cursor, Windsurf, Copilot, and any other MCP-compatible AI assistant. A skill script in .claude/skills/deploy/scripts/ is tied to Claude Code\u0026rsquo;s Bash tool. If your team uses multiple AI tools, MCP gives you one interface that works everywhere.\nWhat this means in practice # Capability Skill script MCP tool Skill workaround Verdict Model reasoning Reads interface from instructions Reads JSON Schema from tools array N/A — both work Draw Input validation Errors surface at runtime Zod rejects before execution Script validates its own args before calling the API Draw — both prevent the bad call; MCP is marginally faster Discovery (5 tools) Skill descriptions cover it Schemas in tools array N/A — both work Draw Discovery (50+ tools) Must load skill instructions All schemas always visible Rich Level 1 descriptions or a \u0026ldquo;catalog\u0026rdquo; skill Slight MCP edge — but skill catalogs close the gap Authentication Env vars, token cache Server manages OAuth/refresh Script handles tokens itself Draw Persistent state Fresh process each call Server holds connections Sidecar daemon via Unix socket Draw — but the sidecar is an MCP server without the protocol Cross-platform Tied to Claude Code Any MCP-compatible assistant Ship scripts with adapter wrappers per platform MCP wins — one interface vs N adapters The real decision: skills can do almost everything MCP does, but the workarounds add up. A sidecar daemon for persistence, a catalog skill for discovery, adapter wrappers for portability — at some point you\u0026rsquo;ve built an MCP-equivalent system without the standardized protocol. MCP\u0026rsquo;s value isn\u0026rsquo;t any single capability; it\u0026rsquo;s that one protocol solves all of these at once.\nComparison # Dimension Skills MCP What it provides Procedural knowledge + utility scripts — how to do something Typed, authenticated connectivity — the ability to do something reliably Analogy An SOP manual with utility scripts attached A typed SDK for an external system Implementation Markdown instructions + bundled scripts (SKILL.md + scripts/) Client-server architecture via JSON-RPC Runs as Injected instructions; bundled scripts run via Bash Persistent external process (MCP server) API calls Yes — via curl, Python, etc. in shell scripts (untyped) Yes — via typed, schema-validated tool definitions Token cost Very low (Level 1 always; Level 2+ on demand) Higher (full tool schemas always exposed) Requires infrastructure No — just a directory you can git push Yes — an MCP server process must be running Tool control Can restrict available tools via allowed-tools No built-in tool restrictions Shareable Via git (project skills in .claude/skills/) Via server deployment or npm packages Best for Workflows, runbooks, scripts, encoding judgment Reliable interfaces to APIs, databases, SaaS platforms Can Skills Completely Replace MCP? # Yes. If you look closely at the architecture of the Claude Code harness, every capability that MCP provides can be completely replaced by a well-architected Skills implementation.\n1. Replacing Pre-execution Validation Instead of relying on the harness\u0026rsquo;s Zod layer, a skill script can implement robust internal validation before making any API calls. For example, python3 scripts/billing.py charge --amount 100 --currency USD can validate that --amount is positive and --currency is a valid ISO code using argparse or pydantic before hitting the billing API. The functional result is identical: the costly call never happens. The only difference is that the validation runs in the script process rather than the harness process, surfacing errors to the model via standard output/error (which the model handles effortlessly).\n2. Replacing Tool Discovery at Scale You can replace MCP\u0026rsquo;s always-visible tool schemas by using a \u0026ldquo;catalog\u0026rdquo; skill. The Level 1 metadata (name + description) is always in context, so a rich description serves as a discovery mechanism:\n--- name: infra-tools description: | Infrastructure CLI tools: - query-db: Run SQL queries against staging/production Postgres - deploy: Deploy services to staging or production - metrics: Query Datadog metrics for the last N hours - slack-notify: Post messages to Slack channels --- When managing 50+ tools, a catalog skill lists all available scripts. Because Level 1 descriptions are tiny compared to full JSON Schema definitions, this approach is actually more context-efficient than loading 50 full MCP schemas into the harness at startup.\n3. Replacing Persistent Connections MCP servers hold persistent connections (database pools, WebSockets, long-lived sessions). Skills can achieve this exact architecture by talking to a sidecar daemon. You run the daemon in the background to hold the persistent connections, and the skill\u0026rsquo;s Bash scripts communicate with it via Unix sockets or localhost HTTP:\n# scripts/db-query.sh # Talks to a persistent sidecar instead of opening a new connection each time curl -s --unix-socket /tmp/db-proxy.sock \\ -X POST -d \u0026#34;{\\\u0026#34;sql\\\u0026#34;: \\\u0026#34;$1\\\u0026#34;, \\\u0026#34;params\\\u0026#34;: $2}\u0026#34; \\ http://localhost/query This transforms the skill from a stateless script into an interface for a stateful microservice, matching MCP\u0026rsquo;s persistence capability.\n4. Replacing Cross-Platform Portability While MCP defines a standard JSON-RPC protocol across tools like Cursor and Windsurf, Python and Bash scripts are inherently portable themselves. To support multiple AI assistants, you simply ship your scripts with thin adapter wrappers (e.g., a Cursor extension that shells out to your python script, or a Windsurf plugin that does the same). The core logic remains in the script, making it deeply agnostic to the specific AI agent running it.\nThe Architecture of a Full Replacement: If you want to bypass the complexity of deploying and managing MCP servers, you can build a complete equivalent using Skills + Scripts + Sidecars + Catalogs. While this involves writing validation logic and managing daemon processes yourself, it provides supreme flexibility—you are working entirely with standard scripts and bash commands, completely decoupled from the JSON-RPC spec of the Model Context Protocol.\nWhen to use Skills # You need procedural guidance — a repeatable workflow with specific steps, conditions, and rules. You want to encode judgment — \u0026ldquo;if the PR touches the payments module, always run the fraud-detection test suite.\u0026rdquo; You want consistency — the same workflow applied identically across sessions without re-explaining it. You\u0026rsquo;re making one-off API calls — a quick curl in a script is simpler than standing up an MCP server. You\u0026rsquo;re optimizing for context — skills load just-in-time, keeping the baseline context footprint minimal. How they compose # The most powerful workflows stack Skills on top of MCP:\nMCP provides the connection — e.g., an MCP server exposes your JIRA API. A Skill provides the methodology — e.g., a review-pr skill says: \u0026ldquo;First use the JIRA MCP to fetch the linked ticket. Then read the changed files. Then check for breaking changes against our API compatibility guidelines. Finally, post a review comment.\u0026rdquo; 5. Putting It All Together # With all the layers defined, here is how the harness bootstraps and runs:\nasync function main() { // 1. Build the tool registry const registry = new ToolRegistry(); registry.register(ReadFileTool); registry.register(BashTool); registry.register(EditFileTool); registry.register(SpawnAgentTool); registry.register(UseSkillTool); // 2. Connect MCP servers (if configured) await registerMCPTools( { name: \u0026#34;postgres\u0026#34;, url: \u0026#34;http://localhost:3001/mcp\u0026#34; }, registry, ); // 3. Discover skills (personal + project) const skillRegistry = new SkillRegistry(); await skillRegistry.discoverSkills([ path.join(process.env.HOME || \u0026#34;~\u0026#34;, \u0026#34;.claude\u0026#34;, \u0026#34;skills\u0026#34;), // personal path.join(process.cwd(), \u0026#34;.claude\u0026#34;, \u0026#34;skills\u0026#34;), // project ]); // 4. Configure permissions const permissions = new PermissionSystem(\u0026#34;default\u0026#34;, { denyPatterns: [/rm\\s+-rf\\s+\\//, /curl.*\\|.*sh/], allowedPaths: [process.cwd()], }); // 5. Register hooks const hooks = new HookRunner(); hooks.register({ event: \u0026#34;PreToolUse\u0026#34;, command: `bash -c \u0026#39;if echo \u0026#34;$TOOL_INPUT\u0026#34; | grep -q \u0026#34;node_modules\u0026#34;; then echo \u0026#34;BLOCKED\u0026#34;; exit 1; fi\u0026#39;`, }); // 6. Initialize skill-aware context manager const contextManager = new SkillAwareContextManager( process.cwd(), 200_000, skillRegistry, ); // 7. Create the agent loop and run const agent = new AgentLoop(registry, permissions, hooks, contextManager); const result = await agent.run(\u0026#34;Deploy the app to staging\u0026#34;); // The agent will autonomously discover the \u0026#39;deploy\u0026#39; skill from metadata, // load its instructions via UseSkill, and follow the steps. console.log(result); } main().catch(console.error); 6. Architectural Lessons # Stepping back, the Claude Code harness teaches several generalizable lessons about building agentic systems:\nThe model is not the product # Only ~2% of Claude Code\u0026rsquo;s codebase is \u0026ldquo;AI-related\u0026rdquo; in the sense of prompt engineering or model interaction. The remaining 98% is operational infrastructure: state management, safety, tool execution, context optimization. If you are building an agentic system, expect a similar ratio.\nDistributed systems patterns apply # The harness is effectively a distributed system with a single worker (the LLM) and multiple services (the tools):\nPattern Harness analogue Worker pool Sub-agents Service interface Tool registry Middleware Hooks Log rotation Context compaction Configuration management CLAUDE.md Circuit breaker Reactive compact + retry If you have experience building distributed systems, you already have the mental models needed to reason about agentic architectures.\nSafety is infrastructure, not a feature # The permission system, hooks, and schema validation are not bolted-on safety features — they are load-bearing infrastructure that the entire execution model depends on. The deny-first design, deterministic hooks, and layered gates are what make it safe to give an LLM write access to your codebase.\nStatelessness is a feature, not a bug # The model\u0026rsquo;s statelessness is often framed as a limitation, but Claude Code leverages it as a feature. Because every API call is independent, the harness can:\nCompact the context without side effects — the model doesn\u0026rsquo;t \u0026ldquo;notice\u0026rdquo; missing history. Fork sessions — two users can branch from the same conversation and diverge. Resume sessions — the harness reconstructs context from persisted state; the model doesn\u0026rsquo;t need to \u0026ldquo;wake up.\u0026rdquo; The harness transforms a liability (no memory) into a capability (flexible state management).\nConclusion # Claude Code is a masterclass in the unglamorous but essential work of building agentic infrastructure. The agentic loop is simple; the tool registry is modular; the permission system is layered; the context management is multi-staged; and the extensibility surfaces (hooks, MCP, skills, sub-agents) are designed for growth without touching the core loop.\nThe real insight is architectural: the intelligence is in the model, but the reliability is in the harness. If you\u0026rsquo;re building systems that give LLMs agency over real-world environments, the harness is where most of your engineering effort should go.\n","date":"26 April 2026","externalUrl":null,"permalink":"/posts/dissecting-the-claude-code-harness-part-2-extensibility-scale/","section":"Posts","summary":"1. Sub-Agents — Scaling Beyond a Single Loop # The single-threaded agentic loop is simple and predictable, but it cannot parallelize work. Claude Code addresses this with sub-agents — child agent instances that run their own isolated loops.\n","title":"Dissecting the Claude Code Harness - Part 2: Extensibility \u0026 Scale","type":"posts"},{"content":"I\u0026rsquo;m a SDE3 at Salesforce, writing deep dives into System Design and Agentic AI — with real Java implementations you can learn from.\n","date":"26 April 2026","externalUrl":null,"permalink":"/","section":"Home","summary":"I’m a SDE3 at Salesforce, writing deep dives into System Design and Agentic AI — with real Java implementations you can learn from.\n","title":"Home","type":"page"},{"content":"","date":"26 April 2026","externalUrl":null,"permalink":"/tags/llm/","section":"Tags","summary":"","title":"LLM","type":"tags"},{"content":"","date":"26 April 2026","externalUrl":null,"permalink":"/posts/","section":"Posts","summary":"","title":"Posts","type":"posts"},{"content":"","date":"26 April 2026","externalUrl":null,"permalink":"/series/","section":"Series","summary":"","title":"Series","type":"series"},{"content":"","date":"26 April 2026","externalUrl":null,"permalink":"/tags/","section":"Tags","summary":"","title":"Tags","type":"tags"},{"content":" Introduction # Claude Code is Anthropic\u0026rsquo;s terminal-based AI coding agent. On the surface it looks like a CLI that \u0026ldquo;just talks to Claude,\u0026rdquo; but under the hood it is a stateful software layer that sits between a stateless language model and your local development environment. The model provides the reasoning; the harness provides the hands, eyes, and workspace.\nIn March 2026 the complete source was accidentally exposed via a 59.8 MB JavaScript source map bundled in version 2.1.88 of the @anthropic-ai/claude-code npm package (a known Bun bundler bug shipped the .map files into production). The leak gave the community an unprecedented look at roughly 512,000 lines of unobfuscated TypeScript and what they found was striking: approximately 98% of the codebase is infrastructure, not AI \u0026ldquo;decision scaffolding.\u0026rdquo; Claude Code is, at its core, a distributed-systems-style runtime for a single LLM.\nThis post walks through the architectural pillars that make it work and for each layer, we\u0026rsquo;ll build a simplified implementation in TypeScript to make the mechanics concrete. We start by understanding what the harness is, model the LLM call itself as a plain function, and then build every other layer around it.\nThe harness # At a high level, the harness is everything that turns a reasoning model into a full fledged coding agent :\nIt receives the user prompt. It decides what context to load (files, prior messages, session state). It routes to tools and commands, enforces permissions and aggregation results. It manages the agent loop over mutliple runs along with compaction. Core stages in the harness # Broadly, we can divide the harness into 3 main stages per interaction:\nBootstrap Discover environment (workspace root, repo status, OS etc) Load configuration (permissions, default tools, MCP servers/skills, feature flags) Prefetch things like keychain, project scans etc. Query Engine (Agent loop) Build the model input: system prompt + CLAUDE.md + memory + conversation history + current task context (eg: file contents, user prompt, diffs). Let the model propose actions (tool calls, subagent spawns, plans). Route tool calls through a permission pipeline (allow/ask/deny) and execute the ones that pass. Update session memory and compact context as it approaches the token limit. Decide whether to continue another turn or stop, based on model output. Response rendering and persistence Stream markdown/text back to the UI Persist the session transcript locally so you can resume, rewind or diff across sessions. Update project-level memory (CLAUDE.md, trace.md etc.) based on what the model learned. The model call as a function # Throughout this post, the LLM is treated as a pure function: given a list of messages, it returns a response containing either text or tool-use requests. Everything else i.e. state, safety, context is the harness\u0026rsquo;s responsibility.\n// The model is a pure function: messages in, response out. // It has no memory, no side effects, no access to the filesystem. interface Message { role: \u0026#34;system\u0026#34; | \u0026#34;user\u0026#34; | \u0026#34;assistant\u0026#34; | \u0026#34;tool_result\u0026#34;; content: string | ToolUseBlock[]; } interface TextBlock { type: \u0026#34;text\u0026#34;; content: string; } interface ToolUseBlock { type: \u0026#34;tool_use\u0026#34;; id: string; // unique call ID, e.g. \u0026#34;toolu_01A...\u0026#34; name: string; // tool name, e.g. \u0026#34;Bash\u0026#34;, \u0026#34;ReadFile\u0026#34; input: Record\u0026lt;string, unknown\u0026gt;; // JSON payload matching the tool\u0026#39;s Zod schema } interface ModelResponse { content: (TextBlock | ToolUseBlock)[]; stopReason: \u0026#34;end_turn\u0026#34; | \u0026#34;tool_use\u0026#34;; } // This is the only interface to the LLM. // Every other component in this post wraps or feeds this function. async function callModel(messages: Message[]): Promise\u0026lt;ModelResponse\u0026gt; { // In production: HTTP POST to https://api.anthropic.com/v1/messages // with model, max_tokens, tools, and the messages array. // For this post, treat it as an opaque async function. return await anthropicAPI.createMessage({ messages }); } 1. The Agentic Loop # Every agentic system needs a control loop. Claude Code\u0026rsquo;s control loop is deceptively simple — a single-threaded while loop:\nPlan → Act → Observe → Repeat More concretely:\nReason (Model) — The harness sends the current prompt plus the full conversation history to the Claude API. Claude evaluates the task and responds with either a text answer (task complete) or a structured tool_use request (e.g., \u0026ldquo;read this file,\u0026rdquo; \u0026ldquo;run this shell command\u0026rdquo;). Execute (Tool System) — The harness receives the tool_use block, parses it, validates permissions, and executes the corresponding local tool. Observe (Tool Result) — The output is wrapped in a tool_result message and appended to the conversation history. Repeat — The updated history is sent back to the model for the next iteration. The loop terminates when the model returns a plain text response with no further tool calls. This is the same ReAct (Reason + Act) pattern found in most agentic frameworks, but Claude Code\u0026rsquo;s implementation is notable for what it doesn\u0026rsquo;t do: there is no explicit planning module, no chain-of-thought tree, no separate \u0026ldquo;critic\u0026rdquo; model. The loop trusts a single LLM to both plan and execute, and compensates with strong guardrails in the surrounding infrastructure.\nImplementation # class AgentLoop { private messages: Message[] = []; private toolRegistry: ToolRegistry; private permissionSystem: PermissionSystem; private hookRunner: HookRunner; private contextManager: ContextManager; constructor( toolRegistry: ToolRegistry, permissionSystem: PermissionSystem, hookRunner: HookRunner, contextManager: ContextManager, ) { this.toolRegistry = toolRegistry; this.permissionSystem = permissionSystem; this.hookRunner = hookRunner; this.contextManager = contextManager; } async run(userPrompt: string): Promise\u0026lt;string\u0026gt; { // Inject persistent context (CLAUDE.md, memory files) before anything else this.messages = this.contextManager.buildInitialContext(); this.messages.push({ role: \u0026#34;user\u0026#34;, content: userPrompt }); // The core loop: call model, execute tools, repeat while (true) { // Compact context if approaching the token limit this.messages = this.contextManager.compactIfNeeded(this.messages); // 1. REASON: call the model const response = await callModel(this.messages); // Append the assistant\u0026#39;s response to history this.messages.push({ role: \u0026#34;assistant\u0026#34;, content: response.content }); // 2. CHECK: did the model finish? (no tool calls) if (response.stopReason === \u0026#34;end_turn\u0026#34;) { const textBlocks = response.content.filter( (b): b is TextBlock =\u0026gt; b.type === \u0026#34;text\u0026#34;, ); return textBlocks.map((b) =\u0026gt; b.content).join(\u0026#34;\\n\u0026#34;); } // 3. ACT: execute each tool call const toolCalls = response.content.filter( (b): b is ToolUseBlock =\u0026gt; b.type === \u0026#34;tool_use\u0026#34;, ); for (const toolCall of toolCalls) { const result = await this.executeTool(toolCall); // 4. OBSERVE: feed the result back as a tool_result message this.messages.push({ role: \u0026#34;tool_result\u0026#34;, content: JSON.stringify({ tool_use_id: toolCall.id, content: result, }), }); } // Loop continues — the model sees the tool results on the next iteration } } private async executeTool(toolCall: ToolUseBlock): Promise\u0026lt;string\u0026gt; { const tool = this.toolRegistry.get(toolCall.name); // Schema validation (Zod) const parsed = tool.inputSchema.safeParse(toolCall.input); if (!parsed.success) { return `Error: Invalid input — ${parsed.error.message}`; } // Permission check const allowed = await this.permissionSystem.evaluate(toolCall, tool); if (!allowed) { return `Error: Permission denied for tool \u0026#34;${toolCall.name}\u0026#34;`; } // Pre-tool hooks const hookResult = await this.hookRunner.runPreToolUse(toolCall); if (hookResult.blocked) { return `Blocked by hook: ${hookResult.reason}`; } // Execute const output = await tool.execute(parsed.data); // Post-tool hooks await this.hookRunner.runPostToolUse(toolCall, output); return output; } } Why single-threaded? # A single-threaded loop keeps the execution model predictable. Each tool call is a synchronous, blocking operation from the loop\u0026rsquo;s perspective. There is no concurrent mutation of the conversation state, no race condition between tool executions. This dramatically simplifies debugging and makes the harness\u0026rsquo;s behavior reproducible — critical properties when the agent has write access to your filesystem and shell.\nThe trade-off is throughput: a single loop cannot parallelize independent tasks. Claude Code addresses this through sub-agents, which spawn isolated child loops for parallel work.\n2. The Tool Layer # The model never directly interacts with your filesystem, terminal, or network. Every action goes through the harness\u0026rsquo;s tool registry — a set of ~40 self-contained modules.\nTool anatomy # Each tool is implemented as a discrete module that defines three things:\nInput schema — Validated at runtime using Zod. The model must produce a JSON payload that conforms to the schema or the call is rejected before execution. Permission requirements — Declarative metadata specifying what safety gates the tool must pass (read-only, write, destructive, network, etc.). Execution logic — The actual implementation: file I/O, bash execution, git operations, search, etc. This design enforces a strict separation of concerns: the model is responsible for reasoning (what to do), and the harness is responsible for execution (how to do it safely). The model has no direct access to fs, child_process, or any system API — it can only express intent through structured tool calls.\nImplementation # import { z, ZodSchema } from \u0026#34;zod\u0026#34;; import * as fs from \u0026#34;fs/promises\u0026#34;; import { execSync } from \u0026#34;child_process\u0026#34;; // Tool permission categories type PermissionCategory = | \u0026#34;read\u0026#34; | \u0026#34;write\u0026#34; | \u0026#34;destructive\u0026#34; | \u0026#34;network\u0026#34; | \u0026#34;spawn\u0026#34;; // Every tool implements this interface interface Tool { name: string; description: string; permissionCategory: PermissionCategory; inputSchema: ZodSchema; execute(input: unknown): Promise\u0026lt;string\u0026gt;; } // The registry: a simple map of tool name → tool implementation class ToolRegistry { private tools = new Map\u0026lt;string, Tool\u0026gt;(); register(tool: Tool): void { this.tools.set(tool.name, tool); } get(name: string): Tool { const tool = this.tools.get(name); if (!tool) throw new Error(`Unknown tool: ${name}`); return tool; } // Returns tool definitions in the format the Anthropic API expects toAPIFormat(): object[] { return Array.from(this.tools.values()).map((tool) =\u0026gt; ({ name: tool.name, description: tool.description, input_schema: tool.inputSchema, })); } } // --- Example tool: ReadFile --- const ReadFileTool: Tool = { name: \u0026#34;ReadFile\u0026#34;, description: \u0026#34;Read the contents of a file at the given absolute path.\u0026#34;, permissionCategory: \u0026#34;read\u0026#34;, inputSchema: z.object({ path: z.string().describe(\u0026#34;Absolute path to the file\u0026#34;), startLine: z.number().optional().describe(\u0026#34;1-indexed start line\u0026#34;), endLine: z.number().optional().describe(\u0026#34;1-indexed end line\u0026#34;), }), async execute(input) { const { path, startLine, endLine } = input as { path: string; startLine?: number; endLine?: number; }; const content = await fs.readFile(path, \u0026#34;utf-8\u0026#34;); const lines = content.split(\u0026#34;\\n\u0026#34;); const start = (startLine ?? 1) - 1; const end = endLine ?? lines.length; return lines.slice(start, end).join(\u0026#34;\\n\u0026#34;); }, }; // --- Example tool: Bash --- const BashTool: Tool = { name: \u0026#34;Bash\u0026#34;, description: \u0026#34;Execute a shell command and return stdout/stderr.\u0026#34;, permissionCategory: \u0026#34;destructive\u0026#34;, // shell access is always high-risk inputSchema: z.object({ command: z.string().describe(\u0026#34;The shell command to execute\u0026#34;), timeout: z.number().optional().default(30000).describe(\u0026#34;Timeout in ms\u0026#34;), }), async execute(input) { const { command, timeout } = input as { command: string; timeout: number }; try { const stdout = execSync(command, { timeout, encoding: \u0026#34;utf-8\u0026#34;, maxBuffer: 1024 * 1024, // 1 MB cap }); return stdout; } catch (err: any) { return `Exit code ${err.status}\\nstderr: ${err.stderr}\\nstdout: ${err.stdout}`; } }, }; // --- Example tool: WriteFile (targeted edit, not full overwrite) --- const EditFileTool: Tool = { name: \u0026#34;EditFile\u0026#34;, description: \u0026#34;Replace a target string in a file with new content.\u0026#34;, permissionCategory: \u0026#34;write\u0026#34;, inputSchema: z.object({ path: z.string(), targetContent: z.string().describe(\u0026#34;Exact string to find and replace\u0026#34;), replacementContent: z.string().describe(\u0026#34;Content to replace it with\u0026#34;), }), async execute(input) { const { path, targetContent, replacementContent } = input as { path: string; targetContent: string; replacementContent: string; }; const content = await fs.readFile(path, \u0026#34;utf-8\u0026#34;); if (!content.includes(targetContent)) { return `Error: target content not found in ${path}`; } const updated = content.replace(targetContent, replacementContent); await fs.writeFile(path, updated); return `Successfully edited ${path}`; }, }; // --- Registering tools --- const registry = new ToolRegistry(); registry.register(ReadFileTool); registry.register(BashTool); registry.register(EditFileTool); Built-in tools # The leaked source revealed roughly 40 built-in tools. Some noteworthy categories:\nCategory Example tools Notes File I/O Read, Write, Edit, MultiEdit The Edit tool uses targeted string replacement, not full-file overwrites — a deliberate choice to minimize blast radius. Shell Bash Executes commands in a sandboxed shell. Output is captured and returned as tool_result. Search Grep, Find, CodebaseSearch Various search strategies for navigating large codebases. Git GitDiff, GitLog, GitStatus First-class git operations without requiring shell exec. Browser BrowserNavigate, BrowserClick For agents that need to interact with web UIs. Sub-agent SpawnAgent Launches a child agent loop with its own isolated context. Why Zod? # Schema validation at the tool boundary catches malformed requests before they reach the execution layer. If the model hallucinates a parameter name, passes the wrong type, or omits a required field, the Zod validator rejects it immediately and the error is fed back to the model — giving it a chance to self-correct in the next iteration. This is far cheaper and safer than executing an invalid command and dealing with the consequences.\n3. The Permission System # An agent with shell and filesystem access is a powerful thing and a dangerous one. Claude Code implements a deny-first, layered permission system to manage this risk.\nPermission modes # The harness supports multiple permission modes along a safety-autonomy gradient:\nMode Behavior Default Every potentially destructive action (file writes, shell commands) requires explicit user approval via an interactive prompt. This is the most restrictive mode. Plan The model can read and search freely, but must present a plan for approval before executing any mutations. Auto-accept Pre-approved tool categories (e.g., file reads, searches) execute without prompts; writes and shell commands still require approval. Auto Most actions execute without prompts. Only high-risk operations (e.g., rm -rf, network requests to unknown hosts) trigger safety gates. Implementation # import * as readline from \u0026#34;readline\u0026#34;; type PermissionMode = \u0026#34;default\u0026#34; | \u0026#34;plan\u0026#34; | \u0026#34;auto-accept\u0026#34; | \u0026#34;auto\u0026#34;; interface PermissionPolicy { denyPatterns: RegExp[]; // e.g. [/rm\\s+-rf/, /curl.*\\|.*sh/] allowedPaths: string[]; // e.g. [\u0026#34;/Users/dev/project\u0026#34;] } class PermissionSystem { private mode: PermissionMode; private policy: PermissionPolicy; constructor(mode: PermissionMode, policy: PermissionPolicy) { this.mode = mode; this.policy = policy; } async evaluate(toolCall: ToolUseBlock, tool: Tool): Promise\u0026lt;boolean\u0026gt; { // Gate 1: Policy deny-list (always checked, regardless of mode) if (this.isDeniedByPolicy(toolCall)) { console.log(`🚫 Policy denied: ${toolCall.name}`); return false; } // Gate 2: Mode-based evaluation switch (this.mode) { case \u0026#34;auto\u0026#34;: // Auto mode: allow everything that passes the policy return true; case \u0026#34;auto-accept\u0026#34;: // Auto-accept: reads are fine, writes need approval if (tool.permissionCategory === \u0026#34;read\u0026#34;) return true; return await this.promptUser(toolCall); case \u0026#34;plan\u0026#34;: // Plan mode: reads are fine, any mutation needs approval if (tool.permissionCategory === \u0026#34;read\u0026#34;) return true; return await this.promptUser(toolCall); case \u0026#34;default\u0026#34;: default: // Default: everything except reads needs approval if (tool.permissionCategory === \u0026#34;read\u0026#34;) return true; return await this.promptUser(toolCall); } } private isDeniedByPolicy(toolCall: ToolUseBlock): boolean { const inputStr = JSON.stringify(toolCall.input); // Check deny-list patterns (e.g., rm -rf, curl piped to sh) for (const pattern of this.policy.denyPatterns) { if (pattern.test(inputStr)) return true; } // Check path restrictions if (\u0026#34;path\u0026#34; in (toolCall.input as any)) { const path = (toolCall.input as any).path as string; const inAllowedPath = this.policy.allowedPaths.some((p) =\u0026gt; path.startsWith(p), ); if (!inAllowedPath) return true; } return false; } private async promptUser(toolCall: ToolUseBlock): Promise\u0026lt;boolean\u0026gt; { const rl = readline.createInterface({ input: process.stdin, output: process.stdout, }); return new Promise((resolve) =\u0026gt; { const preview = JSON.stringify(toolCall.input).slice(0, 200); rl.question( `\\n⚠️ Tool: ${toolCall.name}\\n Input: ${preview}\\n Allow? (y/n): `, (answer) =\u0026gt; { rl.close(); resolve(answer.toLowerCase() === \u0026#34;y\u0026#34;); }, ); }); } } How a tool call is evaluated # Every tool_use request passes through a permission classifier before reaching the execution layer:\nSchema validation — Is the request well-formed? (Zod layer) Mode check — Does the current permission mode allow this tool category? Policy evaluation — Does the tool call match any deny-list patterns? (e.g., certain shell commands, paths outside the workspace) Hook evaluation — Do any registered PreToolUse hooks block the call? User prompt — If all gates pass but the mode requires confirmation, the user is prompted. Only after all five gates pass does the tool execute.\nThe safety-autonomy trade-off # Research on real-world usage patterns shows that users tend to shift toward more autonomous modes as they habituate to the tool — a \u0026ldquo;safety-autonomy gradient.\u0026rdquo; The system defaults to conservative, human-in-the-loop approval precisely because of this tendency: the most dangerous moment is when a user trusts the agent just enough to stop reading the prompts.\n4. Hooks — Middleware for Agents # Hooks are Claude Code\u0026rsquo;s mechanism for deterministic, user-defined control at lifecycle boundaries. They are conceptually identical to middleware in a web framework: shell commands that intercept events, inspect payloads, and can block or modify execution.\nHook lifecycle events # Hook When it fires Common use PreToolUse Before a tool call is executed Block dangerous commands, enforce coding standards, log tool usage PostToolUse After a tool call completes Validate outputs, trigger follow-up actions, audit trails Notification When the agent produces a notification Route alerts to Slack, email, or other channels Stop When the agent signals task completion Run post-task validation, trigger CI/CD pipelines Implementation # import { execSync } from \u0026#34;child_process\u0026#34;; type HookEvent = \u0026#34;PreToolUse\u0026#34; | \u0026#34;PostToolUse\u0026#34; | \u0026#34;Notification\u0026#34; | \u0026#34;Stop\u0026#34;; interface HookDefinition { event: HookEvent; command: string; // shell command to execute } interface HookResult { blocked: boolean; reason?: string; } class HookRunner { private hooks: HookDefinition[] = []; register(hook: HookDefinition): void { this.hooks.push(hook); } async runPreToolUse(toolCall: ToolUseBlock): Promise\u0026lt;HookResult\u0026gt; { const relevantHooks = this.hooks.filter((h) =\u0026gt; h.event === \u0026#34;PreToolUse\u0026#34;); for (const hook of relevantHooks) { try { // Pass tool call info as environment variables execSync(hook.command, { encoding: \u0026#34;utf-8\u0026#34;, env: { ...process.env, TOOL_NAME: toolCall.name, TOOL_INPUT: JSON.stringify(toolCall.input), TOOL_ID: toolCall.id, }, timeout: 5000, }); // Exit code 0 → allowed } catch (err: any) { // Non-zero exit code → blocked return { blocked: true, reason: err.stdout?.trim() || `Hook blocked: ${hook.command}`, }; } } return { blocked: false }; } async runPostToolUse( toolCall: ToolUseBlock, toolOutput: string, ): Promise\u0026lt;void\u0026gt; { const relevantHooks = this.hooks.filter((h) =\u0026gt; h.event === \u0026#34;PostToolUse\u0026#34;); for (const hook of relevantHooks) { try { execSync(hook.command, { encoding: \u0026#34;utf-8\u0026#34;, env: { ...process.env, TOOL_NAME: toolCall.name, TOOL_INPUT: JSON.stringify(toolCall.input), TOOL_OUTPUT: toolOutput, }, timeout: 5000, }); } catch { // PostToolUse hooks are advisory — failures are logged, not fatal console.warn(`PostToolUse hook failed: ${hook.command}`); } } } } // --- Example: registering hooks --- const hookRunner = new HookRunner(); // Block modifications to lock files hookRunner.register({ event: \u0026#34;PreToolUse\u0026#34;, command: `bash -c \u0026#39; if echo \u0026#34;$TOOL_INPUT\u0026#34; | grep -q \u0026#34;package-lock.json\\\\|yarn.lock\u0026#34;; then echo \u0026#34;BLOCKED: Lock file modifications are not allowed.\u0026#34; exit 1 fi exit 0 \u0026#39;`, }); // Log every tool execution to a file hookRunner.register({ event: \u0026#34;PostToolUse\u0026#34;, command: `bash -c \u0026#39; echo \u0026#34;[$(date)] $TOOL_NAME: $TOOL_INPUT\u0026#34; \u0026gt;\u0026gt; /tmp/agent-audit.log \u0026#39;`, }); Why deterministic hooks matter # The key insight is that hooks run outside the LLM\u0026rsquo;s non-deterministic reasoning. A hook that blocks rm -rf / will always block it, regardless of what the model believes is appropriate. This provides a hard safety boundary that cannot be prompt-injected or reasoned around.\nBecause hooks are just shell scripts, they can integrate with any existing tooling: linters, security scanners, policy engines, notification systems.\n5. Context Management # The most complex subsystem in the harness is context management — the machinery that maintains the illusion of a continuous, aware assistant on top of a fundamentally stateless model.\nThe problem # Each API call to Claude is independent. The model has no memory between calls. The harness must:\nReconstruct the full conversational context on every call. Keep that context within the model\u0026rsquo;s token limit (the context window). Ensure critical information (project conventions, security constraints) is never lost. As sessions grow longer — accumulating file contents, tool outputs, back-and-forth dialogue — the context window fills up. Naive truncation loses critical information. Claude Code solves this with a multi-layer compaction pipeline.\nImplementation # import * as fs from \u0026#34;fs/promises\u0026#34;; import * as path from \u0026#34;path\u0026#34;; interface CompactionResult { messages: Message[]; compacted: boolean; } class ContextManager { private maxTokens: number; private projectRoot: string; constructor(projectRoot: string, maxTokens: number = 200_000) { this.projectRoot = projectRoot; this.maxTokens = maxTokens; } // Load persistent, compaction-proof context buildInitialContext(): Message[] { const messages: Message[] = []; // 1. System-level: CLAUDE.md is always first (never compacted) const claudeMd = this.loadClaudeMd(); if (claudeMd) { messages.push({ role: \u0026#34;system\u0026#34;, content: claudeMd }); } // 2. Memory files from ~/.claude/MEMORY.md const memory = this.loadMemoryFiles(); if (memory) { messages.push({ role: \u0026#34;system\u0026#34;, content: memory }); } return messages; } // The multi-layer compaction pipeline compactIfNeeded(messages: Message[]): Message[] { const usage = this.estimateTokenUsage(messages); const ratio = usage / this.maxTokens; // Stage 1: Snip compact at 80% — evict cold messages from the middle if (ratio \u0026gt; 0.8) { messages = this.snipCompact(messages); } // Stage 2: Microcompact at 85% — shrink content, preserve cache keys if (ratio \u0026gt; 0.85) { messages = this.microcompact(messages); } // Stage 3: Auto compact at 95% — LLM-based summarization if (ratio \u0026gt; 0.95) { messages = this.autoCompact(messages); } return messages; } // Stage 1: Remove old tool results from the middle of conversation private snipCompact(messages: Message[]): Message[] { const keep = 5; // keep first N and last N messages if (messages.length \u0026lt;= keep * 2) return messages; const head = messages.slice(0, keep); const tail = messages.slice(-keep); const middle = messages.slice(keep, -keep); // Only remove tool_result messages from the middle (they\u0026#39;re bulky) const filtered = middle.filter((m) =\u0026gt; m.role !== \u0026#34;tool_result\u0026#34;); return [...head, ...filtered, ...tail]; } // Stage 2: Truncate long tool outputs while keeping cache-friendly prefix private microcompact(messages: Message[]): Message[] { return messages.map((msg) =\u0026gt; { if (msg.role === \u0026#34;tool_result\u0026#34; \u0026amp;\u0026amp; typeof msg.content === \u0026#34;string\u0026#34;) { if (msg.content.length \u0026gt; 2000) { return { ...msg, content: msg.content.slice(0, 1000) + \u0026#34;\\n... [truncated] ...\\n\u0026#34; + msg.content.slice(-500), }; } } return msg; }); } // Stage 3: Summarize the conversation using the LLM itself private autoCompact(messages: Message[]): Message[] { const systemMessages = messages.filter((m) =\u0026gt; m.role === \u0026#34;system\u0026#34;); const conversationMessages = messages.filter((m) =\u0026gt; m.role !== \u0026#34;system\u0026#34;); // Ask the model to summarize the conversation so far // (In production this is a separate, cheaper model call) const summary = this.summarizeSync(conversationMessages); return [ ...systemMessages, { role: \u0026#34;assistant\u0026#34; as const, content: `[SystemCompactBoundaryMessage] Summary of previous work:\\n${summary}`, }, ]; } // Stage 4: Reactive compact — called when API returns prompt_too_long reactiveCompact(messages: Message[]): Message[] { // Emergency: aggressively summarize and retry const systemMessages = messages.filter((m) =\u0026gt; m.role === \u0026#34;system\u0026#34;); const summary = this.summarizeSync( messages.filter((m) =\u0026gt; m.role !== \u0026#34;system\u0026#34;), ); return [ ...systemMessages, { role: \u0026#34;assistant\u0026#34; as const, content: `[ReactiveCompact] ${summary}`, }, ]; } private loadClaudeMd(): string | null { try { const filePath = path.join(this.projectRoot, \u0026#34;CLAUDE.md\u0026#34;); // fs.readFileSync used here for simplicity in the initializer return require(\u0026#34;fs\u0026#34;).readFileSync(filePath, \u0026#34;utf-8\u0026#34;); } catch { return null; } } private loadMemoryFiles(): string | null { try { const memoryPath = path.join( process.env.HOME || \u0026#34;~\u0026#34;, \u0026#34;.claude\u0026#34;, \u0026#34;MEMORY.md\u0026#34;, ); return require(\u0026#34;fs\u0026#34;).readFileSync(memoryPath, \u0026#34;utf-8\u0026#34;); } catch { return null; } } private estimateTokenUsage(messages: Message[]): number { // Rough estimate: 1 token ≈ 4 characters const totalChars = messages.reduce((sum, m) =\u0026gt; { const content = typeof m.content === \u0026#34;string\u0026#34; ? m.content : JSON.stringify(m.content); return sum + content.length; }, 0); return Math.ceil(totalChars / 4); } private summarizeSync(messages: Message[]): string { // In production, this calls the model with a summarization prompt. // Simplified here for illustration. const totalMessages = messages.length; const toolCalls = messages.filter( (m) =\u0026gt; typeof m.content !== \u0026#34;string\u0026#34;, ).length; return `Completed ${totalMessages} interaction steps including ${toolCalls} tool calls.`; } } The compaction pipeline # The pipeline consists of five stages, each more aggressive than the last:\nStage 1: Snip Compact # Removes older assistant and tool messages from the middle of the conversation that are deemed unlikely to be needed. Think of it as evicting cold cache lines — recent and very early messages are preserved, while the middle (often repetitive exploration) is trimmed.\nStage 2: Microcompact # Shrinks content while preserving Anthropic API prompt cache keys. This is a cost and latency optimization: by keeping the cache-friendly prefix of the conversation intact, the harness avoids re-processing tokens the API has already seen.\nStage 3: Auto Compact # When the context reaches ~95% of the window limit, the harness triggers an LLM-based summarization. The raw conversation history is replaced with a concise summary, marked by a SystemCompactBoundaryMessage. This is the most visible form of compaction — the agent \u0026ldquo;forgets\u0026rdquo; the raw details but retains a high-level understanding of what happened.\nStage 4: Reactive Compact # An emergency mechanism. If the API returns a prompt_too_long error, the harness compacts context mid-request and retries automatically. This ensures the agent never hard-fails due to context overflow.\nStage 5: Context Collapse # For very long tool chains, the harness collapses entire sequences of tool calls and results into compact representations that retain only key outcomes. A 20-step file exploration might collapse to: \u0026ldquo;Explored src/ directory; identified index.ts as entry point; found 3 test files.\u0026rdquo;\nPersistent context: CLAUDE.md # Because compaction can (and will) discard information, Claude Code uses CLAUDE.md as a persistent, compaction-proof instruction layer. This markdown file is placed in the project root and is automatically loaded into the context at the start of every session. It typically contains:\nProject conventions — Naming, styling, testing, and deployment guidelines. Architecture notes — Core files, libraries, and project structure. Workflow rules — Behaviors, constraints, and common commands. Compact instructions — Explicit guidance on what information must survive compaction. CLAUDE.md is treated as a system-level instruction — it is injected before the conversation history and is never subject to compaction. This is the primary mechanism for ensuring the model \u0026ldquo;remembers\u0026rdquo; project-critical information across long sessions.\nSession persistence # Beyond CLAUDE.md, the harness maintains session state in ~/.claude/:\nMEMORY.md — An index file pointing to topic-specific markdown files that are loaded automatically. Session history — Each session\u0026rsquo;s message log, tool usage, and results are persisted as JSONL files, enabling claude --resume \u0026lt;session-id\u0026gt; to pick up where you left off. ","date":"24 April 2026","externalUrl":null,"permalink":"/posts/dissecting-the-claude-code-harness-part-1-the-execution-engine/","section":"Posts","summary":"Introduction # Claude Code is Anthropic’s terminal-based AI coding agent. On the surface it looks like a CLI that “just talks to Claude,” but under the hood it is a stateful software layer that sits between a stateless language model and your local development environment. The model provides the reasoning; the harness provides the hands, eyes, and workspace.\n","title":"Dissecting the Claude Code Harness - Part 1: The Execution Engine","type":"posts"},{"content":"","date":"18 April 2026","externalUrl":null,"permalink":"/tags/architecture/","section":"Tags","summary":"","title":"Architecture","type":"tags"},{"content":"","date":"18 April 2026","externalUrl":null,"permalink":"/categories/","section":"Categories","summary":"","title":"Categories","type":"categories"},{"content":" Introduction # As part of this post, we will be designing a real-time matching and dispatch system for a ride-hailing platform focusing on geospatial indexing, low-latency matching, streaming data, pricing and consistency trade-offs.\n1. Requirements # Functional Requirements # Rider:\nRequest a ride (pickup, dropoff, ride type) Get quote with ETA and estimated price Confirm ride, see assigned driver and live location Driver:\nReceive ride offers, accept or reject. Stream location while en route and during trip. Go online/offline. System:\nSelect a best driver (multi-factor, not just nearest) Compute ETAs for pickup and drop Run dynamic pricing / surge per area. Persist drips, payments, ratings and audit logs. Non-Functional Requirements # Key tradeoffs:\nAvailability over strong consistency on read paths: Riders should almost always see some ETA and price, even if it\u0026rsquo;s slightly stale. Low latency: Matching decision \u0026lt; 200 - 300ms in typical load. Location update end-to-end (driver -\u0026gt; rider map) in \u0026lt; 500ms. Scalability: Handle \u0026gt;= 1M GPS updates globally. Scale horizontally across regions and data centers. Correctness: A driver should never be fully committed to two trips at once - strong consistency around assignment. Observability and cost efficiency: Streaming-first, cheap writes, heavy analytics off hot path. Note: We accept eventual consistency for ETAs and surge levels, but require strong consistency for ride assignment and payment settlement.\n2. High-Level Architecture # Core Components # We can think in terms of 3 planes:\nOnline request plane - matching and real-time user interactions. Streaming plane - GPS ingestion, event processing, feature computation. Batch/analytics plane - offline modeling and business insights. Mobile Apps -\u0026gt; API Gateway -\u0026gt; Connection Service (WebSockets) | v --- Ride Service -\u0026gt; Payment | Location Service | Matching Service -\u0026gt; Geospatial Index GPS -\u0026gt; Ingestion -\u0026gt; Stream ------\u0026gt; ETA / Routing -\u0026gt; Pricing and Surge | |----\u0026gt; Analytics Plane APIs # POST /rides/quote\nInput: pickup_lat, pickup_lng, dropoff_lat, dropoff_lng, ride_type Output: estimated_price, estimated_eta POST /rides\nInput: pickup, drop, price, ride_type, client_token Output: ride_id, status (eg: requested) POST /rides/{ride_id}/cancel (for both rider and driver)\nPOST /rides/{ride_id}/complete (driver)\nPOST /driver/{driver_id}/status (whether driver is online or offline)\nReal Time Over Sockets # Rider Channel:\nride_update (matched, driver en route, started, completed). driver_location (stream) Driver Channel:\nride_offer (new ride request with countdown) ride_update (GPS updates, cancelled ride, ride started, ride completed) Geospatial Indexing: Quadtrees vs S2 vs H3 # We need to handle millions of GPS updates per second, and indexing raw lat/longitude in a traditional RDBMS does not scale. Instead, we will use a geospatial index that will:\nPartition the world into cells Map each (lat, lng) to a cell ID Maintain in-memory state per cell and small neighbourhoods. Quadtrees / Geohash # Idea: Recursively subdivide the world into quadrants; each level adds precision. Geohash encodes lat/lng into a string; prefix similarity means proximity. Pros: Simple to implement String prefix ranged queries. Cons: Cell shapes distort with latitude Neighbourhood queries across cell boundaries are messier. Google S2 # S2 partitions the sphere into hierarchical cells using a space-filling curve (Hilbert) over faces of a cube projected onto Earth.\nPros: Native spherical geometry (good for global systems). Good locality properties via Hilbert curve. Widely used (Google Maps, BigQuery GIS, MongoDB etc). Cons: High complexity Uber H3 # H3 is Uber\u0026rsquo;s open-source hexagonal hierarchical spatial index.\nHexagonal grid over the globe, resolutions 0 - 15. Each cell has a 64-bit index; supports k-ring neighbours, polygons etc. Pros: Hexagons have more uniform neighbour distances than squares. Built-in functions make Nearest neighbour, k-ring (neighbours within N rings), aggregation across resolutions straightforward. Location Tracking: Handling Millions of GPS Pings # When a driver sends a GPS update every 1-5 seconds:\nIngress Driver -\u0026gt; WebSocket -\u0026gt; Connection Service -\u0026gt; Location Service Matching / Tracking Convert (lat, lng) -\u0026gt; cell ID (eg: H3 index) Update in-memory driver state: drivers_by_cell[cell_id] -\u0026gt; set of available drivers in that cell driver_state[driver_id] -\u0026gt; last location, status Redis cache Set TTL for driver locations to automatically remove stale drivers. This ensures matching queries never hit the database and are served from memory.\nPartitioning and Scaling # To handle high write volume:\nPartition by region and cell ID: Each shard manages a subset of cells. Topic partition key: region, cell_id prefix Tunable update frequency: Adaptive throttling (slower updates when driver is stopped) Backpressure and drop policies: If downstream is overloaded, drop older updates for the same driver - only the latest matters per driver. Note: We aim for a write-optimized, streaming-first architecture: We treat the GPS stream as the ground truth, update an in-memory index for sub-second queries, and persist raw events separately to a data lake for analytics and ML.\nMatching Logic: Finding the \u0026ldquo;Best\u0026rdquo; Driver # Best is rarely \u0026ldquo;closest\u0026rdquo; - it\u0026rsquo;s a weighted combination of:\nETA to pickup (road, traffic) Driver rating and cancellation history Driver\u0026rsquo;s current trip (direction) Vehicle constraints (type) Platform incentives (driver utilization etc) Step 1: Candidate Retrieval Given a pickup point:\nMap (lat,lng) -\u0026gt; cell_id Query current cell + k-ring neighbours (eg: 2-3 km). Collect available drivers from these cells, with a cap (eg: top 100 drivers by proximity). Using H3:\ngeoToH3(lat,lng,res) -\u0026gt; cell_id kRing(cell_id, k) -\u0026gt; set of neighbour cell ids Step 2: Scoring Function Define a match score: score = w1*eta + w2*rating + w3*vehicle_type + w4*driver_utilization + ...\nStep 3: Offer and Reservation To avoid double-assignment, we use short-lived reservations:\nMatching Service chooses top candidate driver. It calls a Driver Allocation Store (Redis) with compare-and-set (CAS): Set driver_status = \u0026quot;reserved\u0026quot; if current status is \u0026quot;available\u0026quot;. If CAS succeeds: Create RideOffer Send offer via WebSocket to driver. Driver accepts or rejects via WebSocket -\u0026gt; ConnectionService -\u0026gt; Matching / Ride Service If accepted, Ride Service updates trip status to \u0026quot;assigned\u0026quot; and notifies both rider and driver. On timeout / rejection, Matching Service removes reservation, makes driver available and tries next candidate. Note: 2 riders trying to get the same driver concurrently won\u0026rsquo;t happen due to CAS/locks that enforce exclusive assignment.\nWebSockets for Real-Time Signalling # Both rider and driver maintain persistent WebSocket connections to Connection Service. The Connection Service knows which instance holds the WebSocket for each user. Benefits:\nFull-duplex, low latency updates vs polling. Separates stateful connection management from stateless matching logic. Data Consistency: Preventing Double Booking # Per-driver linearizability: A single \u0026ldquo;home\u0026rdquo; shard / key for each driver. All state transitions go through that shard. Idempotent APIs: Client includes request_id on mutation calls. Server stores (request_id -\u0026gt; result) with some TTL. Ride state machine with explicit transitions: Requested -\u0026gt; OfferPending -\u0026gt; Matched -\u0026gt; Ongoing -\u0026gt; Completed/Cancelled. Every transition is a compare-and-set on current state. ETA Engine # A basic ETA engine has 2 layers:\nRouting layer: Represents the road network as a graph Nodes: intersection / waypoints. Edges: road segments with base travel times and constraints Given source and destination:\nUse A* or Dijkstra\u0026rsquo;s algorithm to find the shortest path Integrate with real-time traffic data to adjust edge weights Pricing and Surge # We can come up with a base fare model: base_price = (per_km * distance) + (per_min * time) + (surge_multiplier * base_fare)\nMeasuring Supply and Demand For each cell:\nSupply: number of available drivers in / around that cell. Demand: recent ride requests / search queries originating from that cell. Computing Price:\nMap pick-up and drop to cells Estimate ETA and trip duration using ETA engine Compute base fare components Fetch surge multiplier for pickup cell Compute final price and show to user with time-limited validity (eg: 2 minutes). 3. Potential Bottlenecks and Mitigations # Hotspots (eg: New Year\u0026rsquo;s eve): Sudden spike in demand in a few cells can cause: a high queue backlog in matching service Connection Service Saturation i.e. too many websockets in a single region. Mitigations can include:\nLimit search radius or candidate count for driver matching during overload Degrade gracefully i.e. approximate ETAs, cached routes, cheaper heuristics Admission Control: if Matching Service queue length exceeds limit, surface an error - \u0026ldquo;High demand in your area, try again later\u0026rdquo;. Connection Service Scaling: Use consistent hashing of ride_ids across nodes. During high demand, auto scale and rebalance. 4. Global Scaling # Multi-AZ (Availability Zone) deployment with automatic failover to prevent downtime. In this setup, you have a Primary instance in one AZ and a Standby instance in a second AZ. If the Primary goes dark, the system automatically swaps them so your application stays online. Per-region H3/S2 index: Global coordination only at higher levels (e.g., cross-border trips). Caching layers: Surge multipliers, popular routes, static map data. Backpressure signals from downstream services: ETA / routing service indicates overload → Matching Service switches to heuristic ETAs. Replayability: All requests and location streams written to Kafka enable reprocessing with improved algorithms and debugging incidents post hoc. ","date":"18 April 2026","externalUrl":null,"permalink":"/posts/designing-a-ride-hailing-system/","section":"Posts","summary":"Introduction # As part of this post, we will be designing a real-time matching and dispatch system for a ride-hailing platform focusing on geospatial indexing, low-latency matching, streaming data, pricing and consistency trade-offs.\n","title":"Designing a Ride-Hailing System","type":"posts"},{"content":"","date":"18 April 2026","externalUrl":null,"permalink":"/tags/geospatial/","section":"Tags","summary":"","title":"Geospatial","type":"tags"},{"content":"","date":"18 April 2026","externalUrl":null,"permalink":"/categories/real-time/","section":"Categories","summary":"","title":"Real-Time","type":"categories"},{"content":"","date":"18 April 2026","externalUrl":null,"permalink":"/categories/system-design/","section":"Categories","summary":"","title":"System Design","type":"categories"},{"content":"","date":"18 April 2026","externalUrl":null,"permalink":"/tags/websockets/","section":"Tags","summary":"","title":"WebSockets","type":"tags"},{"content":"","date":"11 April 2026","externalUrl":null,"permalink":"/tags/e-commerce/","section":"Tags","summary":"","title":"E-Commerce","type":"tags"},{"content":"","date":"11 April 2026","externalUrl":null,"permalink":"/categories/engineering/","section":"Categories","summary":"","title":"Engineering","type":"categories"},{"content":"","date":"11 April 2026","externalUrl":null,"permalink":"/tags/order-management/","section":"Tags","summary":"","title":"Order Management","type":"tags"},{"content":"As part of this post, we\u0026rsquo;ll be covering the design of a modern, production-grade Order Management System (OMS) with a focus on multi-fulfillment, cancellations, refunds, inventory synchronization, and multi-region deployment.\nLet\u0026rsquo;s first start with the requirements.\nRequirements # Functional Requirements # Core order lifecycle: Create order with multiple line items, shipping options, and payment methods. Order state machine: Support states such as PENDING → CONFIRMED → PARTIALLY_FULFILLED → FULFILLED → CANCELLED → REFUNDED. Split shipments: Support split shipments and partial fulfillment when items originate from multiple locations or arrive at different times. Cancellations: Allow customer and system-initiated cancellations in various states (pre-fulfillment, mid-fulfillment) with clear rules. Refunds: Support refunds (full and partial), including multi-payment or mixed-method scenarios (card, wallet, store credit). Multi-fulfillment: Route each line item to an optimal fulfillment node (warehouse, store, 3PL, marketplace drop-shipper). Multiple shipments: Track multiple shipments per order with independent tracking IDs and statuses. Backorders and preorders: Support delayed fulfillment while the order remains active. Inventory and payments: Reserve inventory atomically as part of the order creation saga; release on failure or cancellation. Inventory sync: Prevent overselling across channels with near real-time inventory sync and event-driven updates. Payment gateways: Integrate with one or more payment gateways for authorization, capture, and refund. Multi-channel and integrations: Receive orders from internal checkout, marketplaces, and POS; normalize into a canonical order model. Fulfillment updates: Push fulfillment updates and cancellations back to channels and customer notification systems. Multi-region deployment: Deploy OMS in multiple regions, each with a full stack of services fronted by a global load balancer. Data synchronization: Keep critical data (orders, payments, inventory) synchronized across regions using a mix of strong and eventual consistency depending on domain constraints. Non-Functional Requirements # High Availability and resilience: One failure in a downstream flow should not take down the entire order flow. Scalability: Capable of handling peak events such as flash sales and promotions. Consistency: Clear consistency model for orders and inventory (strong vs eventual consistency). Observability: Comprehensive logging, monitoring, and tracing. Extensibility: Easy to add new fulfillment types, payment methods, or regions without major rewrites. High Level Design # Order Lifecycle and Domain Model # Order Lifecycle Stages # A typical e-commerce order lifecycle contains the following high-level stages:\nOrder captured: Request received from channel with cart items, prices, and customer data. Order validated: Items, pricing, and addresses verified; taxes and shipping calculated. Payment authorization: Payment instrument authorized for total amount. Inventory reservation: Stock reserved or allocated at chosen location(s). Fulfillment: Warehouse/store picks, packs, and ships or hands over for pickup. Shipment and delivery: Carrier tracking pushed; order marked shipped/delivered. Post-order events: Cancellations, returns, exchanges, refunds, inventory adjustments. Core Entities # Key domain entities include:\nOrder: Immutable identity, with overall status (CREATED, CONFIRMED, FULFILLING, SHIPPED, COMPLETED, CANCELLED, RETURNED). OrderItem: Per-SKU line with quantity, price, and fulfillment status. Payment: Records authorization, capture, refund events with idempotent transaction keys. InventoryItem / StockLevel: Per SKU, location, availability, and reservations. FulfillmentRequest: A unit of work sent to a fulfillment node (warehouse, store, 3PL). ReturnRequest: Tracks customer-initiated returns, RMA, and refund disposition. Core Services # Saga Orchestration # In a microservices-based Order Management System (OMS), you can\u0026rsquo;t easily use a single \u0026ldquo;giant\u0026rdquo; database transaction to ensure everything succeeds or fails together. If the Payment service is down but the Inventory service already deducted stock, you have a data consistency nightmare.\nThe Saga Pattern solves this by breaking a large, distributed transaction into a sequence of smaller, local transactions.\nHow a Saga Works Instead of one big lock on the data, each service performs its own local transaction and publishes an event or message. This triggers the next service in the chain. If any step fails, the Saga executes compensating transactions—essentially \u0026ldquo;undo\u0026rdquo; operations—to revert the changes made by previous steps.\nThere are two primary ways to coordinate these steps:\nEvent-Based (Choreography) There is no central \u0026ldquo;boss.\u0026rdquo; Each service listens for events and decides what to do next.\nPros: Simple to start; low coupling. Cons: Hard to track the \u0026ldquo;state\u0026rdquo; of an order as the number of services grows. It can become a \u0026ldquo;spaghetti\u0026rdquo; of events. Orchestration (Centralized) A central \u0026ldquo;Orchestrator\u0026rdquo; (the Saga Manager) tells each service what to do and when.\nPros: Easier to debug and monitor; the logic for the entire business process is in one place. Cons: Risk of the orchestrator becoming a \u0026ldquo;fat\u0026rdquo; service that knows too much about everyone else\u0026rsquo;s business. Let\u0026rsquo;s start with an event-based design for Order, Payment, and Inventory services:\npublic class OrderService { private final EventBus eventBus; private final OrderRepository orderRepository; public OrderService(EventBus eventBus, OrderRepository orderRepository) { this.eventBus = eventBus; this.orderRepository = orderRepository; } public Order createOrder(CreateOrderCommand cmd) { Order order = Order.pending(cmd); orderRepository.save(order); eventBus.publish(new OrderCreatedEvent(order)); return order; } @EventListener public void on(PaymentCompletedEvent event) { Order order = orderRepository.findById(event.orderId()); order.markPaymentCompleted(event.paymentId()); orderRepository.save(order); } @EventListener public void on(InventoryReservedEvent event) { Order order = orderRepository.findById(event.orderId()); order.confirm(); orderRepository.save(order); eventBus.publish(new OrderConfirmedEvent(order.getId())); } @EventListener public void on(PaymentFailedEvent event) { Order order = orderRepository.findById(event.orderId()); order.cancel(\u0026#34;PAYMENT_FAILED\u0026#34;); orderRepository.save(order); eventBus.publish(new OrderCancelledEvent(order.getId(), \u0026#34;PAYMENT_FAILED\u0026#34;)); } @EventListener public void on(InventoryFailedEvent event) { Order order = orderRepository.findById(event.orderId()); order.startCompensation(\u0026#34;INVENTORY_FAILED\u0026#34;); orderRepository.save(order); eventBus.publish(new CompensatePaymentCommand(order.getId(), event.reason())); } } Multi-Fulfillment # public class FulfillmentGroup { private Long id; private Long orderId; private String fulfillmentNodeId; // warehouse, store, 3PL private FulfillmentStatus status; private ShippingMethod shippingMethod; private String trackingNumber; private List\u0026lt;FulfillmentLine\u0026gt; lines; } public class FulfillmentLine { private Long id; private Long fulfillmentGroupId; private Long orderItemId; private int quantity; } Inventory Reservation # @Transactional public ReservationResult reserveItems(String orderId, List\u0026lt;ReservationRequest\u0026gt; requests) { List\u0026lt;InventoryReservation\u0026gt; reservations = new ArrayList\u0026lt;\u0026gt;(); for (ReservationRequest req : requests) { InventoryRow row = inventoryRepository.lockForUpdate(req.getSku(), req.getLocationId()); int available = row.getOnHand() - row.getReserved(); if (available \u0026lt; req.getQuantity()) { throw new InsufficientInventoryException(req.getSku(), req.getLocationId()); } row.setReserved(row.getReserved() + req.getQuantity()); inventoryRepository.save(row); reservations.add(new InventoryReservation(orderId, req.getSku(), req.getLocationId(), req.getQuantity())); } reservationRepository.saveAll(reservations); eventBus.publish(new InventoryReservedEvent(orderId, reservations)); return new ReservationResult(reservations); } Cancellations and Refunds # public void cancelOrder(String orderId, CancelReason reason) { Order order = orderRepository.findById(orderId); order.cancel(reason); orderRepository.save(order); eventBus.publish(new OrderCancelledEvent(orderId, reason)); } @EventListener public void on(OrderCancelledEvent event) { // In Inventory Service reservationRepository.findByOrderId(event.orderId()).forEach(res -\u0026gt; { InventoryRow row = inventoryRepository.lockForUpdate(res.getSku(), res.getLocationId()); row.setReserved(row.getReserved() - res.getQuantity()); inventoryRepository.save(row); }); // Publish release event to other consumers if needed eventBus.publish(new InventoryReleasedEvent(event.orderId())); } Multi-Region Strategy # Each region has its own Order, Payment, Inventory, and Fulfillment services plus local databases. Orders are sticky to a \u0026ldquo;home\u0026rdquo; region determined by user profile or channel. Events that need to be globally visible (e.g., inventory changes, loyalty updates) are replicated to other regions asynchronously using topics or cross-region database replication. Global reporting and reconciliation use eventually consistent data. ","date":"11 April 2026","externalUrl":null,"permalink":"/posts/order-management-system/","section":"Posts","summary":"As part of this post, we’ll be covering the design of a modern, production-grade Order Management System (OMS) with a focus on multi-fulfillment, cancellations, refunds, inventory synchronization, and multi-region deployment.\nLet’s first start with the requirements.\nRequirements # Functional Requirements # Core order lifecycle: Create order with multiple line items, shipping options, and payment methods. Order state machine: Support states such as PENDING → CONFIRMED → PARTIALLY_FULFILLED → FULFILLED → CANCELLED → REFUNDED. Split shipments: Support split shipments and partial fulfillment when items originate from multiple locations or arrive at different times. Cancellations: Allow customer and system-initiated cancellations in various states (pre-fulfillment, mid-fulfillment) with clear rules. Refunds: Support refunds (full and partial), including multi-payment or mixed-method scenarios (card, wallet, store credit). Multi-fulfillment: Route each line item to an optimal fulfillment node (warehouse, store, 3PL, marketplace drop-shipper). Multiple shipments: Track multiple shipments per order with independent tracking IDs and statuses. Backorders and preorders: Support delayed fulfillment while the order remains active. Inventory and payments: Reserve inventory atomically as part of the order creation saga; release on failure or cancellation. Inventory sync: Prevent overselling across channels with near real-time inventory sync and event-driven updates. Payment gateways: Integrate with one or more payment gateways for authorization, capture, and refund. Multi-channel and integrations: Receive orders from internal checkout, marketplaces, and POS; normalize into a canonical order model. Fulfillment updates: Push fulfillment updates and cancellations back to channels and customer notification systems. Multi-region deployment: Deploy OMS in multiple regions, each with a full stack of services fronted by a global load balancer. Data synchronization: Keep critical data (orders, payments, inventory) synchronized across regions using a mix of strong and eventual consistency depending on domain constraints. Non-Functional Requirements # High Availability and resilience: One failure in a downstream flow should not take down the entire order flow. Scalability: Capable of handling peak events such as flash sales and promotions. Consistency: Clear consistency model for orders and inventory (strong vs eventual consistency). Observability: Comprehensive logging, monitoring, and tracing. Extensibility: Easy to add new fulfillment types, payment methods, or regions without major rewrites. High Level Design # Order Lifecycle and Domain Model # Order Lifecycle Stages # A typical e-commerce order lifecycle contains the following high-level stages:\n","title":"Order Management System","type":"posts"},{"content":"","date":"9 April 2026","externalUrl":null,"permalink":"/tags/mobile-wallet/","section":"Tags","summary":"","title":"Mobile Wallet","type":"tags"},{"content":"As part of this post, we\u0026rsquo;ll be covering the design of a mobile wallet payment system that supports -\nTop-ups (add money to wallet from bank/card) P2P transfers (wallet -\u0026gt; wallet) Basic fraud detection Concurrency with clear trade-offs between strong and eventual consistency at scale. Let\u0026rsquo;s start with a basic design and then we can scale it up.\n1. Single node with relational DB # CREATE TABLE wallet ( id BIGINT PRIMARY KEY, owner_id BIGINT NOT NULL, balance_cents BIGINT NOT NULL, version BIGINT NOT NULL DEFAULT 0 ); CREATE TABLE wallet_transaction ( id BIGSERIAL PRIMARY KEY, from_wallet_id BIGINT, to_wallet_id BIGINT, amount_cents BIGINT NOT NULL, payment_status VARCHAR(32) NOT NULL, created_at TIMESTAMP NOT NULL DEFAULT now() ); We use a single DB transaction per operation to ensure atomicity and acquire row locks.\nBut there are still lots of issues with this basic design:\nBalance is a mutable column: bugs can overwrite it. No strong audit guarantees: we can\u0026rsquo;t easily replay transactions or recover from failures. This sets the stage for a ledger-based design\n2. Ledger-Based Design # Modern wallets usually move from \u0026ldquo;balance column\u0026rdquo; to a ledger-based design with double-entry style accounting.\nLedger Schema # Instead of directly mutating balance, we only append immutable ledger entries. Balance is then calculated by summing up the entries for a wallet.\nCREATE TABLE wallet ( id BIGINT PRIMARY KEY, owner_id BIGINT NOT NULL ); CREATE TABLE ledger_entry ( id BIGSERIAL PRIMARY KEY, wallet_id BIGINT NOT NULL, amount_cents BIGINT NOT NULL, -- positive for credit, negative for debit transaction_id BIGINT NOT NULL, entry_type VARCHAR(32) NOT NULL, -- e.g. TRANSFER_DEBIT, TRANSFER_CREDIT, TOP_UP created_at TIMESTAMP NOT NULL DEFAULT now() ); CREATE TABLE transaction ( id BIGSERIAL PRIMARY KEY, external_id VARCHAR(64), -- for idempotency or PSP reference type VARCHAR(32) NOT NULL, status VARCHAR(32) NOT NULL, from_wallet_id BIGINT, to_wallet_id BIGINT, amount_cents BIGINT NOT NULL, created_at TIMESTAMP NOT NULL DEFAULT now(), UNIQUE (external_id) ); @Transactional public TransactionEntity topUp(Long walletId, long amountCents, String externalId) { TransactionEntity tx = transactionRepository .findByExternalId(externalId) .orElseGet(() -\u0026gt; { TransactionEntity t = new TransactionEntity(); t.setExternalId(externalId); t.setType(\u0026#34;TOP_UP\u0026#34;); t.setFromWalletId(null); t.setToWalletId(walletId); t.setAmountCents(amountCents); t.setStatus(\u0026#34;PENDING\u0026#34;); return transactionRepository.save(t); }); if (\u0026#34;SUCCESS\u0026#34;.equals(tx.getStatus())) { return tx; // idempotent replay } LedgerEntryEntity ledger = new LedgerEntryEntity(); ledger.setWalletId(walletId); ledger.setAmountCents(amountCents); ledger.setTransactionId(tx.getId()); ledger.setEntryType(\u0026#34;TOP_UP_CREDIT\u0026#34;); ledgerRepository.save(ledger); tx.setStatus(\u0026#34;SUCCESS\u0026#34;); return transactionRepository.save(tx); } Balance Query # public long getBalance(Long walletId) { Long sum = ledgerRepository.sumAmountByWalletId(walletId); return sum != null ? sum : 0L; } There are still issues with this design:\nPerformance: naive summing is expensive, we need to add caches. Concurrency control: is needed. 3. Ledger + Materialized Balance # CREATE TABLE ledger_entry ( id BIGSERIAL PRIMARY KEY, wallet_id BIGINT NOT NULL, amount_cents BIGINT NOT NULL, -- +10000 for credit, -5000 for debit tx_id BIGINT NOT NULL, entry_type VARCHAR(32) NOT NULL, -- TOPUP_CREDIT, TRANSFER_DEBIT created_at TIMESTAMP DEFAULT now() ); -- Index for fast wallet scans CREATE INDEX idx_ledger_wallet_created ON ledger_entry(wallet_id, created_at); Materialized Balance Table # For fast reads, maintain a projection updated inside the same transaction as ledger writes.\nCREATE TABLE wallet_balance ( wallet_id BIGINT PRIMARY KEY, available_balance_cents BIGINT NOT NULL DEFAULT 0, locked_balance_cents BIGINT NOT NULL DEFAULT 0, updated_at TIMESTAMP DEFAULT now(), version BIGINT DEFAULT 0 -- for optimistic locking ); Double-Entry Extension (for advanced audits) # For full accounting compliance, extend the ledger to double‑entry:\n-- Every txn has debit + credit entries across accounts (user wallet ↔ treasury) ALTER TABLE ledger_entry ADD COLUMN account_id BIGINT; ALTER TABLE ledger_entry ADD COLUMN direction VARCHAR(10); -- DEBIT/CREDIT To catch drift, run a nightly job that recomputes balances and compares against the materialized table. Alert if there is a mismatch.\n4. Concurrency Control for Balances # What could be the challenges while dealing with concurrency control?\nDouble-spend / lost updates: when multiple operations hit the same wallet at the same time (e.g., two transfers spending the same money). Duplicate transactions: from retries, flaky networks, payment gateway timeouts. High contention: on \u0026ldquo;hot\u0026rdquo; wallets (e.g., popular merchants, exchanges). Race conditions: across services (e.g., wallet service vs. fraud service vs. notification service). Let\u0026rsquo;s try to solve these one by one.\nA. Preventing Double-Spend per Wallet # We should never allow two balance modifying operations on the same wallet to run concurrently, even under high load.\nFor this, we can use a per-wallet lock so operations on different wallets can proceed concurrently.\nFor this, we can either implement custom striped locks or simply use Guava\u0026rsquo;s striped locks.\nimport java.util.concurrent.ConcurrentHashMap; import java.util.concurrent.locks.Lock; import java.util.concurrent.locks.ReentrantLock; public class BoundedStripedLockManager { private final ConcurrentHashMap\u0026lt;Integer, Lock\u0026gt; stripes = new ConcurrentHashMap\u0026lt;\u0026gt;(); private final int stripesCount; public BoundedStripedLockManager(int stripesCount) { this.stripesCount = stripesCount; //1024 } //many keys map to the smaller set of locks private int stripeKey(long walletId) { int h = Long.hashCode(walletId); h ^= (h \u0026gt;\u0026gt;\u0026gt; 16); int idx = h % stripesCount; return idx \u0026lt; 0 ? idx + stripesCount : idx; } public Lock lockForWallet(long walletId) { int key = stripeKey(walletId); return stripes.computeIfAbsent(key, k -\u0026gt; new ReentrantLock()); } } Or simply with Guava:\nimport com.google.common.util.concurrent.Striped; import java.util.concurrent.locks.Lock; public class WalletLockManager { // 1024 stripes =\u0026gt; up to 1024 locks spread across wallet IDs private final Striped\u0026lt;Lock\u0026gt; walletLocks = Striped.lock(1024); public Lock lockForWallet(long walletId) { return walletLocks.get(walletId); } } public class WalletService { private final StripedWalletLockManager lockManager; public WalletService(StripedWalletLockManager lockManager) { this.lockManager = lockManager; } public void transfer(long fromWalletId, long toWalletId, long amountCents) { // Enforce deterministic lock order to avoid deadlocks long firstId = Math.min(fromWalletId, toWalletId); long secondId = Math.max(fromWalletId, toWalletId); Lock firstLock = lockManager.lockForWallet(firstId); Lock secondLock = lockManager.lockForWallet(secondId); firstLock.lock(); try { secondLock.lock(); try { doTransfer(fromWalletId, toWalletId, amountCents); } finally { secondLock.unlock(); } } finally { firstLock.unlock(); } } private void doTransfer(long fromWalletId, long toWalletId, long amountCents) { // DB transaction: check balances, write ledger entries, update projections, etc. } } You map multiple wallets to a limited number of locks (striping) so you don\u0026rsquo;t have to maintain a lock for each wallet id leading to a large number of tiny lock objects in a big wallet system.\nB. DB Level Concurrency: Row Locks and Optimistic Retries # Even with in-memory locks, multiple instances could race on the same wallet database row. You need DB-level concurrency control as well.\na. Pessimistic Locking # public interface WalletBalanceRepository extends JpaRepository\u0026lt;WalletBalanceEntity, Long\u0026gt; { @Lock(LockModeType.PESSIMISTIC_WRITE) @Query(\u0026#34;select b from WalletBalanceEntity b where b.walletId = :walletId\u0026#34;) WalletBalanceEntity findByWalletIdForUpdate(@Param(\u0026#34;walletId\u0026#34;) long walletId); } b. Optimistic Locking with Retries # Under high contention, pessimistic locking can cause long waits and deadlocks. Optimistic locking avoids this by only checking for conflicts at commit time. Attach a version field to wallet_balance. On update, JPA will throw OptimisticLockException if the version has changed since load. Retry the operation with fresh state.\n@Entity @Table(name = \u0026#34;wallet_balance\u0026#34;) public class WalletBalanceEntity { @Id private Long walletId; private Long availableBalanceCents; @Version private Long version; // getters/setters } @Service public class OptimisticWalletService { private static final int MAX_RETRIES = 5; @Autowired private WalletBalanceRepository balanceRepository; @Autowired private LedgerRepository ledgerRepository; public void transfer(long fromWalletId, long toWalletId, long amountCents, long txId) { for (int attempt = 1; attempt \u0026lt;= MAX_RETRIES; attempt++) { try { doTransferOnce(fromWalletId, toWalletId, amountCents, txId); return; } catch (ObjectOptimisticLockingFailureException ex) { if (attempt == MAX_RETRIES) { throw new ConcurrentModificationException(\u0026#34;Too much contention, please retry later\u0026#34;, ex); } // backoff could be added here } } } @Transactional protected void doTransferOnce(long fromWalletId, long toWalletId, long amountCents, long txId) { WalletBalanceEntity from = balanceRepository.findById(fromWalletId).orElseThrow(); WalletBalanceEntity to = balanceRepository.findById(toWalletId).orElseThrow(); if (from.getAvailableBalanceCents() \u0026lt; amountCents) { throw new IllegalStateException(\u0026#34;Insufficient funds\u0026#34;); } from.setAvailableBalanceCents(from.getAvailableBalanceCents() - amountCents); to.setAvailableBalanceCents(to.getAvailableBalanceCents() + amountCents); balanceRepository.save(from); balanceRepository.save(to); ledgerRepository.insertTransferEntries(fromWalletId, toWalletId, amountCents, txId); } } C. Idempotency # The request can contain an idempotency key generated at the client side through a random ID generator, and then its uniqueness can be enforced at the DB level.\nALTER TABLE transaction ADD COLUMN idempotency_key VARCHAR(64), ADD CONSTRAINT uk_tx_idempotency UNIQUE (idempotency_key, from_wallet_id); D. Available vs Locked Balance Under Concurrency # Available balance: money that can be spent. Locked balance: reserved for in-flight operations. E. Concurrency in Caches and Projections # You\u0026rsquo;ll likely cache wallet balances (Redis, in-memory) and maintain projections for history and analytics. Challenges include:\n2 threads updating the cache concurrently and overriding each other. Cache becoming inconsistent with DB under failure or retries. Therefore, we should:\nTreat the ledger and balance table as the only source of truth; caches are ephemeral. Use atomic operations in the cache (e.g., Redis INCR guarded by Lua scripts) if you adjust balances in the cache. F. Fraud Detection and Risk # Before committing a transaction, apply fast rules. For example:\nDaily transaction amount limits per user tier (KYC‑based). Velocity checks: number of transactions in last N minutes. Device/IP velocity: too many attempts from same device. Risky patterns: new device + large amount. 5. End to End Flow # ","date":"9 April 2026","externalUrl":null,"permalink":"/posts/mobile-wallet-payment-system/","section":"Posts","summary":"As part of this post, we’ll be covering the design of a mobile wallet payment system that supports -\nTop-ups (add money to wallet from bank/card) P2P transfers (wallet -\u003e wallet) Basic fraud detection Concurrency with clear trade-offs between strong and eventual consistency at scale. Let’s start with a basic design and then we can scale it up.\n","title":"Mobile Wallet Payment System","type":"posts"},{"content":"","date":"9 April 2026","externalUrl":null,"permalink":"/tags/payment-system/","section":"Tags","summary":"","title":"Payment System","type":"tags"},{"content":"","date":"8 April 2026","externalUrl":null,"permalink":"/categories/concurrency/","section":"Categories","summary":"","title":"Concurrency","type":"categories"},{"content":"","date":"8 April 2026","externalUrl":null,"permalink":"/tags/java/","section":"Tags","summary":"","title":"Java","type":"tags"},{"content":"","date":"8 April 2026","externalUrl":null,"permalink":"/tags/java-21/","section":"Tags","summary":"","title":"Java 21","type":"tags"},{"content":"","date":"8 April 2026","externalUrl":null,"permalink":"/series/mastering-concurrency-in-java/","section":"Series","summary":"","title":"Mastering Concurrency in Java","type":"series"},{"content":"In Part 1, Part 2, and Part 3, we covered hazards, primitives, and execution models. In this final part, we will focus on structured concurrency, fan-out/fan-in, fail-fast cancellations, timeout propagation, resource scoping, and observability with thread dumps and Java Flight Recorder (JFR), emphasizing when to choose which pattern.\n1. Structured Concurrency # Structured concurrency treats a group of related tasks as a single unit whose lifetime is bounded by a lexical scope, rather than a set of detached threads that outlive the caller.\nIn Java 21, this is embodied in StructuredTaskScope, which lets a parent thread fork substasks, wait for them as a group, and then guarantees that all subtasks are either completed or cancelled when the scope exits.\nAt a high level, structured concurrency gives three key benefits-\nUnified Error Handling - exceptions from subtasks are aggregated and rethrown to the parent in a controlled way. Prompt cancellation - failure or success conditions can automatically cancel sibling tasks via policies like ShutdownOnFailure or ShutdownOnSuccess. Improved observability - scopes create natural units for logging, metrics and profiling. When to choose structured concurrency?\nYou have a request‑scoped orchestration that fans out to multiple backends (e.g., pricing + inventory + recommendations) and you want a single, well‑defined lifecycle for all the work associated with that request.\nYou want fail‑fast semantics: as soon as one backend fails or violates an SLO, you cancel the rest and return an error or partial result.\nYou need deadline and cancellation propagation: a user cancels, or an upstream deadline expires, and every in‑flight subtask must stop quickly.\nYou care about operability: it should be easy to answer “what are we doing for this request right now?” from logs, metrics, or JFR recordings.\nimport java.time.Duration; import java.util.concurrent.ExecutionException; import java.util.concurrent.StructuredTaskScope; // Domain results record Price(double amount) {} record Inventory(int available) {} record Recommendation(String text) {} // Aggregated response record ProductView(Price price, Inventory inventory, Recommendation recommendation) {} class ProductService { private final PriceClient priceClient; private final InventoryClient inventoryClient; private final RecommendationClient recommendationClient; ProductService(PriceClient priceClient, InventoryClient inventoryClient, RecommendationClient recommendationClient) { this.priceClient = priceClient; this.inventoryClient = inventoryClient; this.recommendationClient = recommendationClient; } public ProductView buildProductView(String productId, Duration deadline) throws InterruptedException, ExecutionException { // Scoped to this request; all child tasks must finish or be cancelled try (var scope = new StructuredTaskScope.ShutdownOnFailure()) { var priceTask = scope.fork(() -\u0026gt; priceClient.fetchPrice(productId)); var inventoryTask = scope.fork(() -\u0026gt; inventoryClient.fetchInventory(productId)); var recTask = scope.fork(() -\u0026gt; recommendationClient.fetchRecommendations(productId)); // Wait for all tasks or first failure; you could combine with a timer thread for deadline scope.join(); scope.throwIfFailed(); // propagate first failure if any // After this point, all subtasks are either successful or cancelled Price price = priceTask.get(); Inventory inventory = inventoryTask.get(); Recommendation rec = recTask.get(); return new ProductView(price, inventory, rec); } } } interface PriceClient { Price fetchPrice(String productId) throws Exception; } interface InventoryClient { Inventory fetchInventory(String productId) throws Exception; } interface RecommendationClient { Recommendation fetchRecommendations(String productId) throws Exception; } The try‑with‑resources block defines the lifetime of all subtasks; nothing leaks beyond it. ShutdownOnFailure encodes a policy: this request only makes sense if all subtasks succeed. Cancellation is automatic: if any task fails, the scope cancels the rest, which is easier than manually tracking and cancelling Futures. 2. Fan‑Out / Fan‑In # Fan‑out/fan‑in is a concurrency pattern where a parent splits work into independent subtasks (fan‑out), runs them in parallel, then aggregates their results (fan‑in).\nFan‑out/fan‑in is primarily a latency optimization pattern: instead of calling backends sequentially, you call them in parallel and pay only the slowest latency, plus a small orchestration overhead. It also helps articulate parallelism vs. concurrency: you want to fan out only when work is independent and safely parallelizable.\nIn a system, you choose fan-out/fan-in when:\nYou have several independent IO‑bound calls (e.g., to microservices, caches, or databases) that can execute safely in parallel.\nYou are optimizing p99 latency by shaving off sequential waits.\nYou need to aggregate results into a single response (e.g., search results from multiple shards, pricing from multiple providers, or recommendations from different sources).\nThis can be implemented using both Java 8+ style CompletableFuture or Java 21+ StructuredTaskScope.\nimport java.util.List; import java.util.concurrent.CompletableFuture; import java.util.concurrent.ExecutorService; import java.util.concurrent.Executors; import java.util.stream.Collectors; // CompletableFuture class SearchService { private final ExecutorService ioExecutor = Executors.newFixedThreadPool(32); private final ShardClient shardClient; SearchService(ShardClient shardClient) { this.shardClient = shardClient; } public List\u0026lt;SearchResult\u0026gt; search(String query, List\u0026lt;String\u0026gt; shardIds) { // fan-out: issue one async call per shard List\u0026lt;CompletableFuture\u0026lt;SearchResult\u0026gt;\u0026gt; futures = shardIds.stream() .map(shardId -\u0026gt; CompletableFuture.supplyAsync( () -\u0026gt; shardClient.searchShard(shardId, query), ioExecutor)) .collect(Collectors.toList()); // fan-in: join all results CompletableFuture\u0026lt;Void\u0026gt; all = CompletableFuture.allOf( futures.toArray(new CompletableFuture[0])); // this blocks the current thread; in real code you might return the CF instead all.join(); return futures.stream() .map(CompletableFuture::join) .collect(Collectors.toList()); } } interface ShardClient { SearchResult searchShard(String shardId, String query); } record SearchResult(String shardId, List\u0026lt;String\u0026gt; documents) {} import java.util.List; import java.util.concurrent.ExecutionException; import java.util.concurrent.StructuredTaskScope; // StructuredTaskScope class SearchService21 { private final ShardClient shardClient; SearchService21(ShardClient shardClient) { this.shardClient = shardClient; } public List\u0026lt;SearchResult\u0026gt; search(String query, List\u0026lt;String\u0026gt; shardIds) throws InterruptedException, ExecutionException { try (var scope = new StructuredTaskScope.ShutdownOnFailure()) { List\u0026lt;StructuredTaskScope.Subtask\u0026lt;SearchResult\u0026gt;\u0026gt; subtasks = shardIds.stream() .map(shardId -\u0026gt; scope.fork(() -\u0026gt; shardClient.searchShard(shardId, query))) .toList(); scope.join(); // wait for all or first failure scope.throwIfFailed(); // propagate first failure return subtasks.stream() .map(StructuredTaskScope.Subtask::get) .toList(); } } } 3. Fail‑Fast Cancellation and Timeout Propagation # Fail‑fast cancellation is a policy where a composite operation aborts as soon as a critical subtask fails or a constraint is violated, rather than waiting for all subtasks to complete.\nIn Java 21, StructuredTaskScope.ShutdownOnFailure embodies this policy: when any subtask fails, the scope cancels remaining tasks and rethrows the error to the caller.\nIn classic Java, this has to be coded manually by cancelling Futures or using orchestration logic around CompletableFutures.\nimport java.util.concurrent.ExecutionException; import java.util.concurrent.StructuredTaskScope; class Aggregator { String aggregate() throws InterruptedException, ExecutionException { try (var scope = new StructuredTaskScope.ShutdownOnFailure()) { var fast = scope.fork(() -\u0026gt; slowCall(\u0026#34;fast\u0026#34;, 100)); var slow = scope.fork(() -\u0026gt; slowCall(\u0026#34;slow\u0026#34;, 5_000)); var failing = scope.fork(() -\u0026gt; failCall()); scope.join(); // wait for all tasks to finish or be cancelled scope.throwIfFailed(); // rethrow first failure // If we reach here, no subtask failed return fast.get() + slow.get() + failing.get(); } } private String slowCall(String name, long millis) throws InterruptedException { Thread.sleep(millis); return name; } private String failCall() { throw new IllegalStateException(\u0026#34;Upstream service failure\u0026#34;); } } import java.util.List; import java.util.concurrent.CompletableFuture; import java.util.concurrent.CompletionException; import java.util.stream.Collectors; class AggregatorCF { String aggregate(List\u0026lt;String\u0026gt; ids) { List\u0026lt;CompletableFuture\u0026lt;String\u0026gt;\u0026gt; futures = ids.stream() .map(this::callServiceAsync) .collect(Collectors.toList()); CompletableFuture\u0026lt;Void\u0026gt; all = CompletableFuture.allOf( futures.toArray(new CompletableFuture[0])); try { all.join(); // blocks until all complete or one fails } catch (CompletionException ex) { // Cancel all remaining work on first failure futures.forEach(f -\u0026gt; f.cancel(true)); throw ex; } return futures.stream() .map(CompletableFuture::join) .collect(Collectors.joining(\u0026#34;,\u0026#34;)); } private CompletableFuture\u0026lt;String\u0026gt; callServiceAsync(String id) { return CompletableFuture.supplyAsync(() -\u0026gt; { // do remote call, possibly throwing return \u0026#34;value-\u0026#34; + id; }); } } 4. Timeout Propagation # Timeout propagation is the practice of computing a deadline at the outermost layer of a request (e.g., HTTP server) and passing it explicitly to all downstream operations so that they can enforce consistent time limits.\nIn Java this often shows up as a Duration or deadline Instant parameter, combined with APIs like orTimeout, completeOnTimeout, or executor methods that take timeouts.\nJava 9+ CompletableFuture added methods like orTimeout and completeOnTimeout to help enforce per‑operation timeouts.\nStructured concurrency complements this by allowing a single deadline for an entire scope; a common pattern is to race the scope against a timer task and cancel the scope if the deadline expires.\nTimeout propagation prevents “zombie” work that continues after the client has given up, which is critical for throughput, fairness, and avoiding cascading failures.\nimport java.time.Duration; import java.time.Instant; import java.util.List; import java.util.concurrent.ExecutionException; import java.util.concurrent.StructuredTaskScope; class TimeoutOrchestrator21 { public List\u0026lt;Result\u0026gt; orchestrate(List\u0026lt;String\u0026gt; ids, Duration overallTimeout) throws InterruptedException, ExecutionException { Instant deadline = Instant.now().plus(overallTimeout); try (var scope = new StructuredTaskScope.ShutdownOnFailure()) { var tasks = ids.stream() .map(id -\u0026gt; scope.fork(() -\u0026gt; callServiceRespectingDeadline(id, deadline))) .toList(); // Simple approach: loop until deadline or scope completion while (true) { scope.joinUntil(deadline); // waits until subtasks done or deadline if (scope.isShutdown()) { break; // all tasks completed or failed } if (Instant.now().isAfter(deadline)) { scope.shutdown(); // cancel remaining tasks throw new RuntimeException(\u0026#34;Overall timeout expired\u0026#34;); } } scope.throwIfFailed(); return tasks.stream().map(StructuredTaskScope.Subtask::get).toList(); } } private Result callServiceRespectingDeadline(String id, Instant deadline) throws Exception { long remainingMillis = Duration.between(Instant.now(), deadline).toMillis(); if (remainingMillis \u0026lt;= 0) { throw new java.util.concurrent.TimeoutException(\u0026#34;Deadline already expired\u0026#34;); } // Use remainingMillis with your HTTP client / DB driver timeouts return new Result(id); } } import java.time.Duration; import java.time.Instant; import java.util.List; import java.util.concurrent.CompletableFuture; import java.util.concurrent.TimeoutException; class TimeoutOrchestrator { public CompletableFuture\u0026lt;List\u0026lt;Result\u0026gt;\u0026gt; orchestrate(List\u0026lt;String\u0026gt; ids, Duration overallTimeout) { Instant deadline = Instant.now().plus(overallTimeout); List\u0026lt;CompletableFuture\u0026lt;Result\u0026gt;\u0026gt; futures = ids.stream() .map(id -\u0026gt; callServiceWithDeadline(id, deadline)) .toList(); CompletableFuture\u0026lt;Void\u0026gt; all = CompletableFuture.allOf( futures.toArray(new CompletableFuture[0]) ); return all.thenApply(v -\u0026gt; futures.stream() .map(CompletableFuture::join) .toList() ); } private CompletableFuture\u0026lt;Result\u0026gt; callServiceWithDeadline(String id, Instant deadline) { long remainingMillis = Duration.between(Instant.now(), deadline).toMillis(); if (remainingMillis \u0026lt;= 0) { CompletableFuture\u0026lt;Result\u0026gt; failed = new CompletableFuture\u0026lt;\u0026gt;(); failed.completeExceptionally(new TimeoutException(\u0026#34;Deadline already expired\u0026#34;)); return failed; } return serviceCallAsync(id) .orTimeout(remainingMillis, java.util.concurrent.TimeUnit.MILLISECONDS); } private CompletableFuture\u0026lt;Result\u0026gt; serviceCallAsync(String id) { // Implementation calls remote service asynchronously return CompletableFuture.supplyAsync(() -\u0026gt; new Result(id)); } } record Result(String id) {} 5. Resource Scoping # Resource scoping ties the lifetime of resources (threads, thread pools, database connections, file handles, metrics contexts, etc.) to clearly defined lexical scopes or ownership boundaries.\nIn Java, this shows up as try‑with‑resources, ExecutorService that is created and shut down within a component, and now StructuredTaskScope which ensures that spawned tasks cannot outlive the scope.\nBad patterns include globally shared executors that are never shut down, or threads started in one layer that accidentally keep running after the caller has timed out. Resource scoping reduces leaks, contention, and coordination bugs.\nimport java.util.concurrent.ExecutorService; import java.util.concurrent.Executors; class ScopedExecutorExample { void processBatch(Runnable task, int parallelism) { // Executor is scoped to this batch only try (var executor = Executors.newFixedThreadPool(parallelism)) { for (int i = 0; i \u0026lt; parallelism; i++) { executor.submit(task); } // executor is automatically shutdown at end of try-with-resources } } } import java.util.concurrent.StructuredTaskScope; class ResourceScopedService { public void handleRequest(String requestId) throws InterruptedException { try (var scope = new StructuredTaskScope.ShutdownOnFailure()) { // All subtasks are logically scoped to this request scope.fork(() -\u0026gt; logAuditTrail(requestId)); scope.fork(() -\u0026gt; updateAnalytics(requestId)); scope.join(); scope.throwIfFailed(); // Any unfinished work is cancelled here } } private Void logAuditTrail(String requestId) { // perform logging, possibly slow IO return null; } private Void updateAnalytics(String requestId) { // send events to analytics pipeline return null; } } 6. Observability with Thread Dumps and JFR # A thread dump is a snapshot of all Java threads and their stack traces at a moment in time, typically captured using tools like jstack or Java Mission Control.\nJFR is a low‑overhead profiling and diagnostics tool built into the JVM that records events such as method samples, lock contention, GC, and IO, which can be started at launch or attached to a running process.\n","date":"8 April 2026","externalUrl":null,"permalink":"/posts/mastering-concurrency-in-java-part-4-deep-dives-and-modern-patterns/","section":"Posts","summary":"In Part 1, Part 2, and Part 3, we covered hazards, primitives, and execution models. In this final part, we will focus on structured concurrency, fan-out/fan-in, fail-fast cancellations, timeout propagation, resource scoping, and observability with thread dumps and Java Flight Recorder (JFR), emphasizing when to choose which pattern.\n","title":"Mastering Concurrency In Java - Part 4: Deep Dives and Modern Patterns","type":"posts"},{"content":"","date":"8 April 2026","externalUrl":null,"permalink":"/tags/multithreading/","section":"Tags","summary":"","title":"Multithreading","type":"tags"},{"content":"In Part 1 and Part 2, we covered the fundamentals and building blocks of concurrency. In this part, we will discuss the execution models of concurrency - classic thread pools, task queues, Future/Callable, CompletableFuture, and, from Java 21+, virtual threads and virtual-thread-per-task executors. Choosing among them is ultimately about latency, throughput, and operational simplicity, not about syntax.\nThe core questions around concurrency primitives are -\nHow many concurrent units of work can the service handle before it degrades? What is the dominant cost per unit (CPU, remote I/O, memory)? How much complexity can the team safely own in production? Traditional platform threads are expensive because each is backed by an OS thread with significant memory and scheduling overhead. Virtual threads in Java 21 are much cheaper: millions of virtual threads can share a small pool of carrier (OS) threads, blocking freely on I/O while the JVM transparently mounts and unmounts them. This shifts the default from “avoid blocking at all costs” to “write simple blocking code unless there is a concrete reason to go async or reactive.”\nWith the above, let\u0026rsquo;s discuss the execution models -\n1. Thread Pools # A thread pool is an ExecutionService that owns a bounded number of worker threads and a task queue. Tasks (often runnable or callable) are submitted to the pool and executed by available worker threads -\nBounded Concurrency - Limits the number of concurrent tasks to protect CPU and downstream dependency. Amortized thread creation cost - Reuses threads across tasks instead of paying creation / destruction cost per task. Centralized policy - You can tune pool size, queue capacity, and rejection policy based on workload. When do you choose Thread Pools # You need strict control over concurrency to protect CPU or a fragile dependency (eg: database or 3rd party API) from overload. The workload is CPU bound and you want a pool size tuned to your CPU cores (eg: 1-2 threads per core). You are on pre-Java 21 or cannot use virtual threads (legacy runtime, compliance constraints), so each thread is still expensive. You need prioritization or separate pools per traffic class (eg: user vs batch) to prevent starvation. Example:\nExecutorService ioBoundPool = Executors.newFixedThreadPool(10); for (int i = 0; i \u0026lt; 100; i++) { ioBoundPool.submit(() -\u0026gt; { // do io bound work }); } ioBoundPool.shutdown(); 2. Task Queues # A Task Queue is typically a BlockingQueue or a BlockingQueue decoupling producers from consumers inside a process.\nWhen do you choose Task Queues # You want an internal buffer between request handling and slow work - eg: image processing or heavy db aggregations. You want to smooth out bursts of work without overwhelming downstream consumers. You are modelling the transition to a proper distributed queue later, starting with in-memory queues keeps the programming model similar. import java.util.concurrent.*; public class TaskQueueExample { private final BlockingQueue\u0026lt;Runnable\u0026gt; queue = new LinkedBlockingQueue\u0026lt;\u0026gt;(100); private final ExecutorService workerPool = Executors.newFixedThreadPool(5); public TaskQueueExample() { for (int i = 0; i \u0026lt; 10; i++) { workerPool.submit(this::workerLoop); } } private void workerLoop() { try { while (!Thread.currentThread().isInterrupted()) { Runnable task = queue.take(); task.run(); } } catch (InterruptedException e) { Thread.currentThread().interrupt(); } } // queue tasks public void populateQueue(String username) { try { queue.put(() -\u0026gt; doHeavyIO(username)); } catch (InterruptedException e) { Thread.currentThread().interrupt(); } } private void doHeavyIO(String username) { // simulate heavy I/O try { Thread.sleep(1000); } catch (InterruptedException e) { Thread.currentThread().interrupt(); } } } 3. Future and Callable # Callable represents a task that returns a value and may throw checked exceptions. Submitting a Callable to an ExecutorService returns a Future, which is a handle for the task’s eventual result. Future allows:\nBlocking waits with get(). Timed waits with get(timeout). Cancellation with cancel(). However, Future has notable limitations: no built-in composition operators, no callbacks on completion (without extra wrappers), and get() is blocking and potentially awkward for complex pipelines\nWhen to choose Future / Callable # simple parallelism: fan-out a small number of CPU-bound or I/O-bound operations and then join. You are on an older version of java and cannot use CompletableFuture. private final ExecutorService pool = Executors.newFixedThreadPool(10); public List\u0026lt;String\u0026gt; searchAllShards(String query) throws InterruptedException { List\u0026lt;Callable\u0026lt;String\u0026gt;\u0026gt; tasks = new ArrayList\u0026lt;\u0026gt;(); for (int shardId = 0; shardId \u0026lt; 3; shardId++) { int finalShardId = shardId; tasks.add(() -\u0026gt; searchShard(finalShardId, query)); } List\u0026lt;Future\u0026lt;String\u0026gt;\u0026gt; futures = pool.invokeAll(tasks, 50, TimeUnit.MILLISECONDS); List\u0026lt;String\u0026gt; results = new ArrayList\u0026lt;\u0026gt;(); for (Future\u0026lt;String\u0026gt; f : futures) { try { if (!f.isCancelled()) { results.add(f.get()); } } catch (ExecutionException e) { // log and skip this shard } } return results; } private String searchShard(int shardId, String query) { // blocking call to shard return \u0026#34;results-from-\u0026#34; + shardId + \u0026#34; for query \u0026#34; + query; } 4. CompletableFuture # CompletableFuture (Java 8+) extends Future with the ability to complete manually, register callbacks, and compose stages non-blockingly. It is the foundation of Java’s async fluent API:\nsupplyAsync / runAsync to start async computations. thenApply, thenCompose, thenAccept for transformations and side effects. allOf, anyOf to combine multiple futures. exceptionally, handle for explicit error handling. By default, async stages run on ForkJoinPool.commonPool() unless a custom Executor is supplied. Using custom executors is important to avoid saturating the common pool with blocking I/O.\nWhen do you choose CompletableFuture # Even in a world with virtual threads, CompletableFuture is valuable when:\nYou need true non-blocking behavior on a constrained thread pool, often in libraries that cannot assume virtual threads. You are composing distributed I/O-heavy fan-out/fan-in workloads where overlap matters (e.g. hitting 5 microservices at once and timing out partial results). You want to keep a reactive-ish style without adopting a full reactive stack (Project Reactor, RxJava). You are building APIs or SDKs where consumers may or may not use virtual threads; returning CompletionStage keeps you neutral.\nCompletableFuture gives resource efficiency and composability but at the cost of cognitive complexity and harder debugging.\nimport java.util.List; import java.util.concurrent.*; import java.util.stream.Collectors; public class CompletableFutureExample { private final ExecutorService ioPool = Executors.newFixedThreadPool(64); public CompletableFuture\u0026lt;List\u0026lt;String\u0026gt;\u0026gt; fetchProfile(String userId) { CompletableFuture\u0026lt;String\u0026gt; basicFuture = CompletableFuture.supplyAsync(() -\u0026gt; fetchBasic(userId), ioPool); CompletableFuture\u0026lt;String\u0026gt; postsFuture = CompletableFuture.supplyAsync(() -\u0026gt; fetchPosts(userId), ioPool); CompletableFuture\u0026lt;String\u0026gt; friendsFuture = CompletableFuture.supplyAsync(() -\u0026gt; fetchFriends(userId), ioPool); return CompletableFuture .allOf(basicFuture, postsFuture, friendsFuture) .thenApply(v -\u0026gt; List.of( basicFuture.join(), postsFuture.join(), friendsFuture.join() )); } private String fetchBasic(String userId) { // HTTP call to /basic return \u0026#34;basic-info\u0026#34;; } private String fetchPosts(String userId) { // HTTP call to /posts return \u0026#34;posts\u0026#34;; } private String fetchFriends(String userId) { // HTTP call to /friends return \u0026#34;friends\u0026#34;; } } 5. Virtual Threads (Java 21) # Virtual threads are lightweight threads managed by the JVM rather than the OS. Many virtual threads are multiplexed onto a smaller number of carrier (platform) threads via a scheduler built on ForkJoinPool.\nVirtual threads block like normal threads in user code, but when they hit a blocking I/O operation (Socket.read, FileChannel, JDBC, etc.), they unmount from the carrier thread, allowing it to run other virtual threads.\nBecause stacks are stored in heap segments and can be parked/unparked cheaply, you can have hundreds of thousands or millions of virtual threads in a single JVM.\nThe standard factory Executors.newVirtualThreadPerTaskExecutor() creates an executor that spawns a new virtual thread for each submitted task and closes it when done.\nWhen should you choose Virtual-Thread-Per-Task # With Java 21+, the default answer for an I/O-heavy service is often “one virtual thread per request,” because:\nCode stays simple and blocking: no callbacks, no async chains; each request handler reads like synchronous code. You achieve massive concurrency as long as tasks are mostly waiting on I/O, not burning CPU. You can often avoid complex async frameworks (reactive, actor models) and instead rely on cheaper blocking. import java.util.concurrent.ExecutorService; import java.util.concurrent.Executors; public class VirtualThreadPerTaskExample { private final ExecutorService virtualExecutor = Executors.newVirtualThreadPerTaskExecutor(); public void handleIncomingRequest(String payload) { virtualExecutor.submit(() -\u0026gt; { String userId = parse(payload); // All of this code can block freely due to virtual threads String profile = callProfileService(userId); String timeline = callTimelineService(userId); writeResponse(profile, timeline); }); } private String parse(String payload) { return payload; } private String callProfileService(String userId) { // blocking HTTP/JDBC return \u0026#34;profile\u0026#34;; } private String callTimelineService(String userId) { // blocking HTTP/JDBC return \u0026#34;timeline\u0026#34;; } private void writeResponse(String profile, String timeline) { // write to socket } } Under the hood, many such tasks share a much smaller number of carrier threads; when they block on I/O, they are unmounted, and the carrier runs other tasks.\nWhen Async Pipelines Are Worse Than Blocking Code on Virtual Threads # Cost Model: Async Graphs vs. Cheap Blocking # Before virtual threads, the main motivations for complex async pipelines (CompletableFuture, reactive streams, actor systems) were:\nOS threads were expensive; you wanted to minimize thread count. Blocking I/O meant tying up a thread, so you used non-blocking APIs and callbacks to multiplex work. Virtual threads change this cost model by making blocking cheap and letting the JVM multiplex many blocked virtual threads onto a small number of carriers. As a result, several downsides of async pipelines become more significant relative to their benefits -\nCognitive complexity: deep chains of thenCompose/thenCombine are harder to reason about than straight-line code. Debugging and observability: stack traces in async code are fragmented across callbacks, making it harder to reconstruct the logical call stack. Error handling: exception flows are non-local, with exceptionally and handle sprinkled through the graph. Context propagation: propagating tracing, security context, or transaction context across async boundaries is fragile. When blocking is cheap, these costs often outweigh their benefits for typical request–response microservices.\nExample: Aggregator Service – Async vs. Virtual Threads # Async with CompletableFuture # import java.util.concurrent.*; public class AggregatorWithCompletableFuture { private final ExecutorService ioPool = Executors.newFixedThreadPool(64); public CompletableFuture\u0026lt;AggregatedResult\u0026gt; aggregate(String userId) { CompletableFuture\u0026lt;String\u0026gt; profileFuture = CompletableFuture.supplyAsync(() -\u0026gt; callProfile(userId), ioPool); CompletableFuture\u0026lt;String\u0026gt; timelineFuture = CompletableFuture.supplyAsync(() -\u0026gt; callTimeline(userId), ioPool); CompletableFuture\u0026lt;String\u0026gt; notificationsFuture = CompletableFuture.supplyAsync(() -\u0026gt; callNotifications(userId), ioPool); return CompletableFuture .allOf(profileFuture, timelineFuture, notificationsFuture) .orTimeout(200, TimeUnit.MILLISECONDS) .thenApply(v -\u0026gt; new AggregatedResult( profileFuture.join(), timelineFuture.join(), notificationsFuture.join() )) .exceptionally(ex -\u0026gt; fallback(userId, ex)); } private String callProfile(String userId) { return \u0026#34;profile\u0026#34;; } private String callTimeline(String userId) { return \u0026#34;timeline\u0026#34;; } private String callNotifications(String userId) { return \u0026#34;notifications\u0026#34;; } record AggregatedResult(String profile, String timeline, String notifications) {} private AggregatedResult fallback(String userId, Throwable ex) { // log and return degraded result return new AggregatedResult(\u0026#34;partial-profile\u0026#34;, \u0026#34;empty-timeline\u0026#34;, \u0026#34;empty-notifications\u0026#34;); } } This is efficient on a limited pool, but the control flow (timeouts, fallback, composition) is non-trivial.\nBlocking on Virtual Threads # import java.util.concurrent.ExecutorService; import java.util.concurrent.Executors; import java.util.concurrent.TimeUnit; public class AggregatorWithVirtualThreads { private final ExecutorService virtualExecutor = Executors.newVirtualThreadPerTaskExecutor(); public AggregatedResult aggregate(String userId) { // Each aggregation runs on its own virtual thread try { return virtualExecutor.submit(() -\u0026gt; doAggregate(userId)) .get(200, TimeUnit.MILLISECONDS); } catch (Exception e) { return fallback(userId, e); } } private AggregatedResult doAggregate(String userId) { String profile = callProfile(userId); // blocking String timeline = callTimeline(userId); // blocking String notifications = callNotifications(userId); // blocking return new AggregatedResult(profile, timeline, notifications); } private String callProfile(String userId) { return \u0026#34;profile\u0026#34;; } private String callTimeline(String userId) { return \u0026#34;timeline\u0026#34;; } private String callNotifications(String userId) { return \u0026#34;notifications\u0026#34;; } record AggregatedResult(String profile, String timeline, String notifications) {} private AggregatedResult fallback(String userId, Throwable ex) { // log and return degraded result return new AggregatedResult(\u0026#34;partial-profile\u0026#34;, \u0026#34;empty-timeline\u0026#34;, \u0026#34;empty-notifications\u0026#34;); } } Control flow is now straightforward, with a single blocking call and standard try/catch handling, while virtual threads preserve high concurrency.\nWhen Async Pipelines Are Strictly Worse # I/O-bound request–response services where each request performs a handful of remote calls and modest CPU work. Virtual threads give you sufficient concurrency with far less complexity. Teams without deep async expertise: subtle bugs in async graphs, lost exceptions, and context propagation issues are operationally expensive. Debug-critical systems where clean stack traces and simple profiling are essential. CompletableFuture-style async graphs still make sense when you:\nMust run on pre–Java 21 runtimes or cannot enable virtual threads. Are writing libraries that must be neutral about execution model and therefore expose CompletionStage. Need tight integration with non-blocking I/O APIs or reactive clients where virtual threads would still block. ","date":"7 April 2026","externalUrl":null,"permalink":"/posts/mastering-concurrency-in-java-part-3-execution-models/","section":"Posts","summary":"In Part 1 and Part 2, we covered the fundamentals and building blocks of concurrency. In this part, we will discuss the execution models of concurrency - classic thread pools, task queues, Future/Callable, CompletableFuture, and, from Java 21+, virtual threads and virtual-thread-per-task executors. Choosing among them is ultimately about latency, throughput, and operational simplicity, not about syntax.\n","title":"Mastering Concurrency In Java - Part 3: Execution Models","type":"posts"},{"content":"In Part 1, we discussed the core concurrency hazards and control concepts in Java: race conditions, visibility, atomicity, deadlocks, starvation, livelock, contention, backpressure, interruption, and cancellation.\nIn this part, we will discuss the coordination primitives - synchronized, volatile, Atomics, Locks, Semaphores, Blocking Queues, and Concurrent Collections.\n1. synchronized # Every Java object implicitly carries a monitor lock. Synchronized methods and blocks acquire and release that monitor around a critical section.\nsynchronized (lockObject) { // critical section } What does it guarantee?\nMutual Exclusion on the monitor: At most one thread at a time executes code guarded by the same monitor object. Visibility and ordering: Exiting a synchronized block flushes writes to main memory, and entering a synchronized block invalidates the local cache, forcing a reload of variables from main memory, so the acquiring thread sees the latest values. Reentrancy: The owning thread that holds the monitor can reacquire it multiple times without blocking itself. What does it not guarantee?\nFairness: There is no guarantee that threads will acquire the lock in the order they request it. Timeout: There is no way to time out while waiting for the lock to be released; it simply blocks. Deadlock or starvation freedom: It does not inherently prevent deadlocks or starvation. When to choose synchronized?\nThe critical sections are small, the contention is moderate, and you want to keep the code simple. There is no need for timed, interruptible, or fair acquisition. Synchronized protects basic in-memory invariants that should always be acquired before proceeding. public class SynchronizedCounter { private int count = 0; public synchronized void increment() { count++; } public synchronized int get() { return count; } } 2. volatile # The volatile keyword marks a field so that reads and writes go directly to main memory, with additional ordering guarantees.\nWhat does volatile guarantee?\nVisibility: A write to a volatile field by one thread is promptly visible to reads of that field by other threads; they do not see stale cached values. Ordering fence: Reads and writes to a volatile field also preserve order, preventing instruction reordering around the access. What does it not guarantee?\nAtomicity: It does not make compound operations atomic. For example, incrementing a volatile int is still a non-atomic read-modify-write. Mutual Exclusion: Multiple threads can read and write to a volatile variable concurrently. It does not serialize them like a lock. When to choose volatile?\nOne-writer, many-reader flags: Ideal for shutdown signals, configuration toggles, health indicators, or status fields where only one thread mutates and others only observe. Example:\npublic class VolatileSingleWriterMultipleReadersExample { private volatile boolean running = true; public void startWorker() { Thread worker = new Thread(() -\u0026gt; { while (running) { doWork(); } }); worker.start(); } public void stop() { running = false; // visible to worker without locks } private void doWork() { // perform unit of work } } 3. Atomics # The java.util.concurrent.atomic package provides classes such as AtomicInteger, AtomicLong, and AtomicReference that offer lock-free, atomic read-modify-write operations implemented using Compare-And-Swap (CAS) primitives.\nWhat do Atomics guarantee?\nAtomic read-modify-write on a single variable. Visibility of updates. Non-blocking progress under contention: CAS-based updates avoid explicit blocking; threads retry on contention rather than sleeping on a monitor. What do Atomics not guarantee?\nAtomicity of compound operations beyond operating one variable at a time. Fairness or bounded retry delays for a given thread. Example:\nimport java.util.concurrent.atomic.AtomicLong; public class AtomicCounter { private final AtomicLong value = new AtomicLong(); public void increment() { value.incrementAndGet(); } public long get() { return value.get(); } } 4. Locks: ReentrantLock # The java.util.concurrent.locks package provides Lock implementations such as ReentrantLock that offer more control than synchronized, including timed and interruptible acquisition, explicit unlocking, and configurable fairness policies.\nWhat do Locks guarantee?\nMutual Exclusion and visibility: At most one thread at a time executes code guarded by the same lock. Reentrancy: The owning thread that holds the lock can reacquire it multiple times. It must unlock the same number of times it acquired it. Optional Fairness: ReentrantLock can be constructed with a fairness policy, so that waiting threads acquire the lock in FIFO order of arrival. What do Locks not guarantee?\nAutomatic Release: Unlike synchronized, locks must be explicitly released by the caller. Failure to do so results in a deadlock or locked state. Absolute Freedom From Starvation: They can implement a fairness policy but cannot control OS thread scheduling. When to choose Locks?\nTimed or interruptible lock acquisition is required. A fairness policy is required. Complex lock acquisition patterns (e.g., hand-over-hand locking) are required. Example:\nimport java.util.concurrent.TimeUnit; import java.util.concurrent.locks.Lock; import java.util.concurrent.locks.ReentrantLock; public class TryLockExample { private final Lock lock = new ReentrantLock(true); // fair lock public boolean doWithLockOrTimeout() throws InterruptedException { if (lock.tryLock(50, TimeUnit.MILLISECONDS)) { try { // critical section return true; } finally { lock.unlock(); } } else { // fall back: log, queue, or degrade functionality return false; } } } 5. Semaphores: Bounded Concurrency / Throttling # A Semaphore maintains a set of permits. acquire() blocks (or fails) when no permits are available, and release() returns a permit to the pool.\nA semaphore with one permit behaves much like a lock; with more permits, it limits concurrent access to a resource, which maps directly to throttling and connection-pooling patterns.\nWhat do Semaphores guarantee?\nUpper bound of concurrent permit holders: At most N threads can hold N permits at any time. Optional Fairness: The constructor takes a fairness flag to favor FIFO acquisition of permits, reducing starvation for long-waiting threads. Mutual exclusion when permits = 1 (behaves like an unstructured lock). What do they not guarantee?\nAutomatic permit balancing: Forgetting to call release() leads to permanent permit leakage. Protection of critical sections: Semaphores only limit concurrency; they do not inherently protect the critical section itself from race conditions or visibility issues (unless permits = 1). Backpressure: Semaphores do not inherently provide backpressure; they only limit the number of concurrent requests, not the rate at which they arrive. When to choose Semaphores?\nLimiting the number of concurrent calls to a downstream dependency while still allowing many application threads to exist. Implementing bounded resource pools (e.g., database connections, file handles). Example:\nimport java.util.concurrent.Semaphore; public class BoundedResource { private final Semaphore semaphore = new Semaphore(3); // 3 permits public void accessResource() { semaphore.acquireUninterruptibly(); try { // Use resource } finally { semaphore.release(); } } } Note: Semaphores aren\u0026rsquo;t used for rate limiting because they don\u0026rsquo;t have any concept of time, only of how many concurrent threads are allowed to execute. They don\u0026rsquo;t work across distributed systems either.\n6. Blocking Queues # java.util.concurrent.BlockingQueue represents a thread-safe FIFO queue that supports blocking operations for insertion and removal.\nWhat do Blocking Queues guarantee?\nThread-safe insertion and removal: Multiple threads can interact with the queue without race conditions or visibility issues. Blocking Semantics: Methods such as put() or take() block when the queue is full or empty, respectively, until space or an element becomes available. Optional Bounded Capacity: Implementations like ArrayBlockingQueue and bounded LinkedBlockingQueue enforce a maximum size, preventing unbounded memory growth. What do they not guarantee?\nStrict fairness across producers / consumers. Automatic load shedding: A bounded queue can block or reject producers, but deciding when to drop, buffer elsewhere, or apply rate limiting remains an application-level concern. When to deliberately use Blocking Queues: Blocking queues are the in-process analog of message queues, and are used when:\nThe design naturally decomposes into producers and consumers. Backpressure and decoupling are desired. Example:\nimport java.util.concurrent.BlockingQueue; import java.util.concurrent.LinkedBlockingQueue; public class ProducerConsumerExample { private final BlockingQueue\u0026lt;String\u0026gt; queue = new LinkedBlockingQueue\u0026lt;\u0026gt;(10); public void produce(String item) throws InterruptedException { queue.put(item); // blocks if full } public String consume() throws InterruptedException { return queue.take(); // blocks if empty } } 7. Concurrent Collections # The java.util.concurrent package provides collection implementations designed for concurrent access, such as ConcurrentHashMap, ConcurrentLinkedQueue, and CopyOnWriteArrayList.\nWhat it guarantees:\nThread-safe access without global locking. Atomic compound operations: ConcurrentMap defines methods that add, remove, or replace entries only if certain conditions hold (for example, “add if absent”), implemented atomically to avoid race conditions. Non-blocking reads in many cases: Implementations such as ConcurrentHashMap are heavily optimized for concurrent reads, often allowing them without global locks. What does it not guarantee?\nGlobal atomicity across multiple operations. Stale reads in some cases where iterators are used, where new writes may or may not become visible. When to use:\nIdeal for situations where reads vastly outnumber writes (e.g., listener lists, configuration registries). Example:\nimport java.util.concurrent.ConcurrentHashMap; public class CacheExample { private final ConcurrentHashMap\u0026lt;String, String\u0026gt; cache = new ConcurrentHashMap\u0026lt;\u0026gt;(); public void putIfMissing(String key, String value) { cache.putIfAbsent(key, value); } } Conclusion # Choosing the right concurrency primitive is about balancing safety, performance, and complexity.\nUse synchronized or Atomics for simple state. Use Locks or Semaphores when you need fine-grained control or resource throttling. Use Blocking Queues for producer-consumer architectures. Use Concurrent Collections to scale shared state access. ","date":"6 April 2026","externalUrl":null,"permalink":"/posts/mastering-concurrency-in-java-part-2-the-fundamentals/","section":"Posts","summary":"In Part 1, we discussed the core concurrency hazards and control concepts in Java: race conditions, visibility, atomicity, deadlocks, starvation, livelock, contention, backpressure, interruption, and cancellation.\nIn this part, we will discuss the coordination primitives - synchronized, volatile, Atomics, Locks, Semaphores, Blocking Queues, and Concurrent Collections.\n","title":"Mastering Concurrency In Java - Part 2: The Fundamentals","type":"posts"},{"content":" Introduction # Here, we will discuss the core concurrency hazards and control concepts in Java: race conditions, visibility, atomicity, deadlocks, starvation, livelock, contention, backpressure, interruption, and cancellation.\nRace Conditions # A race condition occurs when the correctness of a task depends on the relative timing or interleaving of threads accessing shared mutable state. The bug is not “thread A ran before B”, it is “if they interleave in this specific way, invariants break.”\nLet\u0026rsquo;s consider the below example and try to figure out what could be going wrong here :-\nclass BrokenCounter { private int count = 0; public void increment() { // Non-atomic read-modify-write count = count + 1; } public int get() { return count; } } void demo() throws InterruptedException { BrokenCounter counter = new BrokenCounter(); Thread t1 = new Thread(() -\u0026gt; { for (int i = 0; i \u0026lt; 1_000_000; i++) { counter.increment(); } }); Thread t2 = new Thread(() -\u0026gt; { for (int i = 0; i \u0026lt; 1_000_000; i++) { counter.increment(); } }); t1.start(); t2.start(); t1.join(); t2.join(); System.out.println(counter.get()); // Almost never 2_000_000 } The problem here is that the read‑modify‑write on count is raced; the increments almost always get lost.\nA fixed version of this counter will look like :-\nclass AtomicCounter { private final java.util.concurrent.atomic.AtomicInteger count = new java.util.concurrent.atomic.AtomicInteger(); public void increment() { count.incrementAndGet(); // atomic } public int get() { return count.get(); } } Visibility # Visibility is whether a write to a shared variable by one thread becomes observable by another thread. This is because, in Java, threads may cache variables in their local memory, leading to situations where one thread’s updates to a variable are never seen by another. Without proper happens‑before edges (locks, volatile etc) threads can see stale values indefinitely.\nclass StoppableWorker implements Runnable { private boolean running = true; // not volatile @Override public void run() { while (running) { // do work } System.out.println(\u0026#34;Stopped\u0026#34;); } public void stop() { running = false; // may never be seen } } void demo() throws InterruptedException { StoppableWorker worker = new StoppableWorker(); Thread t = new Thread(worker); t.start(); Thread.sleep(100); worker.stop(); // might not stop the thread } The problem here is that worker may spin forever because the write to running might not become visible in the loop.\nclass VisibleWorker implements Runnable { private volatile boolean running = true; // establishes happens-before @Override public void run() { while (running) { // do work } System.out.println(\u0026#34;Stopped\u0026#34;); } public void stop() { running = false; } } Atomicity # Atomicity means an operation (or group of operations) appears to happen as one indivisible step: no other thread observes partial state. Most domain‑level actions (transfer funds, update a status plus timestamp) require multiple field updates, so they are not atomic by default.\nclass Account { int balance; } class Bank { public void transfer(Account from, Account to, int amount) { if (from.balance \u0026gt;= amount) { from.balance -= amount; // step 1 to.balance += amount; // step 2 } } } The problem with the above is that two threads calling transfer concurrently on shared accounts can violate invariants (negative balances, lost money).\nclass SafeBank { public void transfer(Account from, Account to, int amount) { // naive: lock ordering problems, but shows atomicity synchronized (from) { synchronized (to) { if (from.balance \u0026gt;= amount) { from.balance -= amount; to.balance += amount; } } } } } Though there is a lock ordering problem that we\u0026rsquo;ll solve next, atomicity is now at the method level: other threads never see a half‑applied transfer.\nDeadlocks # Deadlock is a liveness failure where a set of threads permanently blocks, each waiting for a resource held by another in the set, so nobody can make progress.\nIn the atomicity example above, there could be a scenario where t1 holds lockA and t2 holds lockB and t1 tries to acquire lockB while t2 tries to acquire lockA, resulting in a deadlock.\nTo avoid this at code level, enforce a global lock ordering, avoid holding locks while calling external code, and prefer higher‑level constructs (e.g., ExecutorService + message passing) over manual nested locks\nIn the above example, say always acquire the lock on the smaller account number first so that a deadlock is avoided.\nStarvation # Starvation occurs when some threads or tasks are perpetually denied CPU or lock access, so they never make progress even though others do. This is often a scheduling or policy bug, not a correctness bug.\nclass StarvationDemo { private final Object lock = new Object(); void run() { // \u0026#34;Selfish\u0026#34; thread hogging the lock Thread hog = new Thread(() -\u0026gt; { while (true) { synchronized (lock) { // long computation } } }); hog.setPriority(Thread.MAX_PRIORITY); hog.start(); // Lower-priority worker that may rarely acquire the lock Thread victim = new Thread(() -\u0026gt; { while (true) { synchronized (lock) { // work that seldom runs } } }); victim.setPriority(Thread.MIN_PRIORITY); victim.start(); } } The victim thread may be effectively starved.\nLivelock # Livelock is a liveness failure where threads are not blocked but still can’t make progress because they keep changing state in response to each other in a way that prevents completion.\nclass DiningFriend { private final String name; private volatile boolean polite = true; public DiningFriend(String name) { this.name = name; } public void eatWith(DiningFriend partner) { while (true) { if (polite \u0026amp;\u0026amp; partner.polite) { // Both decide to yield to the other... System.out.println(name + \u0026#34;: you go first\u0026#34;); try { Thread.sleep(10); } catch (InterruptedException ignored) {} } else { // One eventually decides to eat System.out.println(name + \u0026#34;: I will eat now\u0026#34;); break; } } } } Both threads can keep toggling polite flags and never actually “eat”.\nFix: Introduce randomness or a deterministic tie‑breaker (ID ordering) so someone wins.\nContention # Contention is the performance cost when many threads compete for the same lock or mutually exclusive resource. The program is correct but throughput drops and latency gets noisy because threads are queuing on locks.\nclass HotCounter { private long count = 0; public synchronized void increment() { count++; } public synchronized long get() { return count; } } A single monitor can become a scalability bottleneck under heavy multi‑threaded updates.\nReduced contention with striping # class StripedCounter { private final java.util.concurrent.atomic.AtomicLongArray cells; public StripedCounter(int stripes) { this.cells = new java.util.concurrent.atomic.AtomicLongArray(stripes); } public void increment() { int idx = (int) (Thread.currentThread().getId() % cells.length()); cells.incrementAndGet(idx); } public long get() { long sum = 0; for (int i = 0; i \u0026lt; cells.length(); i++) { sum += cells.get(i); } return sum; } } Backpressure # Backpressure is how a system signals producers to slow down or stop when downstream components are saturated. Without it, you get unbounded queues, OOMs, or meltdown.\nclass BackpressureDemo { private final ThreadPoolExecutor executor; BackpressureDemo() { int core = 8; int max = 8; int queueCapacity = 1000; this.executor = new ThreadPoolExecutor( core, max, 60, TimeUnit.SECONDS, new ArrayBlockingQueue\u0026lt;\u0026gt;(queueCapacity), new ThreadPoolExecutor.CallerRunsPolicy() // backpressure ); } public void submitTask(Runnable task) { executor.execute(task); } } With CallerRunsPolicy, when the queue is full, the caller thread runs the task, effectively slowing the producer.\nInterruption # Interruption in Java is a cooperative signal that a thread should stop what it is doing (or change behavior). Thread.interrupt() sets the interrupted status; many blocking methods respond by throwing InterruptedException and clearing the flag.\nclass InterruptibleWorker implements Runnable { @Override public void run() { try { while (!Thread.currentThread().isInterrupted()) { // Do a unit of work doWorkChunk(); } } catch (InterruptedException e) { // Restore interrupted status if you want callers to see it Thread.currentThread().interrupt(); } finally { cleanup(); } } private void doWorkChunk() throws InterruptedException { // Simulate blocking operation Thread.sleep(100); } private void cleanup() { // release resources } } Thread worker = new Thread(new InterruptibleWorker()); worker.start(); // later worker.interrupt(); Cancellation # Cancellation is the cooperative process of stopping an in‑flight task in a controlled way, typically because of timeouts, shutdown, or user actions. Interruption is a common mechanism for cancellation, but cancellation also needs protocol and ownership: who is allowed to cancel what and how that propagates.\nclass CancellableTask implements Callable\u0026lt;Void\u0026gt; { @Override public Void call() throws Exception { while (!Thread.currentThread().isInterrupted()) { // do work doUnit(); } return null; } private void doUnit() throws InterruptedException { // Potentially blocking operation Thread.sleep(100); } } void demo() throws Exception { ExecutorService executor = Executors.newSingleThreadExecutor(); Future\u0026lt;Void\u0026gt; future = executor.submit(new CancellableTask()); // Let it run for a bit Thread.sleep(500); // Cancel with interrupt future.cancel(true); // mayInterruptIfRunning = true executor.shutdownNow(); } The task cooperates by checking isInterrupted() and by letting blocking calls throw InterruptedException which unwinds the task.\nCancellation via shared flag (non-blocking work) # class FlagCancellableTask implements Runnable { private volatile boolean cancelled = false; public void cancel() { cancelled = true; } @Override public void run() { while (!cancelled) { // do pure CPU work that doesn\u0026#39;t block } } } Note: For blocking operations (e.g., BlockingQueue.put), a volatile flag alone is not enough - you need interruption-aware code or non‑blocking APIs.\nThread.interrupt() just sets the thread’s interrupt flag. Java interruption is cooperative, not pre‑emptive: nothing automatically stops your code when the flag is set. For the interrupt to matter, either:\nThe thread must be blocked in an interruptible call (sleep, wait, join, many java.util.concurrent operations, NIO), in which case that call will throw InterruptedException and clear the flag.\nOr your code must periodically check the flag (isInterrupted() or Thread.interrupted()) and react to it (e.g., break out of the loop, clean up, and return).\n","date":"5 April 2026","externalUrl":null,"permalink":"/posts/mastering-concurrency-in-java-part-1-the-fundamentals/","section":"Posts","summary":"Introduction # Here, we will discuss the core concurrency hazards and control concepts in Java: race conditions, visibility, atomicity, deadlocks, starvation, livelock, contention, backpressure, interruption, and cancellation.\nRace Conditions # A race condition occurs when the correctness of a task depends on the relative timing or interleaving of threads accessing shared mutable state. The bug is not “thread A ran before B”, it is “if they interleave in this specific way, invariants break.”\n","title":"Mastering Concurrency In Java - Part 1: The Fundamentals","type":"posts"},{"content":" Introduction # Search is often the primary way users find products on e-commerce platforms. A high-quality search system shouldn\u0026rsquo;t only be fast and resilient but also highly relevant.\n1. Requirements # Functional Requirements # Keyword Search: Full-text product search over titles, descriptions, categories, and attributes. Fuzzy Matching: Fuzzy matching for typos and approximate terms, particularly on short fields like product titles and brands. Personalization: User-based or segment-based results using behavioural data. Ranking: Results must be ordered by relevance to the user\u0026rsquo;s intent. Autocomplete/typeahead and search suggestions: As the user types, the system should provide relevant suggestions. Faceted Search: Users should be able to search and also filter results based on various attributes like category, colour, price, etc. Internationalization: Support for multiple languages and locales. Non-Functional Requirements # Low Latency: Search results should appear in under 200ms for 100K queries per second. High Availability: The system should be available 99.9% of the time. Near Real-time Indexing: Product updates should be reflected in search results within seconds. Strong Observability: Observability is key in a search system, we should be able to track metrics like query latency, search relevance, click-through rate, etc. 2. High-Level Architecture # For this discussion, we will be using Lucene as our search engine.\nCore Concepts in Lucene # Inverted Index and Documents # Lucene represents data as documents composed of fields, which are tokenized into terms and stored in an inverted index mapping Terms -\u0026gt; Documents. This structure enables fast term lookup and relevance scoring without scanning all documents linearly.\nA product document might include:\nText fields: title, description, category_path, brand. Keyword fields: product_id, seller_id, category_id, brand_id. Numeric fields: price, discount, inventory, rating, popularity_score. Facet fields: hierarchical or categorical fields suitable for Lucene\u0026rsquo;s facet module. Analyzers, tokenizers, and filters # Lucene analyzers perform lexical analysis: lowercasing, tokenization, stopword removal, stemming, etc., producing tokens that are written into the index and used for query parsing. Different fields can use different analyzers; for example, titles might use a language-specific analyzer, while SKUs and IDs use a keyword analyzer.\nFuzzyQuery and edit distance # Lucene\u0026rsquo;s FuzzyQuery leverages Damerau–Levenshtein edit distance to match terms within a configurable maximum number of edits (typically up to 2) and rewrites internally into an optimized multi-term query over matching terms. Higher edit distances can explode the term expansion set and are usually not recommended for general search, where n‑gram or spelling-correction approaches are often preferable.\nWith the above in mind, let\u0026rsquo;s break down our components:\nLogical components # Indexing pipeline\nSource ingestion (catalog DB, CMS, inventory, pricing systems) Normalization and enrichment (category mapping, attribute extraction, language detection). Index builders (Lucene index writers for main index, suggestion index, facet taxonomies). Online query layer\nAPI gateway / GraphQL or REST endpoint for clients. Query parser and search service using Lucene IndexSearcher. Ranking and re-ranking subsystem (combining Lucene scores with business and personalization signals). Autocomplete and suggestion service (Lucene suggesters, possibly separate indices). Facet computation service (Lucene facet module / taxonomy index). Personalization services\nUser profile store (interests, historic interactions, constraints like preferred price ranges). Real-time behavioral feature extraction (session-level events such as clicks and add-to-cart). ML ranking models and feature service for inference. Physical deployment # Multi-node Lucene search cluster, sharded by product ID ranges or category, each shard replicated across AZs. Separate clusters or indices for: Core product search. Autocomplete/suggestion. Analytics or long-term storage (e.g., separate warehouse). API layer stateless; communicates with search cluster via RPC or a thin search gateway. 3. Data Modeling and Index Design # Product document schema # Key principles:\nChoose field types carefully (text vs keyword vs numeric vs facet) and optimize stored vs indexed vs doc values usage. Denormalize related information (brand, category path, basic seller info) into the document for performance. Example schema:\nIdentifiers: product_id (keyword, stored, not analyzed), sku_id (optional, keyword). Text fields: title (text, language-specific analyzer, indexed, term vectors optional), description (text, same analyzer as title but with lower boost), brand_name (text with minimal normalization; can also have a keyword brand_id), category_path (tokenized on separators for matching queries such as \u0026ldquo;running shoes\u0026rdquo;). Numeric / sortable fields: price (numeric doc values, used for range queries and sorting), discount_percent, rating, review_count, popularity_score (numeric doc values). Facet fields: brand_facet, category_facet, colour_facet, size_facet, price_bucket_facet as facet fields integrated with Lucene\u0026rsquo;s facet module and a separate facet index. Personalization hints: segment_ids or audience_tags to support targeted boosting or filtering. Index layout and sharding strategy # Horizontal sharding by product ID range, potentially with hot categories split more finely. Each shard is a separate Lucene index with N replicas; IndexSearcher instances per replica. Global routing either via a search gateway or within an application server layer. Pros of this approach include straightforward scaling and isolation of large segments of data; cons include increased complexity for aggregated operations (e.g., global sort by price), which are handled by coordinating shard-level top‑K results. 4. Indexing Pipeline # Source ingestion and change data capture # Primary product data originates from product catalog and CMS systems. Change Data Capture (CDC) via Kafka - captures inserts, updates, and deletes. Auxiliary data (inventory counts, pricing updates, click and conversion stats) are ingested on separate topics and joined with product data in the indexing pipeline. Maintaining autocomplete and suggestion indices # Lucene\u0026rsquo;s suggest module provides specialized data structures (e.g., AnalyzingInfixSuggester) optimized for autocomplete without indexing all possible query combinations. The suggestion index is built from fields such as product titles and popular queries, and is updated periodically from the product index and query logs.\nFacet taxonomy management # Lucene\u0026rsquo;s facet module typically uses a taxonomy index to map hierarchical category paths to numerical identifiers instead of string labels, enabling efficient aggregation of facet counts over search results. The pipeline must:\nMaintain the taxonomy consistent with the product index, updating when categories or facets change. Ensure faceted fields in product documents reference the correct taxonomy ordinals. Think of the taxonomy index as a master lookup table. It says:\n0 -\u0026gt; ROOT 1 -\u0026gt; Category/Electronics 2 -\u0026gt; Category/Electronics/Phones 3 -\u0026gt; Brand/Apple etc. Each product document just stores a list of these numbers (e.g. 2, 3, …) instead of full strings. Your pipeline’s job is to:\nKeep this lookup table complete and up‑to‑date with all categories that exist. Make sure every product document uses the right numbers from that table for its facet fields. If either of those breaks (taxonomy not updated, or wrong IDs stored in docs), facet counts will be wrong or missing.\n5. Query Processing Flow # API and request model # The client sends a search request containing:\nquery: raw user query string. filters: list of filters (e.g., brand: Nike, price: [1000, 4000], in_stock: true). facets: requested facet dimensions and configurations (e.g., which facets to compute, maximum number of buckets). sort: primary and secondary sort keys; default is relevance. user_context: user ID, locale, device type, experiment IDs. pagination: page number, page size. Query understanding and parsing # Steps:\nPreprocess query: normalize whitespace, lowercasing, language detection, and optional spell correction. Tokenize: using the same analyzer as title/description fields where appropriate to ensure analysis symmetry between indexing and querying. Rewrite into Lucene Queries: Primary full-text query (e.g., MultiFieldQueryParser over title^3, brand_name^2, category_path^1.5, description). Optional FuzzyQuery clauses for misspellings on key fields like title or brand_name when term frequency is low or when short queries with high typo risk are detected. Filter queries for facets and other constraints, usually implemented as BooleanQuery with FILTER clauses for structured fields. Coordination across shards # The search coordinator broadcasts the logical query to all relevant shards. Each shard returns top‑K results (document IDs, Lucene scores, selected stored fields, facet counts if computed shard-local). The search coordinator merges results: Combine and re-rank top‑K results globally based on Lucene scores and re-ranking logic. Aggregate facet counts by summing per-shard facet histograms. How does pagination work with this sharded lucene cluster? # Let\u0026rsquo;s consider the below example:\nFor page 1 (from=0, size=10):\nSearch coordinator asks each shard for some K (e.g., 20) best hits. Each shard returns its TopDocs sorted by score (or by the requested Sort). Search coordinator calls TopDocs.merge(start=0, topN=10, shardHits) and gets the global top 10. For page 2 (from=10, size=10):\nCoordinator again asks each shard for its top K (often K \u0026gt;= from+size or a tuned value). Each shard returns its top K starting from rank 0 — no local “skip 10”. Coordinator calls TopDocs.merge(start=10, topN=10, shardHits) and gets the globally ranked [10..19] slice. The “skip 10” happens once, centrally, on the globally merged list, not once per shard.\nNothing is missed, because:\nAll shards’ candidates up to at least the depth you care about are available to the merge. The global merge computes a single sorted stream and only then drops the first from entries. The cost is that each page recomputes top‑K per shard; that’s why deep pagination is expensive and often capped.\n// Represents a logical search request coming from your API layer. public class SearchRequest { private final String queryString; private final int from; private final int size; public SearchRequest(String queryString, int from, int size) { this.queryString = queryString; this.from = from; this.size = size; } public String getQueryString() { return queryString; } public int getFrom() { return from; } public int getSize() { return size; } } // A single hit in the API response (business-level, not Lucene-level). public class SearchHit { private final String productId; private final float score; public SearchHit(String productId, float score) { this.productId = productId; this.score = score; } public String getProductId() { return productId; } public float getScore() { return score; } } // Aggregated response back to the caller. import org.apache.lucene.search.TotalHits; import java.util.List; public class SearchResponse { private final TotalHits totalHits; private final List\u0026lt;SearchHit\u0026gt; hits; public SearchResponse(TotalHits totalHits, List\u0026lt;SearchHit\u0026gt; hits) { this.totalHits = totalHits; this.hits = hits; } public TotalHits getTotalHits() { return totalHits; } public List\u0026lt;SearchHit\u0026gt; getHits() { return hits; } } // Abstraction over a shard-local Lucene IndexSearcher. import org.apache.lucene.search.Query; import org.apache.lucene.search.TopDocs; public interface ShardClient { int getShardId(); // Returns topK hits (by score) for this shard. TopDocs searchShard(Query query, int topK) throws Exception; // Load business fields (e.g., product_id) for a given doc. String loadProductId(int luceneDocId) throws Exception; } import org.apache.lucene.analysis.standard.StandardAnalyzer; import org.apache.lucene.queryparser.classic.QueryParser; import org.apache.lucene.search.*; import java.util.ArrayList; import java.util.List; import java.util.concurrent.*; public class SearchCoordinator { private final List\u0026lt;ShardClient\u0026gt; shardClients; private final ExecutorService executor; private final QueryParser queryParser; public SearchCoordinator(List\u0026lt;ShardClient\u0026gt; shardClients) { this.shardClients = shardClients; this.executor = Executors.newFixedThreadPool(shardClients.size()); this.queryParser = new QueryParser(\u0026#34;title\u0026#34;, new StandardAnalyzer()); // field \u0026amp; analyzer as appropriate } public SearchResponse search(SearchRequest request) throws Exception { Query query = queryParser.parse(request.getQueryString()); int numShards = shardClients.size(); int from = request.getFrom(); int size = request.getSize(); // Simple heuristic: ask each shard for enough docs to safely cover \u0026#39;from + size\u0026#39; int perShardTopK = computePerShardTopK(from, size, numShards); // 1) Fan out in parallel List\u0026lt;Future\u0026lt;TopDocs\u0026gt;\u0026gt; futures = new ArrayList\u0026lt;\u0026gt;(numShards); for (ShardClient shard : shardClients) { futures.add(executor.submit(() -\u0026gt; shard.searchShard(query, perShardTopK))); } // 2) Collect shard results and set shardIndex on ScoreDocs (required for TopDocs.merge) TopDocs[] shardHits = new TopDocs[numShards]; TotalHits.Relation relation = TotalHits.Relation.EQUAL_TO; long totalHitsValue = 0L; for (int i = 0; i \u0026lt; numShards; i++) { TopDocs td = futures.get(i).get(); // wait for shard i int shardId = shardClients.get(i).getShardId(); for (ScoreDoc sd : td.scoreDocs) { sd.shardIndex = shardId; // important for correct merge across shards[web:83] } shardHits[i] = td; totalHitsValue += td.totalHits.value; if (td.totalHits.relation == TotalHits.Relation.GREATER_THAN_OR_EQUAL_TO) { relation = TotalHits.Relation.GREATER_THAN_OR_EQUAL_TO; } } TotalHits totalHits = new TotalHits(totalHitsValue, relation); // 3) Merge and paginate globally // Lucene provides TopDocs.merge(start, topN, TopDocs[])[web:83] TopDocs merged = TopDocs.merge(from, size, shardHits); // 4) Materialize business hits by going back to shards and loading product IDs List\u0026lt;SearchHit\u0026gt; hits = new ArrayList\u0026lt;\u0026gt;(merged.scoreDocs.length); for (ScoreDoc sd : merged.scoreDocs) { ShardClient shard = findShardById(sd.shardIndex); String productId = shard.loadProductId(sd.doc); hits.add(new SearchHit(productId, sd.score)); } return new SearchResponse(totalHits, hits); } private ShardClient findShardById(int shardId) { for (ShardClient shard : shardClients) { if (shard.getShardId() == shardId) { return shard; } } throw new IllegalStateException(\u0026#34;Unknown shardId: \u0026#34; + shardId); } private int computePerShardTopK(int from, int size, int numShards) { // Very simple heuristic; in production you’d tune this and cap it. int depth = from + size; int base = Math.max(size, depth / numShards); return Math.min(1000, base * 2); // example cap } } import org.apache.lucene.document.Document; import org.apache.lucene.search.IndexSearcher; import org.apache.lucene.search.Query; import org.apache.lucene.search.TopDocs; public class LuceneShardClient implements ShardClient { private final int shardId; private final IndexSearcher searcher; public LuceneShardClient(int shardId, IndexSearcher searcher) { this.shardId = shardId; this.searcher = searcher; } @Override public int getShardId() { return shardId; } @Override public TopDocs searchShard(Query query, int topK) throws Exception { // Simple score-based topK; could use sort-based search instead return searcher.search(query, topK); } @Override public String loadProductId(int luceneDocId) throws Exception { Document doc = searcher.doc(luceneDocId); return doc.get(\u0026#34;product_id\u0026#34;); // or however you store it } } Using the Search Gateway\nList\u0026lt;ShardClient\u0026gt; shards = List.of( new LuceneShardClient(0, shard0Searcher), new LuceneShardClient(1, shard1Searcher), new LuceneShardClient(2, shard2Searcher) ); SearchCoordinator coordinator = new SearchCoordinator(shards); SearchRequest req = new SearchRequest(\u0026#34;running shoes\u0026#34;, 0, 10); SearchResponse resp = coordinator.search(req); for (SearchHit hit : resp.getHits()) { System.out.println(hit.getProductId() + \u0026#34; score=\u0026#34; + hit.getScore()); } Personalization and learning to rank # Personalization uses user/session specific features to adjust ranking:\nUser features: historical preferences (brands, categories), price sensitivity, frequent size or color choices. Contextual features: device, locale, referrer, time of day. Query-level features: query intent classification (navigational vs exploratory), semantic query category. A typical design uses a two-stage ranking pipeline:\nStage 1: Lucene returns top N candidates (e.g., 100–1000) using the base relevance and lightweight boosts. Stage 2: A learning-to-rank (LTR) model (e.g., gradient boosted trees or neural ranker) re-scores these candidates using a feature vector built from: Textual relevance scores and term statistics. Product attributes and business signals. User and context features. The stage-2 model is hosted in a separate service or library loaded in search nodes, with cached models and feature extraction optimized to stay under latency budgets.\nQuery-time controls and explainability # Provide internal tools to inspect ranking decisions, exposing Lucene\u0026rsquo;s explain output and LTR feature contributions for a given query and product. Allow controlled overrides (e.g., \u0026ldquo;pin this product to the top for query X\u0026rdquo;) via a merchandiser-configured rules engine. 7. Fuzzy Search Design # When to apply fuzzy search # Fuzzy search handles typos and approximate spellings but must be constrained to avoid noise and performance issues, especially as Lucene restricts FuzzyQuery to at most two edits for performance reasons. The system applies fuzzy logic when:\nThe query is short (1–3 tokens) and likely to contain typos. Exact term matching yields few or no results. The query tokens have low document frequency, signaling rarity and possible misspelling. Implementation options # Direct FuzzyQuery on fields such as title or brand_name with tuned maxEdits and prefixLength. Spell-check index: a dedicated Lucene index or suggest-based dictionary with correctly spelled terms; query terms are corrected before or in parallel with search, and corrected queries are used to retrieve results. N‑gram fields: index additional n‑gram fields (e.g., 2–3 character shingled tokens) and use them for recall expansion at the cost of larger index size. The chosen approach often combines lightweight spell correction for very short queries with limited fuzzy expansion for specific fields.\n8. Autocomplete and Suggestions # Autocomplete requirements # Autocomplete must respond with suggestions within tens of milliseconds and can be implemented via Lucene suggesters, which maintain optimized data structures in memory to resolve partial queries without enumerating all word combinations.\nSuggestion index design # Build a suggestion index using AnalyzingInfixSuggester or similar:\nKeys: product titles, brand names, categories, and popular queries. Weight: popularity metrics such as past query frequency or product sales. Payload: serialized product or metadata (e.g., product IDs, image URLs) to display rich suggestions. Use analyzers consistent with main index for language handling; infix suggester enables matching within multi-word terms.\nRequest handling # As the user types, the client sends the current prefix; the autocomplete service queries the suggester for top‑K completions.\nYou can also implement debouncing at client side to prevent too many queries from being sent to the autocomplete service. Wrap the “call autocomplete API” function in a debounce, e.g. wait 150–300 ms after the last keypress before firing; intermediate keystrokes reset the timer. This alone can cut requests by ~80–90% while still feeling instant to users if you keep the delay under ~250 ms of perceived latency. Alternatively or additionally, use throttling (“at most one request every X ms”) for long typing sessions.\nFurther, do not hit the backend until the user has typed at least N characters (commonly 2–3). Ignore non‑semantic keys (arrows, tab, shift, ctrl, etc.) and avoid firing requests on every backspace or repeated same value.\nSuggestions may be:\nQuery suggestions (likely full queries that lead to good results). Direct product suggestions (individual products or brands) for fast navigation. The service may filter suggestions using context (user locale, category context, prior interactions) and maintain separate suggesters per locale or per major category.\n","date":"4 April 2026","externalUrl":null,"permalink":"/posts/designing-product-search/","section":"Posts","summary":"Introduction # Search is often the primary way users find products on e-commerce platforms. A high-quality search system shouldn’t only be fast and resilient but also highly relevant.\n","title":"Designing Product Search","type":"posts"},{"content":"","date":"4 April 2026","externalUrl":null,"permalink":"/tags/elasticsearch/","section":"Tags","summary":"","title":"Elasticsearch","type":"tags"},{"content":"","date":"4 April 2026","externalUrl":null,"permalink":"/tags/l2r/","section":"Tags","summary":"","title":"L2R","type":"tags"},{"content":"","date":"4 April 2026","externalUrl":null,"permalink":"/tags/ranking/","section":"Tags","summary":"","title":"Ranking","type":"tags"},{"content":"","date":"4 April 2026","externalUrl":null,"permalink":"/categories/search/","section":"Categories","summary":"","title":"Search","type":"categories"},{"content":"","date":"2 April 2026","externalUrl":null,"permalink":"/categories/backend/","section":"Categories","summary":"","title":"Backend","type":"categories"},{"content":"","date":"2 April 2026","externalUrl":null,"permalink":"/tags/flash-sale/","section":"Tags","summary":"","title":"Flash Sale","type":"tags"},{"content":"","date":"2 April 2026","externalUrl":null,"permalink":"/tags/high-concurrency/","section":"Tags","summary":"","title":"High Concurrency","type":"tags"},{"content":" Introduction # A High Concurrency flash sale system is a system that is designed to handle a large number of concurrent users who are trying to purchase a limited number of items in a short period of time.\n1. Understanding the Requirements # Let\u0026rsquo;s consider the core functional requirements for our flash sale system:\nAdmins can create, schedule, pause and stop flash sale events with limited inventory per SKU and discount rules. Users can browse sale listings, see real time-ish product availability and countdown timers, and attempt purchases during the sale window. System must accept purchase attempts, validate inventory, apply discounts, and record orders. And now for the non-functional requirements:\nHandle sudden 10-100x spikes vs normal traffic; peak may be hundreds of thousands to millions of requests per minute at the edge. Low Latency for \u0026ldquo;hot path\u0026rdquo; APIs (inventory check / reservation) - sub 100ms p95 at the app layer. Very high availability during the event, strong consistency for inventory and orders (no double-selling the same item) and graceful degradation instead of crashes. 2. High-Level Design # At a high level, you want a layered design that pushes load handling to the edges and keeps the \u0026ldquo;critical section\u0026rdquo; (inventory decrement + order creation) as small and isolated as possible.\nTypical Components # CDN: Static pages, product images etc can be served from CDN. API Gateway: Handles authentication, rate limiting, request routing to appropriate services. (See API Gateway for further details). Rate Limiter: Implements Token bucket algorithm to handle sudden spikes in traffic.(See Rate Limiter for further details). Flash Sale Service: Validates sale window, user eligibility, and interacts with Inventory + Queue. Inventory Service: Maintains hot inventory in an in-memory store like Redis. Message Queue (Kafka, MQ): Buffers accepted purchase attempts and decouples spikes from downstream systems. Order Service: Consumes from the queue, creates orders, coordinates with Payment Service and persists to DB. Payment Service: Integrates with payment gateways; updates order state. DB: Relational DB(Postgres, MYSQL) as system of record for orders, payments and inventory. Sample Component Diagram using Redis for inventory # 3. Data Modelling for a Flash Sale # The data model should make the \u0026ldquo;inventory-critical path\u0026rdquo; small and easily protected, while the rest uses standard e-commerce schemas.\nKey Entities # Product: Basic product info, SKU, images, description etc. FlashSaleEvent: Defines the sale window, start time, end time, status (upcoming, active, ended) etc. FlashSaleProduct: Links products to a sale event, with sale-specific price and inventory. Reservation: Temporary hold on inventory for a user during the purchase flow, contains fields like reservation_id, user_id, product_id, quantity, expires_at, status (active, expired, converted). Order: Final order record, created after successful payment. Contains fields like order_id, user_id, product_id, quantity, price, status (pending, paid, failed). During the sale, the SSOT for inventory is Redis with DB updated asynchronously, the DB remains the source of truth for orders and payments.\n4. Inventory Oversell Prevention Strategies # Preventing oversell is the heart of any flash sale system, the design should ensure that no path can decrement the inventory to below zero.\na. Database Only # Keep remaining_quantity in the DB and use row-level locks or optimistic locking to prevent oversell.\n(Optimistic Locking is a strategy where we assume that the data is not going to be contested and we don\u0026rsquo;t use locks. Instead, we use a version column (or timestamp, or checksum) to detect if the data has been modified by another transaction. If it has, we abort the transaction and retry.)\nb. Redis as the Hot Inventory Store # For high concurrency, maintain inventory counters in Redis and use atomic operations / Lua scripts. Lua script runs atomically within Redis, ensuring that no two requests can decrement the inventory below zero.\nc. Queue-Based Strict Serialization # For extreme fairness / robustness, you can treat inventory as a ticket dispenser behind a queue. Enqueue each validated purchase attempt to a per-event queue. A small pool of consumers dequeue and decrements inventory, and create reservations/orders. This serializes all inventory writes and is extremely robust, but adds latency and requires a separate \u0026ldquo;inventory worker\u0026rdquo; process.\n5. Reservation vs Direct Order Creation # You need to choose whether to reserve stock when a user clicks \u0026ldquo;Buy Now\u0026rdquo;, then finalize after payment or create the order directly.\nReservation Model (Common) # For every Buy Now request, you atomically decrement inventory in Redis, create a Reservation record in Redis with a short TTL (e.g. 10 minutes) or in DB, and wait for user to complete payment. If the payment succeeds before expiry, the reservation is converted to a confirmed order. If not, the reservation expires and inventory is released. You could run a recovery job that scans expired reservations and adds stock back.\nDirect Order Model # Immediately create an order with Pending_Payment state and decrement inventory. If payment fails or times out, you mark the order cancelled and release the inventory. This can be done using a cron job that scans for orders in Pending_Payment state and releases the inventory. It\u0026rsquo;s simple conceptually but requires a robust recovery mechanism to handle failures and prevent dead inventory.\nWhen reservation is better: when conversion rate is uncertain and you expect many failed/abandoned attempts (typical in flash sales), reservations keep \u0026ldquo;trash\u0026rdquo; out of the main order table. It prevents the order table from filling up with failed orders and keeps the critical path clean.\n","date":"2 April 2026","externalUrl":null,"permalink":"/posts/high-concurrency-flash-sale-systems/","section":"Posts","summary":"Introduction # A High Concurrency flash sale system is a system that is designed to handle a large number of concurrent users who are trying to purchase a limited number of items in a short period of time.\n","title":"High-Concurrency Flash Sale Systems","type":"posts"},{"content":"","date":"29 March 2026","externalUrl":null,"permalink":"/tags/consistent-hashing/","section":"Tags","summary":"","title":"Consistent Hashing","type":"tags"},{"content":"","date":"29 March 2026","externalUrl":null,"permalink":"/categories/distributed-systems/","section":"Categories","summary":"","title":"Distributed Systems","type":"categories"},{"content":"Imagine a small app where each client talks directly to a specific server, say https://10.0.0.1:8080. As traffic grows, that one server becomes a bottleneck, and if it crashes, the whole app is effectively down for any client pointing at it.\nTo scale out, you might spin up more servers like https://10.0.0.2:8080, https://10.0.0.3:8080, and so on. But now each client must somehow \u0026ldquo;know\u0026rdquo; which server to talk to and when to switch, which is brittle and hard to manage. DNS-based tricks (round-robin DNS) help a bit but react slowly to failures and don\u0026rsquo;t give you fine-grained control over distribution or health.\nThat operational pain is why Load Balancers exist: they sit in front of a pool of backend servers and act as a traffic director, distributing incoming requests across them based on some algorithm.\npublic class LoadBalancer { private final List\u0026lt;BackendServer\u0026gt; servers; public LoadBalancer(List\u0026lt;BackendServer\u0026gt; servers){ this.servers = servers; } public void handleRequest(String request){ BackendServer server = chooseServer(request); server.handleRequest(request); } private BackendServer chooseServer(String request){ // algorithm for choosing server } } Layer 4 vs. Layer 7 Load Balancing # Load balancers typically operate at different layers of the OSI model:\nLayer 4 (Transport Layer): Operates at the network and transport layer (TCP/UDP). It makes routing decisions based on IP addresses and port numbers. It is extremely fast because it doesn\u0026rsquo;t inspect the application-level data (like HTTP headers or cookies). Layer 7 (Application Layer): Operates at the application layer (HTTP/HTTPS). It can inspect the content of the request (URL, headers, cookies, query parameters). This allows for much more sophisticated routing, such as sending traffic for /api/v1/users to one set of servers and /api/v1/orders to another. Server Selection Algorithms # To decide which server should receive a request, load balancers use various algorithms:\n1. Round Robin # A classic algorithm where requests are distributed sequentially across the list of servers. Simple but doesn\u0026rsquo;t account for server load or capacity.\nprivate final AtomicInteger currentIndex = new AtomicInteger(0); private BackendServer chooseServer(String request){ int index = currentIndex.getAndUpdate(i -\u0026gt; (i + 1) % servers.size()); return servers.get(index); } Real load balancers monitor backend health and avoid sending traffic to servers that are down or misbehaving. Typical implementations use periodic health checks (e.g., HTTP GET /health) and mark servers healthy or unhealthy based on the response.\n2. Sticky Sessions # In a normal load balancer, each request is routed independently. This is fine for stateless applications, but it\u0026rsquo;s a problem for stateful applications where the server needs to maintain session information for each client.\nOne way to handle this is by maintaining a ConcurrentHashMap of session ID to server mapping on the load balancer. However, this only lives in one JVM instance. If you have multiple load balancer instances, you need to sync session information across them, which is complex and inefficient.\nThis is why many platforms use cookie-based or hashing-based affinity rather than in-memory session maps.\n3. Hashing Based Sticky Sessions # A straightforward way to get stickiness is:\nChoose a key like client IP or session ID. Compute a hash h(key). Route to server at index h(key) % N, where N is the number of servers. This works as long as the set of servers never changes. But as soon as you add or remove a server, N changes and almost all keys get remapped to different indices, which destroys cache locality and sticky mappings.\nConsistent Hashing: Step by Step # Consistent Hashing is a scheme that allows us to minimize the number of keys that get remapped when the number of servers changes.\nHash Space as a Ring: Imagine all possible hash values (e.g., 0 to 2^32 - 1) arranged on a circle; this is the \u0026ldquo;hash ring\u0026rdquo;. Place Servers on the Ring: For each backend server, hash its identifier (e.g., IP address or hostname) to a point on the ring. Place Keys on the Ring: For each key (e.g., client IP or session ID), hash it into the space and find the first server clockwise from the key\u0026rsquo;s position. That server is responsible for the key. Adding or Removing a Server: When a server is added, it takes over only the keys that fall between its position and the previous server\u0026rsquo;s position (counter-clockwise). When a server is removed, its keys are reassigned to the next server clockwise. Virtual Nodes for Better Balance # If you place each physical server once on the ring, the distribution can be uneven. To reduce this imbalance, each physical server is mapped to multiple points on the ring. These are called Virtual Nodes.\nNow, let\u0026rsquo;s incrementally design a small, self contained Java implementation that contains - a. a dynamic pool of backend servers. b. a hash function to distribute our keys onto the consistent hash ring. c. a consistent hash ring with virtual nodes. d. a router for requests based on some sticky key (eg: session ID or client IP)\na. Backend Server # public class BackendServer { private final String id; private final String host; private final int port; public BackendServer(String id, String host, int port){ this.id = id; this.host = host; this.port = port; } public String getId(){ return id; } public String getHost(){ return host; } public int getPort(){ return port; } @Override public String toString(){ return \u0026#34;BackendServer{\u0026#34; + \u0026#34;id=\u0026#39;\u0026#34; + id + \u0026#39;\\\u0026#39;\u0026#39; + \u0026#34;, host=\u0026#39;\u0026#34; + host + \u0026#39;\\\u0026#39;\u0026#39; + \u0026#34;, port=\u0026#34; + port + \u0026#39;}\u0026#39;; } } b. Hash Function # public final class HashUtils { private static final String HASH_ALGORITHM = \u0026#34;SHA-256\u0026#34;; public static long hashToLong(String key) { try { MessageDigest md = MessageDigest.getInstance(HASH_ALGORITHM); byte[] digest = md.digest(key.getBytes(StandardCharsets.UTF_8)); // Take the first 8 bytes and convert to a long ByteBuffer buffer = ByteBuffer.wrap(digest); return buffer.getLong(); } catch (NoSuchAlgorithmException e) { // Should not happen for SHA-256 in a standard JDK throw new RuntimeException(\u0026#34;Hash algorithm not available\u0026#34;, e); } } } c. Consistent Hash Ring # public class ConsistentHashRing { private final NavigableMap\u0026lt;Long, BackendServer\u0026gt; ring = new TreeMap\u0026lt;\u0026gt;(); private final int virtualNodes; public ConsistentHashRing(int virtualNodes) { if (virtualNodes \u0026lt;= 0) { throw new IllegalArgumentException(\u0026#34;virtualNodes must be \u0026gt; 0\u0026#34;); } this.virtualNodes = virtualNodes; } public synchronized void addServer(BackendServer server) { Objects.requireNonNull(server, \u0026#34;server must not be null\u0026#34;); for (int i = 0; i \u0026lt; virtualNodes; i++) { String vnodeKey = server.getId() + \u0026#34;#\u0026#34; + i; long hash = HashUtils.hashToLong(vnodeKey); ring.put(hash, server); } } public synchronized void removeServer(BackendServer server) { Objects.requireNonNull(server, \u0026#34;server must not be null\u0026#34;); for (int i = 0; i \u0026lt; virtualNodes; i++) { String vnodeKey = server.getId() + \u0026#34;#\u0026#34; + i; long hash = HashUtils.hashToLong(vnodeKey); ring.remove(hash); } } public synchronized BackendServer getServerForKey(String key) { if (ring.isEmpty()) { return null; } long hash = HashUtils.hashToLong(key); // Find first entry \u0026gt;= hash var entry = ring.ceilingEntry(hash); if (entry == null) { // Wrap around to the first entry entry = ring.firstEntry(); } return entry.getValue(); } public synchronized int size() { return ring.size(); } public synchronized Collection\u0026lt;BackendServer\u0026gt; getAllServersSnapshot() { // This returns a collection view; to be safe, you might want to copy it return ring.values(); } } d. ConsistentHashLoadBalancer # public class ConsistentHashLoadBalancer { private final ConsistentHashRing ring; public ConsistentHashLoadBalancer(int virtualNodesPerServer) { this.ring = new ConsistentHashRing(virtualNodesPerServer); } public void registerServer(BackendServer server) { ring.addServer(server); } public void unregisterServer(BackendServer server) { ring.removeServer(server); } /** * Route based on a sticky key, e.g. session ID, client IP, or custom header. */ public BackendServer routeRequest(String stickyKey) { if (stickyKey == null) { throw new IllegalArgumentException(\u0026#34;stickyKey must not be null\u0026#34;); } return ring.getServerForKey(stickyKey); } } Now let\u0026rsquo;s verify the above usage with a test\npublic class ConsistentHashLoadBalancerTest { @Test public void testConsistentLoadBalancer() { ConsistentHashLoadBalancer lb = new ConsistentHashLoadBalancer(100); BackendServer s1 = new BackendServer(\u0026#34;server-A\u0026#34;, \u0026#34;10.0.0.1\u0026#34;, 8080); BackendServer s2 = new BackendServer(\u0026#34;server-B\u0026#34;, \u0026#34;10.0.0.2\u0026#34;, 8080); BackendServer s3 = new BackendServer(\u0026#34;server-C\u0026#34;, \u0026#34;10.0.0.3\u0026#34;, 8080); lb.registerServer(s1); lb.registerServer(s2); lb.registerServer(s3); String[] clients = { \u0026#34;user-1-session\u0026#34;, \u0026#34;user-2-session\u0026#34;, \u0026#34;user-3-session\u0026#34;, \u0026#34;user-4-session\u0026#34;, \u0026#34;user-5-session\u0026#34; }; System.out.println(\u0026#34;=== Initial routing ===\u0026#34;); Map\u0026lt;String, BackendServer\u0026gt; initialMapping = new LinkedHashMap\u0026lt;\u0026gt;(); for (String client : clients) { BackendServer server = lb.routeRequest(client); initialMapping.put(client, server); System.out.printf(\u0026#34;Client %s -\u0026gt; %s%n\u0026#34;, client, server); } // Now add a new server and see who moved BackendServer s4 = new BackendServer(\u0026#34;server-D\u0026#34;, \u0026#34;10.0.0.4\u0026#34;, 8080); lb.registerServer(s4); System.out.println(\u0026#34;\\n=== After adding server-D ===\u0026#34;); int moved = 0; for (String client : clients) { BackendServer server = lb.routeRequest(client); BackendServer before = initialMapping.get(client); boolean same = before.getId().equals(server.getId()); if (!same) { moved++; } System.out.printf( \u0026#34;Client %s -\u0026gt; %s (was %s)%s%n\u0026#34;, client, server, before, same ? \u0026#34;\u0026#34; : \u0026#34; \u0026lt;-- moved\u0026#34; ); } System.out.printf(\u0026#34;%nTotal clients moved: %d out of %d%n\u0026#34;, moved, clients.length); } } Output:\n=== Initial routing === Client user-1-session -\u0026gt; BackendServer{id=\u0026#39;server-A\u0026#39;, host=\u0026#39;10.0.0.1\u0026#39;, port=8080} Client user-2-session -\u0026gt; BackendServer{id=\u0026#39;server-C\u0026#39;, host=\u0026#39;10.0.0.3\u0026#39;, port=8080} Client user-3-session -\u0026gt; BackendServer{id=\u0026#39;server-B\u0026#39;, host=\u0026#39;10.0.0.2\u0026#39;, port=8080} Client user-4-session -\u0026gt; BackendServer{id=\u0026#39;server-C\u0026#39;, host=\u0026#39;10.0.0.3\u0026#39;, port=8080} Client user-5-session -\u0026gt; BackendServer{id=\u0026#39;server-A\u0026#39;, host=\u0026#39;10.0.0.1\u0026#39;, port=8080} === After adding server-D === Client user-1-session -\u0026gt; BackendServer{id=\u0026#39;server-D\u0026#39;, host=\u0026#39;10.0.0.4\u0026#39;, port=8080} (was BackendServer{id=\u0026#39;server-A\u0026#39;, host=\u0026#39;10.0.0.1\u0026#39;, port=8080}) \u0026lt;-- moved Client user-2-session -\u0026gt; BackendServer{id=\u0026#39;server-C\u0026#39;, host=\u0026#39;10.0.0.3\u0026#39;, port=8080} (was BackendServer{id=\u0026#39;server-C\u0026#39;, host=\u0026#39;10.0.0.3\u0026#39;, port=8080}) Client user-3-session -\u0026gt; BackendServer{id=\u0026#39;server-B\u0026#39;, host=\u0026#39;10.0.0.2\u0026#39;, port=8080} (was BackendServer{id=\u0026#39;server-B\u0026#39;, host=\u0026#39;10.0.0.2\u0026#39;, port=8080}) Client user-4-session -\u0026gt; BackendServer{id=\u0026#39;server-C\u0026#39;, host=\u0026#39;10.0.0.3\u0026#39;, port=8080} (was BackendServer{id=\u0026#39;server-C\u0026#39;, host=\u0026#39;10.0.0.3\u0026#39;, port=8080}) Client user-5-session -\u0026gt; BackendServer{id=\u0026#39;server-A\u0026#39;, host=\u0026#39;10.0.0.1\u0026#39;, port=8080} (was BackendServer{id=\u0026#39;server-A\u0026#39;, host=\u0026#39;10.0.0.1\u0026#39;, port=8080}) Total clients moved: 1 out of 5 Conclusion # Consistent Hashing is the gold standard for distributing traffic in stateful systems. While simple algorithms like Round Robin or standard modulo hashing serve as a good starting point, they fail to provide the stability needed when backend nodes are dynamic.\nThis pattern is fundamental to modern distributed architecture, powering distributed caches like Redis and Memcached, partitioned databases like Cassandra and DynamoDB, and global Content Delivery Networks (CDNs) where maximizing cache hit rates is critical. By mapping both servers and keys onto a unified hash ring and using techniques like virtual nodes, we can achieve high performance and minimal data movement during scaling events. Mastering these concepts is essential for building the resilient, auto-scaling infrastructure that powers the web today.\n","date":"29 March 2026","externalUrl":null,"permalink":"/posts/load-balancers-and-consistent-hashing/","section":"Posts","summary":"Imagine a small app where each client talks directly to a specific server, say https://10.0.0.1:8080. As traffic grows, that one server becomes a bottleneck, and if it crashes, the whole app is effectively down for any client pointing at it.\n","title":"Load Balancers and Consistent Hashing","type":"posts"},{"content":"","date":"29 March 2026","externalUrl":null,"permalink":"/tags/load-balancing/","section":"Tags","summary":"","title":"Load Balancing","type":"tags"},{"content":"","date":"29 March 2026","externalUrl":null,"permalink":"/tags/scalability/","section":"Tags","summary":"","title":"Scalability","type":"tags"},{"content":"","date":"28 March 2026","externalUrl":null,"permalink":"/tags/caching/","section":"Tags","summary":"","title":"Caching","type":"tags"},{"content":"Modern APIs frequently access databases, or complex business logic that introduce significant latency and consume CPU and I/O resources. Without caching, every request pays the full cost of database queries, network calls, and computation. This can lead to slow response times and poor scalability as traffic increases.\nCaching is a technique for storing expensive to compute or slow to fetch data in a faster storage layer so repeated requests can be served quickly without hitting the original source each time. For Java APIs, a solid caching strategy can drastically improve latency, throughput and reliability, especially under heavy load.\nCaching is especially valuable for:\nExpensive database queries that aggregate or join large tables, such as analytics dashboards or complex search queries. Frequently accessed but rarely changing data, such as reference data, configuration, product catalogs or content metadata. API responses from third-party services that are rate-limited or charge per call, where caching reduces cost and protects against throttling. CPU-intensive computations like report generation, HTML rendering, or recommendation results which can be used across many requests. A very prevalent framework for caching is Redis.\nWhat Redis Brings to Caching # Redis is an open source, in-memory key value data structure that keeps data in RAM rather than on disk, making access orders of magnitude faster than traditional databases.\nFor caching, Redis offers several critical features :-\nIn-memory storage for very low read and write latency. Configurable Time To Live (TTL) for automatic cache eviction. Eviction policies like LRU, LFU, FIFO to remove least useful data when memory is full. Atomic operations for thread-safe cache updates. Optional persistence and replication, allowing data to survive restarts. Typical manual caching steps include:\nOn a read, compute a cache key (for example, user:123:profile), query Redis. If present, deserialize and return the value. If not present, fetch from the database, serialize and store in Redis with a TTL, then return the value. On writes, either update the cache entry (write-through) or invalidate/delete the key and allow it to be repopulated on the next read (write-invalidate/cache-aside). Core Caching Principles # 1. What to Cache (Data Selection) # Not all data should be cached. Good candidates include:\nData that is read frequently and updated infrequently, such as user profiles, product details, or configuration settings. Expensive-to-calculate results such as complex reports, aggregations, or multi-step workflows. Responses from external APIs where rate limits or cost constraints apply. Poor candidates include highly volatile data that changes very frequently or data with strong transactional consistency requirements that cannot tolerate staleness.\n2. Read-Write Patterns and Workload # Understanding the API\u0026rsquo;s read/write ratio and access distribution / data shape is essential for determining cache value.\nRead-heavy workloads benefit greatly from caching because a single cached object may serve thousands of requests. Write-heavy or strongly consistent workloads may require more conservative caching (short TTLs, write-through, or event driven invalidation) to avoid stale responses. Skewed traffic patterns (e.g. hot keys, popular items) can be handled well with caching but also pose risks of cache stampedes when keys expire. 3. Key Design and Namespacing # Cache keys must be designed so that they uniquely and predictably identify cached items.\nKeys typically incorporate entity type, identifier and sometimes version or locale, such as user:123:profile:v1. Namespacing keys by domain or service (for example, catalog:product:42) simplifies bulk invalidation and avoids collisions across teams. Versioning keys allows mass invalidation when schemas or interpretation logic changes by simply bumping a version prefix/suffix, without individually deleting old entries. Clear conventions for key construction should be documented and enforced to avoid subtle cache bugs.\n4. Consistency and Freshness Requirements # A core design trade-off in caching is between consistency and data freshness.\nTTL-based expiration: Each key is given a TTL; once expired, the entry is refetched on the next request. This provides eventual consistency and predictable staleness windows. Write-through: Updates are written to the cache at the write time, either updating or invalidating entries immediately, providing stronger consistency at the cost of increased write path complexity. Event-driven invalidation: Cache invalidation is triggered by events from the data source (e.g., database change events, message queue updates) rather than relying solely on TTLs. This allows for more immediate cache updates and can reduce data staleness in write-heavy workloads. The strategy should specify, per data type, the acceptable staleness window, whether stale reads are allowed, and how conflicts are resolved when outdated data is served.\n5. Handling Cache Stampede # A cache stampede occurs when many requests simultaneously miss the cache for the same key, causing a surge of requests to the underlying data source. Mitigation techniques include:\nRequest Coalescing: Only one request is allowed to fetch the data from the database and populate the cache, while others wait for the cache to be populated. Staggered Expiration: Instead of setting the same TTL for all keys, set random TTLs for each key to avoid simultaneous expiration. Cache Warming: Pre-populate the cache with frequently accessed data to avoid cache misses. 6. Serialization Format and Object Size # Java objects must be serialized before being stored in Redis.\nCommon formats include JSON, or binary formats. Large objects can hurt performance, increase network latency and memory usage. Sometimes it\u0026rsquo;s preferred to only store the fields required by the API or partition the large object into multiple keys. 7. Observability and Tuning # A caching strategy is not complete without observability. a. Track cache hit/miss ratios, latency distributions, error rates, and eviction counts to understand whether the cache is effective. b. Use insights from monitoring to adjust TTLs, eviction policies, and which data is cached as access patterns evolve.\nReal World Example # Now let\u0026rsquo;s consider a real world example of caching in a Java API. Consider an ecommerce API with the below endpoint :-\nGET /products/{id}: Fetch product details by ID. The backing store is a relational database (eg: Postgres/SQL Server) and Redis used as a distributed cache.\nHigh-level strategy\nWhat to cache? We can cache the individual product details (GET /products/{id}). These are read-heavy, not updated frequently.\nRead/Write Patterns\nRead: First check the cache, if the data is not found, retrieve from the database, store in cache and return the value. Write: Every time a product is updated or deleted, delete the cache key. Key Design\nSingle Product: product:{id} TTL The product details are not updated frequently, so we can use a long TTL. If changes to your product catalog are seasonal, you could go as high as 30 days.\nStampede protection Simple request coalescing by using a lock or by allowing only one repopulate per key.\nSerialization Format and Object Size If product object is too big and you see cache fetches taking longer for some products, you should populate only the relevant fields which are returned by the API.\nObservability and Tuning Track cache hit ratios, time taken for cache fetches, cache evictions, and error rates. Use these insights to tune TTLs, eviction policies, and which data is cached as access patterns evolve.\nTalk is cheap, let\u0026rsquo;s look at the above high level strategy in action.\nFirst let\u0026rsquo;s look at our product API without stampede protection :-\n@Service public class ProductService { private final ProductRepository repository; private final RedisTemplate\u0026lt;String, Product\u0026gt; redisTemplate; public ProductService(ProductRepository repository, RedisTemplate\u0026lt;String, Product\u0026gt; redisTemplate) { this.repository = repository; this.redisTemplate = redisTemplate; } public Product getProduct(Long id) { String key = \u0026#34;product:\u0026#34; + id; Product cached = null; try { cached = redisTemplate.opsForValue().get(key); } catch (Exception e) { // Logging } if (cached != null) { return cached; } Product product = repository.findById(id).orElse(null); if (product == null) { return null; } try { redisTemplate.opsForValue().set(key, product, 1, TimeUnit.MONTH); } catch (Exception e) { // Logging } return product; } public Product upsertProduct(Product product){ Product saved = repository.save(product); String key = \u0026#34;product:\u0026#34; + saved.getId(); try { redisTemplate.delete(key) } catch (Exception e){ // Logging / logic to retry deletion } return saved; } } Now let\u0026rsquo;s try to add stampede protection to the above get method.\npublic Product getProduct(Long id) { String key = \u0026#34;product:\u0026#34; + id; Product cached = null; try { cached = redisTemplate.opsForValue().get(key); } catch (Exception e) { // Logging } if (cached != null) { return cached; } String lockKey = \u0026#34;lock:\u0026#34; + key; String lockValue = UUID.randomUUID().toString(); boolean lockAcquired = false; try { lockAcquired = Boolean.TRUE.equals(redisTemplate.opsForValue().setIfAbsent(lockKey, lockValue, 10, TimeUnit.SECONDS)); // tune as needed } catch (Exception e){ // Logging } if (lockAcquired) { try { // Double check in case another request loaded the cache key and released the lock just before this request tried to acquire the lock cached = redisTemplate.opsForValue().get(key); } catch (Exception e){ // Logging } if (cached != null) { return cached; } // Populate cache product = repository.findById(id).orElse(null); if (product == null) { return null; } try { redisTemplate.opsForValue().set(key, product, 1, TimeUnit.MONTH); } catch (Exception e) { // Logging } return product; } finally { if (lockAcquired) { try { redisTemplate.delete(lockKey); } catch (Exception e) { // Logging } } } //lock not acquired // Wait for lock to be released try { Thread.sleep(1000); // small backoff, tune as necessary } catch (InterruptedException e) { Thread.currentThread().interrupt(); } if(cached != null){ return cached; } //fallback just hit the db if the cache is still not populated return repository.findById(id).orElse(null); } In the above implementation, we do request coalescing such that only one request goes to the db and subsequent requests sleep for 1000 ms till the cache gets populated. In case there are any failures while setting the cache, the fallback is to hit the db directly.\n","date":"28 March 2026","externalUrl":null,"permalink":"/posts/how-to-cache/","section":"Posts","summary":"Modern APIs frequently access databases, or complex business logic that introduce significant latency and consume CPU and I/O resources. Without caching, every request pays the full cost of database queries, network calls, and computation. This can lead to slow response times and poor scalability as traffic increases.\n","title":"How to Cache?","type":"posts"},{"content":"","date":"28 March 2026","externalUrl":null,"permalink":"/tags/performance/","section":"Tags","summary":"","title":"Performance","type":"tags"},{"content":"","date":"28 March 2026","externalUrl":null,"permalink":"/tags/redis/","section":"Tags","summary":"","title":"Redis","type":"tags"},{"content":"","date":"25 March 2026","externalUrl":null,"permalink":"/categories/architecture/","section":"Categories","summary":"","title":"Architecture","type":"categories"},{"content":"","date":"25 March 2026","externalUrl":null,"permalink":"/tags/kafka/","section":"Tags","summary":"","title":"Kafka","type":"tags"},{"content":"","date":"25 March 2026","externalUrl":null,"permalink":"/tags/message-broker/","section":"Tags","summary":"","title":"Message Broker","type":"tags"},{"content":" Introduction # Distributed message brokers are the backbone of modern microservices architectures, enabling asynchronous communication, decoupling services, and providing a reliable way to handle data streams at scale.\nRequirements # The broker is intented to support high throughput, low latency messaging with durability and horizontal scalability, similar to Kafka\u0026rsquo;s real time streaming usecases and its role as a microservice communication backbone.\nFunctional Requirements # Producer publishes messages to named topics. Consumers subscribe to topics and read messages in order (atleast within a partition). Multiple producers and consumers operate concurrently. Consumer groups provide load balancing - each message is delivered to exactly one consumer within a group, while the same topic can be consumed by multiple consumer groups independently. Non Functional Requirements # High Availability - The broker should be available even if some brokers fail. Durability - Messages should not be lost even if brokers fail. Scalability - The broker should be able to handle increasing load by adding more brokers. Low Latency - Messages should be delivered with low latency. Approaches # 1. Single Node In Memory Broker # Let\u0026rsquo;s start with a single broker system and then move to a distributed system. The first version will be a single broker system with an in-memory queue. Producers enqueue messages to the queue and consumers dequeue messages from the queue.\npublic class Message { private final String key; private final byte[] value; public Message(String key, byte[] value) { this.key = key; this.value = value; } public String getKey() { return key; } public byte[] getValue() { return value; } } // In Memory Broker import java.util.Map; import java.util.concurrent.*; public class InMemoryBroker { private final Map\u0026lt;String, BlockingQueue\u0026lt;Message\u0026gt;\u0026gt; topics = new ConcurrentHashMap\u0026lt;\u0026gt;(); public void createTopic(String topic) { topics.putIfAbsent(topic, new LinkedBlockingQueue\u0026lt;\u0026gt;()); } public void produce(String topic, Message message) { BlockingQueue\u0026lt;Message\u0026gt; queue = topics.get(topic); if (queue == null) { throw new IllegalArgumentException(\u0026#34;Topic \u0026#34; + topic + \u0026#34; does not exist\u0026#34;); } queue.add(message); } public Message consume(String topic, long timeout) throws InterruptedException { BlockingQueue\u0026lt;Message\u0026gt; queue = topics.get(topic); if (queue == null) { throw new IllegalArgumentException(\u0026#34;Topic \u0026#34; + topic + \u0026#34; does not exist\u0026#34;); } return queue.poll(timeout, TimeUnit.MILLISECONDS); } } Usage # public class InMemoryBrokerTest { @Test public void testProducerConsumerForInMemoryBroker() throws InterruptedException { InMemoryBroker broker = new InMemoryBroker(); broker.createTopic(\u0026#34;test\u0026#34;); Thread producer = new Thread(() -\u0026gt; { for (int i = 0; i \u0026lt; 10; i++) { try { broker.produce(\u0026#34;test\u0026#34;, new Message(\u0026#34;key\u0026#34; + i, (\u0026#34;value\u0026#34; + i).getBytes())); System.out.println(\u0026#34;Produced: \u0026#34; + i); } catch (Exception e) { Thread.currentThread().interrupt(); return; } } }); Thread consumer = new Thread(() -\u0026gt; { while(true){ try { Message message = broker.consume(\u0026#34;test\u0026#34;, 1000); if(message == null){ break; } System.out.println(\u0026#34;Consumed: \u0026#34; + message.getKey()); } catch (Exception e) { Thread.currentThread().interrupt(); break; } } }); producer.start(); consumer.start(); producer.join(); consumer.join(); } } Output:\nProduced: 0 Consumed: key0 Consumed: key1 Produced: 1 Produced: 2 Consumed: key2 Consumed: key3 Produced: 3 Produced: 4 Consumed: key4 Consumed: key5 Produced: 5 Produced: 6 Produced: 7 Produced: 8 Produced: 9 Consumed: key6 Consumed: key7 Consumed: key8 Consumed: key9 Notice that the print statement appears after the produce operation, so even if the producer produces first, its log message may still appear after the consumer has already consumed the message. This is due to the nature of concurrent execution.\nCurrently the broker is single threaded and single process. It lacks durability and scalability.\nLet\u0026rsquo;s add persistence to the broker. We can use a file to store the messages.\n2. Single Node Append Only Log With Replay # Let\u0026rsquo;s make the broker log based. Each topic is a commit log on the disk: append-only, ordered and addressable by offset. Producers append to the end, consumers read from a specific offset and track their own position. This gives us durability and replay similar to Kafka\u0026rsquo;s commit log.\nKafka does exactly this: each topic is split into partitions, and each partition is a persistent ordered log with monotonically increasing offsets.\nLog Partition Abstraction\npublic interface LogPartition{ long append(Message message) throws Exception; // Read message starting from a given offset ReadResult read(long offset) throws Exception; long endOffset() throws Exception; //lastOffset + 1 } import java.io.*; import java.nio.file.*; import java.util.*; public class FileBackedLogPartition implements LogPartition, Closeable { private final RandomAccessFile raf; public FileBackedLogPartition(Path path) throws IOException { Files.createDirectories(path.getParent()); this.raf = new RandomAccessFile(path.toFile(), \u0026#34;rw\u0026#34;); } @Override public synchronized long append(Message message) throws IOException { long offset = raf.length(); raf.seek(offset); raf.writeInt(message.getKey().length()); raf.write(message.getKey().getBytes()); raf.writeInt(message.getValue().length); raf.write(message.getValue()); return offset; } @Override public ReadResult read(long offset) throws IOException { if (offset \u0026gt;= raf.length()) { return null; } raf.seek(offset); int keyLen = raf.readInt(); byte[] keyBytes = new byte[keyLen]; raf.readFully(keyBytes); int valueLen = raf.readInt(); byte[] valueBytes = new byte[valueLen]; raf.readFully(valueBytes); long nextOffset = raf.getFilePointer(); // position after this record Message msg = new Message(new String(keyBytes), valueBytes); return new ReadResult(msg, nextOffset); } @Override public long endOffset() throws IOException { return raf.length(); } @Override public void close() throws IOException { raf.close(); } } // Broker with per group offsets import java.nio.file.*; import java.util.Map; import java.util.concurrent.ConcurrentHashMap; public class FileBroker { private final Path logDir; private final Map\u0026lt;String, LogPartition\u0026gt; topicsToPartitionMap = new ConcurrentHashMap\u0026lt;\u0026gt;(); // consumerGroup -\u0026gt; topic -\u0026gt; offset private final Map\u0026lt;String,Map\u0026lt;String,Long\u0026gt;\u0026gt; consumerOffsets = new ConcurrentHashMap\u0026lt;\u0026gt;(); public FileBroker(Path logDir){ this.logDir = logDir; } public synchronized void createTopic(String name) throws Exception{ topicsToPartitionMap.computeIfAbsent(name, t -\u0026gt; { try { return new FileBackedLogPartition(logDir.resolve(name+\u0026#34;.log\u0026#34;)); } catch (Exception e) { throw new RuntimeException(e); } }); } public long produce(String topic, Message message) throws Exception{ LogPartition partition = topicsToPartitionMap.get(topic); if(partition == null){ throw new IllegalArgumentException(\u0026#34;Topic \u0026#34; + topic + \u0026#34; does not exist\u0026#34;); } return partition.append(message); } public Message consume(String topic, String consumerGroup) throws Exception{ LogPartition partition = topicsToPartitionMap.get(topic); if(partition == null){ throw new IllegalArgumentException(\u0026#34;Topic \u0026#34; + topic + \u0026#34; does not exist\u0026#34;); } long offset = consumerOffsets.computeIfAbsent(consumerGroup, g -\u0026gt; new ConcurrentHashMap\u0026lt;\u0026gt;()).getOrDefault(topic, 0L); if(offset \u0026gt;= partition.endOffset()){ return null; } ReadResult result = partition.read(offset); long next = result.getNextOffset(); if(result != null){ consumerOffsets.get(consumerGroup).put(topic, next); } return result.getMessage(); } } Let\u0026rsquo;s verify durability and replay now\nimport java.nio.file.*; public class FileBrokerTest { @Test public void testFileBasedBroker() throws Exception { Path logDir = Files.createTempDirectory(\u0026#34;broker\u0026#34;); FileBroker broker = new FileBroker(logDir); broker.createTopic(\u0026#34;test\u0026#34;); for(int i=0;i\u0026lt;3;i++){ broker.produce(\u0026#34;test\u0026#34;, new Message(\u0026#34;key\u0026#34;+i, (\u0026#34;value\u0026#34;+i).getBytes())); } // simulate broker restart FileBroker broker2 = new FileBroker(logDir); broker2.createTopic(\u0026#34;test\u0026#34;); // consumer A reads the messages Message m; while((m = broker2.consume(\u0026#34;test\u0026#34;, \u0026#34;groupA\u0026#34;)) != null){ System.out.println(\u0026#34;Consumed: \u0026#34; + m.getKey()); } // consumer B can also read the same messages Message m2; while((m2 = broker2.consume(\u0026#34;test\u0026#34;, \u0026#34;groupB\u0026#34;)) != null){ System.out.println(\u0026#34;Consumed: \u0026#34; + m2.getKey()); } } } Output:\nConsumed: key0 Consumed: key1 Consumed: key2 Consumed: key0 Consumed: key1 Consumed: key2 Improvements over In Memory Broker:\nDurability: Messages are stored on disk and can be recovered after a restart. Replay: Messages can be replayed from the beginning. Multiple consumers can consume the same message. The above system still has limitations:\nNo partitioning yet, so write throughput is limited by single log per topic. Offsets are stored in memory, so if the broker restarts, the offsets will be lost. No replication yet, so if the disk crashes, the data will be lost. No sharding yet, so the log size is limited by the disk size of the broker. This leads naturally to a distributed, partitioned design with consumer groups and replication, similar to Kafka\u0026rsquo;s architecture.\n3. Partitioned, Distributed Broker Cluster # The next step is to distribute data and load across multiple brokers using topics, partitions, and consumer groups, in line with Kafka\u0026rsquo;s artchitecture.\nA cluster consists of multiple brokers. Each topic is split into multiple partitions, each a separate log. For a given topic, partitions are spread across brokers to provide horizontal scalability. A consumer group shares the partitions of a topic such that each partition is consumed by atmost one consumer in that group. Partitioning Strategy:\nimport java.util.Random; public class Partitioner { private final Random random = new Random(); public int choosePartition(String key, int numPartitions){ if(key == null){ return random.nextInt(numPartitions); } return Math.abs(key.hashCode()) % numPartitions; } } This strategy is analogous to Kafka\u0026rsquo;s use of hashing the key to select a partition, preserving key-based ordering within a partition.\nNow let\u0026rsquo;s come up with a group-level partition assignment -\nimport java.util.*; import java.util.concurrent.ConcurrentHashMap; public class ConsumerGroupAssignment { public static class TopicPartition{ public final String topic; public final int partition; public TopicPartition(String topic, int partition){ this.topic = topic; this.partition = partition; } } // groupId -\u0026gt; memberId -\u0026gt; partitions private final Map\u0026lt;String, Map\u0026lt;String, List\u0026lt;TopicPartition\u0026gt;\u0026gt;\u0026gt; assignments = new ConcurrentHashMap\u0026lt;\u0026gt;(); public void assign(String groupId, List\u0026lt;String\u0026gt; members, String topic, int numPartitions){ Map\u0026lt;String, List\u0026lt;TopicPartition\u0026gt;\u0026gt; group = new ConcurrentHashMap\u0026lt;\u0026gt;(); int numMembers = members.size(); for(int i=0;i\u0026lt;numPartitions;i++){ String memberId = members.get(i % numMembers); group.computeIfAbsent(memberId, k -\u0026gt; new ArrayList\u0026lt;\u0026gt;()).add(new TopicPartition(topic, i)); } assignments.put(groupId, group); } public List\u0026lt;TopicPartition\u0026gt; ownedPartitions(String groupId, String memberId){ return assignments.getOrDefault(groupId, Collections.emptyMap()).getOrDefault(memberId, Collections.emptyList()); } } Kafka\u0026rsquo;s real implementation uses a group coordinator and more sophisticated assigners like round-robin, sticky, and cooperative partitioning strategies, however the above implementation gives a good idea of how partition assignment works in a distributed system.\nNow let\u0026rsquo;s also add replication to the above model.\nReplication with Leaders and Followers # Leader-follower Model\nEach partition has one leader and multiple followers. Producers and consumers interact only with the leader for that partition. Followers asynchronously replicate the leader\u0026rsquo;s log and can be promoted to leader upon failure. Leader Partition:\npublic interface ReplicaClient { void replicate(String topic, int partitionId, long offset, Message msg) throws Exception; } public class ReplicatedPartitionLeader{ private final String topic; private final int partitionId; private final LogPartition localLog; private final List\u0026lt;ReplicaClient\u0026gt; followers; public ReplicatedPartitionLeader(String topic, int partitionId, LogPartition localLog, List\u0026lt;ReplicaClient\u0026gt; followers){ this.topic = topic; this.partitionId = partitionId; this.localLog = localLog; this.followers = followers; } public LogPartition getLocalLog(){ return localLog; } public synchronized long append(Message msg) throws Exception { long offset = localLog.append(msg); for(ReplicaClient follower : followers){ follower.replicate(topic, partitionId, offset, msg); } return offset; } } Broker node hosting leaders and followers:\nimport java.util.*; import java.util.concurrent.ConcurrentHashMap; public class BrokerNode { private final int brokerId; private final Map\u0026lt;String, Map\u0026lt;Integer, ReplicatedPartitionLeader\u0026gt;\u0026gt; leaders = new ConcurrentHashMap\u0026lt;\u0026gt;(); private final Map\u0026lt;String, Map\u0026lt;Integer, LogPartition\u0026gt;\u0026gt; followers = new ConcurrentHashMap\u0026lt;\u0026gt;(); public BrokerNode(int brokerId){ this.brokerId = brokerId; } public void addLeaderPartition(String topic, int partitionId, LogPartition log, List\u0026lt;ReplicaClient\u0026gt; followers){ leaders.computeIfAbsent(topic, k -\u0026gt; new ConcurrentHashMap\u0026lt;\u0026gt;()).put(partitionId, new ReplicatedPartitionLeader(topic, partitionId, log, followers)); } public void addFollowerPartition(String topic, int partitionId, LogPartition log){ followers.computeIfAbsent(topic, k -\u0026gt; new ConcurrentHashMap\u0026lt;\u0026gt;()).put(partitionId, log); } public long handleProduce(String topic, int partitionId, Message msg) throws Exception { ReplicatedPartitionLeader leader = leaders.getOrDefault(topic, Collections.emptyMap()).get(partitionId); if(leader == null){ throw new Exception(\u0026#34;Leader not found for topic \u0026#34; + topic + \u0026#34; and partition \u0026#34; + partitionId); } return leader.append(msg); } public Message handleConsume(String topic, int partitionId, long offset) throws Exception { ReplicatedPartitionLeader leader = leaders.getOrDefault(topic, Collections.emptyMap()).get(partitionId); if(leader == null){ throw new Exception(\u0026#34;Leader not found for topic \u0026#34; + topic + \u0026#34; and partition \u0026#34; + partitionId); } return leader.getLocalLog().read(offset); } public void handleReplicate(String topic, int partitionId, long offset, Message msg) throws Exception { LogPartition log = followers.getOrDefault(topic, Collections.emptyMap()).get(partitionId); if(log == null){ throw new Exception(\u0026#34;Follower not found for topic \u0026#34; + topic + \u0026#34; and partition \u0026#34; + partitionId); } log.append(msg); } } Now let\u0026rsquo;s verify the above code\npublic class DistributedBrokerTest { @Test public void testDistributedBroker() throws Exception { BrokerNode b1 = new BrokerNode(1); BrokerNode b2 = new BrokerNode(2); BrokerNode b3 = new BrokerNode(3); Path base = Paths.get(\u0026#34;/tmp/broker-v3\u0026#34;); FileBackedLogPartition p0b1 = new FileBackedLogPartition(base.resolve(\u0026#34;orders-0-b1.log\u0026#34;)); FileBackedLogPartition p0b2 = new FileBackedLogPartition(base.resolve(\u0026#34;orders-0-b2.log\u0026#34;)); FileBackedLogPartition p1b2 = new FileBackedLogPartition(base.resolve(\u0026#34;orders-1-b2.log\u0026#34;)); FileBackedLogPartition p1b3 = new FileBackedLogPartition(base.resolve(\u0026#34;orders-1-b3.log\u0026#34;)); ReplicaClient b2Follower = (t, p, off, msg) -\u0026gt; b2.handleReplicate(t, p, off, msg); ReplicaClient b3Follower = (t, p, off, msg) -\u0026gt; b3.handleReplicate(t, p, off, msg); b1.addLeaderPartition(\u0026#34;orders\u0026#34;, 0, p0b1, List.of(b2Follower)); b2.addFollowerPartition(\u0026#34;orders\u0026#34;, 0, p0b2); b2.addLeaderPartition(\u0026#34;orders\u0026#34;, 1, p1b2, List.of(b3Follower)); b3.addFollowerPartition(\u0026#34;orders\u0026#34;, 1, p1b3); Partitioner partitioner = new Partitioner(); for (int i = 0; i \u0026lt; 4; i++) { String key = \u0026#34;order-\u0026#34; + i; int partition = partitioner.choosePartition(key, 2); BrokerNode leader = (partition == 0) ? b1 : b2; long offset = leader.handleProduce(\u0026#34;orders\u0026#34;, partition, new Message(key, (\u0026#34;payload-\u0026#34; + i).getBytes())); System.out.printf(\u0026#34;Produced %s -\u0026gt; partition %d, offset %d%n\u0026#34;, key, partition, offset); } // Consume from partition 0 leader (b1) long offset0 = 0L; while (true) { ReadResult rr = b1.handleConsume(\u0026#34;orders\u0026#34;, 0, offset0); if (rr == null) { break; // no more data } Message m = rr.getMessage(); System.out.println(\u0026#34;Consumed from p0: \u0026#34; + m.getKey()); offset0 = rr.getNextOffset(); // advance to just after this record } // Consume from partition 1 leader (b2) long offset1 = 0L; while (true) { ReadResult rr = b2.handleConsume(\u0026#34;orders\u0026#34;, 1, offset1); if (rr == null) { break; } Message m = rr.getMessage(); System.out.println(\u0026#34;Consumed from p1: \u0026#34; + m.getKey()); offset1 = rr.getNextOffset(); } } Output:\nProduced order-0 -\u0026gt; partition 1, offset 0 Produced order-1 -\u0026gt; partition 0, offset 0 Produced order-2 -\u0026gt; partition 1, offset 24 Produced order-3 -\u0026gt; partition 0, offset 24 Consumed from p0: order-1 Consumed from p0: order-3 Consumed from p1: order-0 Consumed from p1: order-2 Scalability and Fault Tolerance # With partitioning and replication, we can achieve horizontal scalability and fault tolerance.\nAdding partitions increases consumer parallelism within groups. Adding brokers increases storage and throughput capacity. Replication ensures data durability and availability. Conclusion # We went from a single node in memory message broker to a distributed, partitioned, replicated message broker cluster. This is a simplified version of Kafka, but it captures the core concepts of partitioning, replication, and consumer groups.\n","date":"25 March 2026","externalUrl":null,"permalink":"/posts/the-anatomy-of-a-distributed-message-broker/","section":"Posts","summary":"Introduction # Distributed message brokers are the backbone of modern microservices architectures, enabling asynchronous communication, decoupling services, and providing a reliable way to handle data streams at scale.\n","title":"The Anatomy of a Distributed Message Broker","type":"posts"},{"content":" Life Without a Rate Limiter # Imagine a public web API that allows clients to fetch user data without any rate limiting. Under normal conditions this might work, but during traffic spikes or abuse (e.g., bots or scrapers) the backend can be overwhelmed, leading to resource exhaustion, cascading failures, and poor availability for legitimate users. Without any form of control, a single noisy neighbor can starve others, increase infrastructure costs, and make it difficult to meet SLAs.\nSimple Java service without limiting # public class UserService { public String getUser(String userId) { // Pretend this hits database, cache, etc. return \u0026#34;User-\u0026#34; + userId; } public static void main(String[] args) { UserService service = new UserService(); for (int i = 0; i \u0026lt; 1_000_000; i++) { service.getUser(\u0026#34;123\u0026#34;); } } } In this world, any number of calls are accepted; there is no back-pressure, fairness, or abuse protection. The first step is to introduce a minimal gate that can reject excessive traffic.\nVersion 1: In-Memory Fixed Window Counter # A fixed window rate limiter divides time into discrete windows (e.g., 1 second or 1 minute) and maintains a counter of requests per window. If the count exceeds a threshold in that window, further requests in that window are rejected, and when the window expires, the counter is reset to zero.\nImplementation # This first implementation is a single in-process limiter.\nimport java.util.Map; import java.util.concurrent.ConcurrentHashMap; import java.util.concurrent.atomic.AtomicInteger; public class FixedWindowPerKeyRateLimiter { private static class WindowCounter { final AtomicInteger count = new AtomicInteger(0); volatile long windowStart; WindowCounter(long windowStart) { this.windowStart = windowStart; } } private final int maxRequests; private final long windowSizeMillis; private final Map\u0026lt;String, WindowCounter\u0026gt; counters = new ConcurrentHashMap\u0026lt;\u0026gt;(); public FixedWindowPerKeyRateLimiter(int maxRequests, long windowSizeMillis) { this.maxRequests = maxRequests; this.windowSizeMillis = windowSizeMillis; } public boolean allow(String key) { long now = System.currentTimeMillis(); WindowCounter wc = counters.computeIfAbsent(key, k -\u0026gt; new WindowCounter(now)); synchronized (wc) { if (now - wc.windowStart \u0026gt;= windowSizeMillis) { wc.windowStart = now; wc.count.set(0); } if (wc.count.get() \u0026lt; maxRequests) { wc.count.incrementAndGet(); return true; } return false; } } } Usage test and boundary problem # public class FixedWindowPerKeyRateLimiterTest { public static void main(String[] args) throws InterruptedException { // 5 requests per 10 seconds per user FixedWindowPerKeyRateLimiter limiter = new FixedWindowPerKeyRateLimiter(5, 10_000); String user = \u0026#34;user-1\u0026#34;; // Scenario: 5 requests at the end of one window long start = System.currentTimeMillis(); for (int i = 1; i \u0026lt;= 5; i++) { boolean allowed = limiter.allow(user); System.out.println((allowed ? \u0026#34;ALLOWED\u0026#34; : \u0026#34;BLOCKED\u0026#34;)); } // Sleep just enough to cross the window boundary Thread.sleep(10_100); // Immediately send 5 more requests at the start of the new window for (int i = 6; i \u0026lt;= 10; i++) { boolean allowed = limiter.allow(user); System.out.println((allowed ? \u0026#34;ALLOWED\u0026#34; : \u0026#34;BLOCKED\u0026#34;)); } } } This implementation protects the system from unlimited traffic, but it has several issues:\nBoundary spikes: Bursts at window edges can slip through. Because the algorithm resets the counter at fixed boundaries, the user effectively sends 10 requests in a little over 10 seconds (5 at the end of one window and 5 at the start of the next), violating the intuitive \u0026ldquo;5 per 10 seconds\u0026rdquo; limit. To smooth out these edge effects, a sliding window algorithm evaluates requests over a rolling time window instead of fixed slots.\nSingle process only: It doesn\u0026rsquo;t work across multiple service instances.\nVersion 2: In-Process Sliding Window Log # The sliding window log algorithm tracks the timestamp of each accepted request in a per-key log and, on each new request, drops timestamps older than the window and counts the remaining ones. This yields an accurate sliding window that enforces \u0026ldquo;N requests per W seconds\u0026rdquo; regardless of window boundaries.\nImplementation # import java.util.ArrayDeque; import java.util.Deque; import java.util.Map; import java.util.concurrent.ConcurrentHashMap; public class SlidingWindowLogRateLimiter { private final int maxRequests; private final long windowSizeMillis; private final Map\u0026lt;String, Deque\u0026lt;Long\u0026gt;\u0026gt; logs = new ConcurrentHashMap\u0026lt;\u0026gt;(); public SlidingWindowLogRateLimiter(int maxRequests, long windowSizeMillis) { this.maxRequests = maxRequests; this.windowSizeMillis = windowSizeMillis; } public boolean allow(String key) { long now = System.currentTimeMillis(); Deque\u0026lt;Long\u0026gt; deque = logs.computeIfAbsent(key, k -\u0026gt; new ArrayDeque\u0026lt;\u0026gt;()); synchronized (deque) { long threshold = now - windowSizeMillis; while (!deque.isEmpty() \u0026amp;\u0026amp; deque.peekFirst() \u0026lt; threshold) { deque.pollFirst(); } if (deque.size() \u0026lt; maxRequests) { deque.addLast(now); return true; } return false; } } } Usage test and limitation # public class SlidingWindowLogRateLimiterTest { public static void main(String[] args) throws InterruptedException { // 5 requests per 10 seconds per key SlidingWindowLogRateLimiter limiter = new SlidingWindowLogRateLimiter(5, 10_000); String user = \u0026#34;user-1\u0026#34;; long start = System.currentTimeMillis(); // 5 requests spread over 9 seconds for (int i = 1; i \u0026lt;= 5; i++) { boolean allowed = limiter.allow(user); System.out.println(allowed ? \u0026#34;ALLOWED\u0026#34; : \u0026#34;BLOCKED\u0026#34;); } Thread.sleep(9_000); // 6th request within 10-second sliding window should be blocked boolean allowed = limiter.allow(user); System.out.println((allowed ? \u0026#34;ALLOWED\u0026#34; : \u0026#34;BLOCKED\u0026#34;)); Thread.sleep(1_001); // 7th request should be allowed as the first request is now outside the sliding window allowed = limiter.allow(user); System.out.println((allowed ? \u0026#34;ALLOWED\u0026#34; : \u0026#34;BLOCKED\u0026#34;)); } } This approach correctly enforces the rate across arbitrary boundaries, but the log can grow large for high-traffic keys and is still local to a single process, making it hard to use across multiple instances or data centers.\nProduction systems often prefer algorithms like Token Bucket, which provide good control over average rate while allowing bursts that might be a business requirement and are more amenable to efficient implementations in distributed stores like Redis.\nVersion 3: In-Process Token Bucket # The token bucket algorithm maintains a bucket of tokens that replenishes at a fixed rate up to a maximum capacity; each request consumes one token and is allowed only if a token is available. This allows short bursts (up to the bucket size) while enforcing a long-term average rate defined by the refill rate.\nImplementation # import java.util.Map; import java.util.concurrent.ConcurrentHashMap; public class TokenBucketRateLimiter { private static class Bucket { int tokens; long lastRefillTimestamp; Bucket(int tokens, long lastRefillTimestamp) { this.tokens = tokens; this.lastRefillTimestamp = lastRefillTimestamp; } } private final int refillTokensPerSecond; private final int capacity; private final Map\u0026lt;String, Bucket\u0026gt; buckets = new ConcurrentHashMap\u0026lt;\u0026gt;(); public TokenBucketRateLimiter(int refillTokensPerSecond, int capacity) { this.refillTokensPerSecond = refillTokensPerSecond; this.capacity = capacity; } public boolean allow(String key) { long now = System.nanoTime(); Bucket bucket = buckets.computeIfAbsent(key, k -\u0026gt; new Bucket(capacity, now)); synchronized (bucket) { int secondsSinceLast = (int)((now - bucket.lastRefillTimestamp) / 1_000_000_000.0); int tokensToAdd = secondsSinceLast * refillTokensPerSecond; if (tokensToAdd \u0026gt; 0) { bucket.tokens = Math.min(capacity, bucket.tokens + tokensToAdd); bucket.lastRefillTimestamp = now; } if (bucket.tokens \u0026gt;= 1) { bucket.tokens -= 1; return true; } return false; } } } Usage test and limitation # public class TokenBucketRateLimiterTest { public static void main(String[] args) throws InterruptedException { // Capacity 5, refill 1 token per second per key TokenBucketRateLimiter limiter = new TokenBucketRateLimiter(1, 5); String user = \u0026#34;user-1\u0026#34;; // After some idle time, user has accumulated tokens for a burst Thread.sleep(5_000); // accumulate close to capacity // Burst of 6 requests for (int i = 1; i \u0026lt;= 6; i++) { boolean allowed = limiter.allow(user); System.out.println((allowed ? \u0026#34;ALLOWED\u0026#34; : \u0026#34;BLOCKED\u0026#34;)); } Thread.sleep(2_000); // 2 requests after 2 seconds should be allowed for (int i = 1; i \u0026lt;= 2; i++) { boolean allowed = limiter.allow(user); System.out.println((allowed ? \u0026#34;ALLOWED\u0026#34; : \u0026#34;BLOCKED\u0026#34;)); } } } The token bucket enforces a sustainable rate with configurable burst tolerance and is widely used in API gateways, microservices, and infrastructure rate limiting. However, this implementation is still in-process and not safe across multiple instances; concurrent instances with their own buckets would each allow the full quota unless a shared store and atomic updates are introduced.\nVersion 4: Distributed Token Bucket with Redis and Lua # In a distributed system with many application servers behind a load balancer, rate limiting must be coordinated across instances so that the limit is enforced globally for each key. A common pattern is to store rate limiter state in a centralized store such as Redis and use Lua scripts so that the read–modify–write sequence (refill tokens, check, consume) is executed atomically.\nHigh-level architecture # Each application server receives requests and computes an identity key (user ID, API key, IP, tenant ID, etc.). Before processing the request, the server calls a rate limiter component that runs a Redis Lua script implementing token bucket logic for that key. The script reads the current token count and last refill time, refills tokens based on elapsed time, consumes one token if available, and writes back the updated state in a single atomic operation. Java + Redis + Lua example (conceptual) # import redis.clients.jedis.Jedis; import redis.clients.jedis.params.SetParams; public class RedisTokenBucketRateLimiter { private final Jedis jedis; private final String scriptSha; public RedisTokenBucketRateLimiter(Jedis jedis) { this.jedis = jedis; String script = \u0026#34;local key = KEYS[1] \u0026#34; + \u0026#34;local capacity = tonumber(ARGV[1]) \u0026#34; + \u0026#34;local refill_per_sec = tonumber(ARGV[2]) \u0026#34; + \u0026#34;local now = tonumber(ARGV[3]) \u0026#34; + \u0026#34;local state = redis.call(\u0026#39;HMGET\u0026#39;, key, \u0026#39;tokens\u0026#39;, \u0026#39;timestamp\u0026#39;) \u0026#34; + \u0026#34;local tokens = tonumber(state[1]) or capacity \u0026#34; + \u0026#34;local last_ts = tonumber(state[2]) or now \u0026#34; + \u0026#34;local delta = now - last_ts \u0026#34; + \u0026#34;if delta \u0026gt; 0 then \u0026#34; + \u0026#34; local refill = delta * refill_per_sec \u0026#34; + \u0026#34; tokens = math.min(capacity, tokens + refill) \u0026#34; + \u0026#34; last_ts = now \u0026#34; + \u0026#34;end \u0026#34; + \u0026#34;if tokens \u0026lt; 1 then \u0026#34; + \u0026#34; redis.call(\u0026#39;HMSET\u0026#39;, key, \u0026#39;tokens\u0026#39;, tokens, \u0026#39;timestamp\u0026#39;, last_ts) \u0026#34; + \u0026#34; return 0 \u0026#34; + \u0026#34;else \u0026#34; + \u0026#34; tokens = tokens - 1 \u0026#34; + \u0026#34; redis.call(\u0026#39;HMSET\u0026#39;, key, \u0026#39;tokens\u0026#39;, tokens, \u0026#39;timestamp\u0026#39;, last_ts) \u0026#34; + \u0026#34; redis.call(\u0026#39;EXPIRE\u0026#39;, key, 3600) \u0026#34; + \u0026#34; return 1 \u0026#34; + \u0026#34;end\u0026#34;; this.scriptSha = jedis.scriptLoad(script); } public boolean allow(String key, double capacity, double refillPerSecond) { long nowSeconds = System.currentTimeMillis() / 1000L; Object result = jedis.evalsha( scriptSha, 1, key, String.valueOf(capacity), String.valueOf(refillPerSecond), String.valueOf(nowSeconds) ); Long allowed = (Long) result; return allowed == 1L; } } Usage test (integration-style) # public class RedisTokenBucketRateLimiterTest { public static void main(String[] args) throws InterruptedException { try (Jedis jedis = new Jedis(\u0026#34;localhost\u0026#34;, 6379)) { RedisTokenBucketRateLimiter limiter = new RedisTokenBucketRateLimiter(jedis); String user = \u0026#34;user-1\u0026#34;; double capacity = 5.0; double refillPerSecond = 1.0; // Simulate concurrent calls from multiple application instances for (int i = 1; i \u0026lt;= 7; i++) { boolean allowed = limiter.allow(\u0026#34;rate:\u0026#34; + user, capacity, refillPerSecond); System.out.printf(\u0026#34;Global req %d -\u0026gt; %s%n\u0026#34;, i, allowed ? \u0026#34;ALLOWED\u0026#34; : \u0026#34;BLOCKED\u0026#34;); } } } } In this design, all instances share the same Redis-backed bucket per key, and the Lua script ensures the rate limiting decision is atomic even under heavy concurrency, which is a common industry standard approach. This solves the single-instance limitation and provides strong, globally consistent rate limiting, though it introduces network latency to Redis and a dependency on its availability, which can be mitigated via caching, sharding, or fallback behaviors.\nSystemic Resilience and Fault Tolerance # If the rate limiting middleware is coded defensively to deny all incoming traffic when Redis is down / unreachable, it can lead to a complete outage of the application. Therefore, the rate limiting middleware should be designed to fail open (allow traffic) when Redis is down using in-memory rate limiting with a lower threshold. This ensures that the application remains available even when the rate limiting infrastructure is unavailable.\nThe API Contract # A rate limiter shouldn\u0026rsquo;t be viewed as merely a gatekeeper but as a contract between the service provider and the consumer. It defines the expected usage patterns and ensures fair resource allocation, protecting the service from abuse while providing predictable performance for legitimate users.\nWhen a limit is exceeded, the API should return a 429 Too Many Requests status code along with a Retry-After header indicating when the client can retry the request. This allows clients to gracefully handle rate limiting and adjust their request patterns accordingly.\n","date":"24 March 2026","externalUrl":null,"permalink":"/posts/rate-limiters/","section":"Posts","summary":"Life Without a Rate Limiter # Imagine a public web API that allows clients to fetch user data without any rate limiting. Under normal conditions this might work, but during traffic spikes or abuse (e.g., bots or scrapers) the backend can be overwhelmed, leading to resource exhaustion, cascading failures, and poor availability for legitimate users. Without any form of control, a single noisy neighbor can starve others, increase infrastructure costs, and make it difficult to meet SLAs.\n","title":"Rate Limiters","type":"posts"},{"content":"","date":"24 March 2026","externalUrl":null,"permalink":"/tags/rate-limiting/","section":"Tags","summary":"","title":"Rate Limiting","type":"tags"},{"content":"","date":"8 March 2026","externalUrl":null,"permalink":"/tags/api-gateway/","section":"Tags","summary":"","title":"API Gateway","type":"tags"},{"content":"","date":"8 March 2026","externalUrl":null,"permalink":"/tags/microservices/","section":"Tags","summary":"","title":"Microservices","type":"tags"},{"content":"","date":"8 March 2026","externalUrl":null,"permalink":"/tags/monolith/","section":"Tags","summary":"","title":"Monolith","type":"tags"},{"content":" Every great architecture starts simple. But as we scale from a single monolith to a swarm of microservices, we hit a wall that only one pattern can break: the API Gateway. Phase 1: The Blissful Monolith # You have one server, one database, and one endpoint. Everything is easy:\nDiscovery? A non-issue for the UI. The client only needs to know one URL (e.g., api.yourdomain.com). All requests go to the same place, and the monolith routes them internally. Security? Handled in one place. A single centralized middleware validates the session/token once. Since everything runs in one process, that \u0026ldquo;Authenticated\u0026rdquo; state is automatically trusted by every internal function. Data? A single SQL join away. The Hidden Bottleneck # While \u0026ldquo;simple,\u0026rdquo; this centralized model has a ceiling. As your platform grows, certain features demand disproportionate resources. Imagine you add a heavy \u0026ldquo;Search and Recommendation\u0026rdquo; engine. Suddenly, every time a user searches for a product, the monolith\u0026rsquo;s CPU spikes.\nThe \u0026ldquo;Noisy Neighbor\u0026rdquo; Problem: Your critical \u0026ldquo;Checkout\u0026rdquo; process now has to compete for CPU and RAM with the heavy, analytical \u0026ldquo;Search\u0026rdquo; process. If a sudden spike in search traffic crashes the server, your users can\u0026rsquo;t check out. Scaling Inefficiency: To keep Search fast, you are forced to scale the entire monolith—cloning your whole codebase across massive, expensive servers just to give that one specific feature more compute power. The First Split: Extracting the Bottleneck # To protect the core business and scale efficiently, we make the first logical move: we extract the Search Engine into its own independent service. Now, Search can scale on its own high-CPU instances, while the rest of the monolith handles standard operations efficiently.\nWe’ve solved the bottleneck, but we’ve unknowingly crossed the Rubicon. We are no longer a monolith; we are a Distributed System.\nPhase 2: The Breaking Point (The Problem) # Even with just two destinations (the core Monolith and the new Search Service), the simplicity of Phase 1 shatters. We solved our internal scaling, but pushed a massive amount of complexity onto the client and the network.\n1. The \u0026ldquo;Endpoint Explosion\u0026rdquo; # In a monolith, the mobile app knew one URL. Now, the client must maintain a mapping for different hostnames. If we eventually split out \u0026ldquo;Checkout\u0026rdquo; or \u0026ldquo;User Profiles,\u0026rdquo; the client\u0026rsquo;s configuration file grows. The client now needs to know the internal topography of your backend. Every internal infrastructure move now requires an app store update, risking breakages for users on older app versions.\n2. The Multi-Round-Trip Latency # Modern UIs are data-hungry. To render a single \u0026ldquo;Dashboard,\u0026rdquo; the app might need user data from the Monolith, and tailored product suggestions from the new Search/Recommendation service. On a slow mobile connection, forcing the client to make multiple sequential network calls drastically increases latency. We\u0026rsquo;ve introduced a \u0026ldquo;Network Tax\u0026rdquo; on the user experience.\n3. Security Fragmentation # How do you ensure both the Monolith and the Search service are equally secure? If you implement JWT validation in both places, you’re repeating complex cryptographic logic in multiple codebases (potentially across different languages). If a vulnerability is found in your security middleware, you now have multiple separate deployment cycles to manage just to patch your system.\nPhase 3: The Solution (The API Gateway) # The API Gateway is the \u0026ldquo;Hero\u0026rdquo; that restores the simplicity of the monolith while keeping the scaling benefits of microservices. It acts as a smart facade for your entire system.\n1. Implementation: The Single Entry Point # The client goes back to knowing just one URL: api.yourdomain.com. The Gateway handles Internal Routing. It knows that requests to /search go to the new service, while everything else routes to the Monolith. As you split off more pieces, the Gateway simply updates its routing table. The client is completely shielded from your backend evolution.\n2. Efficiency: Request Aggregation (API Composition) # To fix the multi-round-trip latency, the Gateway can perform API Composition (often referred to as the Backend-for-Frontend or BFF pattern). Instead of the mobile app making three separate network calls to build a dashboard, it makes one call to the Gateway. The Gateway talks to the internal services simultaneously over the high-speed internal VPC network, merges the data into a single JSON response, and sends it back. We’ve traded slow, public network hops for blazing-fast internal ones.\n3. Security: The Centralized Guard # By centralizing the \u0026ldquo;Front Door,\u0026rdquo; we offload the heavy lifting from all internal services:\nSSL Termination: The Gateway handles HTTPS encryption and decryption at the edge, saving internal CPU cycles. Unified Authentication: The Gateway validates the user\u0026rsquo;s token once, and simply passes a trusted payload (like a user ID header) to the backend services. Your microservices no longer need to know how to decode a JWT; they just trust the Gateway. Centralized Policy: Want to add rate-limiting, IP-allowlisting, or WAF (Web Application Firewall) rules? You apply it once at the Gateway, and every sub-service is instantly protected. Phase 3.5: Putting it into Practice (Spring Cloud Gateway) # To see how this works in reality, let\u0026rsquo;s look at a modern Java implementation using Spring Cloud Gateway. In just a few lines of code, we can define our routing rules and implement our centralized security guard.\n1. The Routing Rules # Using a simple Java Bean, we can configure our Gateway to route /search traffic to our newly extracted Search service, while letting everything else fall back to the Monolith.\nimport org.springframework.cloud.gateway.route.RouteLocator; import org.springframework.cloud.gateway.route.builder.RouteLocatorBuilder; import org.springframework.context.annotation.Bean; import org.springframework.context.annotation.Configuration; @Configuration public class GatewayConfig { @Bean public RouteLocator systemRouteLocator(RouteLocatorBuilder builder, AuthFilter authFilter) { return builder.routes() // 1. Route specific traffic to the new Search Service .route(\u0026#34;search_service\u0026#34;, r -\u0026gt; r.path(\u0026#34;/search/**\u0026#34;) .filters(f -\u0026gt; f.filter(authFilter)) // Apply our security guard .uri(\u0026#34;http://search-service-internal:8081\u0026#34;)) // 2. Default route: Everything else goes to the Monolith .route(\u0026#34;monolith_service\u0026#34;, r -\u0026gt; r.path(\u0026#34;/**\u0026#34;) .filters(f -\u0026gt; f.filter(authFilter)) // Apply our security guard .uri(\u0026#34;http://monolith-internal:8080\u0026#34;)) .build(); } } 2. The Centralized Guard (Auth Filter) # Instead of forcing the Monolith and the Search service to both validate JWTs, we do it once at the Gateway. If the token is valid, the Gateway strips it and injects a trusted, simple header (X-User-Id) for the internal microservices to use.\nimport org.springframework.cloud.gateway.filter.GatewayFilter; import org.springframework.cloud.gateway.filter.GatewayFilterChain; import org.springframework.http.HttpStatus; import org.springframework.http.server.reactive.ServerHttpRequest; import org.springframework.stereotype.Component; import org.springframework.web.server.ServerWebExchange; import reactor.core.publisher.Mono; @Component public class AuthFilter implements GatewayFilter { @Override public Mono\u0026lt;Void\u0026gt; filter(ServerWebExchange exchange, GatewayFilterChain chain) { String authHeader = exchange.getRequest().getHeaders().getFirst(\u0026#34;Authorization\u0026#34;); // 1. Check if token exists and is valid (Implementation hidden for brevity) if (authHeader == null || !isValidJwt(authHeader)) { exchange.getResponse().setStatusCode(HttpStatus.UNAUTHORIZED); return exchange.getResponse().setComplete(); // Block the request early } // 2. Extract data and modify the request for internal services String userId = extractUserIdFromJwt(authHeader); ServerHttpRequest modifiedRequest = exchange.getRequest().mutate() .header(\u0026#34;X-User-Id\u0026#34;, userId) // Pass a simple, trusted header downstream .build(); // 3. Forward the modified request to the routed microservice return chain.filter(exchange.mutate().request(modifiedRequest).build()); } // ... private helper methods for JWT validation ... } By adding these two classes, we have successfully shielded our client from the endpoint explosion and offloaded security paperwork from our backend services.\nPhase 3.6: Beyond Routing—The Swiss Army Knife # While routing and security get us through the door, a production-grade Gateway solves the \u0026ldquo;invisible\u0026rdquo; problems that plague distributed systems: traffic chaos and observability gaps.\n1. The \u0026ldquo;Neighbor from Hell\u0026rdquo; (Rate Limiting) # In a monolith, one \u0026ldquo;noisy neighbor\u0026rdquo; (a user or bot spamming an endpoint) can consume all database connections, crashing the site for everyone. In microservices, this is even more dangerous—a spike in \u0026ldquo;Search\u0026rdquo; traffic can cascade and knock out \u0026ldquo;Inventory\u0026rdquo; if they share resources.\nThe Reasoning: Instead of every microservice writing its own rate-limiting logic (and potentially letting too much traffic through anyway), the Gateway acts as a Pressure Valve. It tracks requests per User ID or IP address at the very edge. If a user exceeds their quota, the Gateway drops the request before it even touches your expensive internal network.\nJava Implementation (Spring Cloud Gateway + Redis):\n@Bean public RouteLocator customRouteLocator(RouteLocatorBuilder builder) { return builder.routes() .route(\u0026#34;search_service\u0026#34;, r -\u0026gt; r.path(\u0026#34;/search/**\u0026#34;) .filters(f -\u0026gt; f.requestRateLimiter(config -\u0026gt; config .setRateLimiter(redisRateLimiter()) .setKeyResolver(userKeyResolver()))) // Limit by User ID .uri(\u0026#34;lb://search-service\u0026#34;)) .build(); } @Bean public RedisRateLimiter redisRateLimiter() { // 10 requests per second, with a \u0026#34;burst\u0026#34; capacity of 20 return new RedisRateLimiter(10, 20); } 2. The \u0026ldquo;Silent Failure\u0026rdquo; (Centralized Observability) # In a distributed system, a single user request might travel through five different services. If the request fails, where did it die? Without a Gateway, you’re searching through five different log files, trying to stitch timestamps together.\nThe Reasoning: The Gateway becomes the Source of Truth. It generates a unique Correlation ID (Trace ID) for every incoming request. It injects this ID into the headers, ensuring it travels through every microservice. Now, you can search one ID in your logging tool (like ELK or Splunk) and see the entire journey of that request across your entire fleet.\n3. The \u0026ldquo;Legacy Bridge\u0026rdquo; (Protocol Transformation) # Your modern frontend might want to speak REST/JSON, but your high-performance internal services might use gRPC, or perhaps an old legacy service only understands XML.\nThe Reasoning: You don\u0026rsquo;t want to force your mobile developers to learn gRPC or handle SOAP XML. The Gateway acts as a Translator. It accepts a standard JSON POST from the client, transforms the payload into the required internal format, calls the service, and translates the response back to JSON. The client remains blissfully unaware of the \u0026ldquo;tech debt\u0026rdquo; hiding behind the curtain.\nPhase 3.7: The \u0026ldquo;Safety Net\u0026rdquo; (Deployment Control) # In the old monolith days, a deployment was a \u0026ldquo;hold your breath\u0026rdquo; moment. If the new code had a bug, the whole site went down. With an API Gateway, you can move away from \u0026ldquo;All-or-Nothing\u0026rdquo; releases toward Traffic Shifting.\n1. Blue-Green Deployments (The \u0026ldquo;Instant Flip\u0026rdquo;) # Instead of updating your service in place, you spin up a brand-new version (Green) alongside the old one (Blue).\nThe Reasoning: The Gateway points to Blue. You test Green in isolation. Once you\u0026rsquo;re confident, you tell the Gateway to flip all traffic to Green. If something breaks 30 seconds later? You flip the Gateway back to Blue instantly. No DNS propagation delays, no server restarts—just a routing change.\n2. Canary Releases (The \u0026ldquo;Slow Drip\u0026rdquo;) # A \u0026ldquo;Canary\u0026rdquo; release is the gold standard for risk management. You roll out the new version to only 5% of your users—perhaps your internal employees or a specific geographic region.\nThe Reasoning: The Gateway looks at the incoming request (maybe a cookie, a Header, or just a random weight) and decides where to send it. If the error rates for that 5% remain low, you bump it to 25%, then 50%, and finally 100%. The Gateway acts as a Blast Shield, ensuring a bad bug only affects a tiny fraction of your users.\nJava Implementation (Spring Cloud Gateway Weighted Routing): Spring Cloud Gateway makes Canary releases trivial using the Weight route predicate. In this example, we send 95% of traffic to the stable \u0026ldquo;v1\u0026rdquo; and 5% to the new \u0026ldquo;v2\u0026rdquo; canary.\n@Bean public RouteLocator canaryRouteLocator(RouteLocatorBuilder builder) { return builder.routes() // 1. The Stable Production Service (95% of traffic) .route(\u0026#34;search_v1\u0026#34;, r -\u0026gt; r.path(\u0026#34;/search/**\u0026#34;) .and().weight(\u0026#34;search_group\u0026#34;, 95) .uri(\u0026#34;http://search-v1:8081\u0026#34;)) // 2. The Canary Service (5% of traffic) .route(\u0026#34;search_v2\u0026#34;, r -\u0026gt; r.path(\u0026#34;/search/**\u0026#34;) .and().weight(\u0026#34;search_group\u0026#34;, 5) .uri(\u0026#34;http://search-v2:8082\u0026#34;)) .build(); } Phase 4: Choosing the Right Tool # Depending on your scale and operational capacity, an API Gateway can take many forms:\nManaged Cloud Services: AWS API Gateway, Azure API Management, or Google Cloud API Gateway (Best if you want zero infrastructure maintenance). Self-Hosted / Control Planes: Kong, Tyk, or Apache APISIX (Highly extensible with plugins, great for hybrid-cloud or custom routing needs). Edge Proxies / Ingress: Envoy or HAProxy (Often used as the foundation for modern service meshes, excellent for high-performance, complex traffic routing). Conclusion: The Final Verdict # The transition to an API Gateway marks the \u0026ldquo;coming of age\u0026rdquo; of a system. It’s the moment you stop thinking about individual servers and start thinking about Traffic Flow.\nBy centralizing the \u0026ldquo;boring\u0026rdquo; stuff—Rate Limiting, Security, and Deployment Control—you free your feature teams to do what they do best: build business value. The \u0026ldquo;Gateway Tax\u0026rdquo; of a single extra network hop is a small price to pay for a system that is resilient, observable, and easy for frontend developers to love.\nThe path to an API Gateway is paved with the lessons learned from scaling. It’s not just a proxy; it’s the brain of your distributed architecture. When you find yourself repeating code across services or struggling to coordinate releases, the message is clear: it’s time to move to the Gateway.\n","date":"8 March 2026","externalUrl":null,"permalink":"/posts/system-design-the-path-to-api-gateway/","section":"Posts","summary":" Every great architecture starts simple. But as we scale from a single monolith to a swarm of microservices, we hit a wall that only one pattern can break: the API Gateway. Phase 1: The Blissful Monolith # You have one server, one database, and one endpoint. Everything is easy:\n","title":"System Design: The Path to API Gateway","type":"posts"},{"content":"Here\u0026rsquo;s what I\u0026rsquo;m focused on right now, and a history of what I\u0026rsquo;ve built.\n📄 Download Resume (PDF) What I\u0026rsquo;m doing now # Currently building the future of AI at Salesforce as an SDE3 / SMTS. Since late 2025, my primary focus has been designing an Agentic Metadata Generation and Evaluation Framework for Agentforce Vibes. This empowers low-code users to reliably generate metadata through natural language, backed by a holistic, end-to-end evaluation system.\nExperience # Salesforce # SMTS/SDE3 (India) | May 2020 – Present\nCustom Apps Gen AI (Jan’25 – Sep’25)\nOptimized the LLM-based app generation framework via batching, parallelization, and eventual consistency, reducing build time by 25%. Early adopter of Cursor and MCPs within the team; built various internal tooling MCPs (like Sourcegraph) to boost developer productivity and automate Trust efforts. Commerce Generative AI (Jan’23 – Dec’24)\nBuilt and demo\u0026rsquo;ed a product recommendations prototype using Python that integrated LLMs for B2B/B2C e-commerce with text and image prompts. Led the project to production on a Java/Spring tech stack, leading teams in Brazil and the US. Developed the prototype integrated with large compute AI models on Heroku, which became a key company strategy with projected revenues exceeding $1B. Led global teams to design the first LLM-based features in Salesforce Commerce (Product Recommendations, Conversational Add To Cart, and Merchant Insights). Buyer And Entitlements (Jan’22 – Dec'22)\nLed API performance improvements, increasing guardrails by 10x for buyer accounts to onboard large F\u0026amp;B customers by designing a predictive caching framework. Designed multi-tier Redis-based cache systems supporting multiple lookups with a p95 time of \u0026lt;50ms. Address Management \u0026amp; Site Search (Jun’20 – Dec’21)\nLed development of Address APIs driving formats for 4076+ enterprise customers (FedEx, Louis Vuitton). Contributed to Salesforce search platform code reducing GTM time for new search products by 70%. Built custom monitoring dashboards, cutting incident response time from 1 hour to near-instant. Arcesium # Senior Software Engineer | Nov 2017 – May 2020\nRevamped pricing models to reduce manual effort by 90%, facilitating the onboarding of 3 clients with $100M portfolios. Optimized quote-fetching logic to save $1M+ for 20+ clients and built bulk APIs reducing data entry time by 99% (3 hours to 45 secs). Wooqer # Platform Engineer | Jul 2017 – Nov 2017\nDeveloped iOS features that reduced long-term maintenance costs by 50%. Technical Skills # Languages: Java (Proficient); Python, Javascript, C, C++ Databases: SQL Server, Oracle, PostgreSQL, Redis Architecture \u0026amp; AI: Distributed Systems, System Design, Evals, RAG, MCPs, Vector Databases, LangChain, Hugging Face Tools: IntelliJ IDEA, Git, Claude CLI, Cursor, Antigravity, LangSmith, Splunk, Grafana Education \u0026amp; Extracurriculars # BTech - Computer Science Engineering, IIIT Kota, India (CGPA: 8.99/10, Rank 3) Extracurriculars: Salesforce TMP Spring’23 Cloud Hackathon winner (Global). Salesforce Commerce Cloud Summer Hackathon winner. Career Coach at InterviewBit (‘20 – ’24). Yoga/Crossfit enthusiast, 100km Cyclothon finisher. ","date":"7 March 2026","externalUrl":null,"permalink":"/about/","section":"Home","summary":"Here’s what I’m focused on right now, and a history of what I’ve built.\n📄 Download Resume (PDF) What I’m doing now # Currently building the future of AI at Salesforce as an SDE3 / SMTS. Since late 2025, my primary focus has been designing an Agentic Metadata Generation and Evaluation Framework for Agentforce Vibes. This empowers low-code users to reliably generate metadata through natural language, backed by a holistic, end-to-end evaluation system.\n","title":"About","type":"page"}]