A practical guide to GPU overprovisioning strategies, including scheduler-level oversubscription, time slicing, memory controls, MIG, vGPU, queue backfill, and operational guardrails.
CFN Cloud
Cloud-native notes on Kubernetes, platform engineering, and modern infrastructure.
Popular topics
Featured reads
A smaller front shelf of pieces worth starting with.
A practical guide to choosing between serverless GPUs and dedicated GPUs for startups, based on cost structure, delivery speed, performance predictability, operations burden, and team maturity.
A comprehensive deep dive into Linux glibc (ptmalloc2) heap memory allocation and reclamation strategies. Explores Arenas, Chunks, Bins (Fast, Small, Large, Unsorted) data structures, and the principles of classic vulnerabilities such as Use-After-Free.
A practical comparison of five AI agent frameworks - OpenClaw, ZeroClaw, PicoClaw, Nanobot, and IronClaw - covering size, architecture, security tradeoffs, and adoption fit.
An engineering-oriented comparison of KAI-Scheduler’s Reservation Pod approach and HAMi’s hard isolation path, including trade-offs, failure modes (noisy neighbor), and how the two layers can complement each other.
An engineering-oriented guide to hetGPU: how a compiler + runtime stack can make one GPU binary run across NVIDIA/AMD/Intel/Tenstorrent, including SIMT vs MIMD, memory model gaps, and live kernel migration.
Reading tracks
Start with basics, then move into operations, tradeoffs, and troubleshooting.
Kubernetes
A curated reading track for Kubernetes.
GPU
A curated reading track for GPU.
System
A curated reading track for System.
Recent writing
New notes, guides, and long-form pieces from the main archive.
GPU Overprovisioning Solutions: From Oversubscription and Sharing to Isolation
A practical guide to GPU overprovisioning strategies, including scheduler-level oversubscription, time slicing, memory controls, MIG, vGPU, queue backfill, and operational guardrails.
How Startups Should Choose: Serverless GPU vs Dedicated GPU
A practical guide to choosing between serverless GPUs and dedicated GPUs for startups, based on cost structure, delivery speed, performance predictability, operations burden, and team maturity.
Deep Dive into Linux Heap Memory Management: From Basics to Core Exploitation
A comprehensive deep dive into Linux glibc (ptmalloc2) heap memory allocation and reclamation strategies. Explores Arenas, Chunks, Bins (Fast, Small, Large, Unsorted) data structures, and the principles of classic vulnerabilities such as Use-After-Free.
OpenClaw vs ZeroClaw vs PicoClaw: Comparing 5 AI Agent Frameworks
A practical comparison of five AI agent frameworks - OpenClaw, ZeroClaw, PicoClaw, Nanobot, and IronClaw - covering size, architecture, security tradeoffs, and adoption fit.
KAI-Scheduler vs HAMi: Two Ways to Share GPUs in Kubernetes (Soft vs Hard Isolation)
An engineering-oriented comparison of KAI-Scheduler’s Reservation Pod approach and HAMi’s hard isolation path, including trade-offs, failure modes (noisy neighbor), and how the two layers can complement each other.
hetGPU: Chasing Cross-Vendor GPU Binary Compatibility
An engineering-oriented guide to hetGPU: how a compiler + runtime stack can make one GPU binary run across NVIDIA/AMD/Intel/Tenstorrent, including SIMT vs MIMD, memory model gaps, and live kernel migration.
Kubernetes vs Docker vs OpenStack: Stop Comparing Tools at Different Layers
A practical boundary guide: Docker packages and runs containers, Kubernetes orchestrates and keeps services stable at scale, and OpenStack turns datacenter hardware into an IaaS resource pool (VM/network/storage).
Kubernetes GPU Virtualization Explained Through gpu-manager Startup Flow
A deep dive into Kubernetes GPU virtualization through gpu-manager startup flow, including device interception, topology awareness, scheduling, and allocation mechanics.
Linux CGroup Deep Dive: Migrating from V1 Chaos to V2 Architecture
A comprehensive, trenches-focused breakdown of CGroup mechanics—exploring core concepts, controller nuances, and actionable troubleshooting for production environments.
Linux Function Calls and Stack Frames
Understand calling conventions, stack frames, call/ret behavior, debugging observation, and security implications from the assembly view.
ELF Explained: Sections, Segments, Relocations, and Dynamic Linking
Understand ELF files from sections and segments to relocations and dynamic linking, with practical examples for debugging Linux binaries and loader issues.
Kubernetes Troubleshooting Playbook: Pending, CrashLoopBackOff, and Traffic Failures
A practical Kubernetes troubleshooting playbook for Pending Pods, CrashLoopBackOff, readiness failures, networking issues, and node-level problems.
Kubernetes Probe Best Practices: Liveness, Readiness, Startup, and Failure Signals
Use better Kubernetes probes by choosing the right signal, tuning thresholds, and avoiding false restarts, traffic drops, and noisy rollouts.
Kubernetes Tip: Autoscaling Without Thrash (HPA + VPA + Cluster Autoscaler)
How to make autoscaling predictable: right requests, sane HPA behavior, VPA recommendations, and capacity-aware cluster scaling.
Kubernetes NetworkPolicy Best Practices: Default Deny, DNS, and Safe Rollout
A practical rollout path for Kubernetes NetworkPolicy: start with default deny, whitelist DNS and key dependencies, and avoid breaking production traffic.
Kubernetes RBAC Least Privilege: Safer Roles, Bindings, and Access Review
Learn practical Kubernetes RBAC least-privilege patterns, how to reduce overbroad permissions, and which checks catch risky role bindings before incidents.
Kubernetes Tip: Debug Pods with Ephemeral Containers
Safely inspect a live Pod without baking debugging tools into production images.
Kubernetes Tip: Safer Rollouts with PDB + Surge/Unavailable
Combine Deployment rollingUpdate settings with PodDisruptionBudgets to keep availability during upgrades and node maintenance.
Kubernetes Tip: Requests & Limits (Without Surprises)
How CPU/memory requests and limits actually affect scheduling, throttling, OOMKills, and autoscaling.
Kubernetes Probes Explained: Liveness, Readiness, and Startup Checks
Learn how liveness, readiness, and startup probes work in Kubernetes, what each one should check, and how to avoid restart loops and false failures.
Helm + MySQL on Kubernetes: Install a Cluster and Understand the Tradeoffs
Use Helm to deploy a MySQL cluster on Kubernetes while understanding chart defaults, persistence, networking, and production tradeoffs.
kubectl Port-Forward Explained: Safe Debugging Access to Kubernetes Workloads
Learn how kubectl port-forward works, when to use it for debugging, and how it differs from Services, Ingress, and production traffic paths.
MySQL Replication on Kubernetes: Topology, Storage, and Failure Modes
Understand how to run MySQL replication on Kubernetes, including primary-replica design, storage concerns, failover risks, and operational checks.
Kubernetes Headless Service Explained: DNS, Pod Identity, and Stateful Workloads
Understand when to use a headless Service in Kubernetes, how DNS works without a virtual IP, and why it matters for StatefulSets and peer discovery.
Kubernetes StorageClass Explained: Dynamic Provisioning and Defaults
Understand how StorageClass enables dynamic provisioning in Kubernetes, how default classes work, and how to choose the right storage policy.
Kubernetes StatefulSet Explained: Stable Identity, Ordering, and Storage
Learn when to use a StatefulSet in Kubernetes, how stable Pod identity works, and why ordering and persistent storage matter.
Kubernetes PV and PVC Explained: Persistent Storage Basics
Learn how PersistentVolumes and PersistentVolumeClaims work in Kubernetes, how binding happens, and how to troubleshoot storage lifecycle issues.
Ephemeral Volumes
Ephemeral volumes live with the Pod and fit cache or temp files.
Kubernetes ConfigMap vs Secret: Configuration, Sensitive Data, and Safe Usage
Understand when to use ConfigMap or Secret in Kubernetes, how they reach Pods, and which practices reduce config drift and secret exposure.
Kubernetes Volumes Explained: EmptyDir, HostPath, and Persistent Storage Basics
Learn the core Kubernetes volume types, what data survives Pod restarts, and how to choose between temporary and persistent storage.