Linux CGroup Deep Dive: From V1 to V2 Resource Governance

A structured walkthrough of CGroup concepts, V1/V2 differences, controllers, and hands-on troubleshooting.

CGroup (Control Group) is the core of Linux resource governance. It answers a practical question: how do you limit and isolate resource usage for different processes on the same machine, and measure it reliably? In the container era, CGroup is the resource foundation for Docker and Kubernetes. But it is not a container-only feature. If you understand CGroup, you can explain why a container is OOM-killed, why a 1-core limit still spikes, and why V2 changes the behavior of limits.

This article extends a basic CGroup introduction into an engineering-focused deep dive. It covers V1 vs V2 structure differences, key controllers, practical configuration, thread vs process behavior, systemd and container mapping, migration notes, and troubleshooting. The goal is to let you solve real issues, not just memorize terms.

1. What CGroup is trying to solve

Before CGroup, resource management relied on:

Process priority (nice/renice)
Scheduler policy (CFS, RT, etc)
Manual ops experience (“the box is slow, reboot it”)

These tools cannot enforce isolation or hard boundaries. A single process can consume all memory and trigger system OOM; a CPU hog can starve latency-sensitive services; a disk writer can saturate IO and cause tail latency spikes. CGroup provides a formal way to say:

This group can use at most 1 core
This group can use at most 512MiB of memory
This group can write at most 10MB/s
This group can create at most 200 child processes

One sentence to remember: CGroup is a process-group-level framework for resource quotas and accounting.

2. Core concepts (V1 terms)

CGroup V1 uses these core concepts, which remain relevant for V2:

Task: the smallest unit, which is a Linux thread (LWP) rather than the classic process abstraction.
CGroup: a set of tasks with the same resource configuration.
Hierarchy: a tree of CGroups. Child nodes inherit from parents.
Subsystem/Controller: the specific resource controller (cpu, memory, io, pids, devices).

A key V1 property: multiple hierarchies can exist, and controllers can be mounted onto different trees. This is flexible, but it makes the model harder to reason about. V2 fixes this with a unified hierarchy.

3. Process, thread, and task semantics

Linux threads are implemented as LWPs (lightweight processes):

The main thread PID equals the process PID
Each additional thread has its own TID
/proc//task shows all threads

CGroup V1 exposes two entry files:

tasks: write a thread ID
cgroup.procs: write a process ID

So you can see surprising behavior: writing to cgroup.procs moves the entire process; writing to tasks moves only one thread. In most cases you want process-level governance, so cgroup.procs is the safer entry point. V2 clarifies this semantics with cgroup.threads.

3.1 Quick experiment: thread ID vs process ID

Here is a minimal C program that creates multiple threads and spins. Use it to observe tasks vs cgroup.procs behavior.

#define _GNU_SOURCE
#include <pthread.h>
#include <stdio.h>
#include <unistd.h>
#include <sys/syscall.h>

static void *spin(void *arg) {
  long tid = syscall(SYS_gettid);
  printf("thread tid=%ld\n", tid);
  while (1) { }
  return NULL;
}

int main() {
  printf("main pid=%d\n", getpid());
  pthread_t t1, t2, t3;
  pthread_create(&t1, NULL, spin, NULL);
  pthread_create(&t2, NULL, spin, NULL);
  pthread_create(&t3, NULL, spin, NULL);
  while (1) { }
  return 0;
}

Build and run:

gcc -O2 -pthread t.c -o t && ./t

Observe threads:

ps -T -p <pid>
ls /proc/<pid>/task

Create a CGroup (V1 example) and write:

mkdir /sys/fs/cgroup/cpu/demo
echo <pid> > /sys/fs/cgroup/cpu/demo/cgroup.procs

This moves all threads of the process. If you do:

echo <tid> > /sys/fs/cgroup/cpu/demo/tasks

Only that thread moves, and other threads remain outside. This is a common reason why CPU limits seem to “not work” when tasks is used incorrectly.

4. CGroup V1 filesystem layout

V1 is implemented via cgroupfs. Each CGroup is a directory, with files representing configuration and task lists.

A typical layout:

ls /sys/fs/cgroup

You will see cpu, memory, blkio, cpuset, etc. Each directory is a hierarchy mounted with one or more controllers.

4.1 What files exist in a CGroup directory

For the cpu controller:

mkdir /sys/fs/cgroup/cpu/demo
ls /sys/fs/cgroup/cpu/demo

You typically see four types:

Subsystem config files (cpu.shares, cpu.cfs_quota_us)
Task lists (tasks and cgroup.procs)
Common config files (notify_on_release, cgroup.clone_children)
Child CGroup directories

4.2 tasks vs cgroup.procs behavior

Write process PID to cgroup.procs: all threads join
Write thread ID to tasks: only that thread joins
Write thread ID to cgroup.procs: whole process joins
Write process PID to tasks: usually fails

This is why V1 thread semantics are considered confusing. V2 separates thread control with cgroup.threads.

4.3 Quick lookup: controllers and mounts

On V1, you can view controller mounts via cgroup-tools:

lssubsys -m

Or check mount directly:

mount | grep cgroup

This tells you which controllers are attached to which hierarchies.

5. Key V1 controllers

5.1 CPU controller (cpu + cpuacct)

The two main parameters:

cpu.cfs_period_us (default 100000us)
cpu.cfs_quota_us

Formula: CPU limit = quota / period.

Example for 50% CPU:

echo 50000 > /sys/fs/cgroup/cpu/demo/cpu.cfs_quota_us

cpu.shares is a weight, not a hard limit. cpuacct provides usage accounting.

A reproducible test (0.2 core, observe throttling):

# 0.2 core
echo 100000 > /sys/fs/cgroup/cpu/demo/cpu.cfs_period_us
echo 20000 > /sys/fs/cgroup/cpu/demo/cpu.cfs_quota_us
echo $$ > /sys/fs/cgroup/cpu/demo/cgroup.procs

# CPU spin (expect periodic throttling)
while :; do :; done

In another terminal:

cat /sys/fs/cgroup/cpu/demo/cpu.stat

If nr_throttled grows, throttling is happening.

5.2 cpuset controller

cpuset limits which CPU cores can be used. You must set cpuset.mems and cpuset.cpus first, or you get errors.

echo 0 > /sys/fs/cgroup/cpuset/demo/cpuset.mems
echo 2-3 > /sys/fs/cgroup/cpuset/demo/cpuset.cpus

cpuset is often used to pin databases to specific cores.

5.3 memory controller

Common files:

memory.limit_in_bytes (hard limit)
memory.soft_limit_in_bytes (soft limit)
memory.oom_control
memory.stat

Exceeding a hard limit triggers OOM Kill.

Simple memory hog program:

#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <unistd.h>

int main(int argc, char **argv) {
  int mb = argc > 1 ? atoi(argv[1]) : 512;
  size_t step = 1 << 20;
  char *buf = NULL;
  for (int i = 0; i < mb; i++) {
    buf = realloc(buf, (size_t)(i + 1) * step);
    if (!buf) return 1;
    memset(buf + (size_t)i * step, 0, step);
    usleep(20000);
  }
  pause();
  return 0;
}

Test with a hard limit:

gcc memhog.c -o memhog
mkdir /sys/fs/cgroup/memory/demo
echo 200M > /sys/fs/cgroup/memory/demo/memory.limit_in_bytes
echo $$ > /sys/fs/cgroup/memory/demo/cgroup.procs
./memhog 500

When it exceeds the limit:

cat /sys/fs/cgroup/memory/demo/memory.failcnt
dmesg | tail -n 5

5.4 blkio controller

Limit IO weights or bandwidth:

blkio.weight
blkio.throttle.read_bps_device
blkio.throttle.write_bps_device

Example: limit writes to 10MB/s

mkdir /sys/fs/cgroup/blkio/demo
# device major:minor for /dev/sda
echo "8:0 10485760" > /sys/fs/cgroup/blkio/demo/blkio.throttle.write_bps_device
echo $$ > /sys/fs/cgroup/blkio/demo/cgroup.procs
dd if=/dev/zero of=/tmp/test.data bs=1M count=200 oflag=direct

5.5 pids controller

Limit process counts:

echo 200 > /sys/fs/cgroup/pids/demo/pids.max

Test (use only on test hosts):

mkdir /sys/fs/cgroup/pids/demo
echo 50 > /sys/fs/cgroup/pids/demo/pids.max
echo $$ > /sys/fs/cgroup/pids/demo/cgroup.procs
for i in $(seq 1 200); do sleep 60 & done

When the limit is hit, fork fails with “Resource temporarily unavailable”.

5.6 devices / freezer / net_cls

devices: restrict device node access (for example /dev/nvidia0)
freezer: freeze or thaw process groups
net_cls / net_prio: classify network packets or set priorities

These are critical in multi-tenant or GPU environments.

6. V1 practice: limiting CPU for a group

A complete example:

# Create CGroup
mkdir /sys/fs/cgroup/cpu/wdj
# Limit to 50% CPU
echo 50000 > /sys/fs/cgroup/cpu/wdj/cpu.cfs_quota_us
# Add the process
echo <pid> > /sys/fs/cgroup/cpu/wdj/cgroup.procs

If you only write a single thread ID to tasks, other threads still run at full speed. That creates the illusion of “limits not working”.

7. Structural issues in V1

V1 provides flexibility, but also long-term problems:

Multiple hierarchies in parallel make reasoning hard
Controllers attached to different trees cause split logic
tasks vs cgroup.procs semantics are confusing

These issues drove the design of V2.

8. V2 design goals

V2 introduces a unified hierarchy:

One tree
All controllers on that tree
A process belongs to a single CGroup

This makes governance more consistent and easier to debug.

9. V2 core files

Mount:

mount -t cgroup2 none /sys/fs/cgroup

Key files:

cgroup.controllers
cgroup.subtree_control
cgroup.procs

Example:

cat /sys/fs/cgroup/cgroup.controllers
echo "+cpu +memory" > /sys/fs/cgroup/cgroup.subtree_control

9.1 Only leaf nodes can host tasks

V2 rule:

Internal nodes enable controllers
Leaf nodes host processes and set limits

Example flow:

cd /sys/fs/cgroup
echo "+cpu +memory" > cgroup.subtree_control
mkdir demo
echo "50000 100000" > demo/cpu.max
echo "300M" > demo/memory.max
echo $$ > demo/cgroup.procs
cat demo/cgroup.procs

If you try to attach processes to the parent node, you will hit the “non-empty cgroup” constraint.

9.2 cgroup.threads

V2 adds cgroup.threads for thread-level operations.

cd /sys/fs/cgroup/demo
echo <tid> > cgroup.threads
cat cgroup.threads

Thread-level control is powerful but can be unpredictable. Use it carefully.

9.3 cgroup.events

cgroup.events contains populated, which tells you whether the group still has running processes. This is useful for cleanup and lifecycle automation.

10. V2 controller configuration

10.1 CPU (cpu.max / cpu.weight)

V2 replaces cpu.cfs_quota_us with cpu.max:

# Limit to 1 core
echo "100000 100000" > cpu.max

cpu.weight ranges from 1 to 10000.

Observe throttling:

cat cpu.stat
# fields: usage_usec, user_usec, system_usec, nr_throttled, throttled_usec

If nr_throttled grows, the quota is active.

10.2 memory (memory.max / memory.high)

memory.high: soft limit
memory.max: hard limit

echo 500M > memory.high
echo 800M > memory.max

Check events:

cat memory.events
# high 12  max 1  oom 1  oom_kill 1

memory.high throttles and reclaims, memory.max triggers OOM Kill.

10.3 io (io.max / io.weight)

V2 unifies IO control with io.max and io.weight.

10.4 pids.max

Same semantics, clearer behavior.

10.5 PSI (pressure stall information)

V2 exposes pressure metrics. PSI helps detect chronic stalls in CPU, memory, or IO.

11. systemd and CGroup

systemd manages services with CGroup slices:

system.slice
user.slice
machine.slice

Useful commands:

systemd-cgls
systemctl status <service>

Create a scope with limits:

systemd-run --scope -p MemoryMax=200M -p CPUQuota=50% bash

Find the CGroup path:

systemctl show -p ControlGroup <service>
systemd-cgls

This shows CGroup is a default system capability, not a manual hack.

12. CGroup and containers (Docker/Kubernetes)

Containers are not magic isolation. They are:

namespaces (view isolation)
cgroups (resource limits)

12.1 Docker mapping

docker run --memory 512m --cpus 1 nginx

This writes to memory and cpu controllers.

Find container CGroup paths:

docker run -d --name demo --memory 256m --cpus 0.5 nginx
docker inspect --format '{{.Id}}' demo
cat /proc/$(docker inspect --format '{{.State.Pid}}' demo)/cgroup

Inside the container:

docker exec demo cat /proc/1/cgroup

12.2 Kubernetes resource model

limits.cpu maps to cpu.max or cpu.cfs_quota_us
limits.memory maps to memory.max or memory.limit_in_bytes

QoS classes (Guaranteed, Burstable, BestEffort) are implemented via CGroup limits and priorities.

13. Thread-level governance

Thread-level control is possible, but usually not recommended. Threads share memory, migrate across CPUs, and can create hard-to-debug behavior. Prefer process-level control unless you are very certain.

If you must, use cgroup.threads (V2 example shown earlier).

14. CPU quota and scheduling pitfalls

CPU is scheduled in slices, not as a continuous stream. With cpu.max:

quota < period leads to throttling
throttling creates bursts and pauses

Latency-sensitive services can see periodic jitter. This is by design. Check cpu.stat to confirm throttling. Mitigations:

Increase period to smooth throttling
Increase quota for critical services
Use cpu.weight for relative fairness instead of hard caps

15. Memory limits and OOM behavior

In V2:

memory.current: usage
memory.high: soft threshold
memory.max: hard ceiling
memory.events: OOM and reclaim signals

memory.high adds pressure and latency but does not kill. memory.max causes OOM Kill. A good practice is to set memory.high as the “pressure zone” and memory.max as the “hard ceiling”. memory.oom.group can kill the whole group to avoid half-dead services.

16. IO controller and device view

IO control is device-based. For example, /dev/sda might be 8:0. You can limit bandwidth:

echo "8:0 rbps=10485760 wbps=10485760" > io.max

This prevents log storms and batch jobs from starving online services. Check io.stat for actual IO usage and latency.

17. Delegation and authority boundaries

V2 supports clean delegation. systemd can hand a subtree to a container runtime, which can create its own children. Requirements:

Parent enables controllers
Parent holds no tasks
Child has write permissions

If rootless containers fail with permission issues, delegation is usually missing.

18. CGroup namespace and visibility

CGroup namespaces hide the host path. Inside a container, you may see /, while the host sees /kubepods.slice/… This means:

Host-side /proc//cgroup shows the full path
Container-side view is a sliced path

Understanding this avoids “path not found” confusion.

19. V1 to V2 migration notes

Migration is not just renaming files:

V2 controllers are off by default; enable subtree_control
Only leaf nodes can host tasks
Old scripts often assume V1 paths

If V1 and V2 both exist, be explicit about which one you target.

20. Troubleshooting checklist

When you suspect limits are wrong:

Which CGroup is the process in?
Is it V1 or V2?
Are controllers enabled?
Are the limits set where you think they are?
Is throttling/pressure/OOM happening?

Commands:

cat /proc/<pid>/cgroup
stat -fc %T /sys/fs/cgroup
cat /sys/fs/cgroup/cgroup.controllers
cat /sys/fs/cgroup/cgroup.subtree_control
cat /sys/fs/cgroup/<group>/cpu.max
cat /sys/fs/cgroup/<group>/memory.max
cat /sys/fs/cgroup/<group>/cpu.stat
cat /sys/fs/cgroup/<group>/memory.events
dmesg | grep -i oom | tail -n 5

Most issues are caused by disabled controllers, non-leaf nodes, or looking at the wrong path.

21. Engineering boundaries

CGroup enforces fairness and ceilings, but it does not fix memory leaks, algorithmic bottlenecks, or flaky dependencies. Use CGroup as a safety valve, and use monitoring and application tuning to solve root causes.

22. Aligning with Kubernetes resource semantics

Kubernetes splits requests and limits:

limits become CGroup hard limits
requests affect scheduling, not runtime caps

If requests are far below limits, latency-sensitive services will be squeezed. Best practices:

Set requests close to steady-state usage for critical services
Keep CPU headroom for latency-sensitive workloads
Use memory.high for softer constraints where possible

23. Summary and practice tips

If Linux resource governance is a traffic system, CGroup is the speed limit, lane control, and traffic monitoring. Without it, the system will collapse. The best way to learn is to run the commands in a test environment. CGroup is not just a concept; it is a tool that directly impacts production stability. If you understand it, you understand container resource boundaries and Kubernetes QoS behavior.

24. Common scenarios in cloud platforms

In multi-tenant platforms, CGroup is not optional. Typical scenarios:

Co-locate online and batch workloads: protect online services from CPU starvation
Database and log IO on the same host: throttle log writes to avoid tail latency spikes
Tenant isolation: use memory.max and pids.max to prevent runaway usage
GPU and device access: use devices controller to restrict device nodes

CGroup turns governance from “tribal knowledge” into explicit policy.

25. Resource governance mindset

Limits are not better when they are tighter. Too strict leads to throttling and OOM; too loose enables noisy neighbors. A practical approach:

Use real metrics to set limits
Guarantee CPU and memory headroom for latency-critical services
Let throughput workloads absorb more variance
Set clear hard ceilings and alert on overuse

When you treat CGroup as a policy layer, its value becomes obvious.

26. Self-check exercises (test hosts only)

Create a demo CGroup, limit a CPU spin loop to 0.2 core, and watch cpu.stat throttling.
Set memory.high and memory.max and observe latency vs OOM behavior.
Use io.max to limit disk writes and compare p99 latency before and after.

These exercises turn config files into real system behavior you can feel.