Linux CGroup Deep Dive: From V1 to V2 Resource Governance
A structured walkthrough of CGroup concepts, V1/V2 differences, controllers, and hands-on troubleshooting.
CGroup (Control Group) is the core of Linux resource governance. It answers a practical question: how do you limit and isolate resource usage for different processes on the same machine, and measure it reliably? In the container era, CGroup is the resource foundation for Docker and Kubernetes. But it is not a container-only feature. If you understand CGroup, you can explain why a container is OOM-killed, why a 1-core limit still spikes, and why V2 changes the behavior of limits.
This article extends a basic CGroup introduction into an engineering-focused deep dive. It covers V1 vs V2 structure differences, key controllers, practical configuration, thread vs process behavior, systemd and container mapping, migration notes, and troubleshooting. The goal is to let you solve real issues, not just memorize terms.
1. What CGroup is trying to solve
Before CGroup, resource management relied on:
- Process priority (nice/renice)
- Scheduler policy (CFS, RT, etc)
- Manual ops experience (“the box is slow, reboot it”)
These tools cannot enforce isolation or hard boundaries. A single process can consume all memory and trigger system OOM; a CPU hog can starve latency-sensitive services; a disk writer can saturate IO and cause tail latency spikes. CGroup provides a formal way to say:
- This group can use at most 1 core
- This group can use at most 512MiB of memory
- This group can write at most 10MB/s
- This group can create at most 200 child processes
One sentence to remember: CGroup is a process-group-level framework for resource quotas and accounting.
2. Core concepts (V1 terms)
CGroup V1 uses these core concepts, which remain relevant for V2:
- Task: the smallest unit, which is a Linux thread (LWP) rather than the classic process abstraction.
- CGroup: a set of tasks with the same resource configuration.
- Hierarchy: a tree of CGroups. Child nodes inherit from parents.
- Subsystem/Controller: the specific resource controller (cpu, memory, io, pids, devices).
A key V1 property: multiple hierarchies can exist, and controllers can be mounted onto different trees. This is flexible, but it makes the model harder to reason about. V2 fixes this with a unified hierarchy.
3. Process, thread, and task semantics
Linux threads are implemented as LWPs (lightweight processes):
- The main thread PID equals the process PID
- Each additional thread has its own TID
- /proc/
/task shows all threads
CGroup V1 exposes two entry files:
- tasks: write a thread ID
- cgroup.procs: write a process ID
So you can see surprising behavior: writing to cgroup.procs moves the entire process; writing to tasks moves only one thread. In most cases you want process-level governance, so cgroup.procs is the safer entry point. V2 clarifies this semantics with cgroup.threads.
3.1 Quick experiment: thread ID vs process ID
Here is a minimal C program that creates multiple threads and spins. Use it to observe tasks vs cgroup.procs behavior.
#define _GNU_SOURCE
#include <pthread.h>
#include <stdio.h>
#include <unistd.h>
#include <sys/syscall.h>
static void *spin(void *arg) {
long tid = syscall(SYS_gettid);
printf("thread tid=%ld\n", tid);
while (1) { }
return NULL;
}
int main() {
printf("main pid=%d\n", getpid());
pthread_t t1, t2, t3;
pthread_create(&t1, NULL, spin, NULL);
pthread_create(&t2, NULL, spin, NULL);
pthread_create(&t3, NULL, spin, NULL);
while (1) { }
return 0;
}
Build and run:
gcc -O2 -pthread t.c -o t && ./t
Observe threads:
ps -T -p <pid>
ls /proc/<pid>/task
Create a CGroup (V1 example) and write:
mkdir /sys/fs/cgroup/cpu/demo
echo <pid> > /sys/fs/cgroup/cpu/demo/cgroup.procs
This moves all threads of the process. If you do:
echo <tid> > /sys/fs/cgroup/cpu/demo/tasks
Only that thread moves, and other threads remain outside. This is a common reason why CPU limits seem to “not work” when tasks is used incorrectly.
4. CGroup V1 filesystem layout
V1 is implemented via cgroupfs. Each CGroup is a directory, with files representing configuration and task lists.
A typical layout:
ls /sys/fs/cgroup
You will see cpu, memory, blkio, cpuset, etc. Each directory is a hierarchy mounted with one or more controllers.
4.1 What files exist in a CGroup directory
For the cpu controller:
mkdir /sys/fs/cgroup/cpu/demo
ls /sys/fs/cgroup/cpu/demo
You typically see four types:
- Subsystem config files (cpu.shares, cpu.cfs_quota_us)
- Task lists (tasks and cgroup.procs)
- Common config files (notify_on_release, cgroup.clone_children)
- Child CGroup directories
4.2 tasks vs cgroup.procs behavior
- Write process PID to cgroup.procs: all threads join
- Write thread ID to tasks: only that thread joins
- Write thread ID to cgroup.procs: whole process joins
- Write process PID to tasks: usually fails
This is why V1 thread semantics are considered confusing. V2 separates thread control with cgroup.threads.
4.3 Quick lookup: controllers and mounts
On V1, you can view controller mounts via cgroup-tools:
lssubsys -m
Or check mount directly:
mount | grep cgroup
This tells you which controllers are attached to which hierarchies.
5. Key V1 controllers
5.1 CPU controller (cpu + cpuacct)
The two main parameters:
- cpu.cfs_period_us (default 100000us)
- cpu.cfs_quota_us
Formula: CPU limit = quota / period.
Example for 50% CPU:
echo 50000 > /sys/fs/cgroup/cpu/demo/cpu.cfs_quota_us
cpu.shares is a weight, not a hard limit. cpuacct provides usage accounting.
A reproducible test (0.2 core, observe throttling):
# 0.2 core
echo 100000 > /sys/fs/cgroup/cpu/demo/cpu.cfs_period_us
echo 20000 > /sys/fs/cgroup/cpu/demo/cpu.cfs_quota_us
echo $$ > /sys/fs/cgroup/cpu/demo/cgroup.procs
# CPU spin (expect periodic throttling)
while :; do :; done
In another terminal:
cat /sys/fs/cgroup/cpu/demo/cpu.stat
If nr_throttled grows, throttling is happening.
5.2 cpuset controller
cpuset limits which CPU cores can be used. You must set cpuset.mems and cpuset.cpus first, or you get errors.
echo 0 > /sys/fs/cgroup/cpuset/demo/cpuset.mems
echo 2-3 > /sys/fs/cgroup/cpuset/demo/cpuset.cpus
cpuset is often used to pin databases to specific cores.
5.3 memory controller
Common files:
- memory.limit_in_bytes (hard limit)
- memory.soft_limit_in_bytes (soft limit)
- memory.oom_control
- memory.stat
Exceeding a hard limit triggers OOM Kill.
Simple memory hog program:
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <unistd.h>
int main(int argc, char **argv) {
int mb = argc > 1 ? atoi(argv[1]) : 512;
size_t step = 1 << 20;
char *buf = NULL;
for (int i = 0; i < mb; i++) {
buf = realloc(buf, (size_t)(i + 1) * step);
if (!buf) return 1;
memset(buf + (size_t)i * step, 0, step);
usleep(20000);
}
pause();
return 0;
}
Test with a hard limit:
gcc memhog.c -o memhog
mkdir /sys/fs/cgroup/memory/demo
echo 200M > /sys/fs/cgroup/memory/demo/memory.limit_in_bytes
echo $$ > /sys/fs/cgroup/memory/demo/cgroup.procs
./memhog 500
When it exceeds the limit:
cat /sys/fs/cgroup/memory/demo/memory.failcnt
dmesg | tail -n 5
5.4 blkio controller
Limit IO weights or bandwidth:
- blkio.weight
- blkio.throttle.read_bps_device
- blkio.throttle.write_bps_device
Example: limit writes to 10MB/s
mkdir /sys/fs/cgroup/blkio/demo
# device major:minor for /dev/sda
echo "8:0 10485760" > /sys/fs/cgroup/blkio/demo/blkio.throttle.write_bps_device
echo $$ > /sys/fs/cgroup/blkio/demo/cgroup.procs
dd if=/dev/zero of=/tmp/test.data bs=1M count=200 oflag=direct
5.5 pids controller
Limit process counts:
echo 200 > /sys/fs/cgroup/pids/demo/pids.max
Test (use only on test hosts):
mkdir /sys/fs/cgroup/pids/demo
echo 50 > /sys/fs/cgroup/pids/demo/pids.max
echo $$ > /sys/fs/cgroup/pids/demo/cgroup.procs
for i in $(seq 1 200); do sleep 60 & done
When the limit is hit, fork fails with “Resource temporarily unavailable”.
5.6 devices / freezer / net_cls
- devices: restrict device node access (for example /dev/nvidia0)
- freezer: freeze or thaw process groups
- net_cls / net_prio: classify network packets or set priorities
These are critical in multi-tenant or GPU environments.
6. V1 practice: limiting CPU for a group
A complete example:
# Create CGroup
mkdir /sys/fs/cgroup/cpu/wdj
# Limit to 50% CPU
echo 50000 > /sys/fs/cgroup/cpu/wdj/cpu.cfs_quota_us
# Add the process
echo <pid> > /sys/fs/cgroup/cpu/wdj/cgroup.procs
If you only write a single thread ID to tasks, other threads still run at full speed. That creates the illusion of “limits not working”.
7. Structural issues in V1
V1 provides flexibility, but also long-term problems:
- Multiple hierarchies in parallel make reasoning hard
- Controllers attached to different trees cause split logic
- tasks vs cgroup.procs semantics are confusing
These issues drove the design of V2.
8. V2 design goals
V2 introduces a unified hierarchy:
- One tree
- All controllers on that tree
- A process belongs to a single CGroup
This makes governance more consistent and easier to debug.
9. V2 core files
Mount:
mount -t cgroup2 none /sys/fs/cgroup
Key files:
- cgroup.controllers
- cgroup.subtree_control
- cgroup.procs
Example:
cat /sys/fs/cgroup/cgroup.controllers
echo "+cpu +memory" > /sys/fs/cgroup/cgroup.subtree_control
9.1 Only leaf nodes can host tasks
V2 rule:
- Internal nodes enable controllers
- Leaf nodes host processes and set limits
Example flow:
cd /sys/fs/cgroup
echo "+cpu +memory" > cgroup.subtree_control
mkdir demo
echo "50000 100000" > demo/cpu.max
echo "300M" > demo/memory.max
echo $$ > demo/cgroup.procs
cat demo/cgroup.procs
If you try to attach processes to the parent node, you will hit the “non-empty cgroup” constraint.
9.2 cgroup.threads
V2 adds cgroup.threads for thread-level operations.
cd /sys/fs/cgroup/demo
echo <tid> > cgroup.threads
cat cgroup.threads
Thread-level control is powerful but can be unpredictable. Use it carefully.
9.3 cgroup.events
cgroup.events contains populated, which tells you whether the group still has running processes. This is useful for cleanup and lifecycle automation.
10. V2 controller configuration
10.1 CPU (cpu.max / cpu.weight)
V2 replaces cpu.cfs_quota_us with cpu.max:
# Limit to 1 core
echo "100000 100000" > cpu.max
cpu.weight ranges from 1 to 10000.
Observe throttling:
cat cpu.stat
# fields: usage_usec, user_usec, system_usec, nr_throttled, throttled_usec
If nr_throttled grows, the quota is active.
10.2 memory (memory.max / memory.high)
- memory.high: soft limit
- memory.max: hard limit
echo 500M > memory.high
echo 800M > memory.max
Check events:
cat memory.events
# high 12 max 1 oom 1 oom_kill 1
memory.high throttles and reclaims, memory.max triggers OOM Kill.
10.3 io (io.max / io.weight)
V2 unifies IO control with io.max and io.weight.
10.4 pids.max
Same semantics, clearer behavior.
10.5 PSI (pressure stall information)
V2 exposes pressure metrics. PSI helps detect chronic stalls in CPU, memory, or IO.
11. systemd and CGroup
systemd manages services with CGroup slices:
- system.slice
- user.slice
- machine.slice
Useful commands:
systemd-cgls
systemctl status <service>
Create a scope with limits:
systemd-run --scope -p MemoryMax=200M -p CPUQuota=50% bash
Find the CGroup path:
systemctl show -p ControlGroup <service>
systemd-cgls
This shows CGroup is a default system capability, not a manual hack.
12. CGroup and containers (Docker/Kubernetes)
Containers are not magic isolation. They are:
- namespaces (view isolation)
- cgroups (resource limits)
12.1 Docker mapping
docker run --memory 512m --cpus 1 nginx
This writes to memory and cpu controllers.
Find container CGroup paths:
docker run -d --name demo --memory 256m --cpus 0.5 nginx
docker inspect --format '{{.Id}}' demo
cat /proc/$(docker inspect --format '{{.State.Pid}}' demo)/cgroup
Inside the container:
docker exec demo cat /proc/1/cgroup
12.2 Kubernetes resource model
- limits.cpu maps to cpu.max or cpu.cfs_quota_us
- limits.memory maps to memory.max or memory.limit_in_bytes
QoS classes (Guaranteed, Burstable, BestEffort) are implemented via CGroup limits and priorities.
13. Thread-level governance
Thread-level control is possible, but usually not recommended. Threads share memory, migrate across CPUs, and can create hard-to-debug behavior. Prefer process-level control unless you are very certain.
If you must, use cgroup.threads (V2 example shown earlier).
14. CPU quota and scheduling pitfalls
CPU is scheduled in slices, not as a continuous stream. With cpu.max:
- quota < period leads to throttling
- throttling creates bursts and pauses
Latency-sensitive services can see periodic jitter. This is by design. Check cpu.stat to confirm throttling. Mitigations:
- Increase period to smooth throttling
- Increase quota for critical services
- Use cpu.weight for relative fairness instead of hard caps
15. Memory limits and OOM behavior
In V2:
- memory.current: usage
- memory.high: soft threshold
- memory.max: hard ceiling
- memory.events: OOM and reclaim signals
memory.high adds pressure and latency but does not kill. memory.max causes OOM Kill. A good practice is to set memory.high as the “pressure zone” and memory.max as the “hard ceiling”. memory.oom.group can kill the whole group to avoid half-dead services.
16. IO controller and device view
IO control is device-based. For example, /dev/sda might be 8:0. You can limit bandwidth:
echo "8:0 rbps=10485760 wbps=10485760" > io.max
This prevents log storms and batch jobs from starving online services. Check io.stat for actual IO usage and latency.
17. Delegation and authority boundaries
V2 supports clean delegation. systemd can hand a subtree to a container runtime, which can create its own children. Requirements:
- Parent enables controllers
- Parent holds no tasks
- Child has write permissions
If rootless containers fail with permission issues, delegation is usually missing.
18. CGroup namespace and visibility
CGroup namespaces hide the host path. Inside a container, you may see /, while the host sees /kubepods.slice/… This means:
- Host-side /proc/
/cgroup shows the full path - Container-side view is a sliced path
Understanding this avoids “path not found” confusion.
19. V1 to V2 migration notes
Migration is not just renaming files:
- V2 controllers are off by default; enable subtree_control
- Only leaf nodes can host tasks
- Old scripts often assume V1 paths
If V1 and V2 both exist, be explicit about which one you target.
20. Troubleshooting checklist
When you suspect limits are wrong:
- Which CGroup is the process in?
- Is it V1 or V2?
- Are controllers enabled?
- Are the limits set where you think they are?
- Is throttling/pressure/OOM happening?
Commands:
cat /proc/<pid>/cgroup
stat -fc %T /sys/fs/cgroup
cat /sys/fs/cgroup/cgroup.controllers
cat /sys/fs/cgroup/cgroup.subtree_control
cat /sys/fs/cgroup/<group>/cpu.max
cat /sys/fs/cgroup/<group>/memory.max
cat /sys/fs/cgroup/<group>/cpu.stat
cat /sys/fs/cgroup/<group>/memory.events
dmesg | grep -i oom | tail -n 5
Most issues are caused by disabled controllers, non-leaf nodes, or looking at the wrong path.
21. Engineering boundaries
CGroup enforces fairness and ceilings, but it does not fix memory leaks, algorithmic bottlenecks, or flaky dependencies. Use CGroup as a safety valve, and use monitoring and application tuning to solve root causes.
22. Aligning with Kubernetes resource semantics
Kubernetes splits requests and limits:
- limits become CGroup hard limits
- requests affect scheduling, not runtime caps
If requests are far below limits, latency-sensitive services will be squeezed. Best practices:
- Set requests close to steady-state usage for critical services
- Keep CPU headroom for latency-sensitive workloads
- Use memory.high for softer constraints where possible
23. Summary and practice tips
If Linux resource governance is a traffic system, CGroup is the speed limit, lane control, and traffic monitoring. Without it, the system will collapse. The best way to learn is to run the commands in a test environment. CGroup is not just a concept; it is a tool that directly impacts production stability. If you understand it, you understand container resource boundaries and Kubernetes QoS behavior.
24. Common scenarios in cloud platforms
In multi-tenant platforms, CGroup is not optional. Typical scenarios:
- Co-locate online and batch workloads: protect online services from CPU starvation
- Database and log IO on the same host: throttle log writes to avoid tail latency spikes
- Tenant isolation: use memory.max and pids.max to prevent runaway usage
- GPU and device access: use devices controller to restrict device nodes
CGroup turns governance from “tribal knowledge” into explicit policy.
25. Resource governance mindset
Limits are not better when they are tighter. Too strict leads to throttling and OOM; too loose enables noisy neighbors. A practical approach:
- Use real metrics to set limits
- Guarantee CPU and memory headroom for latency-critical services
- Let throughput workloads absorb more variance
- Set clear hard ceilings and alert on overuse
When you treat CGroup as a policy layer, its value becomes obvious.
26. Self-check exercises (test hosts only)
- Create a demo CGroup, limit a CPU spin loop to 0.2 core, and watch cpu.stat throttling.
- Set memory.high and memory.max and observe latency vs OOM behavior.
- Use io.max to limit disk writes and compare p99 latency before and after.
These exercises turn config files into real system behavior you can feel.