Running a Multithreaded Executable on Linux

From ELF on disk to threads on CPU — click any component to explore

User Space
ELF / Loader
Virtual Memory
Kernel Space
Scheduler
Sync / IPC
System Calls
Hardware
ELF Header
Program Headers
.text
.rodata
.data
.bss
.dynamic / .plt / .got
.debug_*
Section Headers
An ELF binary is divided into segments (runtime view via program headers) and sections (link-time view via section headers). The kernel only cares about segments when loading.
execve(2)kernel entry point
binfmt_elf handlerkernel parses ELF headers
ld-linux.sodynamic linker/loader
RelocationPLT / GOT patching
LD_PRELOAD / LD_AUDITinterposition hooks
.init_array / .fini_arrayctors & dtors
vDSO mappingfast syscall page
execve replaces the process image. The kernel maps ELF segments, sets up the initial stack with argc/argv/envp/auxv, then jumps to the dynamic linker's entry point.
0x7fff…
Stack (per-thread) ↓ grows down
0x7f00…
mmap region — shared libs, anonymous, files
0x0055…+
Heap ↑ grows up (brk / mmap)
0x0055…
.bss — zero-init data (CoW zero page)
0x0055…
.data — initialised globals (RW)
0x0055…
.rodata — read-only constants (R)
0x0040…
.text — executable code (RX)
[vdso]
vDSO — kernel-mapped fast-path page
0xffff…
[vsyscall] — legacy compat page
Each VMA (vm_area_struct) tracks permissions, file backing, and CoW state. ASLR randomises base addresses. /proc/<pid>/maps lists all VMAs live.
Main Thread (TID=PID)
pthread_create() origin
Stack: argv, env, auxv
Signal handler default
TLS block 0
Worker Thread N
clone(CLONE_VM|…)
Private stack (mmap)
Shared heap/globals
Own TLS block
Worker Thread N+1
clone(CLONE_VM|…)
Private stack (mmap)
Shared heap/globals
Own TLS block
TLS (Thread-Local Storage)
__thread / _Thread_local
FS register base
pthread_self() ptr
errno lives here
pthread_mutexfutex-backed lock
pthread_condcondition variable
pthread_rwlockreaders-writer lock
pthread_barrierrendezvous point
sem_tPOSIX semaphore
Thread cancellationdeferred / async
POSIX threads share address space, file descriptors, and signal disposition but have private stacks, TLS, and scheduling attributes. All are Linux tasks under the hood.
syscall / sysenterring-3 → ring-0 transition
sys_call_tabledispatch table in kernel
vDSO fast-pathclock_gettime etc.
seccomp BPF filtersyscall allowlisting
ptrace / strace hooktracee stops
On x86-64 the syscall instruction saves registers, switches stacks to the per-CPU kernel stack, and jumps to entry_SYSCALL_64.
task_structper-thread kernel object
mm_structshared address space desc.
files_structopen file descriptor table
signal_structsignal handlers & pending
cred (uid/gid/caps)credentials & namespaces
PID / namespacespid_ns, mnt_ns, net_ns…
wait_queueblocking & wakeup infra
Linux uses a single task_struct for both processes and threads. Threads in a process share one mm_struct but each has its own kernel stack.
CFS — Completely Fair Schedulervruntime red-black tree
RT schedulerSCHED_FIFO / SCHED_RR
Scheduling domainsNUMA / CPU topology
Per-CPU runqueuerq, cfs_rq, rt_rq
Load balancerwork-stealing across CPUs
Context switchswitch_to() — save/restore regs
CPU affinitysched_setaffinity()
Kernel preemptionCONFIG_PREEMPT
CFS tracks vruntime per-task and always picks the leftmost node of the red-black tree. Threads compete for slices within a cgroup hierarchy.
Page tablesPGD→PUD→PMD→PTE
TLB / CR3address-translation cache
Page fault handlerdemand paging, CoW
Copy-on-Write (CoW)fork / mmap shared pages
SLAB/SLUB allocatorkernel object caches
Buddy allocatorpage-granularity free lists
Anonymous mmapbrk, thread stacks, malloc
OOM killeroom_badness scoring
THP / HugeTLB2 MB / 1 GB pages
Swap / zswappaging to disk
The kernel maintains a 4- (or 5-) level page table per mm_struct. A page fault is the kernel's opportunity to demand-page, CoW-break, or extend the stack.
futex(2)fast userspace mutex
pipe / pipe2anonymous byte stream
eventfd / timerfdedge-triggered counters
mq_open (POSIX MQ)message queues
shmget / shm_openshared memory
kill / sigqueueasync notification
membarrier(2)cross-thread memory order
pthread_mutex_lock calls futex(FUTEX_WAIT) only when contended — uncontended lock/unlock stays entirely in user space.
CPU Core(s)instruction fetch / decode / execute
Registers + RIPGPRs, SIMD, segment regs
L1/L2/L3 Cachecoherence via MESI
MMU + CR3HW page-table walker
Local APICtimer interrupt, IPI
IOMMU (VT-d)DMA address translation
Spectre / Meltdown mitigationsIBRS, KPTI, STIBP
The scheduler timer interrupt fires via the Local APIC (~250 Hz default). The MMU's CR3 register holds the physical address of the top-level page table — reloaded on every context switch.


1. execve
ELF loaded, linker runs

2. Virtual Memory
VMAs mapped, ASLR applied

3. Threads Scheduled
CFS vruntime → CPU

4. Hardware Executes
Fetch–decode–execute