UK | EN |
LIVE
Технології 🇺🇸 США

Understanding the Linux Kernel: The Linux Kernel Startup

Hacker News valyala 1 переглядів 40 хв читання
The Linux Kernel Startup

Have you ever wondered what really happens between the moment you press the power button and the moment your login screen shows up? That gap—usually some seconds—hides one of the most intricate initialization sequences in computing. Today I want to walk you through it.

This is the first article in a series where I’ll try to make sense of the Linux kernel internals together with you. We’ll talk about how Linux boots, how it manages processes and memory, how it deals with hardware, and so on. If you’ve ever been curious about what’s happening under the hood, you’re in the right place.

⚠️ Quick disclaimer

I’m not a kernel expert—I’m learning out loud. The goal here isn’t a deep, exhaustive tour but a useful map: what the main pieces are and how they fit together. For the deep dives, the source is the real teacher.

This article focuses on x86_64. The big picture applies broadly, but specifics vary on ARM, RISC-V, etc.

Now let me throw a metaphor at you, because this is going to be a long ride and we’ll need a thread to hold onto.

Imagine We’re Setting Up a Space Colony

Picture a barren planet. No air to breathe, no roads, no buildings, no power, no comms. We send a small advance team in a dropship. Their mission: turn this rock into a working colony, and do it before life support runs out.

The advance team can’t just unload everyone and start hosting town hall meetings. They have to do things in a very specific order. First the basics: confirm the lander didn’t crash, set up emergency procedures in case anything goes wrong. Then map the terrain, find usable resources, set aside areas for storage. Then bring up the construction equipment, build the first habitats, the power grid, the comms tower. Then start the proper governance: a colonial governor, a dispatch office that handles future crew arrivals, and a maintenance crew that takes over the boring “keep things running” duties. Finally, they wake up the rest of the colonists from cryosleep and hand them the keys to the place.

That’s pretty much what the Linux kernel does at boot. The bootloader is the dropship. Your computer is the barren planet. The advance team is the execution of the startup code in the Linux kernel—the one we’ll be following the whole time. And by the end of this article, that advance team will literally have transformed itself into the standby maintenance crew while a brand-new civilian government takes over. Bear with me—it’ll make sense as we go.

Here’s the rough trip we’re about to take:

Linux kernel boot flow: Bootloader → Assembly Entry (startup_64) → Early C (x86_64_start_kernel) → Arch Setup (setup_arch) → Core Subsystems (start_kernel) → Threading (rest_init) → Finalization (kernel_init) → User Space (/sbin/init)

Let’s start where the bootloader leaves off.

The Handoff: What the Bootloader Hands Us

So GRUB (or whatever bootloader you’re using) hands control to the kernel. What do we actually have to work with?

Honestly, not much.

The CPU is already running, but in one of several modes—roughly, how wide the registers are and how memory works. On x86 that’s 16-bit Real Mode, 32-bit Protected Mode, or 64-bit Long Mode. UEFI puts us straight in Long Mode; legacy BIOS usually leaves us in Protected Mode. We’ll deal with the rest of the CPU’s state (page tables, interrupts) once we get to Phase 1.

Memory is awkward. The kernel was loaded low in RAM (typically around 0x1000000), but it’s compiled to run at a high virtual address (something like 0xffffffff81000000). That mismatch is going to bite us soon.

What else? A memory map from the firmware (the E820 map on x86) telling us where RAM is, what’s reserved, and where the ACPI (Advanced Configuration and Power Interface) tables live; a bag of boot parameters (the command line, the initrd location, etc.); and that’s it. No console, no allocator, no interrupts, no log.

Let’s build something.

Phase 1: The Assembly Trampoline

Unpacking the Kernel First

One small twist before anything else: the file the bootloader handed us is a bzImage, and most of it is compressed. Shipping the kernel compressed saves space on disk and in memory during boot, but the CPU obviously can’t execute compressed bytes. So the very first code that runs isn’t the kernel proper—it’s a tiny decompressor living in arch/x86/boot/compressed/. Its job is to unpack the real kernel image into memory and then jump to it.

The decompressor also picks a random base address to load the kernel at—this is KASLR (Kernel Address Space Layout Randomization), and it makes life harder for attackers who’d like to guess where kernel code lives.

Once the decompressor is done, control jumps into the real kernel.

Into the Real Kernel

Where we land depends on the bootloader. On a legacy 32-bit boot we start at startup_32 in the decompressor, which has to climb the CPU into 64-bit Long Mode itself—building a tiny page table where every virtual address points to the same physical address (an identity mapping—the simplest possible setup), flipping the “you are now a 64-bit chip” bit, turning paging on, and jumping to startup_64. A modern UEFI bootloader skips the climb and jumps straight to startup_64. Either way, every path converges there. And we’re as bare-bones as it gets: almost-pure assembly, no C runtime, no library calls—just instructions and registers.

So what’s the very first thing the kernel does? Two pieces of plumbing it can’t function without: it points the stack pointer at a small pre-allocated buffer (you can’t call any function without a working stack), and it installs a minimal GDT and IDT—two CPU-required lookup tables for memory segments and exception handlers, respectively. With those in place, the more interesting work can begin.

A Detour for Encrypted Hardware

First interesting step: memory encryption. Some AMD CPUs can transparently encrypt RAM so that someone with physical access to the memory chips can’t just read your data off them. The feature is called SME (Secure Memory Encryption), with a sibling for virtual machines called SEV (Intel’s analogue here is TDX); the simpler “encrypt all RAM with one key” feature on Intel is TME (Total Memory Encryption). If we’re on hardware like that, the kernel has to turn encryption on right now—you can’t go back and encrypt data you’ve already written in the clear.

With encryption sorted, we can ask the next question.

Did the Lander Even Survive?

Before going any further, we should check that the gear works. The advance team isn’t going to start unpacking power generators if the air recyclers don’t even turn on. The kernel does its equivalent with verify_cpu:

verify_cpu: # Check for Long Mode support # Verify SSE2 (required by x86_64 ABI) # Validate other CPU features

Long Mode? Check. SSE2? Check. If something essential is missing, we just stop right here. This is, by the way, exactly why you can’t run a 64-bit kernel on a 32-bit CPU—it’s not that something subtle goes wrong later, it’s that we fail the very first checklist item.

With the gear verified, it’s time to deal with that awkwardness from the inventory list.

The Address Mismatch Problem

Remember the awkwardness from earlier? The kernel was loaded at one address but compiled for another, and we have to fix it before going further. Think of the advance team carrying a detailed map of the colony, but landing two kilometers off from where the map says “you are here.” Every reference like “the power station is 500m north” is now wrong.

In kernel terms, this fix is called page table fixup. A page table is the lookup table the CPU uses to turn the virtual addresses in your code into the physical addresses where bytes actually live. The kernel ships with a small set of these tables pre-filled by the linker, assuming it’ll be loaded at a specific address—an assumption KASLR just broke.

So startup_64 calls a C helper, __startup_64() (via a position-independent thunk called __pi___startup_64, in case you go grepping the source), which computes the difference between “where the code thinks it is” and “where we actually landed” and patches the page table entries by that offset. Once it returns, virtual addresses translate to the right physical bytes, and the map matches reality.

With addresses sorted, we can finally leave the bare-metal assembly world behind.

Jumping into C

With the page tables fixed, the kernel switches the CPU over to them and jumps to its first real C function, x86_64_start_kernel. From this point on, the kernel runs at the high virtual addresses the linker originally targeted. We’re leaving bare assembly behind—but we get C only barely: no allocator, no console, no library calls. Just C with raw pointers and discipline.

Phase 2: Early C Initialization

The first C function is x86_64_start_kernel in arch/x86/kernel/head64.c. We’re still in advance-team mode, but we have slightly better tools now.

Before the more interesting work, x86_64_start_kernel does some bookkeeping we won’t dwell on: cr4_init_shadow() caches a copy of the CR4 control register so future code can avoid re-reading it from the CPU, and reset_early_page_tables() throws away the identity-mapped page tables that got us this far—we don’t need that training-wheels mapping anymore. The first chore worth talking about is embarrassingly mundane.

Wiping Down the Workbenches

In a normal C program, the runtime takes care of zeroing out uninitialized globals before main() runs. But we are the runtime. Nobody’s going to do it for us:

clear_bss();

This zeroes out the .bss section—the region holding uninitialized globals and statics. It’s the equivalent of unpacking the gear and making sure every storage bin starts empty before anyone puts anything in.

Next, we want to get some safety machinery on its feet—even if only as a placeholder.

KASAN, the Safety Inspector with No Office Yet

Right before KASAN, a quick sme_early_init() finishes wiring up the memory-encryption setup we started back in Phase 1, so any page table entries we touch from now on come out encrypted on hardware that needs it.

Then, if the kernel was built with KASAN (Kernel Address Sanitizer)—the safety inspector that catches use-after-free bugs and buffer overflows—we need to bring it up here. The catch: KASAN needs a huge region of shadow memory to track every byte the kernel allocates, and we don’t have a real memory allocator yet.

The trick is to point that entire shadow region at a single zero page. KASAN-instrumented code can read from any shadow address and just see zeros, which keeps it from crashing even though nothing is actually being tracked. Once the real memory allocator comes online later, KASAN gets proper shadow memory and starts doing its job for real.

Speaking of safety nets, we also need to handle the case where something blows up.

Emergency Procedures Before You Need Them

If something goes wrong now—a bad memory access, a divide by zero, anything—the CPU needs to know who to call. The way it figures that out on x86 is by looking up a table called the Interrupt Descriptor Table (IDT): a fixed-size array where each entry says “if exception number N happens, jump to this handler function.” If we haven’t set one up, the CPU has no handler to run for the original problem, which itself becomes a second exception, and the handler-lookup for that fails too. After three failures in a row, the CPU gives up and resets the machine—a triple fault, which from the outside just looks like a silent reboot with no error message. Not ideal.

So we install a minimal IDT:

idt_setup_early_handler();

It’s not fancy. It handles the basics—Page Faults, General Protection Faults—and at least ensures that when we eventually get a console, we can print something useful before dying. It’s the colony’s “dial 911” sticker on the wall, before the actual emergency response building exists.

Saving the Bootloader’s Notes

The bootloader gave us important stuff—the command line, the memory map, the initrd location—but it’s all sitting in scratch memory we’re about to overwrite. So copy_bootdata() copies it into boot_params, a kernel-owned structure (along the way, an internal helper called sanitize_boot_params() zeroes out fields the bootloader had no business filling in). From here on, we can ask “where’s the initrd?” without worrying about the answer being silently overwritten.

Patching the CPU Itself

The very last thing x86_64_start_kernel does is genuinely surprising: it patches the CPU’s own microcode. A modern x86 chip doesn’t execute its instruction set directly—it translates each instruction into smaller internal operations using firmware called microcode that lives on the chip. Intel and AMD ship microcode updates the same way they ship security patches, and the kernel applies them at boot via load_ucode_bsp().

Why now? Because some CPU bugs (Spectre, MDS, and friends) are fixed by the microcode update itself. The advance team finishes updating the lander’s firmware before going anywhere. Once the patch is applied, early C is done. Time to actually look around.

Phase 3: Hardware Discovery and Memory Setup

Now we call setup_arch(), which lives in arch/x86/kernel/setup.c. This is the orbital survey turned into ground truth—the kernel figures out exactly what kind of planet we’re standing on.

The first thing setup_arch() wants to know is what kind of CPU it has to work with.

Cataloging the Team’s Skills

First question: what can this CPU actually do? We don’t know yet whether it has fancy vector instructions, hardware crypto, fast context-switching tricks, or which speculative-execution bugs it’s vulnerable to. So early_cpu_init() asks the chip directly using CPUID instructions and dumps the answers into a struct called boot_cpu_data.

That struct is the kernel’s lookup for “do we have feature X?” later on—it’s how the kernel decides things like “use the AVX-512 memcpy or fall back to the slow one.” It’ll matter a lot when we get to self-patching.

With the team’s skills catalogued, the next big question is the most basic: where can we put stuff?

Reading the Survey

Remember that E820 memory map the firmware gave us? We finally read it for real:

e820__memory_setup() pulls in the firmware’s memory map and cleans it up. Firmware is famously unreliable—regions overlap, ranges are off by one, special areas aren’t marked—so the kernel sanitizes it into a trustworthy version. Then e820__memblock_setup() feeds the clean map into memblock, a primitive early allocator that just tracks “this range is free, this range is reserved.” It’s the only allocator we’ll have for a while—no kmalloc, no vmalloc, just raw chunks of physical memory.

A couple of small helpers slot in around here too: parse_setup_data() reads any extra notes the bootloader left behind (a seed of randomness for the random-number generator, firmware fixups, that kind of thing), and early_ioremap_init() sets up a way to peek at firmware tables (like ACPI) by mapping them into the kernel’s address space—useful because we still need to read those tables before we have a real memory allocator.

Now we’d really like to be able to talk to the outside world.

The First Radio: Early Printk

Up to this point the kernel has been completely silent—if anything had gone wrong, your screen would just reboot with no explanation. Not great for debugging.

This is also where the command line starts to bite: parse_early_param() walks through the boot arguments and dispatches the ones marked early—so if you passed something like earlyprintk=serial,ttyS0, the registered handler now sets up a minimal serial driver. From here on, the kernel can dribble out “we’re alive, here’s what we’re doing” messages over a low-power short-range radio. Not a real comms tower yet—but enough to see what’s happening if something blows up early.

But before mapping memory or anything else, the kernel takes a moment to figure out where it’s running.

Where Are We Running?

The kernel wants to know what kind of machine it’s on. efi_init() hooks up UEFI’s runtime services (if we booted via UEFI), dmi_setup() reads the firmware tables describing the motherboard, vendor, and BIOS version—useful later for applying per-vendor quirks (“oh, this is a 2019 Lenovo laptop with the buggy touchpad firmware, apply workaround”)—and init_hypervisor_platform() checks whether we’re on real hardware or inside a VM (and if so, which hypervisor: KVM, Xen, Hyper-V, VMware). A few smaller pieces happen around here too: an early pass at ACPI table parsing, reserving low memory for the trampoline that’ll wake up the other CPU cores in Phase 6, an early calibration of the CPU’s cycle counter, and some cleanup of cacheability settings.

Mapping All of RAM

Now that the kernel knows the lay of the land, it can finally see the whole RAM. First, kernel_randomize_memory() picks randomized base addresses for the direct map and a few other big regions—same reason as the KASLR we saw earlier, don’t let attackers guess where things live. Then reserve_brk() sets aside a small chunk for the kernel’s own scratch data so it doesn’t get clobbered.

Finally, init_mem_mapping() builds the kernel direct map: a giant region of virtual addresses where every physical page gets a fixed virtual address. After this, the kernel can reach any byte of RAM just by computing an offset.

Roping Off the Important Areas

With the direct map in place, the kernel still needs to mark certain memory regions as off-limits. The initrd lives somewhere in RAM; if a crash dump is configured, that area can’t be reused; ACPI has reserved a few ranges. So reserve_initrd() and arch_reserve_crashkernel() reserve those in memblock. Think of it as the advance team putting “DO NOT BUILD HERE” tape around the supply containers and the emergency shelter.

And now that we have a real map of memory, kasan_init() can finally set up KASAN’s real shadow memory, replacing the zero-page placeholder from Phase 2. The safety inspector finally has a real office.

The advance team has surveyed the planet. They know what they have. Time to start construction.

Phase 4: Core Subsystems Come Online

Quick recap of where we are. The kernel has decompressed itself, fixed its page tables, jumped into C, mapped all of physical RAM, and figured out what kind of machine it’s on. We have a primitive memblock allocator handing out raw chunks of memory, and that’s about it—no real allocator, no scheduler, no interrupts, no console (unless you asked for earlyprintk). Just a single thread running setup code.

Now we step into start_kernel() in init/main.c—the longest function in the boot path, and where we go from “tents and emergency rations” to “actual working town.” It’s basically a couple of hundred lines of carefully ordered calls, each one unlocking something the kernel couldn’t do before.

Rather than walk through all of them, let’s group them into the four big things start_kernel() is really trying to do:

  1. Make this CPU usable.
  2. Make memory and tracing usable.
  3. Make time and concurrency usable.
  4. Make processes, files, networks, and security usable.

Let’s go through them.

Make This CPU Usable

First thing start_kernel() does is arm a tripwire on its own stack with set_task_stack_end_magic()—it writes a magic value at the bottom of the stack so later code can spot a stack smash by checking if that value got clobbered. We’re literally arming the alarm on the floor we’re walking on.

Then a flurry of small calls: figuring out which CPU we’re running on, arming a couple of debugging frameworks, extracting the kernel’s build ID (the string that’ll show up in oops messages later), and setting up just enough of the cgroup subsystem so that the boot thread has a group to live in.

After that comes a quiet but important moment: interrupts get turned off, and a flag flips that says “we’re in early boot.” From here on, almost the entire init sequence runs with interrupts disabled. We won’t turn them back on until much later, once timers, the IRQ controller, RCU, and the timekeeper are all alive.

Timeline of start_kernel(): a long shaded bar covering most subsystem init (mm_core_init, sched_init, rcu_init, timekeeping_init, random_init, etc.) labeled “interrupts disabled,” ending at local_irq_enable() — after which the rest of init runs with interrupts on.

boot_cpu_init() officially marks this CPU as online—until this call, any code iterating over CPUs would see zero of them. And then the famous banner gets pushed into the printk ring buffer:

Linux version 6.x ... (gcc ...) ...

If you didn’t enable earlyprintk, no console exists yet, so the message just sits there patiently waiting for one.

Architecture Setup, Static Keys, and the Command Line

First, setup_arch() (the same function we walked through in Phase 3) actually runs here, as part of start_kernel(). By the time it returns, we’ve got the physical memory layout, the direct-map page tables, and the hypervisor/CPU detection done. Right after, a tiny mm_core_init_early() lays down some preliminary memory-management state that later code expects.

Then come two interesting calls: jump_label_init() and static_call_init(). These are how Linux makes runtime feature flags essentially free. The trick is that the kernel patches its own code at boot—if (some_flag) checks get rewritten into either a NOP or a JMP, so a disabled feature costs literally zero CPU cycles in hot paths. static_call_init() does the same trick for indirect function calls, patching many of them into direct calls so tracepoints, scheduler dispatch, and KVM stay cheap.

Right after, early_security_init() brings up the LSM (Linux Security Module) framework just enough to install the most fundamental security hooks. LSM is the pluggable layer that lets modules like SELinux or AppArmor veto sensitive operations. The full bring-up comes later, but a few hooks (like basic capability checks) need to be live even now.

Then the command line gets parsed. setup_command_line() keeps two copies—one preserved verbatim for /proc/cmdline, one the parser is allowed to chew up by overwriting separators with NULs as it tokenizes. After that, setup_per_cpu_areas() allocates the per-CPU memory regions so every CPU has its own private slot for per-CPU variables. Imagine the colony issuing each crew member a personal locker. Then parse_early_param() and parse_args() walk the command line and apply each option (debug, quiet, loglevel=, earlyprintk=, etc.) to whatever subsystem registered for it.

A handful of other housekeeping calls slot in here too: figuring out the real CPU count, finishing the boot CPU’s per-CPU state, resizing the printk ring buffer, laying the first foundations of the filesystem caches, and sorting the exception table that lets the kernel recover gracefully when something like copy_from_user() hits a bad pointer.

All of that was scaffolding. The next call is where the kernel finally gets a real memory system.

Memory Comes Alive: mm_core_init()

This is the big one—the moment the colony stops surviving on emergency rations and brings the actual food production online. Until now, memblock has been our only allocator: a primitive thing that just tracks “this range is free, this range is reserved.” After mm_core_init(), kmalloc(), vmalloc(), and alloc_pages() all work. The kernel becomes a real memory-managed system.

Here’s the layered shape of the memory subsystem we’re about to bring up:

Layered memory subsystem: physical RAM feeds memblock during boot, then hands over free pages to the buddy allocator, which provides page slabs to SLUB and page mappings to vmalloc, all of which serve kernel callers like task_struct and inodes.

Inside, mm_core_init() brings up three layered allocators in turn:

  • The buddy allocator is the page-grained allocator at the bottom. It groups physical pages into power-of-two blocks (1, 2, 4, 8, …) so it can hand out contiguous chunks quickly and merge them back on free. It’s the colony’s land registry, parceling the planet into plots of various sizes. The handoff happens when memblock_free_all() walks every page memblock never reserved and gives it to the buddy. From this moment on, the buddy is the source of truth for free memory.
  • The slab allocator (SLUB is the default) sits on top of the buddy and specializes in objects rather than raw pages—things like task_struct and inode that the kernel allocates and frees constantly. It carves pages into fixed-size slots so a kmalloc(64) doesn’t need a fresh page each time. If you’ve read the Go memory allocator article , this is essentially the same trick as Go’s mspan. Once it’s up, kmalloc() and friends work.
  • The vmalloc allocator handles the case where you need a big region contiguous in virtual address space but can’t find that many contiguous physical pages. It stitches scattered physical pages together via the page tables so they look continuous to the caller.

A few smaller pieces come up around this too: a memory-leak detector (kmemleak_init), some indexed-data-structure init, the machinery that lets the kernel safely rewrite its own code (poking_init—we’ll need it very soon for self-patching), and the tracing infrastructure (ftrace_init, early_trace_init).

Memory is alive. Now it’s time to bring up the thing that decides what runs on top of it.

The Scheduler: sched_init()

Going in, the kernel has exactly one thread—init_task, the original thread we’ve been riding since the first assembly instruction—and no real way to pick what to run next. By the time sched_init() returns, the scheduler is alive: every CPU has a runqueue (the list of tasks ready to run on that CPU), and the rules for picking who runs next are in place. There’s still only one task in the system, so nothing interesting happens yet—but the machinery is ready.

What does it actually do?

  • Sanity-checks that the kernel’s scheduler classes (the policies that decide who runs next, ordered as stop > deadline > rt > fair > idle) are linked in the right priority order. Wrong order = panic.
  • Allocates one runqueue per possible CPU.
  • Sets up the root task group and bandwidth-control plumbing that cgroups will later use to say things like “this group can only use 50% of a CPU.”
  • Quietly turns init_task into the boot CPU’s idle task. Every CPU eventually needs an idle task—a placeholder for when there’s nothing else to run. The thread we’ve been riding all along gets relabeled as PID 0, the idle task, and that’s what it’ll fully become at the end of boot. In the colony metaphor: the advance team has just been issued their long-term assignment. They’ll keep building for now, but their final job is to be the standby maintenance crew. They just don’t know it yet.

After all that, the scheduler flips the “open for business” flag. (Caveat: at this point it can switch tasks on the boot CPU, but it can’t yet load-balance across multiple CPUs—that comes much later in Phase 6 with sched_init_smp().) The subtle consequence: everything that comes after this point in start_kernel() is the idle task pretending to be init code.

On top of the scheduler, a few more concurrency primitives need to come online before we can do anything ambitious.

RCU, Workqueues, and Tracing

workqueue_init_early(); rcu_init(); trace_init();

workqueue_init_early() brings up the workqueue subsystem—the kernel’s “run this function later, in a normal thread context” mechanism. It’s how code defers work that can’t run right now (like cleanup after an interrupt). Don’t confuse it with the runqueue from earlier: runqueues hold tasks the scheduler picks from; workqueues hold pending function calls waiting to be executed by background kernel threads (kthreads). The interesting bit: this call sets up the queues but doesn’t actually hire any workers yet—those come online later in boot. Until then, scheduled work just piles up like maintenance tickets at a help desk that hasn’t hired anyone, and gets chewed through the moment workers arrive.

rcu_init() (with a few related siblings) brings up RCU (Read-Copy-Update), one of the kernel’s main synchronization tools. The idea: on data that’s read constantly but updated rarely (the list of network devices, the routing table, etc.), traditional locks would make every reader pay a cost. RCU lets readers proceed completely lock-free—they just read the current version—while writers build a new copy, swap a pointer to publish it, and wait until all in-progress readers are done before freeing the old version. Readers always see either the old or the new, never a half-updated state, and they pay nothing on the fast path.

Finally, trace_init() finishes wiring up the kernel’s tracing infrastructure—the system that lets you observe what’s happening inside the kernel at runtime (“log every call to this function,” “tell me when a process is scheduled in or out”). After it returns, tracing is ready to use.

Now we get to the part the whole boot sequence has been carefully orchestrated around: interrupts and time.

Interrupts and Time

This is one of the most carefully ordered passages in the whole function—the kernel needs interrupts and a working sense of time to do basically anything else, and each step here unblocks the next.

First come the interrupt calls (early_irq_init, init_IRQ). An interrupt is the hardware’s way of saying “drop what you’re doing, something happened”: a key was pressed, a packet arrived, a disk finished a read. The kernel allocates the per-interrupt bookkeeping and lets the architecture talk to its interrupt controller (the local APIC on x86, the GIC on ARM) so it knows how to route things. By the time these return, the hardware could deliver interrupts—but the kernel still has them globally masked, so nothing actually fires yet.

Then the time stack comes up. tick_init() arms the periodic heartbeat the kernel uses to wake CPUs up (with optional tickless mode where idle CPUs skip beats to save power). timers_init() and hrtimers_init() set up the two flavors of timers: ordinary millisecond-ish timers (network timeouts, etc.) and high-resolution nanosecond timers (precise sleeps). After these, any code that wants to schedule “do this in 500 ms” can do so. softirq_init() sets up tasklets, an older deferred-work mechanism that’s mostly been superseded by workqueues but still has plenty of drivers depending on it.

Then the kernel learns what time it is. timekeeping_init() reads the real-time clock (the battery-backed clock on the motherboard) and anchors both the wall-clock and monotonic clocks to it—after this call, “what time is it?” returns a sensible answer for the first time. time_init() lets the architecture register a higher-quality clock source if it has one (like the TSC on x86) so the kernel can switch to it.

Finally, random_init() brings up the random number generator, mixing together all the early entropy the kernel has been quietly collecting (hardware sources like RDRAND, a bootloader-provided seed, the timing of early boot events, etc.). The stack-canary setup right after this section will lean on it.

With the interrupt machinery wired, time working, and the RNG online, there’s one piece of deferred protection we can finally finish before flipping the switch on interrupts.

Setting Up the Stack Canary

Right after the RNG, the kernel arms KFENCE (kfence_init())—a low-overhead memory-safety detector that complements KASAN. Then boot_init_stack_canary() finishes the stack-corruption protection we left half-done at the start of start_kernel(). The stack canary is a value the compiler pushes onto every function’s stack frame on entry and re-checks on exit; if a buffer overflow has clobbered the frame in between, the canary won’t match and the kernel panics. Up until now it held a predictable value (no RNG, no real secret); now we ask the working RNG for a proper random value and install it as the real canary.

With the kernel hardened, every prerequisite for actually taking an interrupt is finally in place.

Turning Interrupts On

A couple of last pieces slot in just before the switch flip: the perf-counters/tracepoints subsystem (perf_event_init), legacy profiling, and the cross-CPU IPI mechanism that lets one CPU ask another to run a function (call_function_init).

And then, the moment we’ve been queueing up for since the start of Phase 4: local_irq_enable(). Every prior subsystem was set up with interrupts disabled; now that the IRQ controller is wired, timers can fire, RCU is alive, and the timekeeper has a clock, we can finally accept them. The colony just turned on its alarm system and its loudspeakers. From here on, things can interrupt us.

The Real Console

Remember the early printk from Phase 3? That was a minimal serial driver. Now console_init() brings up the real thing, walking each registered console driver in turn. Every message that’s been piling up in the printk ring buffer since the very first pr_notice gets drained to the console here—which is why your boot log appears in a sudden burst rather than a steady trickle. The colony’s main bulletin board just went up, and somebody’s frantically posting twenty minutes of announcements nobody got to see.

Now that we can actually see what the kernel is doing, it’s a good moment to look at how the kernel adapts itself to the specific CPU it ended up running on.

CPU Finalization and Self-Patching

Between the console coming up and the CPU finalization, a handful of smaller things happen: each CPU gets a small private memory cache to speed up allocations, the ACPI subsystem comes online (the part of the firmware that tells the kernel about power management, devices, and so on), and timekeeping is finalized. Somewhere in there, calibrate_delay() produces the famous BogoMIPS number you’ll see in the boot log—a rough measurement of how fast the CPU is.

After all that, the interesting one: arch_cpu_finalize_init(). On x86, this does the full CPUID-driven feature discovery, chooses the right Spectre/Meltdown/MDS/etc. mitigations for your specific CPU, sets up the FPU, and calls alternative_instructions().

That last call is wild. The kernel patches every ALTERNATIVE macro site in its own code, swapping in instructions that match this CPU’s actual feature set. After this call, the kernel has been runtime-rewritten to be optimal for this exact CPU. Got AVX-512? Some memcpy paths get rewritten to use it. Don’t? They get rewritten to a fallback. The same vmlinux file boots optimally on dozens of different CPU generations because of this.

In the colony metaphor: the prefab buildings in our supply containers came designed generic. Now we’re walking around swapping out standard heaters for cold-climate ones, standard window seals for high-pressure ones—every building gets specialized to the actual planet.

With the CPU tuned to its full potential, the last big chunk of start_kernel() is making sure every other major subsystem has somewhere to allocate from.

Slab Caches for Processes, Files, Namespaces, Networking

The last big block of start_kernel() is dozens of small calls that mostly do the same thing: each one creates the slab caches (the pre-cut memory pools we set up in mm_core_init()) for a specific subsystem, so that subsystem can start allocating its own structures. Each department of the colony stocking its own warehouse.

A few of these are load-bearing enough to call out:

  • fork_init() (with siblings like cred_init, signals_init, pagecache_init) creates the cache for task_struct—the giant struct that represents a running process—plus the supporting caches every new process drags along (credentials, signal handlers, kernel stacks, page cache). Until this point, the kernel had nowhere to put a new process; now it can finally create them. The HR department is open.
  • vfs_caches_init() brings up the VFS, the kernel’s virtual filesystem layer. After this, the kernel can mount filesystems—even before any disk driver exists, it has a tiny in-memory rootfs to hang things off. proc_root_init() then brings /proc to life, the synthetic filesystem that lets you peek at every process from userspace.
  • cgroup_init() finalizes cgroups (process grouping with CPU/memory limits), security_init() does the same for LSM (full bring-up of SELinux/AppArmor/etc.), and net_ns_init() along with siblings (pidfs_init, nsfs_init, cpuset_init, mem_cgroup_init) wires up the namespace plumbing that lets containers exist. The networking stack itself comes later via initcalls.

By the end of this block, all the major kernel subsystems are real. Memory works. Time works. Scheduling works. Filesystems can be mounted. Processes can be forked. The colony has streets, buildings, power, comms, and a registry of citizens. What it doesn’t have yet is actual citizens—any tasks beyond the boot thread—which is the cue for the handoff.

Phase 5: From Single-Threaded to Multitasking

So far we’ve been a single advance team running through a checklist. But Linux is a multitasking OS, and a colony with one person isn’t really a colony. The transformation happens in rest_init() (also in init/main.c), called at the very end of start_kernel(). It never returns—because by the time it’s done, the boot thread itself has become something else.

Spawning the First Real Processes

rest_init() creates two kernel threads:

  • PID 1 (kernel_init)—a kernel thread now, but destined to exec() into userspace and become the init process, the ancestor of every userspace process on the system. The colonial governor we’ve been building toward (eventually /sbin/init, systemd, OpenRC, etc.).
  • PID 2 (kthreadd)—the kernel thread daemon. Whenever any kernel code calls kthread_create(), the request gets routed to kthreadd, which spawns the new thread. This is the dispatch office of the colony, handling every “we need a new kernel worker” request from now until shutdown.

There’s a subtle ordering trick here. kernel_init is spawned first so it gets PID 1, then kthreadd. But kernel_init can’t actually start working until kthreadd is alive, since almost everything it does will eventually need to create kernel threads. The kernel handles this with a synchronization primitive called a completion: kernel_init immediately blocks on wait_for_completion(&kthreadd_done), and once kthreadd is ready, rest_init signals it through. PIDs come out right, and nobody starts working until the dispatcher is open.

(Side note: PID 1 is initially pinned to the boot CPU. It only becomes free to roam once sched_init_smp() runs in Phase 6.)

The Idle Loop, and the Quiet Disappearance of the Boot Thread

After spawning those two threads, rest_init() calls schedule_preempt_disabled()—it yields the CPU. The scheduler picks one of the new threads (most likely kernel_init) and context-switches to it. When control eventually finds its way back to this CPU as the idle task, cpu_startup_entry(CPUHP_ONLINE) runs do_idle(), the idle loop, forever.

That switch is the magic moment. The original thread of execution—the one that has been running since startup_64 in raw assembly, through setup_arch, start_kernel, every subsystem init, all the way to here—has just permanently become the boot CPU’s idle task (PID 0). It only runs again when literally nothing else wants the CPU.

Our advance team just clocked off and became the standby maintenance crew. The civilians take over. From here on, kernel_init is doing the work.

rest_init() splits the boot thread into three: PID 1 kernel_init (finishes init and exec()s /sbin/init), PID 2 kthreadd (spawns all future kernel threads), and PID 0 idle task (the original boot thread, now the boot CPU’s idle).

But the work isn’t quite done yet—kernel_init has a long checklist before it can hand the keys to userspace.

Phase 6: Finishing Up and Launching User Space

kernel_init (PID 1, still in kernel mode) has a final to-do list before it exec()s into userspace and becomes the real init process. Most of the work happens in kernel_init_freeable().

First job on that list: bring the rest of the hardware to life.

Hiring the Workers and Waking Up the Other CPUs

First, workqueue_init() finishes bringing up the workqueue subsystem—the worker kthreads (kworker/...) finally come online and start chewing through the backlog of deferred work that’s been piling up since the workqueue subsystem came up earlier. The help desk that’s been collecting tickets just opened for business.

Then it’s time to wake up the rest of the CPU cores. Up to now only the boot CPU has been running—the others have been parked since firmware powered them up. smp_init() sends inter-processor interrupts (IPIs) to wake them, and each one goes through its own simplified boot (page tables, APIC, idle task) before joining the idle loop.

But just having more CPUs isn’t enough—the scheduler also needs to know which CPUs share caches, NUMA nodes, or hyperthreading siblings, so it can move tasks around intelligently. That’s what sched_init_smp() builds: the scheduling domains the load balancer uses to decide where each task should run. This is when real multitasking truly begins.

In the colony: the rest of the colonists were in cryosleep on the dropship. Now we thaw them out, hand out assignments, and set up the dispatch that decides who works where.

Loading the Drivers and Filesystems (initcalls)

Before the bulk of init runs, page_alloc_init_late() finishes any deferred page setup that was skipped to keep early boot fast (on huge-memory machines this can take noticeable time).

Then do_initcalls() runs through thousands of initcalls—functions across the source tree that each initialize some subsystem (a driver, a filesystem, a network protocol). This is where most of the kernel’s bulk runs, and where the boot log fills up with messages like:

PCI: Using ACPI for IRQ routing ata1: SATA max UDMA/133 ... ext4 filesystem driver registered

Every department finally opens its doors—PCI, SATA, ext4, network protocols—and starts taking customers. Initcalls aren’t all run at once: they’re grouped into ordered levels (early, core, subsys, fs, device, late, etc.), so drivers can register at the right phase and depend on others (e.g., the PCI bus is up before disk drivers look for hardware).

After initcalls, wait_for_initramfs() waits for the initramfs to finish unpacking, and console_on_rootfs() opens /dev/console and wires it up as stdin/stdout/stderr—that’s how the userspace init ends up with a controlling terminal automatically.

Mounting the Root Filesystem

Then prepare_namespace() mounts whatever you specified with root=. On modern distros you’re almost always using an initramfs (initial RAM filesystem): the kernel unpacks it into a tmpfs, runs /init inside, and that script loads any drivers it needs, mounts the real root filesystem, and pivots to it. Without an initramfs, the kernel just mounts the root filesystem directly.

This is the colony unsealing the supply warehouse—up to now we’ve been using only what we landed with; the long-term inventory is finally available.

There’s a bit of housekeeping to do before we hand control off.

Throwing Out the Setup Crew’s Toolboxes

All that init code—setup_arch, every __init function, the whole setup pipeline—is now dead weight. The functions were tagged with __init precisely so they could be placed in a separate memory section and freed after boot. free_initmem() reclaims that section, typically a few megabytes:

Freeing unused kernel memory: 2048K

The advance team’s setup gear—scaffolding, build-phase tools, construction-only signage—gets packed up, and the space they took becomes usable real estate.

Locking the Doors Behind Us

Right after freeing init memory, mark_readonly() flips rodata to actually read-only at the page-table level. The next call, pti_finalize(), finalizes PTI (Page Table Isolation)—the mitigation for the Meltdown vulnerability that keeps kernel page tables almost entirely separate from user-process page tables, so userspace can’t leak kernel memory through speculative side channels.

Why now? Because during boot the kernel was patching its own code (remember alternative_instructions() from earlier in Phase 4?) and writing tables that have to stay writable until they’re finished. You can’t lock the doors while you’re still installing them—now that they’re installed, we throw the deadbolts.

And now we reach the call the entire boot has been leading up to.

Executing Init: The Final Handoff

This is the moment everything has been building toward. The kernel tries to launch the init program by walking a list of candidate paths in order:

  1. rdinit= from the command line (typically /init inside the initramfs)
  2. init= from the command line—if this was set explicitly and the program fails to run, the kernel panics; the user asked for something specific and won’t be silently overridden
  3. The compiled-in CONFIG_DEFAULT_INIT, if any
  4. /sbin/init, then /etc/init, then /bin/init
  5. /bin/sh as a last-ditch fallback

When one of those succeeds, the kernel goes through a small helper called run_init_process() which in turn calls kernel_execve()—the in-kernel version of the same execve(2) syscall userspace uses, just bypassing the syscall boundary since the arguments are already in kernel memory. This replaces the kernel thread with the userspace program. The PID 1 kernel thread is gone; in its place is the userspace init process. Same PID, same task, but now it’s running unprivileged code.

If every path fails, the kernel panics with No working init found.

But it almost always succeeds—and the moment it does, the system is fully booted. Userspace takes over. The colonial governor steps out of the kernel’s control and starts spawning services, mounting filesystems, launching login managers. The kernel’s job is done. It’s now just infrastructure, sitting underneath everything that comes next.

That was a lot of ground to cover, so before we close out, let’s recap the whole journey at a high level.

Summary

The Linux kernel boot process turns bare metal into a working operating system through six carefully ordered phases:

Phase 1 (Assembly): Decompress the kernel and randomize its base (KASLR), climb the CPU into 64-bit Long Mode, verify the CPU’s features, fix the page tables to handle the load-vs-compile address mismatch, and jump into C.

Phase 2 (Early C): Clear BSS, set up safety stubs (KASAN’s placeholder shadow, a minimal IDT), save the bootloader’s parameters, and patch the CPU’s microcode.

Phase 3 (setup_arch): Detect CPU features, parse the firmware memory map, learn what machine we’re on (UEFI, hypervisor, ACPI early pass), build the kernel direct map of all physical RAM, and finalize KASAN’s real shadow.

Phase 4 (start_kernel): Bring up memory (mm_core_init), the scheduler, RCU, time and the IRQ stack (then enable interrupts), the real console, self-patch the kernel for this exact CPU, and the slab caches every other subsystem needs.

Phase 5 (rest_init): Spawn PID 1 (kernel_init, the future init process) and PID 2 (kthreadd, the kernel-thread dispatcher). The original boot thread becomes the boot CPU’s idle task and never executes init code again.

Phase 6 (kernel_init): Wake up the other CPUs, run all initcalls (drivers, filesystems), mount the root filesystem, free __init memory, lock down rodata and finalize PTI, and finally exec() the userspace init program. The advance team has handed the keys to the civilians, and Linux is alive.

In the next article in the series, we’ll look at how userspace actually talks to the kernel from this point on—through system calls, the bridge between unprivileged programs and the machinery we just spent this whole article setting up.

Поділитися

Схожі новини