Monday, April 23, 2007

Linux Kernel Linkage

I have to confess that I didn't know Linux kernel (general?) linkage convention well till the last Sunday. :(

I've been feeling strange we often get unknown function call entries in
kernel stack traces like the below example when we are running Linux on
x86 or x86_64 architecture. Among the stack trace below, entries in red
cannot be found even if you looked into the source code in detail.

crash> bt
PID: 2115 TASK: ffff81007f0ff7d0 CPU: 0 COMMAND: "sendmail"
#0 [ffff810078269950] schedule at ffffffff80260ab2
#1 [ffff810078269a30] schedule_timeout at ffffffff80261416
#2 [ffff810078269a80] do_select at ffffffff80211369
#3 [ffff810078269b70] proc_alloc_inode at ffffffff802f48dc
#4 [ffff810078269b80] alloc_inode at ffffffff80225746
#5 [ffff810078269bb0] inotify_d_instantiate at ffffffff802e6a70
#6 [ffff810078269bd0] d_rehash at ffffffff80240566
#7 [ffff810078269bf0] proc_lookup at ffffffff802277cd
#8 [ffff810078269c20] proc_root_lookup at ffffffff80254ddd
#9 [ffff810078269c40] do_lookup at ffffffff8020cade
#10 [ffff810078269c70] dput at ffffffff8020cff4
#11 [ffff810078269c90] __link_path_walk at ffffffff8020a196
#12 [ffff810078269cf0] vsnprintf at ffffffff80219e0f
#13 [ffff810078269d00] link_path_walk at ffffffff8020e776

#14 [ffff810078269d90] core_sys_select at ffffffff802d8dd8
#15 [ffff810078269e00] __next_cpu at ffffffff8033bf39
#16 [ffff810078269e10] nr_running at ffffffff802861c5
#17 [ffff810078269e30] loadavg_read_proc at ffffffff802f7be7
#18 [ffff810078269e60] lock_kernel at ffffffff802627fb
#19 [ffff810078269e80] de_put at ffffffff802f480b

#20 [ffff810078269f20] sys_select at ffffffff80216185
#21 [ffff810078269f80] system_call at ffffffff8025c00e

Finally, I understood the reason.

Below code chunk is excerpted from 'arch/x86_64/kernel/traps.c' of RHEL5 Update 0 kernel (based on linux-2.6.18). The function 'dump_trace' is the one which prints out stack traces, and note that lines from 269 to 272 say something strange. That means 'dump_trace' cannot determine boundaries of stack frames strictly. Thus, taking a stack area as if it's an array of pointers, the function prints an entry into the stack trace if the value looks like an in-kernel function. :(
242 void dump_trace(struct task_struct *tsk,
struct pt_regs *regs, unsigned long * stack,
243 struct stacktrace_ops *ops, void *data)
244 {
245 const unsigned cpu = smp_processor_id();
246 unsigned long *irqstack_end =
(unsigned long *)cpu_pda(cpu)->irqstackptr;
247 unsigned used = 0;
248
[snip]
259 /*
260 * Print function call entries within a stack. 'cond' is the
261 * "end of stackframe" condition, that the 'stack++'
262 * iteration will eventually trigger.
263 */
264 #define HANDLE_STACK(cond) \
265 do while (cond) { \
266 unsigned long addr = *stack++; \
267 if (kernel_text_address(addr)) { \
268 /* \
269 * If the address is either in the text segment of the \
270 * kernel, or in the region which contains vmalloc'ed \
271 * memory, it *may* be the address of a calling \
272 * routine; if so, print it so that someone tracing \
273 * down the cause of the crash will be able to figure \
274 * out the call path that was taken. \
275 */ \
276 ops->address(data, addr); \
277 } \
278 } while (0)
279
280 /*
281 * Print function call entries in all stacks, starting at the
282 * current stack address. If the stacks consist of nested
283 * exceptions
284 */
285 for (;;) {
286 char *id;
287 unsigned long *estack_end;
288 estack_end = in_exception_stack(cpu, (unsigned long)stack,
289 &used, &id);
290
291 if (estack_end) {
292 if (ops->stack(data, id) <>stack(data, "");
296 /*
297 * We link to the next stack via the
298 * second-to-last pointer (index -2 to end) in the
299 * exception stack:
300 */
301 stack = (unsigned long *) estack_end[-2];
302 continue;
303 }

304 if (irqstack_end) {
[snip]
323 }
324 break;
325 }
The reason comes from the linkage convention of Linux (x86 and x86_64, at least).

In those architectures, callee functions do NOT save any registers except ones which the callees destroy. Even the stack pointer (%sp or %rsp register). Only return addresses are saved automatically by 'call' instructions.

That's why, different from other architectures/operating systems like SPARC/Solaris, it's difficult to track back the caller functions correctly among the stack area in case of Linux/x86(x86_64). :(

Of couse, this must come from performance consideration. But still, it makes trouble shootings difficult, I think. :(

1 comment:

Anonymous said...

There are options in the kernel configuration that allow you to keep the stack pointer to allow for accurate stack backtraces.

Using -fomit-frame-pointer is very common because x86 has so few architectural registers
(though in reality there are many more registers and a renaming engine inside modern CPUs)