Sunday, May 13, 2007

Linux Context Switch



Since I wrote this entry, I was wondering about the top address of a stack
trace of a process sleeping in schedule() like below:
crash> bt
PID: 2086 TASK: ffff81007f108080 CPU: 1 COMMAND: "sendmail"
#0 [ffff810078291950] schedule at ffffffff80260ab2 <== Here!
#1 [ffff810078291a30] schedule_timeout at ffffffff80261416
#2 [ffff810078291a80] do_select at ffffffff80211369
[snip]

crash> rd ffff810078291950 2
ffff810078291950: ffff810078291a28 0000000000000031 (.)x....1.......

My point is that '0xffffffff80260ab2' is not equal to the value stored
at '0xffff810078291950' in the above case. This is different from other
stack frames. Actually, in the above example values stored at the stack
frame #1 and #2 are exactly the return addresses at the rightmost column
of each line.

Q1.

First of all, what does the address '0xffffffff80260ab2' mean?
Assuming the address is a return address of a FAR call instruction,
I disassembled 0xffffffff80260ab2 - five bytes, and got the following:


crash> dis ffffffff80260aad 16
0xffffffff80260aad : callq 0xffffffff80262898 <__kprobes_text_start>
0xffffffff80260ab2 : mov %gs:0x0,%rsi
[snip]

As shown in the above, its procedure name is 'thread_return' presumably
in schedule(). Here, schedule() is the central function to make decisions
which process to run.

Q2.

Next question is why 'thread_return?'

I looked into crash-4.0-2.13 source tree and understood the reason.
Here is a call graph of 'bt' command.

bt_cmd (./kernel.c:1351 or ./kernel.c:1359)                    
back_trace ()
mach_dep->get_stack_frame (./kernel.c:1518)
x86_64_get_stack_frame (x86_64.c:2296)
x86_64_get_pc (x86_64.c:2510)
# Here, if thread_struct does not have %rip as its member,
# x86_64_get_pc returns 'thread_return' as %rip.
# That's the case of kernel 2.6.11 or later.

In short, 'bt' command of crash always returns the same address for a process
which is NOT on a physical CPU.
The stack frame #0 above is the place where the process left in-execution
state and switched to the next process to get the CPU.
In short, a context switch occurs. So, the return address must be
the instruction pointer (%rip) when the previous/released CPU
process gets CPU and resume execution again.
As it's natural that there are several things to do for kernel after switching
one process to another, thus the return address of can be different from
usual function call cases.

Q3.

My third question is why there is no 'thread_return' address on the top of
stack frame #0?

We can understand the reason loooking into schedule().
Here is another call graph of schedule().
schedule()
context_switch() # an inline architecutre independent C function
switch_to() # a macro in assembly language
callq __switch_to # an architecture dependent C function
Actually, 'switch_to' is defined like the following in case of x86_64
(include/asm-x86_64/system.h), and as is shown below the stack pointer
%rsp is switched to the next process's one before calling '__switch_to.'
 23 #define switch_to(prev,next,last)                                      \
24 asm volatile(SAVE_CONTEXT \
25 "movq %%rsp,%P[threadrsp](%[prev])\n\t" /* save RSP */ \
26 "movq %P[threadrsp](%[next]),%%rsp\n\t" /* restore RSP */ \
27 "call __switch_to\n\t" \
28 ".globl thread_return\n" \
29 "thread_return:\n\t" \
30 "movq %%gs:%P[pda_pcurrent],%%rsi\n\t" \
31 "movq %P[thread_info](%%rsi),%%r8\n\t" \
32 LOCK_PREFIX "btr %[tif_fork],%P[ti_flags](%%r8)\n\t" \
33 "movq %%rax,%%rdi\n\t" \
34 "jc ret_from_fork\n\t" \
35 RESTORE_CONTEXT \
[snip]

The call instruction on line 27 above does not save its return address
('thread_return') into the previous stack frame top, because the kernel
changed the current stack frame from the previous one (saved RSP on line 25)
to the next (restored RSP on line 26) just before that.

Thus, thread_return does not appear in the stack frame of the previos process.

Q4.

The last question is what is 'thread_return'?
But, the name looks like self-explaining, and actually on the line 28
of 'switch_to' definition above there it is. It's just the re-entry
point for a process on returning from '__switch_to().'

I hope this article is useful for people.

No comments: