Gofunclatency

本文介绍怎么实现一个可用的gofunclatency脚本。

起因

最近对bcc产生了一些兴趣，而我工作语言是golang，因此打算实现一个gofunclancy。有人会问bcc tools里面不是自带了一个funclatency吗？实际上这个工具不能直接应用于golang。我们接下来将从原生的funclatency的原理以及实现讲起，然后看下golang是哪里不适合这个工具，并逐步提出解决方案。

特别心急的小伙伴可以直接看我的实现源码：gofunclatency，中度心急的可以看每个小节的倒数一两段的总结，不急的可以从头到尾看一遍，帮忙找找错漏。

环境

软件	版本
操作系统	Arch Linux
go	1.12.7
gcc	9.3.0

funclatency实现原理探索

以一个简单的C程序作为范例：

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27


#include <stdio.h>
#include <stdlib.h>

int wait4input(void){
    int ch;

    ch = getchar();
    while('\n'!=getchar());

    return ch;
}

void attack(int i)
{
    i++;
}

int main()
{
    printf("waiting for gdb\n");
    wait4input();
    printf("gdb ready\n");
    for (int i = 0; i < 1000000; i++) {
        attack(i);
    }
    return 0;
}

编译的时候记得禁止任何优化，否则attack会被内联：

1

gcc hello.c -O0 -o hello.bin

接着打开一个窗口A运行

1

./hello.bin

另外一个窗口B执行

1

gdb -p `pgrep hello.bin`

在窗口B给i++那行打个断点，然后continue，再在窗口A随便输入点东西让程序继续。

1
2
3
4
5
6
7


(gdb) b hello.c:15
Breakpoint 1 at 0x55c71f396170: file hello.c, line 15.
(gdb) c
Continuing.

Breakpoint 1, attack (i=0) at hello.c:15
15	    i++;

完成上面的步骤后，在使用funclatency之前，我们先看看原来的代码是怎么样的：

 1
 2
 3
 4
 5
 6
 7
 8
 9
10


(gdb) disass
Dump of assembler code for function attack:
   0x000055d574cb7169 <+0>:	push   %rbp
   0x000055d574cb716a <+1>:	mov    %rsp,%rbp
   0x000055d574cb716d <+4>:	mov    %edi,-0x4(%rbp)
=> 0x000055d574cb7170 <+7>:	addl   $0x1,-0x4(%rbp)
   0x000055d574cb7174 <+11>:	nop
   0x000055d574cb7175 <+12>:	pop    %rbp
   0x000055d574cb7176 <+13>:	retq   
End of assembler dump.

我们反汇编了attack函数，可以看到经典的保存rbp指针，将rsp指针赋予rbp指针的函数栈操作。而edi是用于存储第一个函数的寄存器，我们把这个函数放到临时的栈位置上，然后加1，这和i++的语义是一样的。最后是恢复rbp指针和返回上级函数。在断点，也就是上面小箭头的位置上，我们可以知道，栈上保存了rbp原来的值，以及返回地址。返回地址应该是main函数调用完attack函数后的下一个指令的地址，我们验证一下：

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27


(gdb) x/8xg $rsp
0x7ffce76e1b40:	0x00007ffce76e1b60	0x000055d574cb71af
0x7ffce76e1b50:	0x00007ffce76e1c50	0x0000000000000000
0x7ffce76e1b60:	0x0000000000000000	0x00007f0fb550a023
0x7ffce76e1b70:	0x00007f0fb57005c0	0x00007ffce76e1c58
(gdb) disass main
Dump of assembler code for function main:
   0x000055d574cb7177 <+0>:	push   %rbp
   0x000055d574cb7178 <+1>:	mov    %rsp,%rbp
   0x000055d574cb717b <+4>:	sub    $0x10,%rsp
   0x000055d574cb717f <+8>:	lea    0xe7e(%rip),%rdi        # 0x55d574cb8004
   0x000055d574cb7186 <+15>:	callq  0x55d574cb7030 <puts@plt>
   0x000055d574cb718b <+20>:	callq  0x55d574cb7149 <wait4input>
   0x000055d574cb7190 <+25>:	lea    0xe7d(%rip),%rdi        # 0x55d574cb8014
   0x000055d574cb7197 <+32>:	callq  0x55d574cb7030 <puts@plt>
   0x000055d574cb719c <+37>:	movl   $0x0,-0x4(%rbp)
   0x000055d574cb71a3 <+44>:	jmp    0x55d574cb71b3 <main+60>
   0x000055d574cb71a5 <+46>:	mov    -0x4(%rbp),%eax
   0x000055d574cb71a8 <+49>:	mov    %eax,%edi
   0x000055d574cb71aa <+51>:	callq  0x55d574cb7169 <attack>
   0x000055d574cb71af <+56>:	addl   $0x1,-0x4(%rbp)
   0x000055d574cb71b3 <+60>:	cmpl   $0xf423f,-0x4(%rbp)
   0x000055d574cb71ba <+67>:	jle    0x55d574cb71a5 <main+46>
   0x000055d574cb71bc <+69>:	mov    $0x0,%eax
   0x000055d574cb71c1 <+74>:	leaveq 
   0x000055d574cb71c2 <+75>:	retq   
End of assembler dump.

可以看到0x000055d574cb71af确实是main函数调用完attack后的下一个指令的地址。接着，我们看下用funclatency测量attack函数的耗时后，程序会出现什么变化，打开窗口C，执行：

1
2


[root@archlinux perf]# funclatency `realpath hello.bin`:attack
Tracing 1 functions for "/data/workspace/perf/hello.bin:attack"... Hit Ctrl-C to end.

然后在窗口B的gdb终端里再看一下汇编代码：

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16


(gdb) c
Continuing.

Program received signal SIGTRAP, Trace/breakpoint trap.
0x000055d574cb7171 in attack (i=2) at hello.c:15
15	    i++;
(gdb) disass
Dump of assembler code for function attack:
   0x000055d574cb7169 <+0>:	int3   
   0x000055d574cb716a <+1>:	mov    %rsp,%rbp
   0x000055d574cb716d <+4>:	mov    %edi,-0x4(%rbp)
   0x000055d574cb7170 <+7>:	addl   $0x1,-0x4(%rbp)
   0x000055d574cb7174 <+11>:	nop
   0x000055d574cb7175 <+12>:	pop    %rbp
   0x000055d574cb7176 <+13>:	retq   
End of assembler dump.

我们先continue一下，进入下一个attack循环断点，发现原来的代码位置0x000055d574cb7169的push %rbp指令变成了int3。int3是一个软中断，用于给gdb等程序作为断点使用，这里是被funclatency加入了一个断点。

funclatency既然是计算函数延迟，不会只在函数开始的时候断点，结束的时候也应该有个断点。看管们别急，我们再看一下刚才说的栈返回地址的地方：

1
2
3
4
5


(gdb) x/8xg $rsp
0x7ffce76e1b40:	0x00007ffce76e1b60	0x00007fffffffe000
0x7ffce76e1b50:	0x00007ffce76e1c50	0x0000000200000000
0x7ffce76e1b60:	0x0000000000000000	0x00007f0fb550a023
0x7ffce76e1b70:	0x00007f0fb57005c0	0x00007ffce76e1c58

我们发现，应该返回的地址0x000055d574cb71af变成了0x00007fffffffe000。这个地址里面是什么呢？我们尝试gdb打印一下：

1
2


(gdb) x/1xb 0x00007fffffffe000
0x7fffffffe000:	Cannot access memory at address 0x7fffffffe000

居然看不到，看不到也没关系，我们看下这块内存是干啥的：

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24


[root@archlinux perf]# cat /proc/`pgrep hello.bin`/maps
55d574cb6000-55d574cb7000 r--p 00000000 08:12 14554273                   /data/workspace/perf/hello.bin
55d574cb7000-55d574cb8000 r-xp 00001000 08:12 14554273                   /data/workspace/perf/hello.bin
55d574cb8000-55d574cb9000 r--p 00002000 08:12 14554273                   /data/workspace/perf/hello.bin
55d574cb9000-55d574cba000 r--p 00002000 08:12 14554273                   /data/workspace/perf/hello.bin
55d574cba000-55d574cbb000 rw-p 00003000 08:12 14554273                   /data/workspace/perf/hello.bin
55d57530f000-55d575330000 rw-p 00000000 00:00 0                          [heap]
7f0fb54e3000-7f0fb5508000 r--p 00000000 08:02 3935586                    /usr/lib/libc-2.31.so
7f0fb5508000-7f0fb5654000 r-xp 00025000 08:02 3935586                    /usr/lib/libc-2.31.so
7f0fb5654000-7f0fb569f000 r--p 00171000 08:02 3935586                    /usr/lib/libc-2.31.so
7f0fb569f000-7f0fb56a2000 r--p 001bb000 08:02 3935586                    /usr/lib/libc-2.31.so
7f0fb56a2000-7f0fb56a5000 rw-p 001be000 08:02 3935586                    /usr/lib/libc-2.31.so
7f0fb56a5000-7f0fb56ab000 rw-p 00000000 00:00 0 
7f0fb56d5000-7f0fb56d7000 r--p 00000000 08:02 3935574                    /usr/lib/ld-2.31.so
7f0fb56d7000-7f0fb56f7000 r-xp 00002000 08:02 3935574                    /usr/lib/ld-2.31.so
7f0fb56f7000-7f0fb56ff000 r--p 00022000 08:02 3935574                    /usr/lib/ld-2.31.so
7f0fb5700000-7f0fb5701000 r--p 0002a000 08:02 3935574                    /usr/lib/ld-2.31.so
7f0fb5701000-7f0fb5702000 rw-p 0002b000 08:02 3935574                    /usr/lib/ld-2.31.so
7f0fb5702000-7f0fb5703000 rw-p 00000000 00:00 0 
7ffce76c2000-7ffce76e3000 rw-p 00000000 00:00 0                          [stack]
7ffce77d1000-7ffce77d5000 r--p 00000000 00:00 0                          [vvar]
7ffce77d5000-7ffce77d7000 r-xp 00000000 00:00 0                          [vdso]
7fffffffe000-7ffffffff000 --xp 00000000 00:00 0                          [uprobes]
ffffffffff600000-ffffffffff601000 --xp 00000000 00:00 0                  [vsyscall]

可以看到，这块内存是在[uprobe]这个区域，根据get_trampoline_vaddr，这是一个跳板区域，在函数返回之后，先执行这个跳板区域的程序，然后再返回原来应该返回的地址。

至此，我们知道了funclatency是：

在函数开始的地方加了个int3软中断，执行一段程序；
修改栈中保存的返回地址，改为[uprobes]里面的一个跳板区域，执行一段特殊程序后正常返回原地址。

第一点就是uprobe的原理，第二点就是uretprobe的原理。这里我们暂不探讨kprobe和kretprobe。

我们已经初步知道所谓断点的机制，下面我们将通过阅读funclatency源码，了解程序在断点中注入了什么(下面都是摘抄片段，有需要看完整源码请看funclatency)：

1
2
3
4


b.attach_uprobe(name=library, sym_re=pattern, fn_name="trace_func_entry",
                    pid=args.pid or -1)
b.attach_uretprobe(name=library, sym_re=pattern,
                    fn_name="trace_func_return", pid=args.pid or -1)

其中library为程序路径或者库名，pattern是函数的正则表达式，我们往这个程序的这些函数里面注入uprobe和uretprobe。uprobe里面是执行了一个trace_func_entry函数：

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16


BPF_HASH(start, u32);
STORAGE

int trace_func_entry(struct pt_regs *ctx)
{
    u64 pid_tgid = bpf_get_current_pid_tgid();
    u32 pid = pid_tgid;
    u32 tgid = pid_tgid >> 32;
    u64 ts = bpf_ktime_get_ns();

    FILTER
    ENTRYSTORE
    start.update(&pid, &ts);

    return 0;
}

其中bpf_get_current_pid_tgid是获取进程的pid和tgid，这是内核的概念，内核的pid相当于用户空间的线程id，内核的tgid才是用户空间的进程id，为了避免混乱，下面我们分别以用户角度简称线程id和进程id。可以看到，我们以线程id作为键，存下了开始的纳秒时间(bpf_ktime_get_ns)。

然后我们看下uretprobe的trace_func_return函数干了啥：

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21


int trace_func_return(struct pt_regs *ctx)
{
    u64 *tsp, delta;
    u64 pid_tgid = bpf_get_current_pid_tgid();
    u32 pid = pid_tgid;
    u32 tgid = pid_tgid >> 32;

    // calculate delta time
    tsp = start.lookup(&pid);
    if (tsp == 0) {
        return 0;   // missed start
    }
    delta = bpf_ktime_get_ns() - *tsp; //重点关注
    start.delete(&pid);
    FACTOR

    // store as histogram
    STORE

    return 0;
}

可以看到，我们先用当前的线程id查到开始时间，然后用现在时间减去开始时间得到delta，以纳秒计算的时间差，这就是这个函数的延迟。有人可能会有疑惑，那delta没看到哪里用呀？其实STORE是一个宏，负责存储：

1
2
3


bpf_text = bpf_text.replace('STORAGE', 'BPF_HISTOGRAM(dist);')
bpf_text = bpf_text.replace('STORE',
        'dist.increment(bpf_log2l(delta));')

我们把值存到了一个dist这个直方图类型里面，最后是打印出来：

1

dist.print_log2_hist(label)

至此，funclatency的uprobe部分在我们眼里再无秘密。总结一下就是，funclatency相当于在函数开始的地方加了一个uprobe，也就是替换第一个指令为int3，进入中断处理函数，在中断处理函数中用线程id作为键，存了开始的时间t1；然后加个uretprobe，也就是修改栈上保存的返回地址，指向一个跳板区域，跳板区域里面用线程id查找到开始时间t1, 用现在的时间t2-t1，得到延迟时间t'，最后打印时间分布直方图。

原生funclatency的原理我们很清楚了，下面我们将探索funclatency不适用于go的原因。

go的线程id之困

老规矩，先来个小程序提神醒脑，提出问题：

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23


package main

/*
#include <sys/types.h>
extern pid_t gettid(void);
*/
import "C"

import (
	"fmt"
	"time"
)

func bark(prefix string) {
	fmt.Printf("[%s] tid is %d\n", prefix, C.gettid())
	time.Sleep(time.Second)
	fmt.Printf("[%s] tid is %d\n", prefix, C.gettid())
}

func main() {
	go bark("b")
	bark("a")
}

多跑几次

1
2
3
4
5
6
7
8


[root@archlinux perf]# go run gotid.go
[a] tid is 480820
[b] tid is 480822
[a] tid is 480823
[root@archlinux perf]# go run gotid.go
[a] tid is 480923
[b] tid is 480926
[a] tid is 480927

我们发现，线程id居然在函数a中间变了，这在普通的多线程C/C++函数里面难以想象，这也导致了funclatency脚本以tid作为键存储函数开始时间变得不可行。很自然的，为了测量函数的延迟，我们需要找另一个键，一个能表示这个函数的这次运行的唯一的键。比如goroutine id。那么，怎么获取goroutine id呢？(以下简称goid)

遗憾的是，经过一番查找，目前go官方没有暴露任何获得goid的方法，但是由于go是开源的，我们可以看源码。经过看源码，我们发现goid存储在G结构体中：

1
2
3
4
5


type g struct {
...省略N行无关代码...
	goid           int64
...省略N行无关代码...
}

而在runtime包里面，获取当前的g是通过如下方法：

1
2
3
4


// getg returns the pointer to the current g.
// The compiler rewrites calls to this function into instructions
// that fetch the g directly (from TLS or from the dedicated register).
func getg() *g

难以置信，不知道是不是老眼昏花，笔者我居然找不到这个函数的实现，看来只能是按照这个函数的注释，是编译器把这个调用直接转成指令了。不过既然能编译出来，那我们找个调用了getg函数的比较短的父函数反汇编一下。在这里我们找了runtime.showframe函数：

1
2
3
4
5
6
7
8
9


// showframe reports whether the frame with the given characteristics should
// be printed during a traceback.
func showframe(f funcInfo, gp *g, firstFrame bool, funcID, childID funcID) bool {
	g := getg()
	if g.m.throwing > 0 && gp != nil && (gp == g.m.curg || gp == g.m.caughtsig.ptr()) {
		return true
	}
	return showfuncinfo(f, firstFrame, funcID, childID)
}

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48


(gdb) disass 'runtime.showframe'
Dump of assembler code for function runtime.showframe:
   0x0000000000449df0 <+0>:	mov    %fs:0xfffffffffffffff8,%rcx
   0x0000000000449df9 <+9>:	cmp    0x10(%rcx),%rsp
   0x0000000000449dfd <+13>:	jbe    0x449ea8 <runtime.showframe+184>
   0x0000000000449e03 <+19>:	sub    $0x28,%rsp
   0x0000000000449e07 <+23>:	mov    %rbp,0x20(%rsp)
   0x0000000000449e0c <+28>:	lea    0x20(%rsp),%rbp
   0x0000000000449e11 <+33>:	mov    %fs:0xfffffffffffffff8,%rax
   0x0000000000449e1a <+42>:	mov    0x30(%rax),%rax
   0x0000000000449e1e <+46>:	cmpl   $0x0,0xf4(%rax)
   0x0000000000449e25 <+53>:	jle    0x449e43 <runtime.showframe+83>
   0x0000000000449e27 <+55>:	mov    0x40(%rsp),%rcx
   0x0000000000449e2c <+60>:	test   %rcx,%rcx
   0x0000000000449e2f <+63>:	je     0x449e43 <runtime.showframe+83>
   0x0000000000449e31 <+65>:	cmp    %rcx,0xc0(%rax)
   0x0000000000449e38 <+72>:	jne    0x449e98 <runtime.showframe+168>
   0x0000000000449e3a <+74>:	mov    $0x1,%eax
   0x0000000000449e3f <+79>:	test   %al,%al
   0x0000000000449e41 <+81>:	jne    0x449e89 <runtime.showframe+153>
   0x0000000000449e43 <+83>:	mov    0x30(%rsp),%rax
   0x0000000000449e48 <+88>:	mov    %rax,(%rsp)
   0x0000000000449e4c <+92>:	mov    0x38(%rsp),%rax
   0x0000000000449e51 <+97>:	mov    %rax,0x8(%rsp)
   0x0000000000449e56 <+102>:	movzbl 0x48(%rsp),%eax
   0x0000000000449e5b <+107>:	mov    %al,0x10(%rsp)
   0x0000000000449e5f <+111>:	movzbl 0x49(%rsp),%eax
   0x0000000000449e64 <+116>:	mov    %al,0x11(%rsp)
   0x0000000000449e68 <+120>:	movzbl 0x4a(%rsp),%eax
   0x0000000000449e6d <+125>:	mov    %al,0x12(%rsp)
   0x0000000000449e71 <+129>:	callq  0x449ec0 <runtime.showfuncinfo>
   0x0000000000449e76 <+134>:	movzbl 0x18(%rsp),%eax
   0x0000000000449e7b <+139>:	mov    %al,0x50(%rsp)
   0x0000000000449e7f <+143>:	mov    0x20(%rsp),%rbp
   0x0000000000449e84 <+148>:	add    $0x28,%rsp
   0x0000000000449e88 <+152>:	retq   
   0x0000000000449e89 <+153>:	movb   $0x1,0x50(%rsp)
   0x0000000000449e8e <+158>:	mov    0x20(%rsp),%rbp
   0x0000000000449e93 <+163>:	add    $0x28,%rsp
   0x0000000000449e97 <+167>:	retq   
   0x0000000000449e98 <+168>:	mov    0xc8(%rax),%rdx
   0x0000000000449e9f <+175>:	cmp    %rcx,%rdx
   0x0000000000449ea2 <+178>:	sete   %al
   0x0000000000449ea5 <+181>:	nop
   0x0000000000449ea6 <+182>:	jmp    0x449e3f <runtime.showframe+79>
   0x0000000000449ea8 <+184>:	callq  0x451470 <runtime.morestack_noctxt>
   0x0000000000449ead <+189>:	jmpq   0x449df0 <runtime.showframe>
End of assembler dump.

哎，这哗啦啦的一堆汇编代码，搞得像电视里面特工的电脑画面一样。那么一堆，那么getg的汇编代码在哪里？别急，我们稍微扫一眼汇编，runtime.showfuncinfo并没有被内联进来，这大大降低了我们的阅读难度；其次g.m.throwing > 0这个真是令人感动，我们只要找到立即数0就好，果然，一眼就发现cmpl $0x0,0xf4(%rax)，那么0xf4(%rax)明显是throwing，我们合理推测%rax里面是g.m的地址，看代码验证一下：

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22


type m struct {
	g0      *g     // goroutine with scheduling stack
	morebuf gobuf  // gobuf arg to morestack
	divmod  uint32 // div/mod denominator for arm - known to liblink

	// Fields not known to debuggers.
	procid        uint64       // for debuggers, but offset not hard-coded
	gsignal       *g           // signal-handling g
	goSigStack    gsignalStack // Go-allocated signal handling stack
	sigmask       sigset       // storage for saved signal mask
	tls           [6]uintptr   // thread-local storage (for x86 extern register)
	mstartfn      func()
	curg          *g       // current running goroutine
	caughtsig     guintptr // goroutine running during fatal signal
	p             puintptr // attached p for executing go code (nil if not executing go code)
	nextp         puintptr
	oldp          puintptr // the p that was attached before executing a syscall
	id            int64
	mallocing     int32
    throwing      int32
...此处省略N行代码...
}

根据内存对齐规则，我们算了一下throwing的偏移，确实是0xf4。那么根据再前一行代码mov 0x30(%rax),%rax，我们猜测这里%rax里面存的是g，一样的办法，我们检查g结构体中m的偏移：

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17


type g struct {
	// Stack parameters.
	// stack describes the actual stack memory: [stack.lo, stack.hi).
	// stackguard0 is the stack pointer compared in the Go stack growth prologue.
	// It is stack.lo+StackGuard normally, but can be StackPreempt to trigger a preemption.
	// stackguard1 is the stack pointer compared in the C stack growth prologue.
	// It is stack.lo+StackGuard on g0 and gsignal stacks.
	// It is ~0 on other goroutine stacks, to trigger a call to morestackc (and crash).
	stack       stack   // offset known to runtime/cgo
	stackguard0 uintptr // offset known to liblink
	stackguard1 uintptr // offset known to liblink

	_panic         *_panic // innermost panic - offset known to liblink
	_defer         *_defer // innermost defer
    m              *m      // current m; offset known to arm liblink
...此处省略N行代码...
}

确实是0x30，那我们很容易知道mov %fs:0xfffffffffffffff8,%rax，这句代码就是getg。%fs是x86的一个段寄存器，主要用来存储TLS（Thread Local Storage），就是线程本地变量。但是，对于funclatency程序来说，看上去我们能获取的只有struct pt_regs变量：

1
2
3
4


int trace_func_entry(struct pt_regs *ctx)
{
    ...省略...
}

结构体定义在<uapi/asm/ptrace.h>

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34


struct pt_regs {
/*
 * C ABI says these regs are callee-preserved. They aren't saved on kernel entry
 * unless syscall needs a complete, fully filled "struct pt_regs".
 */
	unsigned long r15;
	unsigned long r14;
	unsigned long r13;
	unsigned long r12;
	unsigned long rbp;
	unsigned long rbx;
/* These regs are callee-clobbered. Always saved on kernel entry. */
	unsigned long r11;
	unsigned long r10;
	unsigned long r9;
	unsigned long r8;
	unsigned long rax;
	unsigned long rcx;
	unsigned long rdx;
	unsigned long rsi;
	unsigned long rdi;
/*
 * On syscall entry, this is syscall#. On CPU exception, this is error code.
 * On hw interrupt, it's IRQ number:
 */
	unsigned long orig_rax;
/* Return frame for iretq */
	unsigned long rip;
	unsigned long cs;
	unsigned long eflags;
	unsigned long rsp;
	unsigned long ss;
/* top of stack page */
};

没有fs寄存器，但是经过查找，我们发现内核thread_struct结构体里面有：

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14


// arch/x86/include/asm/processor.h
struct thread_struct {
...省略N行无关代码...
	unsigned long		fsbase;
...省略N行无关代码...
}

// include/linux/sched.h
struct task_struct {
...省略N行无关代码...
	/* CPU-specific state of this task: */
	struct thread_struct		thread;
...省略N行无关代码...
}

对于golang来说，根据go调度，参考golang的runtime包内的代码，我们可以找出fs每次启动一个新线程设置fs寄存器的路径是：

1

startm->newm->newm1->newosproc->clone->settls->[系统调用arch_prctl(ARCH_SET_FS, ...)]

而系统调用arch_prctl，会修改task_struct->thread.fsbase。那这个问题就简单了，我们可以写一个获取goid的函数：

 1
 2
 3
 4
 5
 6
 7
 8
 9
10


static u64 get_goid()
{
    struct task_struct *t = (struct task_struct *)bpf_get_current_task();
    void* fsbase = (void*)t->thread.fsbase;
    void* g;
    bpf_probe_read(&g, sizeof(g), fsbase-8);
    u64 goid;
    bpf_probe_read(&goid, sizeof(goid), g+152);
    return goid;
}

其中fsbase-8就是%fs:0xfffffffffffffff8，我们先读取g的指针，然后g+152是goid在g中的偏移。这里有一些风险，go以后的版本里面goid的位置可能不对，需要使用者注意并处理

目前为止，x86-64键的问题我们算是比较好地解决了。总而言之，go里面，一个函数运行过程中可能会切换到其它线程运行，不能以线程id作为键关联开始和结束时间。我们这里选择了goid，并探讨了在x86-64系统中获得goid的方法。

go的栈扩展之殇

虽然实现了goid的获取，的确让人松了一口气，但是革命尚未成功，问题并不止一个。先来一段程序引入这小节的问题：

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15


package main

import "fmt"

func joke(i int) {
	var arr [2048]int
	arr[i] = i
	fmt.Println(arr)
}

func main() {
	for i := 0; i < 3; i++ {
		joke(i)
	}
}

很简单的程序，简单到编译器都内联了，我们禁止内联：

1

go build -gcflags=-l stack.go

我们先试着跑一下原来的funclatency脚本，试一下测量main.joke函数的延迟：

1

./funclatency '/data/workspace/perf/stack/stack:main.joke'

然后启动我们的stack例子程序：

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72


[root@archlinux stack]# ./stack                   
runtime: unexpected return pc for main.joke called from 0x7fffffffe000
stack: frame={sp:0xc00009cf70, fp:0xc00009cf78} stack=[0xc00009c000,0xc00009d000)
000000c00009ce70:  0000000000000000  0000000000000000 
000000c00009ce80:  0000000000000000  0000000000000000 
000000c00009ce90:  0000000000000000  0000000000000000 
000000c00009cea0:  0000000000000000  0000000000000000 
000000c00009ceb0:  0000000000000000  0000000000000000 
000000c00009cec0:  0000000000000000  0000000000000000 
000000c00009ced0:  0000000000000000  0000000000000000 
000000c00009cee0:  0000000000000000  0000000000000000 
000000c00009cef0:  0000000000000000  0000000000000000 
000000c00009cf00:  0000000000000000  0000000000000000 
000000c00009cf10:  0000000000000000  0000000000000000 
000000c00009cf20:  0000000000000000  0000000000000000 
000000c00009cf30:  0000000000000000  0000000000000000 
000000c00009cf40:  0000000000000000  0000000000000000 
000000c00009cf50:  0000000000000000  0000000000000000 
000000c00009cf60:  0000000000000000  0000000000000000 
000000c00009cf70: <00007fffffffe000 >0000000000000000 
000000c00009cf80:  0000000000000000  000000c000042790 
000000c00009cf90:  00000000004298fc <runtime.main+524>  000000c000080000 
000000c00009cfa0:  0000000000000000  000000c000080000 
000000c00009cfb0:  0000000000000000  0000000000000000 
000000c00009cfc0:  0000000000000000  000000c000000180 
000000c00009cfd0:  0000000000000000  0000000000451181 <runtime.goexit+1> 
000000c00009cfe0:  0000000000000000  0000000000000000 
000000c00009cff0:  0000000000000000  0000000000000000 
fatal error: unknown caller pc

runtime stack:
runtime.throw(0x4ba8d3, 0x11)
	/usr/local/go/src/runtime/panic.go:617 +0x72
runtime.gentraceback(0xffffffffffffffff, 0xffffffffffffffff, 0x0, 0xc000000180, 0x0, 0x0, 0x7fffffff, 0x4bff80, 0x7ffea3d65f78, 0x0, ...)
	/usr/local/go/src/runtime/traceback.go:275 +0x1cd1
runtime.copystack(0xc000000180, 0x1000, 0x7fab4e901001)
	/usr/local/go/src/runtime/stack.go:881 +0x25b
runtime.newstack()
	/usr/local/go/src/runtime/stack.go:1050 +0x2fd
runtime.morestack()
	/usr/local/go/src/runtime/asm_amd64.s:429 +0x8f

goroutine 1 [copystack]:
runtime: unexpected return pc for main.joke called from 0x7fffffffe000
stack: frame={sp:0xc00009cf70, fp:0xc00009cf78} stack=[0xc00009c000,0xc00009d000)
000000c00009ce70:  0000000000000000  0000000000000000 
000000c00009ce80:  0000000000000000  0000000000000000 
000000c00009ce90:  0000000000000000  0000000000000000 
000000c00009cea0:  0000000000000000  0000000000000000 
000000c00009ceb0:  0000000000000000  0000000000000000 
000000c00009cec0:  0000000000000000  0000000000000000 
000000c00009ced0:  0000000000000000  0000000000000000 
000000c00009cee0:  0000000000000000  0000000000000000 
000000c00009cef0:  0000000000000000  0000000000000000 
000000c00009cf00:  0000000000000000  0000000000000000 
000000c00009cf10:  0000000000000000  0000000000000000 
000000c00009cf20:  0000000000000000  0000000000000000 
000000c00009cf30:  0000000000000000  0000000000000000 
000000c00009cf40:  0000000000000000  0000000000000000 
000000c00009cf50:  0000000000000000  0000000000000000 
000000c00009cf60:  0000000000000000  0000000000000000 
000000c00009cf70: <00007fffffffe000 >0000000000000000 
000000c00009cf80:  0000000000000000  000000c000042790 
000000c00009cf90:  00000000004298fc <runtime.main+524>  000000c000080000 
000000c00009cfa0:  0000000000000000  000000c000080000 
000000c00009cfb0:  0000000000000000  0000000000000000 
000000c00009cfc0:  0000000000000000  000000c000000180 
000000c00009cfd0:  0000000000000000  0000000000451181 <runtime.goexit+1> 
000000c00009cfe0:  0000000000000000  0000000000000000 
000000c00009cff0:  0000000000000000  0000000000000000 
main.joke(0x0)
	/data/workspace/perf/stack/stack.go:5 +0x10e fp=0xc00009cf78 sp=0xc00009cf70 pc=0x48739e

这么简单的程序居然崩溃了，但我们只要不用funclatency，这个程序就不会崩溃。这个在生产环境中是无法接受的，也有悖于ebpf程序活体x光的名声，这相当于一个医生诊断一个病人，结果病人被诊断死了。废话少说，赶紧查查。根据go打印的内容，另外我们还知道问题出在go将旧栈复制到新栈的过程中：

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36


//文件在$GOROOT/src/runtime/stack.go

func copystack(gp *g, newsize uintptr, sync bool) {
...此处省略N行代码...
	// Adjust pointers in the new stack.
	gentraceback(^uintptr(0), ^uintptr(0), 0, gp, 0, nil, 0x7fffffff, adjustframe, noescape(unsafe.Pointer(&adjinfo)), 0)
...此处省略N行代码...
}

//文件在GOROOT/src/runtime/traceback.go
func gentraceback(pc0, sp0, lr0 uintptr, gp *g, skip int, pcbuf *uintptr, max int, callback func(*stkframe, unsafe.Pointer) bool, v unsafe.Pointer, flags uint) int {
...此处省略N行代码...
			flr = findfunc(frame.lr)
			if !flr.valid() {
				// This happens if you get a profiling interrupt at just the wrong time.
				// In that context it is okay to stop early.
				// But if callback is set, we're doing a garbage collection and must
				// get everything, so crash loudly.
				doPrint := printing
				if doPrint && gp.m.incgo && f.funcID == funcID_sigpanic {
					// We can inject sigpanic
					// calls directly into C code,
					// in which case we'll see a C
					// return PC. Don't complain.
					doPrint = false
				}
				if callback != nil || doPrint {
					print("runtime: unexpected return pc for ", funcname(f), " called from ", hex(frame.lr), "\n")
					tracebackHexdump(gp.stack, &frame, lrPtr)
				}
				if callback != nil {
					throw("unknown caller pc")                          //错误在这里
				}
			}
...此处省略N行代码...
}

copystack通过gentraceback调整栈变量的值，但是findfunc函数找不到uretprobe修改的返回地址代表的函数，也就是上面代码的frame.lr,go打印了这个地址0x7fffffffe000，因此崩溃。所以golang在发生扩展栈操作的时候，使用funclatency脚本测量函数延迟分布是有风险的。有没有更安全的方法，在这种情况下不会导致程序崩溃。

经过一番查找，我们发现issue1320，这里有人提出了一种新的思路，给函数的ret指令设置uprobe来代替uretprobe。这里有两个问题：

不能用于进行了尾调用优化的函数，但是go目前还没；
需要在脚本中找到指令ret，这对使用了变长指令集的x86_64来说，需要一些功夫。

第1个问题庆幸的是不存在，但是需要留意有这么个未来的可能的坑。第2个我们可以借助外部工具如gdb解决。

文章目录

起因

环境

funclatency实现原理探索

go的线程id之困

go的栈扩展之殇