星期二 十月 30, 2007

Shanghai Tech Day

Last week I went to Shanghai tech day. This is my first time to be in Shanghai, and my impression about Shanghai is that it is a very...very modern city. It contains all the modern symbols as a big city: skyscrapers, fashion girls, terrible traffic, and subway rush hour.

Here's some photos I got from Shanghai. The landscape of the Bund, the Oriental Pearl, and the subway.

 When I write in English, I tend to overlook the main point about the topic. Yes, this should be a blog about Sun tech day, not street photography in Shanghai. So, my presentation in Shanghai Tech day is about "OpenSolaris networking for developers", it is a slice put together by Nicolas targeting at software developers.

In this forty minutes presentation, I covered topics like SCTP/SDP/Crossbow/Kernel socket/Clearview/Nemo...that's a lot of topics. And guess which one got most attention? Not crossbow, because there were already enough speakers covered this project. Well, the answer is kernel socket. It seems to me software developers are very interested in seeing this project, they want to know what they can do with this project, as they are already familiar with socket applications. I admit this is a little bit surprise to me, maybe I should put more slices in the Beijing tech day this week.
 

星期一 七月 02, 2007

Presentation at Chinese Academy of Science

Last week I gave a presentation to a group of college students from Chinese Academy of Science (CAS). The presentation was part of OpenSolaris programming contest held in China. And my presentation was about Solaris network programming and the FireEngine architecture in Solaris 10. Most of the students have little experience with Solaris, but I believe many of them know a lot about Linux and Windows.

The presentation lasted about two hours, and it surely became the longest one I ever had. During the presentation, I gave a demo on STREAMS using the upmod from Sasha's blog. By using truss(1), DTrace(1M), and mdb(1), I showed to the students how to push and pop a module from a stream, how to extract data from a mblk, and how flow control works.

 The students were attracted by the demo. There were many questions after the demo and the presentation was over. Since most of them heard about STREAMS for the first time, (even if STREAMS is described from some of the famous UNIX book such as APUE and UNP, they simply ignored the part because they don't have SVR derived UNIX environment), most of the questions were about STREAMS and how FireEngine addressed the drawbacks of STREAMS.

Hope the presentation helps more Chinese students be comfortable with using Solaris. After all, UNIX is not widely used as desktop in China.

星期一 五月 28, 2007

The Packet Event Framework Open Sourced

A quick question would be "How soon?". Well, right after I got an account from Sun Download Center, plus the time to upload the big BFU archives and source codes tarballs. So it's really soon unless there are too much traffic on the internet such that my cable connection becomes as slow as a 56Kbps modem. Then you will see the files listed here, and of course you need to follow some instructions here to download and install it on your favorite OpenSolaris distributions, and you need to run a modified version of Iperf or Ping-Pong to see the benchmark results. What is more, if you want to dive a little more deeper, read the man pages here and try to find and understand them in the source codes.

I have been working on PEF for a while, and finally we archived this first milestone. It took us a long time to reach this point, and it is still a long way to go. Among all the obstacles we have encountered, performance issues are the most nasty ones. Even now there are several performance curves glued on my cubic, and they constant reminds me what we have archived, where we were and where we need to go.

So what's next? We will focus on the API part to allow third party modules to integrate into PEF. In fact, we are seeking for input from both inside Sun and outside to help design these APIs. If you have some thoughts about PEF and enthusiasm in networking architectures, feel free to drop me a note here or send me an email at eric dot yu dot sun dot com.

星期一 六月 05, 2006

A Short Overview to PEF(Packet Event Framework)

The FireEngine networking stack is widely known as a cutting-edge technology shipped in Solaris 10, a technical white paper is available in opensolaris community. Actually it is still moving forward and the Packet Event Framework is one of those innovations.

PEF Event List

PEF expands the FireEngine packet classification architecture, it allows different protocol processing functions to link together as a sequence of events, each connection has its own event list and each event in this list is executed by the framework. With this optimized event list, we can see improvement in the following area:
  • Code locality: While looking at the current TCP/IP stack, we can find some very large functions that runs the whole TCP state machine, for example tcp_rput_data(). This leads to poor code locality because every packet has to walk through these long functions, so we can split such functions into fine grain PEF events and alter the event list on the fly.
  • Networking observability: It's quite easy to insert a certain event into PEF event list to trace the packet passing thought the stack, the insertion and removal of such observing events is quite light weighted.
  • STREAMS interaction: FireEngine switches the Solaris networking stack from a message passing-based(STREAMS style) interface to a function call-based(BSD style) interface, PEF makes a further move, it removes the putnext() in TCP RX code path and wraps strrupt() into a PEF event.
Now, having enough knowledge about PEF event, let's move to a real life event list in the current project. In the current networking stack we have following following protocol processing functions in TCP RX code path: ip_tcp_input(), tcp_rput_data() and strrput(), which is hiding behind a putnext(), PEF wraps these three functions into a coarse grained event list, that is, per layer per event.

Network stack parallelism on CMT

CMT introduces multi-core and multiple hardware threads per CPU core, it can significantly increase system throughput by parallel processing, however, it brings impact to the current FireEngine networking stack especially on single connection throughput, that's because the current stack uses a per-CPU synchronization mechanism called vertical perimeter(aka, squeue), the vertical perimeter ensures data locality by processing the same connection in the same CPU whenever possible, so the utilization of parallel processing is limited.

PEF explores the parallelism of network stack on CMT. The events in PEF event list is executed one by one within an arbitrary vertical perimeter, thus we can achieve different parallel models by different means. One of the ideas is to dedicate some network tasks(such as TCP protocol processing) to a given CPU core or hardware thread by assigning the proper vertical perimeter to the PEF
event. For example, if we have four events in a PEF event list, and each of the event has a vertical perimeter binding to different CPU cores, then the function inside this event will be executed on different CPU cores in a pipeline fashion.

星期一 五月 29, 2006

Show up again with PEF

It is a long silence since my last blog, so I'd like to give some update here.

I have been with Sun for a year and two months, now I'm working on project PEF(Packet Event Framework) after fighting with some STREAMS bugs, it is an interesting project and I'm having a lot of fun with Solaris TCP/IP stack everyday.

PEF has a Chinese codename "FengHuoTai"(烽火台), which is the watch tower in the Great Wall. With FengHuoTai, the ancient Chinese warriors can deliver war messages in a fast and efficient fashion -- Yet we need to move more packets faster and more efficient, that's why we named it "FengHuoTai". Also I need to mention that the I-team of this project is in China, and Brutus is the project leader.

Technorati Tag:
Technorati Tag:

星期五 七月 15, 2005

A MDB tip on stack backtrace

When you set a breakpoint to a function entry point in MDB, and try to show the stack backtrace through dcmd, you may find it not always corrent. Now let me show you a small tip.

You may want to set a breakpoint like this:

# mdb -K
kmdb: target stopped at:
kmdbmod`kaif_enter+7:   popfl
[1]> tcp_zcopy_check:b

When the kernel hits this function, you'll check the stack backtrace as follows:

kmdb: stop at ip`tcp_zcopy_check
kmdb: target stopped at:
ip`tcp_zcopy_check:     pushl  %ebp
[1]> $c
ip`tcp_zcopy_check(cc6eb4a0, 2, ffff, 800, 4, cdce8edc)
ip`svr4_optcom_req+0x64e(cc6eb4a0, cd88afc0, cbeca010, fecc4048)
ip`tcp_wput_proto+0x179(cc059e00, cd88afc0, c1942e00)
ip`squeue_enter+0x335(c1942e00, cd88afc0, f68ab44c, cc059e00, 1c)
ip`tcp_wput+0x244(cc6eb4a0, cd88afc0)
putnext+0x298(cc6eb4a0, cd88afc0)
strput+0x19c(cc6e5d00, cd88afc0, 0, c2336b94, 0, 0)
kstrputmsg+0x219(cde5f940, 0, 0, ffffffff, 0, 2c4)
sockfs`sotpi_setsockopt+0x5c6(cc4334f8, ffff, 800, c2336c98, 4)
sockfs`sosendfile64+0x1e6(cc4d57e8, cc4d5200, c2336cd0, c2336e24)
sendvec64+0xfb(cc4d57e8, 8047d98, 1, 8047dac, 4)
sendfilev+0x163()
sys_call+0x1a2()

I don't think the stack backtrace is correct because tcp_zcopy_check has only one argument and svr4_optcom_req never calls him!

Now let's check the function entry point, you'll see almost every function on x86 begins with the following instructions:

[1]> tcp_zcopy_check::dis
ip`tcp_zcopy_check:             pushl  %ebp
ip`tcp_zcopy_check+1:           movl   %esp,%ebp
ip`tcp_zcopy_check+3:           subl   $0x8,%esp
[...]

That's the homework for every function, after executing the above instrctions, the esp and ebp pointer will point to the right place in the current stack frame, then you'll see the corrent stack backtrace.

[1]> ::step over
kmdb: target stopped at:
ip`tcp_zcopy_check+1:   movl   %esp,%ebp
[1]> ::step over
kmdb: target stopped at:
ip`tcp_zcopy_check+3:   subl   $0x8,%esp
[1]> ::step over
kmdb: target stopped at:
ip`tcp_zcopy_check+6:   pushl  %ebx
[1]> $c
ip`tcp_zcopy_check+6(cc05a1c0)
ip`tcp_opt_set+0x276(cc6eb4a0, 2, ffff, 800, 4, cdce8edc)
ip`svr4_optcom_req+0x64e(cc6eb4a0, cd88afc0, cbeca010, fecc4048)
ip`tcp_wput_proto+0x179(cc059e00, cd88afc0, c1942e00)
ip`squeue_enter+0x335(c1942e00, cd88afc0, f68ab44c, cc059e00, 1c)
ip`tcp_wput+0x244(cc6eb4a0, cd88afc0)
putnext+0x298(cc6eb4a0, cd88afc0)
strput+0x19c(cc6e5d00, cd88afc0, 0, c2336b94, 0, 0)
kstrputmsg+0x219(cde5f940, 0, 0, ffffffff, 0, 2c4)
sockfs`sotpi_setsockopt+0x5c6(cc4334f8, ffff, 800, c2336c98, 4)
sockfs`sosendfile64+0x1e6(cc4d57e8, cc4d5200, c2336cd0, c2336e24)
sendvec64+0xfb(cc4d57e8, 8047d98, 1, 8047dac, 4)
sendfilev+0x163()
sys_call+0x1a2()

Now the backtrace is correct!

So the conclusion is: Try one more instruction, and think more about what you have seen, there will be a reasonbale fact behind that.

Technorati Tag:
Technorati Tag:
Technorati Tag:

星期二 六月 21, 2005

How to Write a KMOD

How to write a kmod module

This is a re-post, because I deleted this article by accident.

Writing a KMDB module is slightly different from writing a MDB one, especially in building the binary file, now I'm trying to illustrate the difference between them. Here's a simple kmod that contains only one dcmd:

$ cat simple_trace.c
#include <sys/mdb_modapi.h>

static int
simple_trace(uintptr_t addr, uint_t flags, int argc, const mdb_arg_t \*argv)
{
        /\* do nothing \*/
        mdb_printf("Hello KMDB");
        return DCMD_OK;
}

static const mdb_dcmd_t dcmds[] = {
        {"simple_trace", "wait...", "Hello world", simple_trace },
        { NULL }
};

static const mdb_modinfo_t modinfo = {
        MDB_API_VERSION, dcmds, NULL
};

const mdb_modinfo_t \*
_mdb_init(void)
{
        return &modinfo;
}

When running this dcmd, it does nothing but print "Hello KMDB" on the console, but this is a good start point to write a real kmod.

As you can see, the source code looks no difference from a MDB module, but the Makefile is different:

$ cat Makefile
OBJS = simple_trace.o
KMODULE = simple_trace

CFLAGS +=  -D_KERNEL -D_KMDB

CC=cc
LD=ld

all: $(OBJS)
        $(LD) -dy -r -Nmisc/kmdbmod -o $(KMODULE) $(OBJS)

.KEEP_STATE:

%.o: %.c
        $(CC) $(CFLAGS) -c $<

clean:
        rm -f $(OBJS)

It modifies the .dynamic section of the binary file, and make it loadable to KMDB. You can now load this module into KMDB:

# mdb -K

Welcome to kmdb
Loaded modules: [ crypto ptm ufs unix krtld sppp nca uhci lofs genunix ip
logindmux usba specfs nfs random sctp ]
[1]> ::load /tmp/simple_trace
Loaded modules: [ simple_trace ]
[0]> ::simple_trace
Hello KMDB
[0]>

and check what ld did to the binary file:

$ elfdump simple_trace
......
Dynamic Section:  .dynamic
     index  tag            value
       [0]  NEEDED         0x1             misc/kmdbmod
       [1]  FLAGS          0x4             [ TEXTREL ]
       [2]  FLAGS_1        0               0
Technorati Tag:
Technorati Tag:
Technorati Tag:

星期二 六月 14, 2005

OpenSolaris is alive!

That's Great! Technorati Tag:
Technorati Tag:

星期三 六月 08, 2005

Hello World in KMDB module

最近用KMDB跟踪FireEngine的代码时,总觉得不方便,但究竟是哪里不方便,我自己也说不清楚。第一反应是也许可以写一个kmod来让某些复杂的调试步骤简单化一点。出于这个目的,开始着手写kmod。

然而网上资料虽然很丰富,竟然Google不到kmod的资料!就算能找到的也只是关于MDB的——Google上没有的,我就认为它不存在了——因此我把编写kmod的基本步骤写在这里,当作一个小小的Hello World教程。

程序的代码如下,它只提供了一个很简单的dcmd,因为我的目的在于演示如何编译生成kmod:

$ cat simple_trace.c
#include <sys/mdb_modapi.h>

static int
simple_trace(uintptr_t addr, uint_t flags, int argc, const mdb_arg_t \*argv)
{
        /\* do nothing \*/
        mdb_printf("Hello KMDB");
        return DCMD_OK;
}

static const mdb_dcmd_t dcmds[] = {
        {"simple_trace", "wait...", "Hello world", simple_trace },
        { NULL }
};

static const mdb_modinfo_t modinfo = {
        MDB_API_VERSION, dcmds, NULL
};

const mdb_modinfo_t \*
_mdb_init(void)
{
        return &modinfo;
}

通过下面的Makefile来编译:

$ cat Makefile
OBJS = simple_trace.o
KMODULE = simple_trace

CFLAGS += -D_KERNEL -D_KMDB

CC=cc
LD=ld

all: $(OBJS)
        $(LD) -dy -r -Nmisc/kmdbmod -o $(KMODULE) $(OBJS)

.KEEP_STATE:

%.o: %.c
        $(CC) $(CFLAGS) -c $<

clean:
        rm -f $(OBJS)

明眼人一看就明白,其实最后的差别很简单,在kmod中对.dynamic做了一些特殊处理。当然,在Solaris内核里还有一些其他的操作,但不是必须的,所以略过了。

$ elfdump simple_trace
......
Dynamic Section:  .dynamic
     index  tag               value
       [0]  NEEDED         0x1             misc/kmdbmod
       [1]  FLAGS            0x4             [ TEXTREL ]
       [2]  FLAGS_1        0                 0

Sign,原来就这么简单,还害得我一直读到了IP kmod的Makefile里才搞明白。好了,废话少说,接着看intern的简历。

Update:运行结果:

# mdb -K

Welcome to kmdb
Loaded modules: [ crypto ptm ufs unix krtld sppp nca uhci lofs genunix ip
logindmux usba specfs nfs random sctp ]
[1]> ::load /tmp/simple_trace
Loaded modules: [ simple_trace ]
[0]> ::simple_trace
Hello KMDB
[0]>

星期四 六月 02, 2005

用DTrace跟踪TCP报文

DTrace是Solaris 10提供的一个非常强大的工具,在它的manual里有这么一句话:“use it, learn it, love it”,由此也可以看出它对UNIX程序员的吸引力。在这里提供一小段脚本,是前段时间我在研究FireEngine代码时写来练手的,可以用它来跟踪Solaris在TCP连接建立时的函数调用关系,稍作修改还可以用在很多其它场合。

DTrace的脚本语言叫D语言,它的语法和C语言很类似,但是如果把它和C语言混为一谈那就大错特错了。和其他脚本语言一样,要在文本的第一行加上执行的程序:

#! /usr/sbin/dtrace -s

这句话的意思表示运行结果采用缩进显示方式,语法很像C语言中的#pragma语句:

#pragma D option flowindent

BEGIN表示在进行DTrace probe之前需要执行的语句,如果对awk(1)比较熟悉的话,要理解它很容易。

BEGIN
{
PROTO_TCP = 6;
IP_HEADER = 20;
TH_SYN = 2;
}

这里初始化了几个常量,非别表示的是TCP在IP报文头中的协议号,IP报文头的长度,以及TCP SYN标志位的值。

再花一两句话时间介绍一下DTrace的程序结构。在一个DTrace的脚本中可以定义一个或多个probe description,它的结构如下:predicate,用谓词逻辑中的predicate(断言)来理解它挺合适,另外在C++的boost库中也有predicate,概念也很类似但是形式上很遥远。在一个probe description上,当断言的条件满足时,就会执行{}之间的语句。

probe descriptions
/ predicate /
{
action statements
}

显然,在下面这段probe中,表示在ip_rput函数的入口处开始跟踪,直到ip_rput函数结束。条件比较复杂,判断报文是不是设置了TCP SYN标记。

fbt::ip_rput:entry
/((ipha_t\*)args[1]->b_rptr)->ipha_protocol == PROTO_TCP &&
(((tcph_t\*)&args[1]->b_rptr[IP_HEADER])->th_flags[0] & TH_SYN) == TH_SYN/
{
self->traceme = 1;
printf("%s(%d, %x, %d)\\n", probefunc, arg0, arg1, arg2);
}
fbt::ip_rput:return
/self->traceme/
{
self->traceme = 0;
exit(0);
}

fbt:::
/self->traceme/
{}

在机器上运行这个脚本,然后telnet过去,立刻就能看到运行结果,几乎是全部的函数调用关系。

值得指出的是,虽然DTrace很强大,但是由于DTrace只能跟踪函数的调用,对于更细节的跟踪...当然还得用mdb,在下一次有时间的时候我可能会写一点怎么用mdb来做跟踪的例子。

星期二 五月 31, 2005

正式开始blog

刚来sun不久就申请了这个地方,但是一直就这么荒废着,真是惭愧。从今天开始,我可能会把自己遇到的问题,学到的技术和总结出的经验都贴在这里,但愿不会因为写的太乏味而没有人看。

先自我介绍一下,我刚从北航毕业,现在Sun ERI旗下的Solaris Core Techlonogies Group工作。我的周围是一群非常优秀的工程师,也做着非常有意思的咚咚—Solaris内核,对我来说这就是一件很美好的事。

About

yu

Search

Categories
Archives
« 四月 2014
星期日星期一星期二星期三星期四星期五星期六
  
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
   
       
今天