lmbench的使用方法与CPU上下文切换的次数和时间（context switch）

一、引言
要评价一个系统的性能，通常有不同的指标，相应的会有不同的测试方法和测试工具，一般来说为了确保测试结果的公平和权威性，会选用比较成熟的商业测试软件。但在特定情形下，只是想要简单比较不同系统或比较一些函数库性能时，也能够从开源世界里选用一些优秀的工具来完成这个任务，本文就通过lmbench 简要介绍系统综合性能测试。

二、测试软件

Lmbench是一套简易，可移植的，符合ANSI/C标准为UNIX/POSIX而制定的微型测评工具。一般来说，它衡量两个关键特征：反应时间和带宽。Lmbench旨在使系统开发者深入了解关键操作的基础成本。

软件说明：

lmbench是个用于评价系统综合性能的多平台开源benchmark，能够测试包括文档读写、内存操作、进程创建销毁开销、网络等性能，测试方法简单。
Lmbench是个多平台软件，因此能够对同级别的系统进行比较测试，反映不同系统的优劣势，通过选择不同的库函数我们就能够比较库函数的性能；更为重要的是，作为一个开源软件，lmbench提供一个测试框架，假如测试者对测试项目有更高的测试需要，能够通过少量的修改源代码达到目的（比如现在只能评测进程创建、终止的性能和进程转换的开销，通过修改部分代码即可实现线程级别的性能测试）。
下载：
www.bitmover.com/lmbench，最新版本3.0-a9

LMbench的主要功能：

*带宽测评工具

—读取缓存文件

—拷贝内存

—读内存

—写内存

—管道

—TCP

* 反应时间测评工具

—上下文切换

—网络：连接的建立，管道，TCP，UDP和RPC hot potato

—文件系统的建立和删除

—进程创建

—信号处理

—上层的系统调用

—内存读入反应时间

* 其他

—处理器时钟比率计算

LMbench的主要特性：

—对于操作系统的可移植性测试

评测工具是由C语言编写的，具有较好的可移植性（尽管它们更易于被GCC编译）。这对于产生系统间逐一明细的对比结果是有用的。

—自适应调整

Lmbench对于应激性行为是非常有用的。当遇到BloatOS比所有竞争者慢4倍的情况时，这个工具会将资源进行分配来修正这个问题。

— 数据库计算结果

数据库的计算结果包括了从大多数主流的计算机工作站制造商上的运行结果。

—存储器延迟计算结果

存储器延迟测试展示了所有系统（数据）的缓存延迟，例如一级，二级和三级缓存，还有内存和TLB表的未命中延迟。另外，缓存的大小可以被正确划分成一些结果集并被读出。硬件族与上面的描述相象。这种测评工具已经找到了操作系统分页策略的中的一些错误。

—上下文转换计算结果

很多人好象喜欢上下文转换的数量。这种测评工具并不是特别注重仅仅引用“在缓存中”的数量。它时常在进程数量和大小间进行变化，并且在当前内容不在缓存中的时候，将结果以一种对用户可见的方式进行划分。您也可以得到冷缓存上下文切换的实际开销。

— 回归测试

Sun公司和SGI公司已经使用这种测评工具以寻找和补救存在于性能上的问题。

Intel公司在开发P6的过程中，使用了它们。

Linux在Linux的性能调整中使用了它们。

— 新的测评工具

源代码是比较小的，可读并且容易扩展。它可以按常规组合成不同的形式以测试其他内容。举例来说，如包括处理连接建立的库函数的网络测量，服务器关闭等。

三、测试

在此次测试中我分两种测试，一个是在我的pc机上测试的，一个是在SEP4020的arm720t平台上测试的：

(1)     在pc机上的测试
测试平台：HP compoq，fedora 7 Linux 2.6.21
1、  确认安装了C编译器，假如没有需要先安装
2、  拷贝lmbench源码文档lmbench-3.0-a9.tgz到fedora的/root/test目录下，解压到当前目录即可
3、  cd lmbench-3.0-a9，在命令行键入make results即可开始编译测试
4、  假如编译没有错误，会出现一些选择提示以对测试进行一个配置并生成配置脚本，后续的测试将使用该配置脚本，在以后测试中也能够直接使用同样的配置多次测试。配置提示除了测试的内存范围（如“MB [default 371]”时，对内存较大的应该避免选择太大值，否则测试时间会很长）和是否Mail results外，基本上都能够选择缺省值。
5、  Lmbench根据配置文档执行任何测试项，在results目录下根据系统类型、系统名和操作系统类型等生成一个子目录，测试结果文档（system name+序号）存放于该目录下。
6、  测试完毕执行make see可查看到测试结果报告，则可以将测试数据/results/i686-pc-linux-gnu/目录下的文件导出为测试报告/results/summary.out文件，我们查看summary.out文件就可以看测试结果了。

(2)    在SEP4020上的测试
测试平台：SEP4020 evb1.5， Linux 2.6.16
1、  确认宿主机上安装了交叉编译编译器arm-linux-gcc，假如没有需要先安装
2、  拷贝lmbench源码文档lmbench-3.0-a9.tgz到fedora的/root/test目录下，解压到当前目录即可
3、  cd lmbench-3.0-a9，在命令行键入make CC=arm-linu-gcc OS=arm-linux 即可开始编译测试用例，编译完成后，会在/root/test/lmbench-3.0-a9/bin下出现一个arm-linux目录，在这个目录下就是测试用例的目标文件。由于我们的目标平台不支持make命令，所以我们必须另外写一个运行脚本，脚步名为run_all.sh，放在scripts下面，内容是：

#!/bin/sh

echo run the lmbench on sep4020 arm-linux

env OS=arm-linux ./config-run

env OS=arm-linux ./results
4、然后将整个lmbench-3.0-a9目录拷贝到目标机的nfs根目录下面，然后进入目标机的串口终端，在/lmbench-3.0-a9/scripts下面输入./run_all.sh

假如交叉编译没有错误，会出现一些选择提示以对测试进行一个配置并生成配置脚本，后续的测试将使用该配置脚本，在以后测试中也能够直接使用同样的配置多次测试。配置提示除了测试的内存范围（如“MB [default 19]”时，对内存较大的应该避免选择太大值，否则测试时间会很长）和是否Mail results外，基本上都能够选择缺省值。
5、 Lmbench根据配置文档执行任何测试项，在results目录下根据系统类型、系统名和操作系统类型等生成一个子目录，测试结果文档（system name+序号）存放于该目录下。
6、测试完毕执行，在虚拟机fedora7中进入/nfs/lmbench-3.0-a9 键入make see命令可生成测试结果报告，它可以将测试数据/results/i686-pc-linux-gnu/目录下的文件导出为测试报告/results/summary.out文件，我们查看summary.out文件就可以看测试结果了。

四、关于测试结果及说明

make[1]: Entering directory `/nfs/lmbench-3.0-a9/results'

                 L M B E N C H  3 . 0   S U M M A R Y------------------------------------(Alpha software, do not distribute)

Basic system parameters
------------------------------------------------------------------------------
Host                 OS Description              Mhz  tlb  cache  mem   scalpages line   par   loadbytes  
--------- ------------- ----------------------- ---- ----- ----- ------ ----
192.168.0  Linux 2.6.16               arm-linux   85    60     8 1.0000    1
192.168.0  Linux 2.6.27               arm-linux   86    63    16 1.0000    1
192.168.0  Linux 2.6.16               arm-linux   86    63    16 1.0000    1
192.168.0  Linux 2.6.16               arm-linux   86    63    16 1.0000    1
192.168.0  Linux 2.6.16               arm-linux   86    63    16 1.0000    1
localhost Linux 2.6.21-       i686-pc-linux-gnu 1817     8   128 1.3300    1
localhost Linux 2.6.21-       i686-pc-linux-gnu 1864     8   128 1.2900    1

Processor, Processes - times in microseconds - smaller is better
------------------------------------------------------------------------------
Host                 OS  Mhz null null      open slct sig  sig  fork exec sh  call  I/O stat clos TCP  inst hndl proc proc proc
--------- ------------- ---- ---- ---- ---- ---- ---- ---- ---- ---- ---- ----
192.168.0  Linux 2.6.16   85 2.04 8.44 187. 2064      21.0 81.2 9655 42.K 63.K
192.168.0  Linux 2.6.27   86 2.69 8.44 266. 5338      20.7 94.7 10.K 44.K 73.K
192.168.0  Linux 2.6.16   86 2.03 8.34 185. 5100      20.7 85.9 9468 63.K 121K
192.168.0  Linux 2.6.16   86 2.03 8.72 185. 19.K      20.7 84.9 9556 53.K 72.K
192.168.0  Linux 2.6.16   86 2.04 8.33 185. 5321      20.7 80.5 9395 42.K 101K
localhost Linux 2.6.21- 1817 1.11 1.26 3.08 5.17 10.2 1.70 2.85 674. 1922 5177
localhost Linux 2.6.21- 1864 1.09 1.26 2.98 5.05 8.94 1.48 3.27 1083 2086 6119

Basic integer operations - times in nanoseconds - smaller is better
-------------------------------------------------------------------
Host                 OS  intgr intgr  intgr  intgr  intgr  bit   add    mul    div    mod   
--------- ------------- ------ ------ ------ ------ ------ 
192.168.0  Linux 2.6.16   11.6 8.6900   52.1 1489.3  255.9
192.168.0  Linux 2.6.27   11.5 8.5800   52.2 1469.2  252.6
192.168.0  Linux 2.6.16   11.5 8.5400   52.2 1472.0  252.9
192.168.0  Linux 2.6.16   11.5 8.6200   52.0 1472.8  251.9
192.168.0  Linux 2.6.16   11.5 8.6400   52.2 1472.5  254.5
localhost Linux 2.6.21- 0.5600 0.2800 0.2000   20.6   10.9
localhost Linux 2.6.21- 0.6100 0.2700 0.1700   20.0 9.8600

Basic uint64 operations - times in nanoseconds - smaller is better
------------------------------------------------------------------
Host                 OS int64  int64  int64  int64  int64  bit    add    mul    div    mod   
--------- ------------- ------ ------ ------ ------ ------ 
192.168.0  Linux 2.6.16    23.         691.6 4295.6 3895.0
192.168.0  Linux 2.6.27    23.         685.4 4192.8 4074.3
192.168.0  Linux 2.6.16    23.         683.0 4199.0 4082.1
192.168.0  Linux 2.6.16    23.         680.7 4202.6 4082.9
192.168.0  Linux 2.6.16    23.         686.9 4235.7 4080.3
localhost Linux 2.6.21-  0.690        0.6200   34.5   41.4
localhost Linux 2.6.21-  0.660        0.6100   36.8   40.2

Basic float operations - times in nanoseconds - smaller is better
-----------------------------------------------------------------
Host                 OS  float  float  float  floatadd    mul    div    bogo
--------- ------------- ------ ------ ------ ------ 
192.168.0  Linux 2.6.16 6902.1 7781.9  12.1K  42.2K
192.168.0  Linux 2.6.27 6911.0 6568.4  11.6K  43.0K
192.168.0  Linux 2.6.16 6757.4 7578.5  11.9K  43.5K
192.168.0  Linux 2.6.16 6763.1 7611.3  11.7K  43.5K
192.168.0  Linux 2.6.16 6759.3 7640.4  11.9K  43.5K
localhost Linux 2.6.21- 1.6600 2.7900   21.7   20.6
localhost Linux 2.6.21- 1.6300 2.7200   20.9   20.1

Basic double operations - times in nanoseconds - smaller is better
------------------------------------------------------------------
Host                 OS  double double double doubleadd    mul    div    bogo
--------- ------------- ------  ------ ------ ------ 
192.168.0  Linux 2.6.16 9955.5  10.6K  22.8K  79.8K
192.168.0  Linux 2.6.27 9157.0 9909.4  20.6K  79.4K
192.168.0  Linux 2.6.16 9793.3  10.3K  22.4K  79.8K
192.168.0  Linux 2.6.16 9703.9  10.4K  22.2K  79.9K
192.168.0  Linux 2.6.16 9746.9  10.3K  22.3K  79.7K
localhost Linux 2.6.21- 1.6900 2.7900   21.2   20.6
localhost Linux 2.6.21- 1.6300 2.8800   21.0   20.2

Context switching - times in microseconds - smaller is better
-------------------------------------------------------------------------
Host                 OS  2p/0K 2p/16K 2p/64K 8p/16K 8p/64K 16p/16K 16p/64Kctxsw  ctxsw  ctxsw ctxsw  ctxsw   ctxsw   ctxsw
--------- ------------- ------ ------ ------ ------ ------ ------- -------
192.168.0  Linux 2.6.16  164.8  120.0  311.9  165.3  162.5   165.9   151.1
192.168.0  Linux 2.6.27  247.5  196.1  198.4  238.0  254.9   262.9   291.2
192.168.0  Linux 2.6.16  164.4  118.5  115.2  161.1  156.4   164.4   164.3
192.168.0  Linux 2.6.16  167.2  116.6  119.6  166.9  161.9   171.3   158.1
192.168.0  Linux 2.6.16  172.5  117.4  114.3  161.3  147.6   163.8   127.5
localhost Linux 2.6.21-   11.0   11.6   11.7   15.3   19.2    16.8    25.1
localhost Linux 2.6.21-   10.2   11.4   11.3   14.3   20.9    17.4    26.0

*Local* Communication latencies in microseconds - smaller is better
---------------------------------------------------------------------
Host                 OS 2p/0K  Pipe AF     UDP  RPC/   TCP  RPC/ TCPctxsw       UNIX         UDP         TCP conn
--------- ------------- ----- ----- ---- ----- ----- ----- ----- ----
192.168.0  Linux 2.6.16 164.8 482.3 925.                             
192.168.0  Linux 2.6.27 247.5 770.7 1069                             
192.168.0  Linux 2.6.16 164.4 477.4 917.                             
192.168.0  Linux 2.6.16 167.2 472.9 926.                             
192.168.0  Linux 2.6.16 172.5 474.9 913.                             
localhost Linux 2.6.21-  11.0  28.3 50.8  45.9  55.2  48.2  59.8 126.
localhost Linux 2.6.21-  10.2  32.1 55.7  36.7  49.2  40.2  53.1 113.

*Remote* Communication latencies in microseconds - smaller is better
---------------------------------------------------------------------
Host                 OS   UDP  RPC/  TCP   RPC/ TCPUDP         TCP  conn
--------- ------------- ----- ----- ----- ----- ----
192.168.0  Linux 2.6.16                             
192.168.0  Linux 2.6.27                             
192.168.0  Linux 2.6.16                             
192.168.0  Linux 2.6.16                             
192.168.0  Linux 2.6.16                             
localhost Linux 2.6.21-                             
localhost Linux 2.6.21-

File & VM system latencies in microseconds - smaller is better
-------------------------------------------------------------------------------
Host                 OS   0K File      10K File     Mmap    Prot   Page   100fdCreate Delete Create Delete Latency Fault  Fault  selct
--------- ------------- ------ ------ ------ ------ ------- ----- ------- -----
192.168.0  Linux 2.6.16 6410.3 6135.0  37.0K 6896.6  5112.0 3.124    36.8 280.8
192.168.0  Linux 2.6.27  18.9K  71.4K  55.6K  28.6K   16.2K  15.9    54.2 194.3
192.168.0  Linux 2.6.16  22.7K  15.4K 1000.K  47.6K  4926.0 5.213    37.1 284.2
192.168.0  Linux 2.6.16  31.2K  29.4K  41.7K  50.0K  4907.0 1.087    36.0 277.1
192.168.0  Linux 2.6.16  33.3K  25.0K  58.8K 9434.0  5108.0 9.428    37.1 285.6
localhost Linux 2.6.21-  112.0   12.4   88.5  130.8  7413.0 2.360 5.98870 4.635
localhost Linux 2.6.21-   36.1   19.0  181.2  138.4  9006.0 2.134   482.1 4.148

*Local* Communication bandwidths in MB/s - bigger is better
-----------------------------------------------------------------------------
Host                OS  Pipe AF    TCP  File   Mmap  Bcopy  Bcopy  Mem   MemUNIX      reread reread (libc) (hand) read write
--------- ------------- ---- ---- ---- ------ ------ ------ ------ ---- -----
192.168.0  Linux 2.6.16 10.2 11.2        13.1   32.8   19.1   17.8 32.8  72.9
192.168.0  Linux 2.6.27 8.96 11.4        12.9   32.6   19.1   17.7 32.7  71.2
192.168.0  Linux 2.6.16 10.2 11.2        13.0   32.8   19.0   17.8 32.7  71.2
192.168.0  Linux 2.6.16 10.2 11.2        12.9   32.9   19.0   17.8 32.9  71.6
192.168.0  Linux 2.6.16 10.2 11.2        12.9   32.9   19.0   17.8 32.7  71.6
localhost Linux 2.6.21- 1153 436. 640. 1742.8 3463.7 1239.0 1116.5 3502 1589.
localhost Linux 2.6.21- 1194 451. 744. 1742.3 3443.5 1217.8 1159.0 3357 1555.

Memory latencies in nanoseconds - smaller is better(WARNING - may not be correct, check graphs)
------------------------------------------------------------------------------
Host                 OS   Mhz   L1 $   L2 $    Main mem    Rand mem    Guesses
--------- -------------   ---   ----   ----    --------    --------    -------
192.168.0  Linux 2.6.16    85   33.6  293.4       296.6       856.8    No L2 cache?
192.168.0  Linux 2.6.27    86   35.2  293.8       309.8       863.1    No L2 cache?
192.168.0  Linux 2.6.16    86   35.4  293.7       310.3       861.4    No L2 cache?
192.168.0  Linux 2.6.16    86   35.4  293.7       309.9       863.6    No L2 cache?
192.168.0  Linux 2.6.16    86   35.4  293.6       308.2       860.2    No L2 cache?
localhost Linux 2.6.21-  1817 1.6620 7.9160        98.6       191.7
localhost Linux 2.6.21-  1864 1.7240 7.7130       104.3       205.4
make[1]: Leaving directory `/nfs/lmbench-3.0-a9/results'

主要技术参数说明：

分类

其中host为localhost 说明是我用的虚拟机，而192.168.0则说明是用4020进行的测试
技术参数
参数说明
（1）Basic system parameters（系统基本参数）
Tlb pages:TLB（Translation Lookaside Buffer）的页面数
Cache line bytes ：（cache的行字节数）
Mem par
memory hierarchy parallelism
Scal load：并行的lmbench数
（2）Processor, Processes(处理器、进程操作时间)
Null call：简单系统调用（取进程号）
Null I/O：简单IO操作（空读写的平均）
Stat：取文档状态的操作
Open clos：打开然后立即关闭关闭文档操作
Slct tcp
Select：配置
Sig inst：配置信号
Sig hndl：捕获处理信号
Fork proc ：Fork进程后直接退出
Exec proc：Fork后执行execve调用再退出
Sh proc：Fork后执行shell再退出
（3）Basic integer/float/double operations
略

（4）Context switching 上下文切换时间
2p/16K：表示2个并行处理16K大小的数据

（5）*Local* Communication latencies（本地通信延时，通过不同通信方式发送后自己立即读）
Pipe：管道通信
AF UNIX
Unix协议
UDP
UDP
RPC/UDP
TCP
RPC/TCP
TCP conn
TCP建立connect并关闭描述字
（6）File & VM system latencies（文档、内存延时）
File Create & Delete：创建并删除文档
MMap Latency：内存映射
Prot Fault
Protect fault
Page Fault：缺页
100fd selct：对100个文档描述符配置select的时间
（7）*Local* Communication bandwidths（本地通信带宽）
Pipe：管道操作
AF UNIX
Unix协议
TCP
TCP通信
File reread：文档重复读
MMap reread：内存映射重复读
Bcopy(libc)：内存拷贝
Bcopy(hand)：内存拷贝
Mem read：内存读
Mem write：内存写
（8）Memory latencies（内存操作延时）
L1：缓存1
L2：缓存2
Main Mem：连续内存
Rand Mem：内存随机访问延时
Guesses
假如L1和L2近似，会显示“No L1 cache?”
假如L2和Main Mem近似，会显示“No L2 cache?”

CPU上下文切换的次数和时间（context switch）

博客分类：
java
linux

什么是CPU上下文切换？

现在linux是大多基于抢占式，CPU给每个任务一定的服务时间，当时间片轮转的时候，需要把当前状态保存下来，同时加载下一个任务，这个过程叫做上下文切换。时间片轮转的方式，使得多个任务利用一个CPU执行成为可能，但是保存现场和加载现场，也带来了性能消耗。那线程上下文切换的次数和时间以及性能消耗如何看呢？

 如何获得上下文切换的次数？

vmstat直接运行即可，在最后几列，有CPU的context switch次数。这个是系统层面的，加入想看特定进程的情况，可以使用pidstat。

$ vmstat 1 100

procs -----------memory---------- ---swap-- -----io---- --system-- -----cpu------

r b swpd free buff cache si so bi bo in cs us sy id wa st

0 0 88 233484 288756 1784744 0 0 0 23 0 0 4 1 94 0 0

4 0 88 233236 288756 1784752 0 0 0 0 6202 7880 4 1 96 0 0

2 0 88 233360 288756 1784800 0 0 0 112 6277 7612 4 1 95 0 0

0 0 88 232864 288756 1784804 0 0 0 644 5747 6593 6 0 92 2 0

 执行pidstat，将输出系统启动后所有活动进程的cpu统计信息：

linux:~ # pidstat

Linux 2.6.32.12-0.7-default (linux) 06/18/12 _x86_64_

11:37:19 PID %usr %system %guest %CPU CPU Command

……

11:37:19 11452 0.00 0.00 0.00 0.00 2 bash

11:37:19 11509 0.00 0.00 0.00 0.00 3 dd

11:37:19: pidstat获取信息时间点

PID: 进程pid

%usr: 进程在用户态运行所占cpu时间比率

%system: 进程在内核态运行所占cpu时间比率

%CPU: 进程运行所占cpu时间比率

CPU: 指示进程在哪个核运行

Command: 拉起进程对应的命令

备注:执行pidstat默认输出信息为系统启动后到执行时间点的统计信息，因而即使当前某进程的cpu占用率很高，输出中的值有可能仍为0。

 

 上下文切换的性能消耗在哪里呢？

context switch过高，会导致CPU像个搬运工，频繁在寄存器和运行队列直接奔波，更多的时间花在了线程切换，而不是真正工作的线程上。直接的消耗包括CPU寄存器需要保存和加载，系统调度器的代码需要执行。间接消耗在于多核cache之间的共享数据。

 引起上下文切换的原因有哪些？

对于抢占式操作系统而言，大体有几种：

1、当前任务的时间片用完之后，系统CPU正常调度下一个任务；

2、当前任务碰到IO阻塞，调度线程将挂起此任务，继续下一个任务；

3、多个任务抢占锁资源，当前任务没有抢到，被调度器挂起，继续下一个任务；

4、用户代码挂起当前任务，让出CPU时间；

5、硬件中断；

 如何测试上下文切换的时间消耗？

LMbench，知道这个工具，是在霸爷的博客上面（http://blog.yufeng.info/archives/753），然后就开始在测试环境下搞了一把，一会就出结果了。然后就搞了台线上机器安装这个工具，然后测试，后面在测试Memory的时候，直接导致Load飙升，还好没人发现，机器java进程重启就好了。这方面纯粹是业务选手。霸爷说分析的结果对于高性能C的开发同学来说，是需要熟记的，没办法，咱是搞java的，只能每个指标逐个看一下了。

 LMbench的简单介绍？

首先看英文介绍：LMbench -Tools for Performance Analysis，微观性能分析工具。

官方地址：http://www.bitmover.com/lmbench/

下载地址：http://www.bitmover.com/lmbench/lmbench3.tar.gz

 LMbench主要能干啥？

主要是带宽（读取缓存文件、内存拷贝、读写内存、管道等）和反应时间（上下文切换、网路、进程创建等）的评测工具。

 LMbench 安装？

#wget http://www.bitmover.com/lmbench/lmbench3.tar.gz

#tar -zxvf lmbench3.tar.gz

#cd lmbench3

#make

中间遇到一个问题，就是报错，在CSDN上面找到了答案，这这里贴一下。

此时会报错：

make[2]: *** 没有规则可以创建“bk.ver”需要的目标“../SCCS/s.ChangeSet”。停止。

make[2]:正在离开目录 `/home/hero/lmbench3/src'

make[1]: *** [lmbench] 错误 2

make[1]:正在离开目录 `/home/hero/lmbench3/src'

make: *** [build] 错误 2

解决办法：

lmbench3目录下

#mkdir SCCS

#touch ./SCCS/s.ChangeSet

#make

 LMbench关于结果解释（这次主要关注线程切换信息）

在网上找了半天，信息很少，只能看doc下面的英文解释了。

测试上下文切换的时间，一个上下文切换，包括保存一个进程状态的保存和恢复另外一个进程的时间。

典型的上下文切换性能，仅仅是测量最小的线程切换时间。仅仅是做进程切换，任何实质的任务都不做。

Context switching - times in microseconds - smaller is better

-------------------------------------------------------------------------

Host OS 2p/0K 2p/16K 2p/64K 8p/16K 8p/64K 16p/16K 16p/64K

ctxsw ctxsw ctxsw ctxsw ctxsw ctxsw ctxsw

--------- ------------- ------ ------ ------ ------ ------ ------- -------

commonway Linux 2.6.18- 9.2400 4.0200 9.0300 7.5600 8.3800 11.6 6.28000

时间的单位是微秒。

LMbench是如何来测量进程切换的时间的？

The benchmark is a ring of two to twenty processes that are connected

with Unix pipes. A token is passed from process to process, forcing

context switches. The benchmark measures the time it takes to pass

the token two thousand times from process to process. Each hand off

of the token has two costs: (a) the context switch, and (b) the cost

of passing the token. In order to get just the context switching time,the benchmark first measures the cost of passing the token through a

ring of pipes in a single process. This time is defined as the cost

of passing the token and is not included in the reported context switch

time.

.PP

When the processes are larger than the default baseline of ``zero''

(where zero means just big enough to do the benchmark), the cost

of the context switch includes the cost of restoring user level

state (cache lines). This is accomplished by having the process

allocate an array of data and sum it as a series of integers

after receiving the token but before passing the token to the

next process. Note that the overhead mentioned above includes

the cost of accessing the data but because it is measured in

just one address space, the cost is typically the cost with hot

caches. So the context switch time does not include anything

other than the context switch provided that all the processes

fit in the cache. If there are cache misses (as is common), the

cost of the context switch includes the cost of those cache misses.

.PP

首先是看任务处理的时间（通过一次任务处理，这个任务处理的时间被定义为token时间，不包括线程切换的）。

然后多次执行，排除任务执行的时间，然后计算出CS的时间（如果有cache miss，则CS的时间也包括cache misses的时间）。

文章参考：

霸爷和周忱的博客

http://www.bitmover.com/lmbench/

https://www.usenix.org/legacy/publications/library/proceedings/sd96/full_papers/mcvoy.pdf

http://blog.csdn.net/taozi343805436/article/details/7876087

http://blog.yufeng.info/archives/753

http://rdc.taobao.com/team/jm/archives/1706

http://iamzhongyong.iteye.com/blog/1895728