从DPDK的snake test看性能影响因素

steeven

浏览: 318251 次
性别:
来自: 上海

最近访客更多访客>>

devcang

gmacel

感觉不妨

u010261322

博主相关

博客

微博

相册

留言

关于我

文章分类

社区版块

存档分类

博客分类：

DPDK

dpdk numa pmd testpmd 性能

snake test一般把数据包在各个端口之间来回转，形成比较大的满负荷。

testpmd是dpdk用来验证两个直连网卡的性能，双方对打流量。如果没有硬件（你怎么什么都没有啊？）我们一样可以玩。 Linux下的tap就是成对出现的粒子，不，虚拟网卡，创建以后，什么bridge都不要，他们就是天然的好基友。。。

# ip link add ep1 type veth peer name ep2
# ifconfig ep1 up; ifconfig ep2 up
看看ifconfig, ip link是不是出现了？

testpmd安装运行参见： http://dpdk.org/doc/quick-start
testpmd运行多个实例需要加--no-shconf
hugepage多次运行以后貌似没有释放，不用它性能下降不多, --no-huge

# ./testpmd --no-huge -c7 -n3 --vdev="eth_pcap0,iface=ep1" --vdev=eth_pcap1,iface=ep2 -- -i --nb-cores=2 --nb-ports=2 --total-num-mbufs=2048
testpmd> start tx_first
testpmd> show port stats all
testpmd> show port stats all //两次
Rx-pps:       418634
Tx-pps:       436095

我们再创建一对taps测试，同时跑两组：
# ip link add ep3 type veth peer name ep4
# ifconfig ep3 up; ifconfig ep4 up
# ./testpmd1 --no-huge --no-shconf -c70 --vdev="eth_pcap2,iface=ep3" --vdev=eth_pcap3,iface=ep4 -- -i --nb-cores=2 --nb-ports=2 --total-num-mbufs=2048

两个同时跑性能差不多，因为-c参数把程序分散到不同core上，top命令按“1”可以看到

那么两个对串联性能会怎样？本来数据在 EP1<->EP2, EP3<->EP4, 现在改成EP2<->EP3, EP4<->EP1.

# ./testpmd --no-huge --no-shconf -c70 --vdev="eth_pcap1,iface=ep2" --vdev=eth_pcap2,iface=ep3 -- -i --nb-cores=2 --nb-ports=2 --total-num-mbufs=2048
testpmd> show port stats all
这时候你将看到pps都是0! 因为一边报文发出去tap对端没连上。现在我们在另外一个窗口把ep4-ep1联通：
# ./testpmd --no-huge -c7 -n3 --vdev="eth_pcap0,iface=ep1" --vdev=eth_pcap3,iface=ep4 -- -i --nb-cores=2 --nb-ports=2 --total-num-mbufs=2048
testpmd> start tx_first
testpmd> show port stats all
testpmd> show port stats all
Rx-pps:       433939
Tx-pps:       423428
跑起来了，回去第一个窗口show一样有流量，至此snake流量打通。

问题来了，为什么两个串联性能变化不大？！
# lscpu
NUMA node0 CPU(s):     0,2,4,6,8,10,12,14,16,18,20,22
NUMA node1 CPU(s):     1,3,5,7,9,11,13,15,17,19,21,23
从top看testpmd的core在1-2, 5-6上跑的，跨越NUMA这个内存效率。。。
好吧，-c参数改成15, 这是 bitmap，实际使用core 4 2 0，ep1-ep2测试结果提升50%：
Rx-pps:       612871
Tx-pps:       597219
恢复snake test, cpu分别是15, 2A，测试性能如下，貌似慢了不少：
Rx-pps:       339290
Tx-pps:       336334
cpu如果用15,1500，结果：
Rx-pps:       540867
Tx-pps:       496891
性能比跨越numa好了很多，但是比单个tap对还是下降了1/6, 那么再看看3个taps的snake结果，第三组cpu 150000还是同一numa，居然变化不大：
Rx-pps:       511881
Tx-pps:       503456

假设cpu不够用了，第三个testpmd程序也跑在cpu 1500上面, 结果非常可悲：
Rx-pps:         1334
Tx-pps:         1334

以上测试说明：
1. 尽量不要跨越numa传递数据
2. 绑定cpu击鼓传花处理数据总吞吐量决定要最慢的一个应用
3. cpu不能复用，切换调度严重影响性能

========================
创建一个bridge br0, 把ep1, ep3, ep5加进去，用testpmd测试ep2-ep4, 这是标准网桥，看看性能下降多少：
#brctl add br0
#brctl add ep1; brctl add ep3
# ./testpmd --no-huge --no-shconf -c15 --vdev="eth_pcap1,iface=ep2" --vdev=eth_pcap3,iface=ep4 -- -i --nb-cores=2 --nb-ports=2 --total-num-mbufs=2048

Rx-pps:       136157
Tx-pps:       128207
600kpps降到130k左右，1/4不到。。。有空用ovs试试。

分享到：

p4lang quick start: p4-factory | 现在的NFV架构是否存在重大性能问题

2016-09-26 21:24
浏览 1329
评论(1)
分类:企业架构
查看更多

1 楼 steeven 2016-10-11

--no-shconf 一个实例运行多次
--no-huge 如果没有配置。。。
--port-topology=chained，在两个以上端口中逐个转发成环

除了pcap, dpdk还能使用Ring基于内存的虚拟端口测试：4 core 4 rings
./testpmd --no-huge -c 0x2aa -n 4 --vdev=eth_ring0 --vdev=eth_ring1 --vdev=eth_ring2 --vdev=eth_ring3 -- -i --total-num-mbufs=2048 --port-topology=chained --nb-cores=4
Rx-pps:    121887976
Tx-pps:    121887976

Rx-pps:    281685659 //--no-huge, R720
Tx-pps:    281685659
也就是说不考虑网络PCI cost, 基于内存队列的虚拟端口每个core可以处理200Gbps以上的流量。。。

发表评论

您还没有登录,请您登录后再发表评论

最近访客更多访客>>

博主相关

文章分类

社区版块

存档分类

最新评论