关于 “VM resize revert 失败” 问题的分析
作者:张航东
版本: Kilo 2015.1.1
1. Problem
When we tested Kilo 2015.1.1, we met an error (randomly) about resize-revert function. The error finally caused VM goto “Error” status, because of the “VirtualInterfaceCreateException”.
We can reproduce the error easily through the following step:
Step 1. Lanuch 3-5 VMs:
Step 2. Resize these VMs one by one, but not confirm.
Step 3. Revert them one by one. And repeat Step 2 to Step 3. Then we can see some VM will stay at “reverting” status , and go to “Error” finally.
And, we can see the following “nova-compute.log” with “VirtualInterfaceCreateException”.
File "/usr/lib/python2.7/site-packages/nova/compute/manager.py", line 298, in decorated_function
return function(self, context, *args, **kwargs)
File "/usr/lib/python2.7/site-packages/nova/compute/manager.py", line 377, in decorated_function
return function(self, context, *args, **kwargs)
File "/usr/lib/python2.7/site-packages/nova/compute/manager.py", line 355, in decorated_function
kwargs['instance'], e, sys.exc_info())
File "/usr/lib/python2.7/site-packages/oslo_utils/excutils.py", line 85, in __exit__
six.reraise(self.type_, self.value, self.tb)
File "/usr/lib/python2.7/site-packages/nova/compute/manager.py", line 343, in decorated_function
return function(self, context, *args, **kwargs)
File "/usr/lib/python2.7/site-packages/nova/compute/manager.py", line 3868, in finish_revert_resize
block_device_info, power_on)
File "/usr/lib/python2.7/site-packages/nova/virt/libvirt/driver.py", line 6788, in finish_revert_migration
raise ex
VirtualInterfaceCreateException: Virtual Interface creation failed
2. Analysis
2.1 Reason for short
For short, the error caused by nova wait for an event (network-vif-plugged-xxxx<port id>) from neutron, but neutron not send because the port (vif) inconsistent with its binding host. So make nova timeout for waiting the event
And, In normal situation, nova will not wait the event.
2.2 Resize revert success (Sequence)
Above is the sequence diagram of successful resize revert. There are some important steps:
Step 1.1.1, vif.active was set to true
Step 1.3.1.1, no any event be set for waiting, because the vif.active was not false.
Step 1.3.2.1, after call libvirt to create VM, no event need waited, the process keep going.
Step 1.4.1.1, neutron change the host of port binding in DB (neutron.ml2_port_bindings).
For example, We create a VM in host_A, the VM’port will bind with host_A. If we resize VM from host_A to host_B, and not confirm/revert, now, the VM’port will bind with host_B; then, we revert VM, the VM’s port will binding with host_A again, and it is changed in this step.
2.3 Resize revert failed (Sequence)
Above is the sequence diagram of failed resize revert, we can see there are some differences:
Step 1.1.1, vif.active was set to false
Step 1.3.1.1, event named network-vif-plugged-xxxx was set for waiting.
Step 1.3.2.1, after call libvirt to create VM, nova will hang up and waiting for the event (network-vif-plugged-xxxx)
Step 2.1.1.1.1, neutron get the host of port bound in DB (neutron.ml2_port_bindings), and compare it with the host which VM will revert to. Because neutron found they are inconsistent (DB is wrong), so it return at once and not send the event which nova waiting for.
Step 1.4.1.1, As mentioned in “Resize revert success” chapter, the host of port binding in DB will be changed here. But this operation is called by “1.4 migrate_instance_finish()”, and it (migrate_instance_finish) can not be runned, because nova was hanged up and waiting for event.
So, the error raised.
Follows are the codes about neutron not send the event:
PS: “port_host” is from DB; “host” is input parameter, and from the target host (the host VM revert to). They are inconsistent, and we can see the info in DB is wrong.
2.4 Why there is difference between success and failure
According to above analysis, we can see there is a main difference beween success and failure: vif.active = true/false (ture in success, and false in failure).
Following is the sequence diagram of vif status change when resize-revert.
Note:The source/target hosts mentioned above are relative to revert operation. For example, We create a VM in host_A, and resize it to host_B (not confirm). Then while we revert the VM, host_B is source host, and host_A is target host.
There are 3 processes in above sequence diagram:
1. Revert_resize() function in source host (host_B).
2. Finish_revert_resize() function in target host (host_A).
3. Linuxbridge neutron agent daemon on source host (host_B). The daemon has 2 seconds interval (default) and can be set in “/etc/neutron/plugins/linuxbridge/linuxbridge_conf.ini” on compute host:
Some important steps:
Step 1.1, on source host, the tap is removed by libvirt.
Step 1.2, on target host, finish_revert_resize() function run.
Step 1.2.1.1, on target host, _build_network_info_model() function get vif status by client.list_ports() function, then set vif.active = true/false.
Step 2.1, at the same time, on source host, linuxbridge neutron agent daemon found device (port) info changed (be removed), and start process_network_devices() and treat_devices_removed().
Step 2.1.1.1, on source host, linuxbridge neutron agent daemon set vif status DOWN.
In normal time, step 1.2.1.1 usually run before step 2.1.1.1, because the latter one is triggered by the daemon with 2s interval. So resize-revert will success.
But, occasionally, when step 2.1.1.1 run before step 1.2.1.1, the error will raised.
And, there is still an unreasonable thing: in success situation, on target host, _build_network_info_model() function get vif status as “Active”, but at the time, the “Active” is the vif status on source host.
3. Solution
3.1 Solution 1 – Set “vif_plugging_is_fatal = false” in nova.conf
At first glance, maybe this is not a good way to fix the error.
But I guess, in NFV scenario, customer may not create new VM frequently, What they most care about is how to maintein all exiting VMs. If this, resize/migrate/evacuate will be more important, so when we set “vif_plugging_is_fatal = false”, we can always get an active VM even if a wrong vif, I think this is better than an error VM.
3.2 Solution 2 – Modify code
We can see in above 2 sequence diagrams, nova do nothing in “setup_networks_on_host()” function (step 1.2 and step 1.2.1 in sequence diagram).
We will change here, actually setup network to change the host of port binding. So that, in later process, neutron will get a correct info (host of port binding) from DB.
相关推荐
tornado-6.4.1-cp38-abi3-musllinux_1_2_i686.whl
tornado-6.1-cp36-cp36m-manylinux2014_aarch64.whl
基于java的ssm停车位短租系统程序答辩PPT.pptx
tornado-6.4b1-cp38-abi3-musllinux_1_1_x86_64.whl
基于java的招生管理系统答辩PPT.pptx
本压缩包资源说明,你现在往下拉可以看到压缩包内容目录 我是批量上传的基于SpringBoot+Vue的项目,所以描述都一样;有源码有数据库脚本,系统都是测试过可运行的,看文件名即可区分项目~ |Java|SpringBoot|Vue|前后端分离| 开发语言:Java 框架:SpringBoot,Vue JDK版本:JDK1.8 数据库:MySQL 5.7+(推荐5.7,8.0也可以) 数据库工具:Navicat 开发软件: idea/eclipse(推荐idea) Maven包:Maven3.3.9+ 系统环境:Windows/Mac
基于java的农机电招平台答辩PPT.pptx
jdk23 甲骨文官方安装包
基于java的机场网上订票系统答辩PPT.pptx
项目经过测试均可完美运行! 环境说明: 开发语言:java jdk:jdk1.8 数据库:mysql 5.7+ 数据库工具:Navicat11+ 管理工具:maven 开发工具:idea/eclipse
基于java的网上书店销售管理系统答辩PPT.pptx
tornado-6.3.3-cp38-abi3-win32.whl
【作品名称】:基于 Jsp+Sqlserver 实现的超市信息管理系统 【适用人群】:适用于希望学习不同技术领域的小白或进阶学习者。可作为毕设项目、课程设计、大作业、工程实训或初期项目立项。 【项目介绍】: 系统功能: (1)系统分两种身份:管理员和员工,选择不同的身份进入不同的功能操作界面! (2)商品信息管理:管理员可以添加和维护商品信息,员工只能对商品信息进行查询 (3)员工信息管理:管理员登陆系统后可以可以添加和维护超市员工(收银员)的信息 (4)商品进货管理:管理员登陆系统后可以添加商品进货信息,可以对商品进货信息进行查询和统计,添加商品进进货退货信息,对商品进货退货信息进行查询和统计 (5)商品销售管理:员工(收银员)登陆系统后可以对商品进行销售,可以按时间查询自己的销售业绩;管理员登陆系统后可以按照时间等条件对销售信息进行查询,可以根据小票号登记顾客退货信息,查询顾客退货信息,可以查看员 【资源声明】:本资源作为“参考资料”而不是“定制需求”,代码只能作为参考,不能完全复制照搬。需要有一定的基础看懂代码,自行调试代码并解决报错,能自行添加功能修改代码。
tornado-6.3.2-cp38-abi3-musllinux_1_1_i686.whl
基于java的热带水果商城答辩PPT.pptx
java awt、Swing实现中国象棋可联机版本采用面向对象思想 采用面向对象的思路,实现中国象棋可联机版本,适合初学者,以及对面向对象有更深层次理解的开发者或者同学。 使用原生的java awt、Swing进行窗口式开发 将素材文件夹放在D:\Game路径下 两个工程直接导入Eclipse,即可运行, ps:一个工程运行两次也可以,需要注意端口号,代码默认如果连接的端口号是3003,则监听3004端口,相反同理。联机前需要确保两台计算机同时处于局域网或外网
web前端设计与开发(详细整理)(包含html讲解,css讲解,移动web讲解),合适学习前端的人员进行基础学习,一秒变高手
分析所需的数据和代码都在这里
Listening Exercise 3 Part 2.mp3