`
Hangdong.Zhang
  • 浏览: 15223 次
  • 性别: Icon_minigender_1
  • 来自: 西安
社区版块
存档分类
最新评论

关于 “VM resize revert 失败” 问题的分析

阅读更多

关于 “VM resize revert 失败” 问题的分析

作者:张航东

版本: Kilo 2015.1.1

 

1. Problem

When we tested Kilo 2015.1.1, we met an error (randomly) about resize-revert function. The error finally caused VM goto “Error” status, because of the “VirtualInterfaceCreateException”.

 

We can reproduce the error easily through the following step:

 

Step 1. Lanuch 3-5 VMs:

 

Step 2. Resize these VMs one by one, but not confirm.


 

Step 3. Revert them one by one. And repeat Step 2 to Step 3. Then we can see some VM will stay at “reverting” status , and go to “Error” finally.

 

And, we can see the following “nova-compute.log” with “VirtualInterfaceCreateException”.

  File "/usr/lib/python2.7/site-packages/nova/compute/manager.py", line 298, in decorated_function
    return function(self, context, *args, **kwargs)
  File "/usr/lib/python2.7/site-packages/nova/compute/manager.py", line 377, in decorated_function
    return function(self, context, *args, **kwargs)
  File "/usr/lib/python2.7/site-packages/nova/compute/manager.py", line 355, in decorated_function
    kwargs['instance'], e, sys.exc_info())
  File "/usr/lib/python2.7/site-packages/oslo_utils/excutils.py", line 85, in __exit__
    six.reraise(self.type_, self.value, self.tb)
  File "/usr/lib/python2.7/site-packages/nova/compute/manager.py", line 343, in decorated_function
    return function(self, context, *args, **kwargs)
  File "/usr/lib/python2.7/site-packages/nova/compute/manager.py", line 3868, in finish_revert_resize
    block_device_info, power_on)
  File "/usr/lib/python2.7/site-packages/nova/virt/libvirt/driver.py", line 6788, in finish_revert_migration
    raise ex
VirtualInterfaceCreateException: Virtual Interface creation failed

 

2. Analysis

2.1 Reason for short

For short, the error caused by nova wait for an event (network-vif-plugged-xxxx<port id>) from neutron, but neutron not send because the port (vif) inconsistent with its binding host. So make nova timeout for waiting the event

And, In normal situation, nova will not wait the event.

 

2.2 Resize revert success (Sequence)

 

Above is the sequence diagram of successful resize revert. There are some important steps:

Ÿ   Step 1.1.1, vif.active was set to true

Ÿ   Step 1.3.1.1, no any event be set for waiting, because the vif.active was not false.

Ÿ   Step 1.3.2.1, after call libvirt to create VM, no event need waited, the process keep going.

Ÿ   Step 1.4.1.1, neutron change the host of port binding in DB (neutron.ml2_port_bindings).

For example, We create a VM in host_A, the VM’port will bind with host_A. If we resize VM from host_A to host_B, and not confirm/revert, now, the VM’port will bind with host_B; then, we revert VM, the VM’s port will binding with host_A again, and it is changed in this step.

 

2.3 Resize revert failed (Sequence)

Above is the sequence diagram of failed resize revert, we can see there are some differences:

Ÿ   Step 1.1.1, vif.active was set to false

Ÿ   Step 1.3.1.1, event named network-vif-plugged-xxxx was set for waiting.

Ÿ   Step 1.3.2.1, after call libvirt to create VM, nova will hang up and waiting for the event (network-vif-plugged-xxxx)

Ÿ   Step 2.1.1.1.1, neutron get the host of port bound in DB (neutron.ml2_port_bindings), and compare it with the host which VM will revert to. Because neutron found they are inconsistent (DB is wrong), so it return at once and not send the event which nova waiting for.

  Step 1.4.1.1, As mentioned in “Resize revert success” chapter, the host of port binding in DB will be changed here. But this operation is called by “1.4 migrate_instance_finish()”, and it (migrate_instance_finish) can not be runned, because nova was hanged up and waiting for event.

 

So, the error raised.

 

Follows are the codes about neutron not send the event:

 

PS: “port_host” is from DB; “host” is input parameter, and from the target host (the host VM revert to). They are inconsistent, and we can see the info in DB is wrong.

 

2.4 Why there is difference between success and failure

According to above analysis, we can see there is a main difference beween success and failure: vif.active = true/false (ture in success, and false in failure).

Following is the sequence diagram of vif status change when resize-revert.

 

Note:The source/target hosts mentioned above are relative to revert operation. For example, We create a VM in host_A, and resize it to host_B (not confirm). Then while we revert the VM, host_B is source host, and host_A is target host.


 

There are 3 processes in above sequence diagram:

1.    Revert_resize() function in source host (host_B).

2.    Finish_revert_resize() function in target host (host_A).

 

3.    Linuxbridge neutron agent daemon on source host (host_B). The daemon has 2 seconds interval (default) and can be set in “/etc/neutron/plugins/linuxbridge/linuxbridge_conf.ini” on compute host:

 

Some important steps:

Ÿ   Step 1.1, on source host, the tap is removed by libvirt.

Ÿ   Step 1.2, on target host, finish_revert_resize() function run.

Ÿ   Step 1.2.1.1, on target host, _build_network_info_model() function get vif status by client.list_ports() function, then set vif.active = true/false.

Ÿ   Step 2.1, at the same time, on source host, linuxbridge neutron agent daemon found device (port) info changed (be removed), and start process_network_devices() and treat_devices_removed().

Ÿ   Step 2.1.1.1, on source host, linuxbridge neutron agent daemon set vif status DOWN.

 

In normal time, step 1.2.1.1 usually run before step 2.1.1.1, because the latter one is triggered by the daemon with 2s interval. So resize-revert will success.

But, occasionally, when step 2.1.1.1 run before step 1.2.1.1, the error will raised.

 

And, there is still an unreasonable thing: in success situation, on target host, _build_network_info_model() function get vif status as “Active”, but at the time, the “Active” is the vif status on source host.

 

 

 

3.  Solution

3.1 Solution 1 – Set “vif_plugging_is_fatal = false” in nova.conf

At first glance, maybe this is not a good way to fix the error.

But I guess, in NFV scenario, customer may not create new VM frequently, What they most care about is how to maintein all exiting VMs. If this, resize/migrate/evacuate will be more important, so when we set “vif_plugging_is_fatal = false”, we can always get an active VM even if a wrong vif, I think this is better than an error VM.

 

3.2 Solution 2 – Modify code

We can see in above 2 sequence diagrams, nova do nothing in “setup_networks_on_host()” function (step 1.2 and step 1.2.1 in sequence diagram).

We will change here, actually setup network to change the host of port binding. So that, in later process, neutron will get a correct info (host of port binding) from DB.

 

  • 大小: 64.6 KB
  • 大小: 82.2 KB
  • 大小: 76.2 KB
  • 大小: 73.5 KB
  • 大小: 129.7 KB
  • 大小: 164.5 KB
  • 大小: 183.7 KB
  • 大小: 97.8 KB
  • 大小: 16.8 KB
  • 大小: 156.2 KB
  • 大小: 9.6 KB
  • 大小: 15.2 KB
  • 大小: 605.6 KB
分享到:
评论

相关推荐

Global site tag (gtag.js) - Google Analytics