Openstack Upgrade Without Down Time 
November 5, 2014 
Takashi Natsume, Software Innovation Center, NTT 
natsume.takashi@lab.ntt.co.jp 
Yankai Liu, Canonical 
yankai.liu@canonical.com
Agenda 
● Introduction 
● Live Upgrade Test Strategy and Plan 
○ Pre-upgrade Investigation 
○ Considerations in Creating Upgrade Procedure 
○ Concrete Upgrade Procedure 
○ Testing 
○ Upgrade Test Results and Issues 
● Summary 
2
Introduction
Introduction 
Who We Are: 
Takashi Natsume 
Takashi Natsume has been working for NTT corporation since April, 
2013.I am engaged in system design of public cloud systems based on 
OpenStack and functional verification of OpenStack. 
Before I was engaged in performance analysis and performance 
troubleshooting for systems. 
Yankai Liu 
Yankai Liu is the Cloud Architect at Canonical being responsible for 
cloud architecture design and delivery. 
I worked with NTT team to provide consultancy on the upgrade test 
project. 
4
Openstack Upgrade Overview 
With the fast openstack releases rolling out, openstack upgrade becomes one 
of the key operation factors for the deployments, which can be performed off-line 
or live-upgrade. 
For the production deployments, live upgrade is desired to achieve these 
goals: 
● Minimal or no down time 
● Catch up the short release cycle of Openstack [1] 
● Ensure the maintenance support(because of short maintenance period[2]) 
● Reduce the cost comparing to off-line upgrade 
In this session, we will introduce how NTT designed and tested the live 
upgrade from Havana to Icehouse service by service. 
5
The Goal of NTT Cloud Live Upgrade 
No impact on users’ resources usage 
● Users can utilize their resources(VMs, virtual volumes,virtual networks) 
that have already created or are running without any interruption during 
live upgrade. 
For example, VM stop and network communication interruption 
● No performance problem that affects users’ resource utilization 
significantly. 
No impact on users’ API calls 
● During live upgrade, users can use the openstack API services as usual 
with: 
No errors or fails 
No incorrect results 
No performance problem that affects users’ operations 
significantly. 
6
Upgrade environment and components 
•System environment 
• Built a test environment based on NTT production public cloud 
system architecture 
(See the figure in the next page.) 
•Upgrade components 
• OpenStack components 
• Nova, Cinder, Glance,Neutron,Keystone,Heat 
• Non-openstack components such as MySQL, RabbitMQ、Load 
balancer(ldirector) and OS were NOT included. 
•Upgrade version 
• Stable/havana(2014.2.2) to icehouse-1(nova, icehouse-3) 
7
System Architecture Built for Upgrade Testing 
Active/Active: processes that do not retain their state 
Active/Standby: processes that retain their state    No HA(single): hypervisor hosts 
Processes that receive REST API requests can be blocked by deploying load balancers in front of them. 
OS: Ubuntu Server 12.04 LTS 
8
Live Upgrade Test Strategy and Plan
NTT Cloud Live Upgrade Test Strategy and Plan 
Overall Strategy 
● Step-by-step(Rolling) upgrade is needed for live upgrade 
● Openstack components co-exist on different versions 
Live Upgrade Test Plan 
1. Pre-upgrade investigation: items that should be considered in 
advance 
2. Considerations in creating details procedure 
3. Concrete upgrade procedure 
4. Testing 
5. Upgrade Test results and issues 
10
Live Upgrade Test Strategy and Plan 
- Pre-upgrade investigation -
Pre-Investigation for Live Upgrade 
A) Database schema 
• Some cases that OpenStack database schemas are different 
between new version and old version. 
• Investigate on the DB schema changes before creating the 
upgrade plans 
B) Consistency of APIs between components 
C) Consistency of APIs in each component. 
• REST API 
• RPC API 
12
Live Upgrade Test Strategy and Plan 
- Considerations in Creating Upgrade Procedure -
Considerations in Creating Upgrade Procedure 
•User resources 
• User resources that are on hosts to upgrade need be migrated to 
another host. 
14
The order of upgrade 
Decide the upgrade order based on RPC API version compatibility 
in the component 
Process C Process B Process A 
Legends: 
RPC call 
Server 
Process 
A caller is upgraded after a callee upgrade. 
In this case, upgrade is performed in the 
order of process A, process B and process C. 
15
Operations Required for Step by Step Upgrade 
•Blockade(Blocking requests) 
• load balancer (ldirectord(LVS)) 
• Disable Service(nova-compute, cinder-volume) 
•Check processings in progress 
• Check connections at the load balancer 
• e.g. glance-api 
• Check child processes 
• e.g. nova-novncproxy 
•If a graceful shutdown function can be used, it had 
better be used. 
• Nova: icehouse-1 or later 
• Cinder: icehouse-1 or later 
• Neutron: icehouse-2 or later 
• Heat: havana-3 or later(We fixed a bug in juno-1) 
• Glance: No need in our environment 
• Keystone: No need in our environment 
16
Database Schema 
• Change database schema at the beginning of 
procedure and the end of procedure 
• The beginning 
• Add tables, add columns and add indexes 
• The end 
• Drop tables, delete columns and delete indexes 
• In current nova live upgrade procedure(community), nova-conductors 
are upgraded at the same time. 
(New version and old version nova-conductors don’t run at the same 
time.) 
• Conversion of data format should be considered 
• We need not convert the data format in our trial. There is no problem. 
• Check codes that define the database schema sufficiently. 
• For example, in nova 
• nova/db/sqlalchemy/migrate_repo/versions/* 
• Data conversion may be needed in some cases. 
• Adding 'triggers' in database tables? 
17
Database Schema (cont’d) 
• Avoid database lock for a long time 
• We can use some tools 
• pt-online-schema-change[3] 
• oak-online-alter-table[4] 
18
HA Configuration 
• In the point of view of live upgrade, Active/Active 
configuration is better. 
• But there are some cases that Active/Active cannot 
be configured, so Active/Standby is forced. 
• cinder-volume(depends on backends) 
• Active/Active can be configured by using Ceph 
(Refer to the discussion https://bugs.launchpad. 
net/cinder/+bug/1280367) 
• While Active/Active setup can’t be supported by all the drivers. 
https://bugs.launchpad.net/cinder/+bug/1322190 
• neutron-server(depends on plugin) 
• neutron-l3-agent/neutron-dhcp-agent 
• nova-consoleauth 
• heat-engine(but multiple engine function has been implemented in 
icehouse-2.) 
19
HA Configuration (cont’d) 
•In Active/Active case(controller) 
• At Load balancer, blocking the node which is in the upgrade process 
•In Active/Standby case 
• When switching Active/Standby, there is service down time of the 
component as expected. 
20
Upgrade Procedure by HA Configuration 
Active/Active configuration 
Block 
requests/connections 
to target host 
Migrate users’ 
reources 
Upgrade host 
Unblock 
Repeat on each target hosts 
No HA(Single) 
Block requests 
to target host 
Migrate users’ 
reources 
Upgrade host 
Unblock 
Active/Standby configuration 
Upgrade 
‘Standby’ host 
Block requests 
to ‘Active’ host 
(if possible) 
Switch 
Active/Standby 
Unblock 
Repeat on each target hosts Repeat on each target hosts 
21
Live Upgrade Test Strategy and Plan 
- Concrete upgrade procedure -
System Architecture Built for Upgrade Testing 
Active/Active: processes that do not retain their state 
Active/Standby: processes that retain their state    No HA(single): hypervisor hosts 
Processes that receive REST API requests can be blocked by deploying load balancers in front of them. 
OS: Ubuntu Server 12.04 LTS 
23
Overall Upgrade Procedure 
24
Live Upgrade Test Strategy and Plan 
- Testing -
Create test plans, test tools and test data 
•Background workload during upgrade test 
• Background workload(API requests) covered patterns of calls 
between components and between processes in components in our 
use case. 
• Network communication(ping) 
• North-South 
• East-West 
• Remain VNC console connected during upgrade test 
26
Build a test environment 
•Build a test environment 
• Same configurations as a production environment 
• HA configuration(Active/Active, Active/Standby) required. 
• In order to repeat upgrade testing, we constructed the 
environment to get back easily by using chef. 
27
Execute(Test) the procedure 
•Evaluation criteria 
• No impact on users’ resources 
• Users can utilize their resources(VMs, virtual volumes,virtual 
networks) that have already created or are running without any 
interruption. 
• No performance problem that affects users’ resource utilization 
significantly. 
• No impact on users’ API calls 
• No error 
• No ‘wrong’ results 
• No performance problem that affects users’ operations significantly 
• Operation step does not need a lot of time 
• Consistency between records that OpenStack manages and actual 
resources. 
28
Live Upgrade Test Strategy and Plan 
- Upgrade Test results and issues -
Identify issues 
•Solved issues 
• Heat Graceful shutdown issue 
• NTT team fixed it in juno-1 
• https://bugs.launchpad.net/heat/+bug/1304244 
•Remaining issues 
• Errors due to Active/Standby switchover 
• Volume Resource creation failure(ERROR state) 
• Errors due to mismatch of RPC API major versions 
• From nova-compute to nova-consoleauth 
• From nova-novncproxy to nova-consoleauth 
Communication interruption (expected to be resolved in Juno) 
• Neutron-l3-agent 
• Changing ‘admin_state_up’ of neutron-l3-agent to False solves 
‘scheduling’ issue, but communication interruption occurred. 
• Interruption of the console connection 
• VM live migration/nova-novncproxy upgrade 
• Impossible to fallback after changing DB schema at the beginning 
30
Lesson learns 
•Clean install 
• Some source code directories/files should be removed during the 
upgrade and fallback. Otherwise it will cause errors and issues. 
• When overwriting openstack components’ files, errors occurred. 
• AttributeError: type object 'foo' has no attribute 'bar' 
31
Summary
Summary 
● The goal of the upgrade test is to achieve the upgrade without down 
time.But there were some issues to prevent us from achieving 
upgrade openstack without down time. 
● During our upgrade test, the down time of the services including: 
○ Network downtime 
■ neutron-l3-agent (expected to be fixed in Juno) 
● Trade-off between the new vRouter creation failure and VM 
communication, e.g. a few of minutes downtime to schedule the 
new vRouter creation OR a few of minutes communication 
interruption for some VMs communication 
○ Some API requests downtime during the Active/Standby switchover 
● Neutron server 
● Heat engine 
● Cinder volume 
○ Nova instance console connection interruption 
■ Need reconnect or Need getting console url again. 
33
Suggestions for communities 
• Cinder-volume drivers Active/Active HA support 
• Presently some drivers for commercial products prevent from configuring 
Active/Active 
• Consistency of RPC API major versions 
• 1 version rolling upgrade is (limited) supported in Nova. 
• It should be considered in all core projects. 
• If OpenStack components utilize oslo.messaging, errors caused by RPC API 
major version difference might occur during live upgrade. 
• Seamless console connection 
• There is a discussion In Juno summit for console seamless migration [5] 
• Consider live upgrade in REST API versions deprecation 
• SDN controller Active/Active HA support should be considered when 
integrating into Neutron as a plugin 
• Although Ceilometer is not in the test scope, there are still gaps to support 
Active/Active HA 
• Graceful shutdown of all services 
34
Reference 
•[1] Release Cycle 
• https://wiki.openstack.org/wiki/Release_Cycle 
•[2] Releases 
• https://wiki.openstack.org/wiki/Releases 
•[3] Percona Toolkit 
• http://www.percona.com/software/percona-toolkit 
•[4] openark kit 
• http://code.openark.org/forge/openark-kit 
•[5] Improve performance of live migration on KVM 
• https://etherpad.openstack.org/p/juno-nova-kvm-live-migration 
35

Openstack upgrade without_down_time_20141103r1

  • 1.
    Openstack Upgrade WithoutDown Time November 5, 2014 Takashi Natsume, Software Innovation Center, NTT [email protected] Yankai Liu, Canonical [email protected]
  • 2.
    Agenda ● Introduction ● Live Upgrade Test Strategy and Plan ○ Pre-upgrade Investigation ○ Considerations in Creating Upgrade Procedure ○ Concrete Upgrade Procedure ○ Testing ○ Upgrade Test Results and Issues ● Summary 2
  • 3.
  • 4.
    Introduction Who WeAre: Takashi Natsume Takashi Natsume has been working for NTT corporation since April, 2013.I am engaged in system design of public cloud systems based on OpenStack and functional verification of OpenStack. Before I was engaged in performance analysis and performance troubleshooting for systems. Yankai Liu Yankai Liu is the Cloud Architect at Canonical being responsible for cloud architecture design and delivery. I worked with NTT team to provide consultancy on the upgrade test project. 4
  • 5.
    Openstack Upgrade Overview With the fast openstack releases rolling out, openstack upgrade becomes one of the key operation factors for the deployments, which can be performed off-line or live-upgrade. For the production deployments, live upgrade is desired to achieve these goals: ● Minimal or no down time ● Catch up the short release cycle of Openstack [1] ● Ensure the maintenance support(because of short maintenance period[2]) ● Reduce the cost comparing to off-line upgrade In this session, we will introduce how NTT designed and tested the live upgrade from Havana to Icehouse service by service. 5
  • 6.
    The Goal ofNTT Cloud Live Upgrade No impact on users’ resources usage ● Users can utilize their resources(VMs, virtual volumes,virtual networks) that have already created or are running without any interruption during live upgrade. For example, VM stop and network communication interruption ● No performance problem that affects users’ resource utilization significantly. No impact on users’ API calls ● During live upgrade, users can use the openstack API services as usual with: No errors or fails No incorrect results No performance problem that affects users’ operations significantly. 6
  • 7.
    Upgrade environment andcomponents •System environment • Built a test environment based on NTT production public cloud system architecture (See the figure in the next page.) •Upgrade components • OpenStack components • Nova, Cinder, Glance,Neutron,Keystone,Heat • Non-openstack components such as MySQL, RabbitMQ、Load balancer(ldirector) and OS were NOT included. •Upgrade version • Stable/havana(2014.2.2) to icehouse-1(nova, icehouse-3) 7
  • 8.
    System Architecture Builtfor Upgrade Testing Active/Active: processes that do not retain their state Active/Standby: processes that retain their state    No HA(single): hypervisor hosts Processes that receive REST API requests can be blocked by deploying load balancers in front of them. OS: Ubuntu Server 12.04 LTS 8
  • 9.
    Live Upgrade TestStrategy and Plan
  • 10.
    NTT Cloud LiveUpgrade Test Strategy and Plan Overall Strategy ● Step-by-step(Rolling) upgrade is needed for live upgrade ● Openstack components co-exist on different versions Live Upgrade Test Plan 1. Pre-upgrade investigation: items that should be considered in advance 2. Considerations in creating details procedure 3. Concrete upgrade procedure 4. Testing 5. Upgrade Test results and issues 10
  • 11.
    Live Upgrade TestStrategy and Plan - Pre-upgrade investigation -
  • 12.
    Pre-Investigation for LiveUpgrade A) Database schema • Some cases that OpenStack database schemas are different between new version and old version. • Investigate on the DB schema changes before creating the upgrade plans B) Consistency of APIs between components C) Consistency of APIs in each component. • REST API • RPC API 12
  • 13.
    Live Upgrade TestStrategy and Plan - Considerations in Creating Upgrade Procedure -
  • 14.
    Considerations in CreatingUpgrade Procedure •User resources • User resources that are on hosts to upgrade need be migrated to another host. 14
  • 15.
    The order ofupgrade Decide the upgrade order based on RPC API version compatibility in the component Process C Process B Process A Legends: RPC call Server Process A caller is upgraded after a callee upgrade. In this case, upgrade is performed in the order of process A, process B and process C. 15
  • 16.
    Operations Required forStep by Step Upgrade •Blockade(Blocking requests) • load balancer (ldirectord(LVS)) • Disable Service(nova-compute, cinder-volume) •Check processings in progress • Check connections at the load balancer • e.g. glance-api • Check child processes • e.g. nova-novncproxy •If a graceful shutdown function can be used, it had better be used. • Nova: icehouse-1 or later • Cinder: icehouse-1 or later • Neutron: icehouse-2 or later • Heat: havana-3 or later(We fixed a bug in juno-1) • Glance: No need in our environment • Keystone: No need in our environment 16
  • 17.
    Database Schema •Change database schema at the beginning of procedure and the end of procedure • The beginning • Add tables, add columns and add indexes • The end • Drop tables, delete columns and delete indexes • In current nova live upgrade procedure(community), nova-conductors are upgraded at the same time. (New version and old version nova-conductors don’t run at the same time.) • Conversion of data format should be considered • We need not convert the data format in our trial. There is no problem. • Check codes that define the database schema sufficiently. • For example, in nova • nova/db/sqlalchemy/migrate_repo/versions/* • Data conversion may be needed in some cases. • Adding 'triggers' in database tables? 17
  • 18.
    Database Schema (cont’d) • Avoid database lock for a long time • We can use some tools • pt-online-schema-change[3] • oak-online-alter-table[4] 18
  • 19.
    HA Configuration •In the point of view of live upgrade, Active/Active configuration is better. • But there are some cases that Active/Active cannot be configured, so Active/Standby is forced. • cinder-volume(depends on backends) • Active/Active can be configured by using Ceph (Refer to the discussion https://bugs.launchpad. net/cinder/+bug/1280367) • While Active/Active setup can’t be supported by all the drivers. https://bugs.launchpad.net/cinder/+bug/1322190 • neutron-server(depends on plugin) • neutron-l3-agent/neutron-dhcp-agent • nova-consoleauth • heat-engine(but multiple engine function has been implemented in icehouse-2.) 19
  • 20.
    HA Configuration (cont’d) •In Active/Active case(controller) • At Load balancer, blocking the node which is in the upgrade process •In Active/Standby case • When switching Active/Standby, there is service down time of the component as expected. 20
  • 21.
    Upgrade Procedure byHA Configuration Active/Active configuration Block requests/connections to target host Migrate users’ reources Upgrade host Unblock Repeat on each target hosts No HA(Single) Block requests to target host Migrate users’ reources Upgrade host Unblock Active/Standby configuration Upgrade ‘Standby’ host Block requests to ‘Active’ host (if possible) Switch Active/Standby Unblock Repeat on each target hosts Repeat on each target hosts 21
  • 22.
    Live Upgrade TestStrategy and Plan - Concrete upgrade procedure -
  • 23.
    System Architecture Builtfor Upgrade Testing Active/Active: processes that do not retain their state Active/Standby: processes that retain their state    No HA(single): hypervisor hosts Processes that receive REST API requests can be blocked by deploying load balancers in front of them. OS: Ubuntu Server 12.04 LTS 23
  • 24.
  • 25.
    Live Upgrade TestStrategy and Plan - Testing -
  • 26.
    Create test plans,test tools and test data •Background workload during upgrade test • Background workload(API requests) covered patterns of calls between components and between processes in components in our use case. • Network communication(ping) • North-South • East-West • Remain VNC console connected during upgrade test 26
  • 27.
    Build a testenvironment •Build a test environment • Same configurations as a production environment • HA configuration(Active/Active, Active/Standby) required. • In order to repeat upgrade testing, we constructed the environment to get back easily by using chef. 27
  • 28.
    Execute(Test) the procedure •Evaluation criteria • No impact on users’ resources • Users can utilize their resources(VMs, virtual volumes,virtual networks) that have already created or are running without any interruption. • No performance problem that affects users’ resource utilization significantly. • No impact on users’ API calls • No error • No ‘wrong’ results • No performance problem that affects users’ operations significantly • Operation step does not need a lot of time • Consistency between records that OpenStack manages and actual resources. 28
  • 29.
    Live Upgrade TestStrategy and Plan - Upgrade Test results and issues -
  • 30.
    Identify issues •Solvedissues • Heat Graceful shutdown issue • NTT team fixed it in juno-1 • https://bugs.launchpad.net/heat/+bug/1304244 •Remaining issues • Errors due to Active/Standby switchover • Volume Resource creation failure(ERROR state) • Errors due to mismatch of RPC API major versions • From nova-compute to nova-consoleauth • From nova-novncproxy to nova-consoleauth Communication interruption (expected to be resolved in Juno) • Neutron-l3-agent • Changing ‘admin_state_up’ of neutron-l3-agent to False solves ‘scheduling’ issue, but communication interruption occurred. • Interruption of the console connection • VM live migration/nova-novncproxy upgrade • Impossible to fallback after changing DB schema at the beginning 30
  • 31.
    Lesson learns •Cleaninstall • Some source code directories/files should be removed during the upgrade and fallback. Otherwise it will cause errors and issues. • When overwriting openstack components’ files, errors occurred. • AttributeError: type object 'foo' has no attribute 'bar' 31
  • 32.
  • 33.
    Summary ● Thegoal of the upgrade test is to achieve the upgrade without down time.But there were some issues to prevent us from achieving upgrade openstack without down time. ● During our upgrade test, the down time of the services including: ○ Network downtime ■ neutron-l3-agent (expected to be fixed in Juno) ● Trade-off between the new vRouter creation failure and VM communication, e.g. a few of minutes downtime to schedule the new vRouter creation OR a few of minutes communication interruption for some VMs communication ○ Some API requests downtime during the Active/Standby switchover ● Neutron server ● Heat engine ● Cinder volume ○ Nova instance console connection interruption ■ Need reconnect or Need getting console url again. 33
  • 34.
    Suggestions for communities • Cinder-volume drivers Active/Active HA support • Presently some drivers for commercial products prevent from configuring Active/Active • Consistency of RPC API major versions • 1 version rolling upgrade is (limited) supported in Nova. • It should be considered in all core projects. • If OpenStack components utilize oslo.messaging, errors caused by RPC API major version difference might occur during live upgrade. • Seamless console connection • There is a discussion In Juno summit for console seamless migration [5] • Consider live upgrade in REST API versions deprecation • SDN controller Active/Active HA support should be considered when integrating into Neutron as a plugin • Although Ceilometer is not in the test scope, there are still gaps to support Active/Active HA • Graceful shutdown of all services 34
  • 35.
    Reference •[1] ReleaseCycle • https://wiki.openstack.org/wiki/Release_Cycle •[2] Releases • https://wiki.openstack.org/wiki/Releases •[3] Percona Toolkit • http://www.percona.com/software/percona-toolkit •[4] openark kit • http://code.openark.org/forge/openark-kit •[5] Improve performance of live migration on KVM • https://etherpad.openstack.org/p/juno-nova-kvm-live-migration 35