Openstack upgrade without_down_time_20141103r1

Openstack Upgrade Without Down Time
November 5, 2014
Takashi Natsume, Software Innovation Center, NTT
natsume.takashi@lab.ntt.co.jp
Yankai Liu, Canonical
yankai.liu@canonical.com

Agenda
● Introduction
● Live Upgrade Test Strategy and Plan
○ Pre-upgrade Investigation
○ Considerations in Creating Upgrade Procedure
○ Concrete Upgrade Procedure
○ Testing
○ Upgrade Test Results and Issues
● Summary
2

Introduction
Who We Are:
Takashi Natsume
Takashi Natsume has been working for NTT corporation since April,
2013.I am engaged in system design of public cloud systems based on
OpenStack and functional verification of OpenStack.
Before I was engaged in performance analysis and performance
troubleshooting for systems.
Yankai Liu
Yankai Liu is the Cloud Architect at Canonical being responsible for
cloud architecture design and delivery.
I worked with NTT team to provide consultancy on the upgrade test
project.
4

Openstack Upgrade Overview
With the fast openstack releases rolling out, openstack upgrade becomes one
of the key operation factors for the deployments, which can be performed off-line
or live-upgrade.
For the production deployments, live upgrade is desired to achieve these
goals:
● Minimal or no down time
● Catch up the short release cycle of Openstack [1]
● Ensure the maintenance support(because of short maintenance period[2])
● Reduce the cost comparing to off-line upgrade
In this session, we will introduce how NTT designed and tested the live
upgrade from Havana to Icehouse service by service.
5

The Goal of NTT Cloud Live Upgrade
No impact on users’ resources usage
● Users can utilize their resources(VMs, virtual volumes,virtual networks)
that have already created or are running without any interruption during
live upgrade.
For example, VM stop and network communication interruption
● No performance problem that affects users’ resource utilization
significantly.
No impact on users’ API calls
● During live upgrade, users can use the openstack API services as usual
with:
No errors or fails
No incorrect results
No performance problem that affects users’ operations
significantly.
6

Upgrade environment and components
•System environment
• Built a test environment based on NTT production public cloud
system architecture
(See the figure in the next page.)
•Upgrade components
• OpenStack components
• Nova, Cinder, Glance,Neutron,Keystone,Heat
• Non-openstack components such as MySQL, RabbitMQ、Load
balancer(ldirector) and OS were NOT included.
•Upgrade version
• Stable/havana(2014.2.2) to icehouse-1(nova, icehouse-3)
7

System Architecture Built for Upgrade Testing
Active/Active: processes that do not retain their state
Active/Standby: processes that retain their state　　　　No HA(single): hypervisor hosts
Processes that receive REST API requests can be blocked by deploying load balancers in front of them.
OS: Ubuntu Server 12.04 LTS
8

Live Upgrade Test Strategy and Plan

NTT Cloud Live Upgrade Test Strategy and Plan
Overall Strategy
● Step-by-step(Rolling) upgrade is needed for live upgrade
● Openstack components co-exist on different versions
Live Upgrade Test Plan
1. Pre-upgrade investigation: items that should be considered in
advance
2. Considerations in creating details procedure
3. Concrete upgrade procedure
4. Testing
5. Upgrade Test results and issues
10

- Pre-upgrade investigation -

Pre-Investigation for Live Upgrade
A) Database schema
• Some cases that OpenStack database schemas are different
between new version and old version.
• Investigate on the DB schema changes before creating the
upgrade plans
B) Consistency of APIs between components
C) Consistency of APIs in each component.
• REST API
• RPC API
12

- Considerations in Creating Upgrade Procedure -

Considerations in Creating Upgrade Procedure
•User resources
• User resources that are on hosts to upgrade need be migrated to
another host.
14

The order of upgrade
Decide the upgrade order based on RPC API version compatibility
in the component
Process C Process B Process A
Legends:
RPC call
Server
Process
A caller is upgraded after a callee upgrade.
In this case, upgrade is performed in the
order of process A, process B and process C.
15

Operations Required for Step by Step Upgrade
•Blockade(Blocking requests)
• load balancer （ldirectord(LVS)）
• Disable Service(nova-compute, cinder-volume)
•Check processings in progress
• Check connections at the load balancer
• e.g. glance-api
• Check child processes
• e.g. nova-novncproxy
•If a graceful shutdown function can be used, it had
better be used.
• Nova: icehouse-1 or later
• Cinder: icehouse-1 or later
• Neutron: icehouse-2 or later
• Heat: havana-3 or later(We fixed a bug in juno-1)
• Glance: No need in our environment
• Keystone: No need in our environment
16

Database Schema
• Change database schema at the beginning of
procedure and the end of procedure
• The beginning
• Add tables, add columns and add indexes
• The end
• Drop tables, delete columns and delete indexes
• In current nova live upgrade procedure(community), nova-conductors
are upgraded at the same time.
(New version and old version nova-conductors don’t run at the same
time.)
• Conversion of data format should be considered
• We need not convert the data format in our trial. There is no problem.
• Check codes that define the database schema sufficiently.
• For example, in nova
• nova/db/sqlalchemy/migrate_repo/versions/*
• Data conversion may be needed in some cases.
• Adding 'triggers' in database tables?
17

Database Schema (cont’d)
• Avoid database lock for a long time
• We can use some tools
• pt-online-schema-change[3]
• oak-online-alter-table[4]
18

HA Configuration
• In the point of view of live upgrade, Active/Active
configuration is better.
• But there are some cases that Active/Active cannot
be configured, so Active/Standby is forced.
• cinder-volume(depends on backends)
• Active/Active can be configured by using Ceph
(Refer to the discussion https://bugs.launchpad.
net/cinder/+bug/1280367)
• While Active/Active setup can’t be supported by all the drivers.
https://bugs.launchpad.net/cinder/+bug/1322190
• neutron-server(depends on plugin)
• neutron-l3-agent/neutron-dhcp-agent
• nova-consoleauth
• heat-engine(but multiple engine function has been implemented in
icehouse-2.)
19

HA Configuration (cont’d)
•In Active/Active case(controller)
• At Load balancer, blocking the node which is in the upgrade process
•In Active/Standby case
• When switching Active/Standby, there is service down time of the
component as expected.
20

Upgrade Procedure by HA Configuration
Active/Active configuration
Block
requests/connections
to target host
Migrate users’
reources
Upgrade host
Unblock
Repeat on each target hosts
No HA(Single)
Block requests
to target host
Migrate users’
reources
Upgrade host
Unblock
Active/Standby configuration
Upgrade
‘Standby’ host
Block requests
to ‘Active’ host
(if possible)
Switch
Active/Standby
Unblock
Repeat on each target hosts Repeat on each target hosts
21

- Concrete upgrade procedure -

System Architecture Built for Upgrade Testing
Active/Active: processes that do not retain their state
Active/Standby: processes that retain their state　　　　No HA(single): hypervisor hosts
Processes that receive REST API requests can be blocked by deploying load balancers in front of them.
OS: Ubuntu Server 12.04 LTS
23

- Testing -

Create test plans, test tools and test data
•Background workload during upgrade test
• Background workload(API requests) covered patterns of calls
between components and between processes in components in our
use case.
• Network communication(ping)
• North-South
• East-West
• Remain VNC console connected during upgrade test
26

Build a test environment
•Build a test environment
• Same configurations as a production environment
• HA configuration(Active/Active, Active/Standby) required.
• In order to repeat upgrade testing, we constructed the
environment to get back easily by using chef.
27

Execute(Test) the procedure
•Evaluation criteria
• No impact on users’ resources
• Users can utilize their resources(VMs, virtual volumes,virtual
networks) that have already created or are running without any
interruption.
• No performance problem that affects users’ resource utilization
significantly.
• No impact on users’ API calls
• No error
• No ‘wrong’ results
• No performance problem that affects users’ operations significantly
• Operation step does not need a lot of time
• Consistency between records that OpenStack manages and actual
resources.
28

- Upgrade Test results and issues -

Identify issues
•Solved issues
• Heat Graceful shutdown issue
• NTT team fixed it in juno-1
• https://bugs.launchpad.net/heat/+bug/1304244
•Remaining issues
• Errors due to Active/Standby switchover
• Volume Resource creation failure(ERROR state)
• Errors due to mismatch of RPC API major versions
• From nova-compute to nova-consoleauth
• From nova-novncproxy to nova-consoleauth
Communication interruption (expected to be resolved in Juno)
• Neutron-l3-agent
• Changing ‘admin_state_up’ of neutron-l3-agent to False solves
‘scheduling’ issue, but communication interruption occurred.
• Interruption of the console connection
• VM live migration/nova-novncproxy upgrade
• Impossible to fallback after changing DB schema at the beginning
30

Lesson learns
•Clean install
• Some source code directories/files should be removed during the
upgrade and fallback. Otherwise it will cause errors and issues.
• When overwriting openstack components’ files, errors occurred.
• AttributeError: type object 'foo' has no attribute 'bar'
31

Summary
● The goal of the upgrade test is to achieve the upgrade without down
time.But there were some issues to prevent us from achieving
upgrade openstack without down time.
● During our upgrade test, the down time of the services including:
○ Network downtime
■ neutron-l3-agent (expected to be fixed in Juno)
● Trade-off between the new vRouter creation failure and VM
communication, e.g. a few of minutes downtime to schedule the
new vRouter creation OR a few of minutes communication
interruption for some VMs communication
○ Some API requests downtime during the Active/Standby switchover
● Neutron server
● Heat engine
● Cinder volume
○ Nova instance console connection interruption
■ Need reconnect or Need getting console url again.
33

Suggestions for communities
• Cinder-volume drivers Active/Active HA support
• Presently some drivers for commercial products prevent from configuring
Active/Active
• Consistency of RPC API major versions
• 1 version rolling upgrade is (limited) supported in Nova.
• It should be considered in all core projects.
• If OpenStack components utilize oslo.messaging, errors caused by RPC API
major version difference might occur during live upgrade.
• Seamless console connection
• There is a discussion In Juno summit for console seamless migration [5]
• Consider live upgrade in REST API versions deprecation
• SDN controller Active/Active HA support should be considered when
integrating into Neutron as a plugin
• Although Ceilometer is not in the test scope, there are still gaps to support
Active/Active HA
• Graceful shutdown of all services
34

Reference
•[1] Release Cycle
• https://wiki.openstack.org/wiki/Release_Cycle
•[2] Releases
• https://wiki.openstack.org/wiki/Releases
•[3] Percona Toolkit
• http://www.percona.com/software/percona-toolkit
•[4] openark kit
• http://code.openark.org/forge/openark-kit
•[5] Improve performance of live migration on KVM
• https://etherpad.openstack.org/p/juno-nova-kvm-live-migration
35

Openstack upgrade without_down_time_20141103r1

More Related Content

What's hot

Viewers also liked

Similar to Openstack upgrade without_down_time_20141103r1

Recently uploaded

Openstack upgrade without_down_time_20141103r1