The Scale-Out Blog

Apr 20, 2011

Belated Thanks to MySQL Community

Tungsten Replicator won O'Reilly Application of the Year at the 2011 O'Reilly MySQL Conference, together with Percona's XtraBackup. Giuseppe Maxia also received an award for Community Contributor of the Year. Having now worked with Giuseppe for almost half a year I know from personal experience his reward is truly deserved. All in all we had a very good week, especially since the replicator award was a complete surprise.

Things were so busy during and after the MySQL conference it was difficult to write a timely thank-you note. I hope it's not too late to thank the committee now for both awards.

More importantly, I would like to thank the MySQL community as a whole. Replicated data is the lifeblood of MySQL applications. There has been a long history of innovation both within the MySQL engineering team as well as the community as a whole. Working on replication for MySQL is a bit like building cars to operate on the German Autobahn. If you can compete here you can compete anywhere.

Apr 14, 2011

Settling in at code.google.com

Tungsten Replicator code is now fully open source and published on code.google.com. Here is our new home in case you do not yet know it: http://code.google.com/p/tungsten-replicator. I hope you will visit our new digs and admire the furniture.

The fact that the replicator is now fully open source under GPL V2 is kind of old news, so I would instead like to talk about something else: our initial experience setting up the replicator project at code.google.com. In a nutshell, it has been excellent. There are several things that stand out.

The site is incredibly easy to use. You can customize the home page, add members, add external links, etc. quickly and without having to resort to help.
It has everything we need. The front page is excellent--clean but also all the information users need to get started. Useful features like issue trackers and Google Groups are cleanly integrated.
It is very fast.
So far it seems to have just the right mix of open and closed for our project. Anybody can post to the groups or log issues, but only committers on the project have write access to code and ability to move bugs through issue status.

The only problems I have run into personally involve SVN code access. For example, say you check out using the http rather than https URL as in:
svn co http://tungsten-replicator.googlecode.com/svn/trunk/builder

If you edit something and try to check in you get a message like the following:

$ svn commit -m "This does not work"
svn: Commit failed (details follow):
svn: Server sent unexpected return value (405 Method Not Allowed) in response to MKACTIVITY request for '/svn/!svn/act/8d94e398-83ba-46f3-aae2-bd10cb707c4b'
svn: Your commit message was left in a temporary file:
svn:    '/home/rhodges/google/tungsten-replicator/builder/svn-commit.tmp'

This message is definitely in the "not helpful" category. Perhaps it is some sort of defense against evildoers. However, this might be subversion behavior and nothing to do with Google. If you receive such a message, run svn info to check the SVN URL. If you see http instead of https you have found the cause. Unfortunately the cure seems to be to check out again properly in another location, copy in your changes, and then commit.

Site credentials are a more insidious problem. Android phone users need to have a Google GMail account to access updates and download apps. (At least that's true for my provider.) Browsers like Firefox do not keep accounts separated properly, so you may run into account confusion when you first get started. On Mac OS X you can get the wrong account in your keychain, which in turn leads to more confusing error messages. This is a Google problem. There is a creeping form of web single-signon using Google, Facebook, and other accounts as identifiers with unintended side-effects for work and personal interactions. It makes you wonder what other problems are out there.

But I digress. As far as the project is concerned the issues look pretty minor. At this point I would recommend code.google.com wholeheartedly for open source projects.

p.s., Tungsten Replicator 2.0.2 is on the way. More on that in another post.

Apr 2, 2011

O'Reilly Conference Tungsten Talks and Some Welcome Open Source Progress

The O'Reilly MySQL 2011 conference is coming up fast. It should be a good conference as it covers the increasingly diverse MySQL community and MySQL alternatives very well. As usual, there are some painful choices about which talks to attend. I'm doing two talks myself that I hope you have on your list:

Curing Replication Deprivation with Tungsten -- A tutorial together with my colleague Ed Archibald. It covers everything you ever wanted to know about how to use parallel replication, handle multi-master/multi-source, replication to PostgreSQL/Oracle, etc. We will have a short section at the end about how to build full clusters with Tungsten Enterprise. Giuseppe Maxia is threatening to join and do some of his famous demos. Apparently doing a tutorial on replication in the morning is not enough to tire him out.
Preparing for the Big Oops: How to Build Disaster Recovery Sites for MySQL -- Survey of methods as well as things to have in mind when building a disaster recovery site. This will cover everything I can think of, not just Tungsten.

If you do not get your fill from our tutorial, Ed Archibald will also be doing another talk on Tungsten Enterprise explaining how you can build Database-as-a-service using Tungsten.

Finally, the open source news. We have been working on getting Tungsten Replicator fully open sourced including features like parallel replication, cool multi-master capabilities, and our fast disk log. Getting there required jumping through a few hoops internally, but I'm happy to say the jumping is done. As soon as we finish a couple of merges between branches we will post a full copy of Tungsten Replicator 2.0 on our new home at code.google.com under a GPL V2 license.

It will take us a while to get used to the new bug tracker, wikis, and forums based on Google Groups, but we should be starting to settle in by 7 April when Giuseppe presents Advanced MySQL Replication for the Masses at the MySQL SFO meetup. Come visit us!

Mar 30, 2011

Slouching towards Multi-Master Conflict Resolution

This title is not nearly as snappy as Yeats' line from "The Second Coming," but it will have to do. Conflict resolution has been a holy grail for us on Tungsten--we had people asking for it when MySQL was still MySQL AB, lo these many years ago. Well, it's finally starting to happen after about a year of work.

Let's start with a simple multi-master configuration. To replicate bi-directionally between two masters, you typically run Tungsten replicator on each master with two replication services on each master. The first local service reads the master log. The second remote service is a slave of the other master. It looks like the following picture:

One of the big problems in multi-master replication is to avoid loops. Let's say you update something in the SJC master. The transaction replicates to NYC and appears in the log. Then it wants to replicate back to SJC and appear in the log there. If we don't do anything, the poor transaction will loop forever.

Tungsten solves this problem using a filter named BidiRemoteSlaveFilter that runs on slave pipelines It has a simple job, which is to identify where transactions originated and keep local transactions that are returning from the other master from being applied again. We use a variety of tricks to "tag" SQL in a way that allows us to deduce the origin--the most common are comments added to SQL statements or specially formatted row updates that we add to user transactions. As long as you set things up properly and don't break some simple rules you can now replicate bi-directionally between 2 or more masters.

This brings us to conflict resolution. Conflict resolution prevents incompatible transactions from colliding with each other. The BidiRemoteSlaveFilter does a simple form of conflict resolution by preventing transactions from the local service from looping back and being applied again. However, it's a very short hop to filters that that address application conflicts.

Here's a simple example that is next on my list to implement. It is not unusual to split customers in multi-tenant applications across sites so that they have active updates on only one site and backup copies on the others. You could imagine a filter that works off a simple file like the following:

[sjc]
   sharks
   athletics
[nyc]
   mets
   yankees

This tells the filter that transactions with shard ID sharks and athletics are allowed on the sjc master from local application. However, if the nyc master updates these, we will reject the updates when the nyc remote service tries to apply them and generate an error message in the log, or perhaps put them in a special log file for later inspection. You could even generate a dummy update on the local master that would result in the sjc data being sent back over to correct the nyc DBMS information via replication.

What we have just done is implemented conflict resolution for a system-of-record approach to multi-site data management. There are many types of conflicts as well as ways to manage them. Tungsten Replicator filters have a lot of potential to implement other schemes as well. Filters are pluggable, so there is a convenient escape hatch if you need to do specialized rules of your own. Meanwhile, there is plenty of scope for Tungsten development to provide useful conflict resolution mechanisms.

The Yeats poem I referred to at the beginning of the article is one of my all-time favorites. Here are the last two lines:

And what rough beast, its hour come round at last,
Slouches towards Bethlehem to be born?

This could certainly describe a lot of software projects. However, Tungsten is not like that at all. We Tungsten engineers wear white lab jackets with pocket protectors and have shiny clipboards to take notes on our breakthroughs. We rarely slouch.

P.s. Speaking of Tungsten engineers my colleague Giuseppe Maxia and I will be doing a webinar on multi-master replication on Thursday March 31st. It will be mostly technical with only a small amount of marketing fluff. As usual Giuseppe has cooked up a cool demo. Sign up at www.continuent.com if you would like to find out more.

Mar 22, 2011

Parallel Replication Using Shards Is the Only Workable Approach for SQL

There have been a couple of recent blog articles (here and here) asking for parallel replication based on something other than schemas. These articles both focus on the problem of parallelizing updates within a single MySQL schema. I read these with great interest, not least because they both mentioned Tungsten (thanks!) and also found that our schema-based parallelization approach is too limited. It is therefore worth a short article explaining exactly what the Tungsten approach is and why we chose it.

First of all, Tungsten does not exactly use schema-based parallel replication. Tungsten is actually based on what I call the serialized shard model of replication. We assign global transaction IDs to all transactions, which means that for any particular set of transactions we can always figure out the correct serialization and apply in the right order. This is true even if the transactions travel across independent replication paths or if we have master failover.

Second, we assign a shard ID to all transactions. Shards are independent streams of transactions that execute correctly when applied by themselves in serial order. Shards are typically independent, which means transactions in different shards can execute in parallel without deadlocking or corrupting data. This is the case when each shard contains data for a single customer in a multi-tenant application. We also have a notion of "critical shards," which are shards that contain global data, such as shared currency rate tables. Updates in critical shards cause full serialization across all shards.

You can define shards in a variety of ways, but as a practical matter identifying durable shards inside individual MySQL schemas is hard for most applications, especially if there are constraints between tables or you have large transactions. Many SQL applications tend to make most of their updates to a small number of very large tables, which makes finding stable dividing lines even more difficult. Schemas are therefore a natural unit of sharding, and Tungsten uses these by default.

Schema-based sharding seems pretty limiting, but for current SQL databases it is really the only approach that works. Here are some important reasons that give you a flavor of the issues.

* Restart. To handle failures you need to mark the exact restart point on each replication apply thread or you will either repeat or miss transactions. This requires precise and repeatable serialization on each thread, which you get with the serialized shard model.

* Deadlocks. If there are conflicts between updates you will quickly hit deadlocks. This is especially true because one of the biggest single thread replication optimizations is block commit, where you commit dozens of success transactions at once--it can raise throughput by 100% in some cases. Deadlocks on the other hand can reduce effective throughput to zero in pathological cases. Shard-based execution avoids deadlocks.

* Ordering. SQL gives you a lot of ways to shoot yourself in the foot through bad transaction ordering. You can't write to a table before creating it. You can't delete a row before it is inserted. Violating these rules does not just lead to invalid data but also causes errors that stop replication. The workarounds are either unreliable and slow (conflict resolution) or impractical for most applications (make everything an insert). To avoid this you need to observe serialization very carefully.

* Throughput. SQL transactions in real systems vary tremendously in duration, which tends to result in individual long transactions blocking simpler parallelization schemes that use in-memory distribution of updates. In the Tungsten model we can solve this by letting shard progress vary (by hours potentially), something that is only possible with a well-defined serialization model that deals with dependencies between parallel update streams. I don't know of another approach that deals with this problem.

If you mess up the solution to any of the foregoing problems, chances are good you will irreparably corrupt data, which leads to replication going completely off the rails. Then you reprovision your slave(s). The databases that most need parallel replication are very large, so this is a multi-hour or even multi-day process. It makes for unpleasant calls with customers when you tell them they need to do this.

I don't spend a lot of time worrying that Tungsten parallel replication is not well suited to the single big schema problem. So far, the only ways I can think of making it work scalably require major changes to the DBMS or the applications that use it. In many cases your least costly alternative may be to use SSDs to boost slave I/O performance.

My concerns about Tungsten's model lie in a different area. The serialized shard model is theoretically sound--it has essentially the same semantics as causally dependent messaging in distributed systems. However, if we fail to identify shards correctly (and don't know we failed) we will have crashes and corrupt data. I want Tungsten either to work properly or tell users it won't work and degrade gracefully to full serialization. If we can't do one of these two for every conceivable sequence of transactions that's a serious problem.

So, to get back to my original point, serialized shards are the best model for parallel replication in SQL databases as we find them today. I suspect if you look at some of the other incipient designs for parallel replication on MySQL you will find that they follow this model in the end if not at first. I would think in fact that the next step is to add MySQL features that make sharded replication more effective. The drizzle team seems to be thinking along these lines already.

Mar 20, 2011

Tuning Tungsten Parallel Replication Performance

Last month my colleague Giuseppe Maxia described how to operate Tungsten parallel replication. Since then we have been doing a good bit of benchmarking on both synthetic as well as real production loads. In this article I would like to follow up with some tips about how you can goose up parallel replication performance. These apply to Tungsten Replicator 2.0.1, which you can find here.

The first way to get good performance with Tungsten is to have the right workload. As explained in an earlier article on this blog, Tungsten parallel replication works by replicating independent databases (aka shards) in parallel. Here is a picture that summarizes what is going on.

If you have a lot of schemas, if the updates are distributed evenly across schemas, and if you don't have many dependencies between schemas that require full serialization, parallel replication can speed things up significantly for I/O-bound workloads. For example, Tungsten runs three times faster than MySQL native replication on large datasets when the slave is catching up to the master following mysqld restart.

Catch-up is a famous slave lag case and one where Tungsten can be quite helpful. (I think we will be faster in the future, but this is a good start.) Nevertheless, there's a chance you'll need to do a bit of tuning to see such benefits yourself.

Tungsten currently uses a structure called a parallel queue to enable parallelization. The parallel queue typically sits at the end of a replicator pipeline in front of the parallel apply threads, as shown in the following handy diagram.

One key to getting decent parallel replication performance is to watch the parallel queue in operation. In Tungsten Replicator 2.0.1 we introduced a new status command trepctl status -name stores that goes a long way to help diagnose how well parallel replication is performing. Here's a typical example using a 6 channel queue store.

$ trepctl status -name stores
Processing status command (stores)...
NAME                VALUE
----                -----
criticalPartition : -1
discardCount      : 0
eventCount        : 3217
maxSize           : 1000
name              : parallel-queue
queues            : 6
serializationCount: 1
serialized        : false
stopRequested     : false
store.queueSize.0 : 0
store.queueSize.1 : 480
store.queueSize.2 : 310
store.queueSize.3 : 1000
store.queueSize.4 : 739
store.queueSize.5 : 407
storeClass        : com.continuent.tungsten.enterprise.replicator.store.ParallelQueueStore
storeSize         : 2936
syncEnabled       : true
syncInterval      : 100
Finished status command (stores)...

The two most important things to look at are distribution of transactions across queues and serialization. Let's start with transaction distribution. In this particular example we were running a parallel queue with 6 channels but only 5 databases. The distribution therefore looks pretty good. One queue is empty but the other have a fairly even distribution of transactions.

Notice that one queue has exactly 1000 transactions. In Tungsten Replicator 2.0.1 the parallel queue has a maximum size parameter (maxSize), which is set to 1000 for this example run. Once an individual queue hits the maxSize limit, the entire parallel queue blocks. It is not uncommon to see one queue blocking in this way if the replicator is catching up, which is exactly what is happening here. In fact, if the queues are all empty it is possible Tungsten is somehow not supplying transactions to the queue fast enough. That is not a problem here.

Bad workloads on the other hand tend to have a lot of transactions in one or two queues and few or none in all the rest. The following is an example of a possibly bad distribution.


$ trepctl status -name stores
Processing status command (stores)...
NAME                VALUE
----                -----
...
store.queueSize.0 : 0
store.queueSize.1 : 4
store.queueSize.2 : 3
store.queueSize.3 : 972
store.queueSize.4 : 0
store.queueSize.5 : 15
...
Finished status command (stores)...

If you see such skewed distributions persistently, you may want to try to adjust the queue partitioning using the shard.list file. The default parallel queue partitioning algorithm hashes shards into channels. This does not always gives optimal performance if your shards mostly happen to hash into the same channel. The other possibility is that the workload is just badly distributed across databases.

You can decide whether the workload or partitioning is at fault using the trepctl status -name shards command. Here's an example.


$ ./trepctl status -name shards
Processing status command (shards)...
NAME                VALUE
----                -----
appliedLastEventId: 000007:0000000000000384;20
appliedLastSeqno  : 1471201
appliedLatency    : 0.0
eventCount        : 6
shardId           : #UNKNOWN
stage             : d-pq-to-dbms
NAME                VALUE
----                -----
appliedLastEventId: 000005:0000000326365895;41
appliedLastSeqno  : 1470999
appliedLatency    : 0.0
eventCount        : 311605
shardId           : db1
stage             : d-pq-to-dbms
NAME                VALUE
----                -----
appliedLastEventId: 000005:0000000326512277;95
appliedLastSeqno  : 1471200
appliedLatency    : 0.0
eventCount        : 298522
shardId           : db2
stage             : d-pq-to-dbms
...

This shows that the distribution of transactions between the db1 and db2 databases is pretty even. If you have many databases with roughly even values in the eventCount parameter, the workload is well suited for parallelization. In that case you may want to assign shards explicitly in the shard.list file if you don't like the distribution in the parallel queue.

Meanwhile, the previous example shows an example of another potential problem. We also see counts for #UNKNOWN, which is a special shard ID that means "I could not tell what schema this is." #UNKOWN transactions can occur if Tungsten cannot parse a SQL statement properly or there is a transaction that updates multiple schemas. In either case, Tungsten serializes the parallel queue.

However it occurs, serialization is a performance killer because it means we have to block the parallel queue until all parallel transactions complete, execute one or more transactions serially, and then reopen the parallel queue. You can see how often this is happening from the serializationCount value on trepctl status -name stores. For many workloads a serializationCount value that is more than a few percent of the number in eventCount means the entire transaction stream is effectively serialized.

If the serialization is occurring due to #UNKNOWN shards, you may be able to improve things using a new value in the replicator service properties file that was added in version 2.0.1. It controls whether we assign the shard ID using the default schema even if Tungsten cannot tell from the SQL command what you are doing.


# Policy for shard assignment based on default database.  If 'stringent', use
# default database only if SQL is recognized.  For 'relaxed' always use the
# default database if it is available.
replicator.shard.default.db=relaxed

Setting the parameter to relaxed can help quite a bit if the problem is due to unusual SQL that confuses the parser. On one workload we were able to reduce the serializationCount from about 10% of transactions to 0% in this way. We then saw the expected speed-up from parallel replication.

Mar 7, 2011

Understanding Tungsten Replication Services

If you follow Giuseppe Maxia's Datacharmer blog you have seen several recent articles on Tungsten Replicator. Giuseppe and I work closely together on replication at Continuent, and I have promised a matching set of articles about replication internals that match the practical introduction provided by Giuseppe. In this first article I will describe replication services, which are message processing flows that run in the Tungsten Replicator.

Unlike many replication engines, Tungsten Replicator can run multiple replication services concurrently. There is a central management interface that allows you to start new replication services without disturbing services that are already running. Each replication service also has its own management interface so that you can put the loaded service online, offline, etc. without disturbing other replication work. As Tungsten is written in Java, the management interfaces are based on JMX, a standard administrative interface for Java applications.

Here is a simple diagram that shows a Tungsten Replicator with two replication services named fra and nyc that replicate from separate DBMS masters in Frankfurt and NYC into a single slave in San Francisco. You can immediately see the power of replication services--a single Tungsten Replicator process can simultaneously replicate between several locations. Replication services are an important building block for the type of complex setups that Giuseppe Maxia discussed in his blog article on Replication Topologies.

Users who are handy with Java can write their own programs to manipulate the JMX interfaces directly. If not, there is the trepctl utility, which is supplied with the Tungsten Replicator and works off the command line.

If the Tungsten Replicator architecture reminds you of a Java application server, you are absolutely right. Java VMs have a relatively large resource footprint compared to ordinary C programs, so it is typically more efficient to put multiple applications in a single VM rather than running a lot of individual Java processes. Tungsten replication services follow the same design pattern, except that instead of serving web pages they replicate database transactions.

Let's now look a little more deeply at how Tungsten Replicator organizes replication services. Each replication service runs a single pipeline, which is Tungsten parlance for a configurable message flow. (For more on pipelines, read here.) When the service starts, it loads an instance of a Java class called OpenReplicatorManager that handles the state machine for replication (online, offline, etc.) and provides the management interfaces for the services. The OpenRepicatorManager instance in turn depends on a number of external resources from the file system and DBMS.

Here is another diagram showing how Tungsten Replicator organizes all of the various parts for services. Services need a configuration file for the pipeline, as well as various bits of disk space to store transaction logs and replication metadata. The big challenge is to ensure things do not accidentally collide.

This layout seems a bit complex at first but is reasonably simple once you get used to it. Let's start with service configuration using our fra service as an example.

Service configuration files are stored in the tungsten-replicator/conf directory.   There are up to two files for each service. The static-fra.properties file defines all properties of the service, pipeline organization and properties like the replication role or master address that may change during operation. The dynamic-fra.properties contains overrides to selected properties. For instance, if you switch the replication role from slave to master as part of a failover operation, it goes in the dynamic-fra.properties file. Tungsten Replicator reads the static file first, then applies the overrides when it starts the service.

Next, we have Tungsten transaction logs, also known as the Transaction History Log. This is a list of all transactions to be replicated along with metadata like global transaction IDs and shard IDs. THL files for each service are normally stored in the logs directory at the same level as the tungsten release directory itself. There is a separate directory for each service, as for example logs/fra.

Next we have Tungsten relay logs. These are downloaded binlogs from a MySQL master DBMS from which the replication service creates the Tungsten transaction logs. Not every replication service uses these. They are required when the MySQL master is on another host, or the binlogs are on an NFS-mounted file system, which Tungsten does not parse very efficiently yet. Tungsten relay logs use the same pattern as the THL--everything is stored under relay-logs with a separate subdirectory for each service, for example relay-logs/fra.

Finally, there is metadata in the DBMS itself. Each replication service has a database that it uses to store restart points for replication (table trep_commit_seqno) as well as heartbeats and consistency checks (tables heartbeat and consistency, respectively).   The name of this database is tungsten_servicename as in tungsten_fra.

Setting up services is difficult to do manually, so Tungsten Replicator 2.0 has a program named configure-service that defines new replication services and removes old ones by deleting all traces including the database catalogs. You can find out all about installation and starting services by looking the Tungsten Replicator 2.0 Installation and Configuration Guide, which is located here.

Services have been part of Tungsten Replicator for a while but we have only recently begun to talk about them more widely as part of the release of Tungsten Replicator 2.0.0 in February 2011, especially as we are start to do more work with multi-master topologies. One of the comments we get is that replication services make Tungsten seem complicated and therefore harder to use, especially compared with MySQL replication, which is relatively easy to set up. That's a fair criticism. Tungsten Replicator is really a very configurable toolkit for replication and does far more than MySQL replication or just about any other open source replicator for that matter. Like most toolkits, the trade-off for power is complexity.

We are therefore working on automating as much of the configuration as possible, so that you can set up even relatively complex topologies with just a couple of commands. You'll see more of this as we make additional replicator releases (version 2.0.1 will be out shortly) and push features fully into open source.   Meanwhile, if you have comments on Tungsten 2.0.0 please feel free to post them back to us.

Jan 30, 2011

Virtual IP Addresses and Their Discontents for Database Availability

Virtual IP addresses or VIPs are commonly used to enable database high availability. A standard failover design uses an active/passive DBMS server pair connected by replication and watched by a cluster manager. The active database listens on a virtual IP address; applications use it for database connections instead of the normal host IP address. Should the active database fail, the cluster manager promotes the passive database server and shifts the floating IP address to the newly promoted host. Application connections break and then reconnect to the VIP again, which points them to the new database.

VIP-Based Database Availability Design

Virtual IP addresses are enticing because they are completely transparent to applications--no changes to database API behavior, no changes to connection strings, etc. While virtual IP addresses seem simple, they depend on arcane TCP/IP behavior that is not especially well understood and not always consistent across different TCP/IP implementations. Unless properly managed, virtual IP addresses can induce problems that run the gamut from simple lack of availability to split-brain, which can in turn lead to unrepairable data corruption.

This blog article describes in some detail how virtual IP addresses work, specifically the behavior of Address Resolution Protocol (ARP) which is a core part of the TCP/IP stack that maps IP numbers to hardware addresses and permits VIPs to move. We will then dig into how split-brain arises as a consequence of ARP behavior. Finally we will look at ways to avoid split-brain or at least make it less dangerous when it occurs.

Note: the examples below use MySQL, in part because mysqld has a convenient way to show the server host name. However you can set up the same test scenarios with PostgreSQL or most other DBMS implementations for that matter.

What is a Virtual IP Address?

Network interface cards (NICs) typically bind to a single IP address in TCP/IP networks. However, you can also tell the NIC to listen on extra addresses using the Linux ifconfig command. Such addresses are called virtual IP addresses or VIPs for short.

Let's say for example you have an existing interface eth0 listening on port 192.168.128.114. Here's the configuration of that interface as shown by the ifconfig command:

saturn# ifconfig eth0
eth0      Link encap:Ethernet  HWaddr 08:00:27:ce:5f:8e  
          inet addr:192.168.128.114  Bcast:192.168.128.255  Mask:255.255.255.0
          inet6 addr: fe80::a00:27ff:fece:5f8e/64 Scope:Link
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
          RX packets:10057681 errors:0 dropped:0 overruns:0 frame:0
          TX packets:8902384 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000 
          RX bytes:6941125495 (6.9 GB)  TX bytes:6305062533 (6.3 GB)

You can allow the eth0 interface to accept traffic for another address 192.168.128.130 using the following simple command.

saturn# ifconfig eth0:0 192.168.128.130 up

This command tells the TCP/IP stack to accept packets directed to IP address 192.168.128.130 as well as the original address 192.168.128.114. This means that if you are running a MySQL server on the host users can connect using either of the following commands:

mysql -utungsten -psecret -h192.168.128.114 (or)
mysql -utungsten -psecret -h192.168.128.130

Finally, you can move VIPs from host to host very easily by dropping them on the first host and binding to them on a second host, as shown in the following example:

saturn# ifconfig eth0:0 192.168.128.130 down
...
neptune# ifconfig eth0:0 192.168.128.130 up

Meanwhile, virtual IP addresses behave in every other way like standard IP addresses. You can put them in DNS, applications can connect to them, etc. VIP-based high availability is therefore a minimally invasive implementation for most applications--about the only requirement is that applications need to reconnect if their connection breaks.

How Virtual IP Addresses Work

To understand the weaknesses of virtual IP addresses for database high availability, it helps to understand exactly how the TCP/IP stack actually routes messages between IP addresses, including those that correspond to VIPs. The following diagram shows moving parts of the TCP/IP stack in a typical active/passive database set-up with a single client host. In this diagram host saturn currently has virtual IP address 192.168.128.130. Neptune is the other database host. Mercury is the client host.

Applications direct TCP/IP packets using IP addresses, which in IPV4 are four byte numbers. The IP destination address is written into the header by the IP layer of the sending host and read by the IP layer on the receiving host.

However, this is not enough to get packets from one host to another. At the hardware level within a single LAN, network interfaces use MAC addresses to route messages over physical connections like Ethernet. MAC addresses are 48-bit addresses that are typically burned into the NIC firmware or set in a configuration file if you are running a virtual machine. To forward a packet from host mercury to saturn, the link layer writes in the proper MAC address, in this case 08:00:27:ce:5f:8e. The link layer on host saturn accepts this packet, since it corresponds to its MAC address. It forwards the packet up into the IP layer for acceptance and further processing.

Yet how does host mercury figure out which MAC address to use for particular IP addresses? This is the job of the ARP cache, which maintains an up-to-date mapping between IP and MAC addresses. You can view the ARP cache contents using the arp command, as shown in the following example.

mercury# arp -an
? (192.168.128.41) at 00:25:00:44:f3:ce [ether] on eth0
? (192.168.128.1) at 00:0f:cc:74:64:5c [ether] on eth0

Note that the ARP cache does not have a mapping for the VIP address 192.168.128.130. Let's say we now make a client connection to the MySQL server at the other end of the VIP address on mercury:

# mysql -utungsten -psecret -h192.168.128.130
Welcome to the MySQL monitor.  Commands end with ; or \g.
Your MySQL connection id is 33826
...
mysql>

To route traffic, host mercury gets the IP address to MAC address mapping using an ARP request. You can watch this happen in real time using the tcpdump command to track ARP traffic.

mercury# tcpdump -n -i eth0 arp 
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on eth0, link-type EN10MB (Ethernet), capture size 96 bytes
09:37:43.755081 ARP, Request who-has 192.168.128.130 tell 192.168.128.110, length 28
09:37:43.755360 ARP, Reply 192.168.128.130 is-at 08:00:27:ce:5f:8e, length 46

Now if you look at the ARP cache again on host mercury, you will see a proper mapping for the VIP in mercury's ARP cache:

# arp -an
? (192.168.128.130) at 08:00:27:ce:5f:8e [ether] on eth0
? (192.168.128.41) at 00:25:00:44:f3:ce [ether] on eth0
? (192.168.128.1) at 00:0f:cc:74:64:5c [ether] on eth0

Now if you go back and look at the picture (or still recall the details), you will notice that the MAC address maps to the NIC on host saturn. This is exactly what we expect since saturn is listening on the corresponding virtual IP address 192.168.128.130.

Virtual IP Addresses and Split-Brain

Most real problems with VIPs appear when you try to move them. The reason is simple: TCP/IP does not stop you from having multiple hosts listening on the same virtual IP address. For instance, let's say you issue the following command on host neptune without first dropping the virtual IP address on saturn.

neptune# ifconfig eth0:0 192.168.128.130 up

Let's further clear the ARP mapping for the virtual IP on mercury using the handy arp -d command and reconnect to MySQL.

mercury # arp -d 192.168.128.130
root@logos1:~# mysql -utungsten -psecret -h192.168.128.130
Welcome to the MySQL monitor.  Commands end with ; or \g.
Your MySQL connection id is 294
...
mysql>

So far so good--we logged into MySQL and everything appears normal. But in fact it is not normal at all. If you run tcpdump and watch the ARP requests during this login, you see the following:

# tcpdump -n -i eth0 arp 
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on eth0, link-type EN10MB (Ethernet), capture size 96 bytes
09:59:32.643518 ARP, Request who-has 192.168.128.130 tell 192.168.128.110, length 28
09:59:32.643768 ARP, Reply 192.168.128.130 is-at 08:00:27:68:cd:7d, length 46
09:59:32.643793 ARP, Reply 192.168.128.130 is-at 08:00:27:ce:5f:8e, length 46

This is not not just bad--it's very bad. Both saturn and neptune responded to mercury's ARP request! Mercury can pick only one for the mapping; which one it picks depends on timing as well as the exact implementation of the TCP/IP stack. In other words we have a race condition and the winner is essentially random.

You can demonstrate the randomness for yourself with a simple experiment. Let's create a test script named mysql-arp-flush.sh, which clears the ARP cache entry for the VIP and then connects to MySQL, all in a loop.

#!/bin/bash
for i in {1..5}; 
do 
  arp -d 192.168.128.130
  sleep 1
  mysql -utungsten -psecret -h192.168.128.130 -N \
    -e "show variables like 'host%'"
done

If you run the script you'll see results like the following. Note the random switch to Neptune on the fourth connection.

# ./mysql-arp-flush.sh
+----------+---------+
| hostname | saturn  | 
+----------+---------+
+----------+---------+
| hostname | saturn  | 
+----------+---------+
+----------+---------+
| hostname | saturn  | 
+----------+---------+
+----------+---------+
| hostname | neptune |
+----------+---------+
+----------+---------+
| hostname | saturn  | 
+----------+---------+

At this point you have successfully created a split-brain. If you use database replication and both databases are open for writes, as would be the default case with MySQL replication, Tungsten, or any of the PostgreSQL replication solutions like Londiste, your applications will randomly connect to each DBMS server. Your data will quickly become irreparably mixed up. All you can do is hope that the problem will be discovered quickly.

A Half-Hearted Solution using Gratuitous ARP

You might think that it would be handy if the ARP protocol provided a way to get around split-brain problems by invalidating client host ARP caches. In fact, there is such a feature in ARP--it's called gratuitous ARP. While useful, it is not a solution for split-brain issues. Let's look closely to see why.

Gratuitous ARP works by sending an unsolicited ARP response to let hosts on the LAN know that an IP address mapping has changed. On Linux systems you can issue the arping command as shown below to generate a gratuitous ARP response.

neptune# arping -q -c 3 -A -I eth0 192.168.128.130

This tells host neptune to send 3 ARP reply messages with its MAC address for the VIP address. If we look at tcpdump output again, we see the following:

# tcpdump -n -i eth0 arp 
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on eth0, link-type EN10MB (Ethernet), capture size 96 bytes
11:02:27.154279 ARP, Reply 192.168.128.130 is-at 08:00:27:68:cd:7d, length 46
11:02:28.159291 ARP, Reply 192.168.128.130 is-at 08:00:27:68:cd:7d, length 46
11:02:29.162403 ARP, Reply 192.168.128.130 is-at 08:00:27:68:cd:7d, length 46

Linux hosts that receive the response will generally then update their ARP caches, though as we will see, there are some important exceptions. But first, we need to show the effect of gratuitous ARP on MySQL connections. Let's start with the following ARP cache contents on host mercury. This shows an existing mapping to the MAC address of host neptune, which is what we expect from the previous arping command on neptune.

mercury# arp -an
? (192.168.128.130) at 08:00:27:68:cd:7d [ether] on eth0
? (192.168.128.41) at 00:25:00:44:f3:ce [ether] on eth0
? (192.168.128.1) at 00:0f:cc:74:64:5c [ether] on eth0

Next, we run a loop that connects to MySQL and prints the host name every second. The loop code is shown below and stored in a script named mysql-no-arp-flush.sh. Unlike the previous script this does not release the ARP cache mapping between connections to MySQL.

#!/bin/bash
for i in {1..30}; 
do 
  sleep 1
  mysql -utungsten -psecret -h192.168.128.130 -N \
    -e "show variables like 'host%'"
done

While the test script is running is running, we run an arping command from saturn.
saturn# arping -q -c 3 -A -I eth0 192.168.128.130
What we see in the MySQL output is the following. Once the gratuitous ARP is received, mercury switches its connection from neptune to saturn and stays there, at least for the time being.

mercury# ./mysql-no-arp-flush.sh
+----------+---------+
| hostname | neptune | 
+----------+---------+
+----------+---------+
| hostname | neptune | 
+----------+---------+
+----------+---------+
| hostname | saturn  | 
+----------+---------+
+----------+---------+
| hostname | saturn  | 
+----------+---------+

There is one more interesting property of gratuitous ARP responses. If you issue one during a session, it will cause client sessions to switch between hosts without waiting for a timeout. Here's an example. First login with MySQL and see which host we are on.

root@logos1:~# mysql -utungsten -psecret -h192.168.128.130
Welcome to the MySQL monitor.  Commands end with ; or \g.
Your MySQL connection id is 33853
mysql> show variables like 'hostname';
+---------------+---------+
| Variable_name | Value   |
+---------------+---------+
| hostname      | neptune | 
+---------------+---------+
1 row in set (0.00 sec)

Now issue an arping command on saturn using a separate window.
saturn# arping -q -c 3 -A -I eth0 192.168.128.130
Finally, go back and check the host name again in the MySQL session. The session switches over to the other server, which you see at the client level as a lost connection followed by a reconnect.

mysql> show variables like 'hostname'; 
ERROR 2006 (HY000): MySQL server has gone away No connection. Trying to reconnect... Connection id:   33854 
Current database: *** NONE ***  
+---------------+--------+ 
| Variable_name | Value  | 
+---------------+--------+ 
| hostname      | saturn |  
+---------------+--------+ 
1 row in set (0.00 sec)

So is gratuitous ARP the solution to virtual IP split-brain? It announces that there is a mapping change, which can make failover work much more quickly. This is useful in its own right. However, it does not solve split-brain.

First, not all TCP/IP stacks even recognize gratuitous ARP responses. Second, gratuitous ARP only takes effect on hosts that actually have a current mapping in their ARP cache. Other hosts will wait until they actually need the mapping and then issue a new ARP request. Finally, ARP mappings automatically time out after a few minutes. In that case the host will issue a new ARP request, which as in the two preceding cases brings us right back to the split-brain scenario we were trying to cure.

Avoiding Virtual IP Split -Brains

Avoiding a VIP-induced split-brain not a simple problem. The best approach is combination of sound cluster management, amelioration, and paranoia.

Proper cluster management is the first line of defense. VIPs are an example of a unique resource in the system that only one host may hold at a time. An old saying that has been attributed to everyone from Genghis Khan to Larry Ellison sums up the problem succinctly:

It is not enough to succeed. All others must fail.

The standard technique to implement this policy is called STONITH, which stands for "Shoot the other node in the head." Basically it means that before one node acquires the virtual IP address the cluster manager must make every effort to ensure no other node has it, using violent means if necessary. Moving the VIP thus has the following steps.

The cluster manager executes a procedure to drop the VIP on all other hosts (for example using ssh or by cutting off power). Once these procedures are complete, the cluster manager executes a command to assign the VIP to the new owner.
Isolated nodes automatically release their VIP. "Isolated" is usually defined as being cut off from the cluster manager and unable to access certain agreed-upon network resources such as routers or public DNS servers.
In cases of doubt, everybody stops. For most systems it is far better to be unavailable than to mix up data randomly.

Cluster managers like Tungsten and Pacemaker handle this kind of process very well. PaceMaker for example has a number of specialized hooks to cut power or otherwise use extreme violence to ensure proper fencing of databases. Tungsten has fewer such hooks but has a much richer set of operations for databases and also has a wide set of connectivity options for HA besides using VIPs.

Incidentally, you want to be very wary about re-inventing the wheel, especially when it comes to DBMS clustering and high availability. Clustering has a lot of non-obvious corner cases; even the "easy" problems like planned failover are quite hard to implement correctly. You are almost always better off using something that already exists instead of trying to roll your own solution.

Amelioration is the next line of defense, namely to make split-brain situations less dangerous when they actually occur. Failover using shared disks or non-writable slaves (e.g., with DRBD or PostgreSQL streaming replication) have a degree of protection because it is somewhat harder to have multiple databases open for writes. However, it is definitely possible and the cluster manager is your best bet to prevent this. However, when using MySQL with either native or Tungsten replication, databases are open and therefore susceptible to data corruption, unless you ensure slaves are not writable.

Fortunately, this is very easy to do in MySQL. To make a database readonly to all accounts other than those with SUPER privilege, just issue the following commands to make the server readonly and ensure the setting is in effect:

neptune# mysql -uroot -e "set global read_only=1"
neptune# mysql -uroot -e "show variables like 'read_only'"
+---------------+-------+
| Variable_name | Value |
+---------------+-------+
| read_only     | ON    | 
+---------------+-------+

This protects you not just in cases of actual failover but also from administrative mistakes or software failures that switch the VIP by accident.   Many cluster managers implement read-only slaves.   Tungsten clustering has explicit support for read-only slaves. Other cluster managers like MMM and Pacemaker can do the same.

Lastly paranoia is always a good thing. You should test the daylights out of clusters that depend on VIPs before deployment, and also check regularly afterwards to ensure there are no unexpected writes on slaves. Regular checks of logs are a good idea. Another good way to check for problems in MySQL master/slave setups is to run consistency checks.   Tungsten Replicator has built-in consistency checking designed for exactly this purpose. You can also run Maatkit mk-table-checksum at regular intervals.   Another best practice is to "flip" masters and slaves on a regular basis to ensure switch and failover procedures work properly.   Don't avoid trouble--look for it!

Conclusion and Note on Sources

Virtual IP addresses are a convenient way to set up database high availability but can lead to very severe split-brain situations if used incorrectly.   To deploy virtual IP addresses without problems you must first of all understand how they work and second use a sound cluster management approach that avoids split-brain and minimizes the impact if it does occur. As with all problems of this kind you need to test the implementation thoroughly before deployment as well as regularly during operations. This will help avoid nasty surprises and corrupt data that are otherwise all but inevitable.

Finally it is worth talking a bit about sources. I wrote this article because I could not find a single location that explained virtual IP addresses in a way that drew out the consequences of their behavior for database failover. That said, there are a couple of good general sources for information on Internet tools and high availability:

http://linux-ip.net/ -- Guide to IP Layer Network Administration with Linux.
http://linux-ha.org/ -- Linux HA wiki pages

Beyond that you can look at general networking sources like Radia Perlman's Interconnections, Second Edition or Internetworking with TCP/IP by Douglas Comer. These are more high-level. If you get really desperate for details, try the RFCs, for example RFC-826, which is the original specification for ARP. Some of them are surprisingly good reads even 30 years after the fact.

Jan 25, 2011

Tungsten Replicator Overview Webinar

On Thursday January 27th at 10am PST I will doing a webinar on Tungsten Replicator together with my colleague Giuseppe Maxia. The title is "What MySQL Replication Cannot Do. And How to Get Around It." Basically it is a nuts and bolts description of Tungsten Replicator capabilities like multi-master replication, failover, parallel apply, and using replication for zero-downtime upgrade. If you have ever wanted an in-depth look at the Tungsten Replicator this is a good opportunity.

During 2010 we implemented an amazing number of new replication features ranging from pipelines early in the year to fast disk logs, multiple replication services per process, bi-directional replication, and parallel apply by the end. We will be building out all of these in the coming year and releasing increasingly capable features into open source as well.

This presentation is part of Continuent's regular webinar series which means we will also talk a bit about commercial products and services at the end. However, it's mostly cool replication stuff. You can sign up on the Continuent webinar page. Hope to see you there.

Jan 11, 2011

Fixing Replication with Replication

A couple of days ago I ran into a Tungsten Replicator case where several MySQL tables became corrupted on slaves and needed to be restored from the master. We identified the tables that had problems fairly quickly using Tungsten Replicator's consistency checks. However, that led to another problem: how to restore the slave tables efficiently from the master. The MySQL server in question processes around 10M tranactions per day--there is virtually no downtime. Though the tables were not large, we could not be sure whether they were in use.

Fortunately, you can use a simple MySQL trick to get all the rows of a table to replicate through to slaves. The idea is to dump the table, delete the rows, then reload it again. The delete and subsequent reload replicate out to slaves, after which everything is consistent again. Let's say we have a table called tpcb.history that needs to be fixed. Login with mysql and run the following commands:

BEGIN;
SELECT * FROM tpcb.history 
  INTO OUTFILE '/tmp/tpcb.history.dmp' FOR UPDATE;
DELETE FROM tpcb.history;
LOAD DATA INFILE '/tmp/tpcb.history.dmp' REPLACE
  INTO TABLE tpcb.history FIELDS TERMINATED BY '\t'
  LINES TERMINATED BY '\n';
COMMIT;

You can do the reload several ways in MySQL, but this particular code has some advantages over other approaches, such as using LOCK TABLES. First, it uses a transaction, so if something goes wrong the changes roll back and you do not lose your data. Second, the SELECT ... FOR UPDATE locks your data and ensures serialization. You can run this while applications are running without problems.

This seems useful enough that I put together a simple script called reload-table.sh with a README and checked them into the Tungsten Replicator codeline on SourgeForge.net. You can refresh the same table shown above using the following command:

[sudo] ./reload-table-b.sh -u tungsten -p secret -t tpcb.history

I tested the reload using Tungsten 1.3.1 on MySQL 5.1 with statement replication. However, it would work equally well with row replication. Moreover, you can do the same trick in MySQL replication, as this involves base replication capabilities that are directly equivalent. There are a few caveats: you need to use InnoDB (or another transactional engine), large tables may be a problem, and you would need to tread carefully in cases where tables contain referential constraints. Finally, it would be wise to save the master table somewhere else before running the script.