The Scale-Out Blog: SaaS

Showing posts with label SaaS. Show all posts

Oct 29, 2014

An Ending and a Beginning: VMware Has Acquired Continuent

As of today, Continuent is part of VMware. We are absolutely over the moon about it.

You can read more about the news on the VMware vCloud blog by Ajay Patel, our new boss. There’s also an official post on our Continuent company blog. In a nutshell the Continuent team is joining the VMware Cloud Services Division. We will continue to improve, sell, and support our Tungsten products and work on innovative integration into VMware’s product line.

So why do I feel exhilarated about joining VMware? There are three reasons.

1. Continuent is joining a world-class company that is the leader in virtualization and cloud infrastructure solutions. Even better, VMware understands the value of data to businesses. They share our vision of managing an integrated fabric of standard DBMS platforms, both in public clouds as well as in local data centers. It is a great home to advance our work for many years to come.

2. We can continue to support our existing users and make Tungsten even better. I know many of you have made big decisions to adopt Continuent technology that would affect your careers if they turned out badly. We now have more resources and a mandate to grow our product line. We will be able to uphold our commitments to you and your businesses.

3. It’s a great outcome for our team, which has worked for many years to make Continuent Tungsten technology successful. This includes our investors at Aura in Helsinki, who have been dogged in their support throughout our journey.

Speaking of the Continuent team…I am so proud of what all of you have achieved. Today we are starting a new chapter in our work together. See you at VMware!

Feb 17, 2014

Why Aren't All Data Immutable?

Over the last few years there has been an increasing interest in immutable data management. This is a big change from the traditional update-in-place approach many database systems use today, where new values delete old values, which are then lost. With immutable data you record everything, generally using methods that append data from successive transactions rather than replacing them. In some DBMS types you can access the older values, while in others the system transparently uses the old values to solve useful problems like implementing eventual consistency.

Baron Schwartz recently pointed out that it can be hard to get decent transaction processing performance based on append-only methods like append-only B-trees. This is not a very strong argument against immutable data per se. Immutable data are already in wide use. It is actually surprising they have not made deeper inroads into online transaction processing, which is widely handled by relational DBMS servers like MySQL and Oracle.

Immutable Data Are Now Economically Feasible

One reason for the popularity of update-in-place approaches is simple: storage used to be really expensive. This is no longer the case. Many applications can now afford to store the entire DBMS transaction log almost indefinitely. To illustrate, look at storage costs in Amazon Web Services. Applications running in Amazon have API-level access to practically unlimited replicated, long-term storage through services like S3 and Glacier. Amazon conveniently publishes prices that serve as good proxies for storage costs in general. Using these numbers, I worked up a simple spread sheet that shows the cost of storing 7 years of transactions for a made-up business application.

To start with, assume our sample app generates one thousand transactions per second at 1,000 bytes per transaction. This is not exceedingly busy by some standards but is relatively high for business systems that handle human-generated transactions. The main place you see numbers approaching this level is SaaS businesses that handle many customers on a single system. Our sample system generates about 205,591 gigabytes of data over seven years.

Xacts/Sec	Bytes/Xact	Bytes/Sec	GB Generated in 1 Hour	GB Generated in 1 Day	GB Generated in 1 Month	GB Generated in 1 Year	GB Generated in 7 Years
1,000	1,000	1,000,000	3.35	80.47	2,447.52	29,370.19	205,591.32

Amazon storage costs vary from $0.011/Gb/month for Glacier to $0.09/Gb/month for S3 with full redundancy. (These are numbers for the US-West region as of 29 December 2013.) Annual storage costs for 7 years of data are pretty hefty if you store uncompressed data. However, if you factor in compression--for example MySQL binlogs tend to compress around 90% in my experience--things start to look a lot better.

	Annual cost to store 7 years of data at different levels of compression
	0%	20%	40%	60%	70%	80%	90%
Glacier	$27,138.05	$21,710.44	$16,282.83	$10,855.22	$8,141.42	$5,427.61	$2,713.81
S3 Reduce Redundancy	$177,630.90	$142,104.72	$106,578.54	$71,052.36	$53,289.27	$35,526.18	$17,763.09
S3 Standard	$222,038.63	$177,630.90	$133,223.18	$88,815.45	$66,611.59	$44,407.73	$22,203.86

The raw costs still look hefty to the untrained eye, but we need to factor in the real expense of operating this type of system. Here's a typical cost structure for a 3 node cluster (to ensure HA) with labor costs factored in and preserving 7 years of data. I have put in generously small IT overhead costs including software development, since the code has to come from somewhere. Under these assumptions long-term storage costs are less 10% of the yearly cost of operation.

Component	Cost	Percentage	Notes
3 i2.4xlarge instances	$46,306.68	20.09%	(Heavy utilization reserved, 1 yr. term)
3 support licenses	$15,000.00	6.51%	(Support subscription costs * 3x)
Raw dbadmin labor	$12,000.00	5.21%	(1 FTE/30 DBMS servers @ 120K per)
Software dev/QA	$120,000.00	52.06%	(10 FTE/30 DBMS servers @ 120K per)
Misc. overhead costs	$15,000.00	6.51%	($5K per server)
S3 Storage	$22,203.86	9.63%	(7 years of data, 90% compression)
Total	$230,510.54	100.00%

Long storage costs for base transaction data can be far lower if any of the following hold:

You generate fewer transactions per second or they are smaller. Many business apps produce far fewer transactions than my example.
You don't keep data for the full 7 years. Some of the analytic users I work with just keep a couple of years.
You are already paying archiving costs for backups, in which case the additional storage cost becomes a wash if you can stop using a separate backup system.
You add more external costs to the picture--running a real business that generates this level of transactions often takes far more people than are shown in my projection.

In these cases long term storage costs could be in the 1-2% range as a percentage of IT operating costs. Over time storage costs will decrease--though the rate of decline is hard to predict--so each year the number systems able to afford preservation of complete transaction histories will corresponding increase. This is particularly true for business transactions, which tend to be human generated and subject to upper growth limits once businesses are fully automated. If you push data into Glacier, economically feasible retention periods can run to decades. This is far longer than most businesses (or more particularly their lawyers) even want to keep information around.

There are still reasons for wanting an update-in-place model for OLTP systems, for example to keep as much of your working set as possible in RAM or on fast SSDs to keep response time low. But storage cost alone is no longer a major factor for a wide range of applications. This development is already affecting data management technology profoundly. Doug Cutting has pointed out on numerous occasions that the downward cost trajectory of commodity storage was a key driver in the development of Hadoop.

Users Want Immutable Data

Many organizations already keep long transaction histories to feed analytics by loading them into traditional data warehouses based on Teradata, Vertica, and the like. As soon as a practical method appeared to keep such data more economically, businesses began to adopt it quickly. That "method" is Hadoop.

Hadoop has a fundamentally different approach to data management from relational and even many NoSQL systems. For one thing, immutable data are fundamental. The default processing model is that you write data but rarely change it once written. To illustrate, the HiveQL SQL dialect does not even have UPDATE or DELETE statements. Instead, you overwrite entire tables or parts of them to make changes. This works because Hadoop organizes storage on cheap commodity hardware (HDFS) and provides a workable way to access data programmatically (MapReduce).

Hadoop changes the data management cost model in other ways besides utilizing commodity hardware efficiently. With Hadoop you don't necessary define *any* data structures up front. Instead, you store transactions in native form and write programs to interpret them later on. If you need structure for efficient queries you add it through MapReduce and perhaps store it as a materialized view to make other queries more efficient. Hadoop eliminates a lot of the up-front effort (and risk) required to get transactions into a data warehouse. Instead, it defers those costs until you actually need to run specific analytics. Moreover by storing native transaction formats, you can answer new questions years later. That is a very powerful benefit.

I have been working a lot with Hadoop over the last few months. It's a bear to use because it consists of a set of loosely integrated and rapidly evolving projects with weak documentation and lots of bugs. Even with these difficulties, the rising level of Hadoop adoption for analytics shows the underlying model has legs and that users want it. As Floyd Strimling pointed out a while ago on Twitter this genie is not going back in the bottle. HDFS is becoming the default storage mechanism for vast quantities of data.

Immutable Data Management Looks Like a Good Bet

One of the basic problems in discussing immutable data management is that there are different kinds of immutable data that persist at different timescales. Baron has a point that Couchbase, Datanomic, NuoDB, or whatever new DBMS implementation you choose are in some ways recapitulating solutions that existing RDBMS implementations reached long ago. But I also think that's not necessarily the right comparison when talking about immutable data, especially when you start to think about long retentions.

The fact is that Oracle, MySQL, PostgreSQL, and the like do not utilize distributed commodity storage effectively and they certainly do not enable storage of the long tail transaction histories that many businesses clearly want for analytics. The best way to do that is to replicate transactions into HDFS and work on them there. That is hard even for MySQL, which has flexible and economical replication options. (We are working on making it easier to do at Continuent but that's another article. :)

In my opinion a more useful criticism of the arriviste competitors of traditional OLTP systems is that they don't go far enough with immutable data and risk being outflanked by real-time transaction handling built on top of HDFS. Hadoop real-time work on projects like Apache Spark is for the time being is focused on analytics but OLTP support cannot be far behind. Moreover, there is a window to build competitors to HDFS that gets smaller as Hadoop becomes more entrenched. This seems more interesting than building stores that offer only incremental improvements over existing RDBMS implementations.

Immutable data now permeate IT due to decreasing storage costs coupled with requirements for analytic processing. It's like the famous quote from William Gibson:

The future is already here--it's just not very evenly distributed.

If you look at the big picture the arguments for database management based on immutable data seem pretty strong. It is hard to believe it won't be a persistent trend in DBMS design. Over the long term mutable data look increasingly like a special case rather than the norm.

Feb 19, 2013

Data Fabric Design Patterns: Fabric Connector

This article is the third in a series on data fabric design and introduces the fabric connector service design pattern. The previous article in this series introduced the transactional data service design pattern, which defines individual data stores and is the building block for data fabrics based on SQL databases. The fabric connector builds on transactional data services and is another basic building block of fabric architecture.

Description and Responsibilities

Fabric connectors make a collection of DBMS servers look like a single server. The fabric connector presents what appears to be a data service API to applications. It routes each request to an appropriate physical server for whatever task the application is performing, hiding the fact that a data fabric can consist of dozens or even hundreds of servers. Applications cannot tell the difference between talking to the fabric connector and talking to a real DBMS server. We call this property transparency.

Here are the responsibilities of a fabric connector. I will use the phrase proxying to refer to the first of these, and routing responsibilities to refer to the remaining three.

Expose a data service interface to applications.
Route each application query to an appropriate DBMS server.
Balance load by distributing queries across multiple replicas, if available.
Switch to another server following a failure or if the DBMS becomes unavailable due to maintenance.

The following diagram shows the logical components of a fabric connector. The fabric connector sits between applications, transactional data services, and a fabric directory service. These are greyed out, as they are not part of the pattern.

Fabric Connector Design Pattern

Fabric connectors contain two logical components. The proxy is responsible for routing queries and responses between applications and underlying data services. This can be a library layer, a separate server process, or a TCP/IP load balancer--anything that provides a transparent indirection layer. The directory information contains rules to route SQL queries correctly to the actual location of data. There is a notification protocol that permits connectors to receive updates about the fabric topology and confirm that they have responded to them.

Motivation

Connecting to data is a problem in large systems. Sharded data sets spread data across multiple services. Data services have different roles, such as master or slave. Services fail or go offline for maintenance. Services change roles, such as a master switching to a slave. Shards move between services to balance load and use storage more efficiently. Within short periods of time there may be significant variations in load across data services. Adding routing logic directly to applications in these cases adds complexity and can lead to a tangled mess for administrators.

The fabric connector design pattern encapsulates logic to route connections from the application to DBMS servers. Hiding connection logic helps keep applications simple. It allows independent testing and tuning of the connection rules. That way you can have some assurance the logic actually works. You can also modify fabric behavior without modifying applications, for example to redistribute load more evening across replicas.

Related Design Patterns

The fabric connector design pattern manages single application connections to data services, for example a transactional data service. Transparency is the leitmotif of this pattern. It provides needed encapsulation for other data fabric design patterns and is particularly critical for sharded as well as fault tolerant data services. These will be covered in future articles on data fabric design.

There are also other design patterns for data access. Here are two that should not be confused with fabric connectors.

Federated query. Federated query splits a SQL query into sub-queries that it routes to multiple underlying data services, then returns the results. Sharding products like DbShards and shard-query implement this pattern. It requires complex parsing, query optimization, and aggregation logic to do correctly and has varying levels of transparency.
MapReduce. MapReduce is a procedure for breaking queries into pieces that can run in parallel across large numbers of hosts by splitting the query into map operations to fetch data followed by reduce operations to aggregate results. It can work on any distributed data set, not just SQL. MapReduce implementations often eschew SQL features like joins and also can have a very different programming model from SQL. Their use is often non-transparent to SQL applications.

Finally, there is a very important pattern for the fabric directory service. This is a directory service that maintains information about the desired topology of the fabric and its actual state. It can be implemented in forms ranging from a shared configuration file to network services in a distributed configuration manager like ZooKeeper.

I hope to add more complete descriptions for the latter three design patterns at some point in the future. For the current article, we will stick to simple connectivity.

Detailed Behavior

Fabric connectors are conceptually simple: route request from application to server, then transfer results back. Actual behavior can be quite complex. To give some perspective on the problem, here is a short Perl program for a SaaS application that logs order detail information in a table named sale, then reads the same data back. We will use the sample program to illustrate the responsibilities of this design pattern in detail.

use DBI;

# Connect to server. 
$dbh = DBI->connect("DBI:mysql:test;host=prodg23", "app", "s3cr3t5"
            ) || die "Could not connect to database: $DBI::errstr";

# Insert order using a transaction. 
$dbh->{'AutoCommit'} = 0;
$dbh->do("INSERT INTO sale(order_id, cust_id, sku, amount) \
   VALUES(2331, 9959, 353009, 24.99)");
$dbh->do("INSERT INTO sale(order_id, cust_id, sku, amount) \
   VALUES(2331, 9959, 268122, 59.05)");
$dbh->commit();

# Select order back with an auto-commit read
$dbh->{'AutoCommit'} = 1;
$sth = $dbh->prepare("SELECT * FROM sale WHERE order_id=2331");
$sth->execute();
while( $href = $sth->fetchrow_hashref ) {
  print "id      : $$href{id} \n";
  print "order_id: $$href{order_id} \n";
  print "cust_id : $$href{cust_id} \n";
  print "sku     : $$href{sku} \n";
  print "amount  : $$href{amount} \n";
}

# Disconnect from server. 
$dbh->disconnect();

The first responsibility of the fabric connector design pattern is to provide a transparent interface for underlying data services. That means that our Perl program has to work as written--no extra changes. Here are just a few things a connector needs to do:

Implement the DBMS connection protocol fully or pass it transparently to an underlying server. This includes handling authentication handshakes as well as setting base session context like client character sets.
Handle all standard features of query invocation and response, including submitting queries, returning automatically generated keys, and handling all supported datatypes in results.
Respect transaction boundaries so that the INSERT statements on the sales table are enclosed in a transaction in the DBMS and the SELECT statement is auto-commit (i.e., a single-statement transaction.)
Read back data written to the sales table.

In addition to handling APIs protocols, fabric connectors need to avoid slowing down transaction processing as a result of proxying. Properly written connectors for the most part add minimal overhead, but there are at least two instances where this may not be the case for some implementations (such as network proxies). The first is establishing connections, a relatively expensive operation that occurs constantly in languages like PHP that do not use connection pools. The second is short primary key-lookup queries on small datasets, which tend to be memory-resident in the server and hence have quick access.

One common reaction is to see such overhead as a serious problem and avoid the whole fabric connector approach. Yet the "tax" applications pay for proxying is not the whole story on performance. Fabric connectors can boost throughput by an order of magnitude by distributing load intelligently across replicas. To understand the real application overhead of a connector you therefore need to measure with a properly sized data set and take into account load-balancing effects. Test results on small data sets that reside fully in memory with no load balancing tend to be very misleading.

The remaining fabric connector design pattern responsibilities are closely related: route requests accurately to the correct service, load-balance queries across replicas within a service, and route around replicas that are down due to maintenance or failure. We call these routing responsibilities. They require information about the fabric topology, which is maintained in the connector's directory information. Here is a diagram of typical directory organization.

Fabric Directory Service Organization

Let's start with the responsibility to route requests to data services. A simple fabric connector implementation allows connections using a logical server name, such as group2, which the connector would translate to an actual DBMS server and port, such as prodg23:3306. A better fabric connector would allow applications use a customer name like "walmart" that matches what the application is doing. The connector would look up the location of customer data and connect automatically to the right server and even DBMS schema. This is especially handy for SaaS applications, which often shard data by customer name or some other simple identifier.

We could then change our program as follows to connect to the local host and look for the "walmart" schema. Under the covers, the fabric connector will connect to the prodg23 server and use the actual schema for that customer's data.

use DBI;
# Connect to customer data.  
$dbh = DBI->connect("DBI:mysql:walmart;host=localhost", "app", "s3cr3t5"
            ) || die "Could not connect to database: $DBI::errstr";

This is a modest change that is very easy to explain and implement. It is a small price to pay for omitting complex logic to locate the correct server and schema that contains the data for this customer.

The next responsibility is to distribute data across replicas. This requires additional directory information, such as the DBMS server role (master vs. slave), current status (online or offline), and other relevant information like slave latency or log position. There are many ways to use this information effectively. Here are a few of the more interesting things we can do.

Slave load balancing. Allow applications to request a read-only connection, then route to the most up-to-date slave. This works well for applications such as Drupal 7, which is an application for website content management. Drupal 7 is slave-enabled, which means that it can use separate connections for read-only queries that can run on a replica. Many applications tuned to work with MySQL have similar features.
Session load balancing. Track the log position for each application session and dispatch reads to slaves when they are caught up with the last write of the session. This is a good technique for SaaS applications that have large numbers of users spread across many schemas. It is one of the most effectively scaling algorithms for master/slave topologies.
Partitioning. Split requests by schema across a number of multi-master data services. SQL requests for schema 1 go to server 1, requests for schema 2 to server 2, etc. Besides distributing load across replicas this technique also helps avoid deadlocks, which can become common in multi-master topologies if applications simultaneously update a small set of tables across multiple replicas.

Recalling our sample program, we could imagine a connector using session load balancing to write the sales table transaction to the master DBMS server, then sending the SELECT to a slave if it happened to be caught up for customer "walmart." No program changes are required for this behavior.

The final responsibility is to route traffic around offline replicas. This gets a bit complicated. We need not only state information but an actual state model for DBMS servers. There also needs to be a procedure to tell fabric connectors about a pending change as well as wait for them to reconfigure themselves. Returning to our sample program, it should be possible to execute the following transaction:

$dbh->{'AutoCommit'} = 0;
$dbh->do("INSERT INTO sale(order_id, cust_id, sku, amount) \
   VALUES(2331, 9959, 353009, 24.99)");
$dbh->do("INSERT INTO sale(order_id, cust_id, sku, amount) \
   VALUES(2331, 9959, 268122, 59.05)");
$dbh->commit();

then failover to a new master and execute:

$dbh->{'AutoCommit'} = 1;
$sth = $dbh->prepare("SELECT * FROM sale WHERE order_id=2331");
$sth->execute();

...

To do this properly we need to ensure that the connecter responds to updates in a timely fashion. We would not want to change fabric topology or take a DBMS server offline while connectors were still using it. The notification protocol that updates connector directory information has to ensure reconfiguration does not proceed until connectors are ready.

Does every fabric connector have to work exactly this way? Not at all. So far, we have only been talking about responsibilities. There are many ways to implement them. To start with, fabric connectors do not even need to handle SQL. This is interesting in two ways.

First, you can skip using the Perl DBI completely and use a specialized interface to connect to the fabric. We will see an example of this shortly. Second, the underlying store does not even need to be a SQL database at all. You can use the fabric connector design pattern for other types of stores, such as key-value stores that use the memcached protocol. This series of articles focuses on SQL databases, but the fabric connector design pattern is very general.

Implementations

Here are a couple of off-the-shelf implementations that illustrate quite different ways to implement the fabric connector design pattern.

1. Tungsten Connector. Tungsten Connector is a Java proxy developed by Continuent that sits between applications and clusters of MySQL or PostgreSQL servers. It implements the MySQL and PostgreSQL network protocols faithfully, so that it appears to applications like a DBMS server.

Tungsten Connector gets directory information from Tungsten clusters. Tungsten clusters use a simple distributed consensus algorithm to keep directory data consistent across nodes even when there are failures or network outages--connectors can receive topology updates from any node in the cluster through a protocol that also ensures each connector acts on it when the cluster reconfigures itself. In this sense, Tungsten clusters implement the fabric directory service pattern described earlier.

The directory information allows the connector to switch connections transparently between servers in the event of planned or even some unplanned failovers. It can also load balance reads automatically using a variety of polices including the slave load balancing and session load balancing techniques described above.

The big advantage of the network proxy approach is the high level of transparency for all applications. Here is a sample session with the out-of-the-box mysql utility that is part of MySQL distributions. In this sample, we check the DBMS host name using the MySQL show variables command. Meanwhile, a planned cluster failover occurs, followed by an unplanned failover.

mysql> show variables like 'hostname';
+---------------+---------+
| Variable_name | Value   |
+---------------+---------+
| hostname      | prodg23 |
+---------------+---------+
1 row in set (0.00 sec)

(Planned failover to prodg21 to permit upgrade on prodg23)
mysql> show variables like 'hostname';
+---------------+---------+
| Variable_name | Value   |
+---------------+---------+
| hostname      | prodg21 |
+---------------+---------+

1 row in set (0.01 sec)

(Unplanned failure to prodg22)
mysql> show variables like 'hostname';
+---------------+---------+
| Variable_name | Value   |
+---------------+---------+
| hostname      | prodg22 |
+---------------+---------+

1 row in set (4.82 sec)

As this example shows, the session continues uninterrupted as the location of the server switches. These changes occur transparently to applications. The downside is that there is some network overhead due to the extra network hop through the Tungsten Connector, though of course load balancing of reads can more than repay the extra latency cost. Also, this type of connector is hard to build because of the complexity of the MySQL network API as well as the logic to transfer connections seamlessly between servers.

2. Gizzard. Gizzard is an open source sharding software developed by Twitter to manage links between Twitter users. The proxy part of the design pattern is implemented by middleware servers, which accept requests from clients using thrift, a language-independent set of tools for building distributed services. For more on a particular application built on Gizzard, look at descriptions of Twitter's FlockDB service. Gizzard servers give applications a simple API for data services, which fulfills the proxy responsibility of the fabric connector design pattern.

Gizzard servers get directory information using gizzmo. Gizzmo is a simple command line tool that maintains persistent copies of the Gizzard cluster topology and takes care of propagating changes out to individual Gizzard servers. For more on how Gizzmo works, look here. Using this information, Gizzard servers can locate data, route around down servers, and handle distribution of queries to replicas, which are the final three responsibilities of the fabric connector design pattern.

The Gizzard architecture lacks the generality of the Tungsten Connector, because it requires clients to use a specific interface rather than general-purpose SQL APIs. It also introduces an extra network hop. On the other hand, it works extremely well for its intended use case of tracking relationships between Twitter users. This is because Gizzard deals with a simplified problem and also allows scalability through many Gizzard servers. Like the Tungsten Connector the network hop expense pays for itself due to the ability to load-balance across multiple replicas.

Gizzard is a nice example of how the fabric connector design pattern does not have to be specifically about SQL. Gizzard clients do not use SQL, so the underlying store could be anything. Gizzard is specifically designed to work with a range of DBMS types.

Fabric Connector Implementation Trade-Offs

General-purpose fabric connectors like the one used for Tungsten are hard to implement for a variety of reasons. This approach is really only practical if you have a lot of resources at your disposal or are doing it as a business venture like Continuent. You can still roll your own implementations. The Gizzard architecture nicely illustrates some of the trade-offs necessary to do so.

1. General vs. particular data service interfaces. Implementing a simple data service interface, for example using thrift, eliminates the complexity of DBMS interfaces like those of MySQL or PostgreSQL. Rather than a thrift server you can also use a library within applications themselves. This takes out the network hop.

2. Automatic vs. manual failover. Automatic failover requires connectors to respond to fabric topology changes in real time, which is a hard problem with a lot of corner cases. (Look here if you disagree.) You can simplify things considerably by minimizing automated administration and instead orchestrate changes through scripts.

3. Generic vs. application-specific semantics. Focusing on a particular application allows you to add features that are important for particular use cases. Gizzard supports shard migration. To make it tractable to implement Gizzard requires a simple update model in which transactions can be applied in any order.

These and other simplifications make the fabric connector design pattern much easier to implement correctly. You can make the same sort of trade-offs for most applications.

Implementations to Avoid

Here are a couple of implementations for the fabric connector design pattern that you should avoid or at least approach warily.

1. Virtual IP Addresses (VIPs). VIPs allow hosts to listen for traffic on multiple IP addresses. They are commonly used in many failover schemes, such as programs like heartbeat. VIPs do not have the intelligence to fulfill fabric connector responsibilities like load-balancing queries across replicas. They are subject to nasty split-brains, a subject I covered in detail as part of an earlier article on this blog. Finally, VIPs are not available in Amazon and other popular cloud environment. VIPs do not seem like a good implementation choice for data fabrics.

2. SQL proxies. There are a number of software packages that solve the problem of proxying SQL queries, such as PgBouncer or MySQL Proxy. Many of them do this quite well, which means that they fulfill the first responsibility of the fabric connector design pattern. The problem is that they do not have a directory service. This means they do not fulfill the next three responsibilities to route queries effectively, at least out of the box.

Unlike VIPs, SQL proxies can be a good starting point for fabric implementations. You need to add the directory information and notification protocol to make them work. It is definitely quite doable for specific cases, especially if you make the sort of trade-offs that Gizzard illustrates.

Conclusion and Takeaways

The fabric connector design pattern reduces the complexity of applications by encapsulating the logic required to connect to servers in a data fabric. There is a tremendous benefit to putting this logic in a separate layer that you can test and tune independently. Fabric connectors are more common than they appear at first because many applications implement the responsibilities within libraries or middleware servers that include embedded session management. Fabric connectors do not have to expose SQL interfaces or any other DBMS-specific interface, for that matter.

Fault-tolerant and sharded data service design patterns depend on fabric connectors to work properly and avoid polluting applications with complex logic to locate data. Products that implement these design patterns commonly include fabric connector implementations as well. You can evaluate them by finding out how well they fulfill the design pattern responsibilities.

Off-the-shelf fabric connectors have the advantage that they are more general than something you can develop easily for yourself. If you decide to write your own fabric connector, you will need to consider some of the trade-offs like reducing automation or simplifying APIs in order to make the problem easier to solve. Regardless of the approach, you should allow time. The responsibilities are complicated and must be implemented with care. Fabric connectors that only work 99% of the time of are not much use in production environments.

One final point about fabric connectors. Automated failover can make fabric connectors harder to implement and increase the risk that the fabric connector may write to the wrong replica. The difficulty of managing connectivity is one of the reasons many data management experts are very cautious about automatic failover. This problem is tractable in my opinion, but it is definitely a practical consideration in system design.

My next article on data fabrics will cover the fault-tolerant data service design pattern. This design pattern depends on the fabric connector design pattern to hide replicas. I hope you will continue reading to find out about it.

Apr 3, 2012

Solving the Cloud Database Memory Conundrum

Cloud databases have a memory problem. Continuent has been doing a lot of Amazon deployments lately, and it is becoming apparent that memory utilization in those environments is more than just an inconvenience. In this article I would like to discuss the memory problem that we see in customer implementations and some new features of Tungsten Enterprise that help alleviate it for MySQL.

The Cloud Memory Problem and Database Arrays

As I discussed in a recent article about prefetch, the amount of RAM allocated to the InnoDB buffer pool is one of the principle determinants of MySQL performance. The speed difference between using a page in the buffer pool vs. getting it from storage is commonly about 100:1 on RAIDed disk. The other determinant is working set size, i.e., the percentage of pages that need to be memory-resident for adequate performance. Working set size is variable and depends on your query patterns as well as the level of indexing on tables. These two variables set a limit to the amount of data you can manage and still access information quickly. Here is a table that shows the relationship.

	Max GB of manageable storage
Buffer Pool Size (GB)	5% resident	10% resident	25% resident	50% resident
15	300	150	60	30
30	600	300	120	60
60	1200	600	240	120
120	2400	1200	480	240

The largest instance currently available from Amazon EC2 is a Quadruple Extra Large, which offers 68GB of RAM. Let's assume we allocate 85% of that to the buffer pool, which is about 58GB. We'll be generous and say 60GB so it matches my table. Assuming our application performs reasonably with 10% of pages resident we can then manage a maximum of 600GB of stored data per server.

The fact that EC2 instances are rather small understates the difficulty with RAM in Amazon. You also need more of it. Amazon EBS is slower than typical on-premise storage, such as your friendly direct-attached RAID, and has highly variable performance as Baron Schwartz pointed out. You might therefore be inclined to double the working set ratio to 20%, since I/O is so costly you need a bigger working set to minimize reads from storage. Quadruple Extra Large instances then max out at 300Gb of managed data. I'm not forgetting SSDs, which are slowly appearing in cloud environments. They alter the working set ratios but introduce other issues, for example related to cost and capacity. It's also not clear how much faster they really are in shared environments like Amazon.

This simple storage math makes it obvious that managing data in the cloud requires a mental shift from fewer but larger DBMS servers to groups of much smaller servers, which we can describe as database arrays. Many small SaaS businesses generate a Terabyte of customer data in a year, hence would provision 4 new 300GB servers annually. Mobile and market automation apps generate many Terabytes per year. I recently did a capacity calculation for a customer where we hit 50 servers without even considering replicas to ensure availability.

Database arrays therefore address memory by partitioning storage. Many applications already partition into shards for other reasons. In this case, the main question is how the applications find data easily. Locating "virtualized databases" is a non-trivial problem, especially when you consider that servers may fail, move, or go offline for maintenance. The more servers you have, the more likely the special cases become. This is where Tungsten Enterprise comes in.

Tungsten Enterprise 1.5 Features for Cloud Database Arrays

Tungsten Enterprise builds clusters by taking a set of off-the-shelf DBMS servers linked by replication and making them look like a single virtualized DBMS server to applications. We call this server a data service. The key to the single-DBMS illusion is the Tungsten Connector, a fast proxy that sits between applications and database hosts. It routes connections into the right type of server, for example a master for writes or a slave for reads. It also allows you to switch masters without stopping applications or losing data.

Tungsten Enterprise already offers several nice properties for cloud operation. The Connector is software-only and does not use VIPs, which are not supported in Amazon anyway. Also, the fact that you can switch master and slaves without stopping applications makes it possible to "escape" slow EBS volumes. That said, previous Tungsten versions limited connections to a single data service per connector, which made database arrays hard to support properly.

Tungsten Enterprise 1.5 adds two features to help with constructing and managing arrays. First, our colleague Jeff Mace created a great Ruby-based install program called tpm that can install complex cluster topologies from the command line. If you have used the tungsten-installer program on Tungsten Replicator, you have an idea what is on the way. (If not Jeff will be talking about installations at the Percona Live MySQL Conference on April 11.) Efficient command-line installations make it easy to set up and manage a lot of clusters in parallel.

Second, thanks to work by Ed Archibald and Gilles Rayrat, the Connector now supports multi-service connectivity. Connectors previously kept track of cluster state by connecting to one of a list of managers for the local data service. We extended this to allow Connectors to track state in multiple data services. We developed the feature to support disaster recovery sites, which are another Tungsten 1.5 feature. As it turns out, though, multi-service connectivity is extremely useful for database arrays in the cloud. The following diagram compares single and multi-data service connectivity.

As with a single data service the Connector receives state changes providing location of the data service master and slaves as well as whether each is online or offline. The Connector uses a simple routing scheme based on logins to route SQL connections to the correct server for whatever it is doing. Beyond that, Tungsten Enterprise takes care of switching masters and load balancing reads that are necessary to use an array efficiently.

Setting up Multi-Service Connectivity

Multi-service access is trivial to set up. This is documented more fully in the forthcoming Tungsten Enterprise 1.5 documentation, but here is the basic idea.

First, each Tungsten Enterprise installation contains a file called dataservices.properties in the cluster-home/conf directory. The Connector uses this file to locate data service managers in order to get cluster state changes. To enable multi-service connectivity, just ensure there are entries for the managers of each data service in your array before starting the local Connector service.

cluster1=pr1c1.bigdata.com,pr2c1.bigdata.com,pr3c1.bigdata.com

cluster2=pr1c2.bigdata.com,pr2c2.bigdata.com,pr3c2.bigdata.com

cluster3=pr1c3.bigdata.com,pr2c3.bigdata.com,pr3c3.bigdata.com

...

Second, the Connector uses a file called user.map locate in tungsten-connect/conf to define logins. Edit this file and add a different login for each cluster to user.map. Save the file and the connector will pick it up automatically.

# user password service

cluster1 secret cluster1

cluster2 secret cluster2

cluster3 secret cluster3

...

That's about it. Applications need to make one change, namely to use a different login for each data service. For some applications, that is quite simple and can be done in a few minutes. I found to my surprise that some customer applications even do it already, which is very convenient. For others, it may require more extensive changes. The good news is that improvements are on the way.

What's Next

Tungsten Enterprise 1.5 is a solid step toward enabling large-scale data management using MySQL database arrays. However, we can do much more. Obviously, it would be easier if the Connector routed to the right DBMS server using the database name on the connect string or by detecting use commands in SQL. That's on the way already.

More generally, the whole point of Tungsten is to make off-the-shelf DBMS work better by adding redundancy and optimizing use of resources without forcing users to migrate or rearrange data. Memory is a key resource in cloud environments and deserves special attention. Tungsten connectivity looks like a powerful complement to the caching that already occurs within databases. It's also a natural extension of the work we have been doing to speed up replication.

Looking to the future, we are investigating more intelligent load balancing that sends queries to servers where the pages they need are most likely to be already in the buffer pool. This type of optimization can double or triple effective buffer pool size in the best cases. Query caching also offers potential low-hanging fruit. Finally, management parallelization is critical for array operation. I foresee a long list of Tungsten improvements that are going to be fun to work on and that go a long way to solving the cloud database memory conundrum.

p.s., Continuent has a number of talks scheduled on Tungsten-related topics at the Percona Live MySQL Conference and the SkySQL MariaDB Solutions Day April 10-13 in Santa Clara. For more on Tungsten Enterprise and EC2 in particular, check out Ed Archibald's talk on April 11. Giuseppe Maxia and I will also touch on these issues in our talk at at the SkySQL/MariaDB Solutions Day.

Aug 30, 2011

Practical Multi-Master Replication using Shard Filters

Earlier this month I published an article on this blog describing the system of record approach to multi-master replication. As mentioned in that article my colleagues and I at Continuent have been working on improving Tungsten to make system of record design patterns easier to implement. This article describes how to set up system of record using Tungsten Replicator shard filters, which are a new feature in Tungsten 2.0.4. By doing so we will create a multi-master configuration that avoids replication loops and transaction conflicts. On top of that, it is quite easy to set up.

There are many possible system of record patterns depending on how many schemas are shared and across how many masters. The following diagram shows three of them. In contrast to many so-called MySQL multi-master implementations, all masters are live and accept updates. (By contrast, schemes such as MySQL-MHA make extra masters read-only. Don't be fooled!)

For today's exercise we will implement the basic system of record. Once you understand this you can quickly set up other multi-master scenarios.

Defining Shard Master Locations

The first step is to tell Tungsten where each shard is mastered. By mastered we mean it is the one master that receives application updates on that shard, whereas all other masters have copies only or may not even contain the shard at all. Tungsten uses a variant of CSV (comma-separated format) where the first line contains column names. You can have any amount of whitespace between entries. Create a file called shards.map with your favorite editor and type in the following lines.

shard_id master critical

tungsten_nyc nyc false

tungsten_sjc sjc false

acme sjc false

pinnacle nyc false

The first column is the name of the shard. This must be unique--because a shard can only live on one master. The next column is the "home" master for the shard. This is the one and only master that should receive shard updates. The third column defines whether the shard is critical and requires full serialization. It will be linked to parallel replication in a later release.

It turns out you do not need to add entries for Tungsten catalog schemas such as tungsten_nyc. Tungsten Replicator will create them automatically. They are shown here for completeness only.

Creating Replication Services

Next we need to define services to replicate bi-directionally between DBMS servers and set options to filter shards using the ShardFilter class, which is new in Tungsten 2.0.4. The shard filter helps ensure that shards replicate from their home masters only and not from other locations. If you do not know what replication services are, you can find a description of them in this article.

Multi-master replication is easy to mis-configure, so to prevent accidents we will tell the shard filter to generate an error any time it processes a shard it has never seen before. The replication service will immediately fail, which signals that we have to update shard definitions. This is the safest way to implement system of record or any multi-master configuration for that matter. It is generally easier to restart replication after correcting the configuration than to mix up data, which can lead to major outages.

The first step is to set replication services for each master. These read the binlog and make transactions available to slave replication services. Here are the commands. Note that the sjc master is on host logos1, while the nyc master is on logos2. The remaining examples use these names consistently.

# Define common master settings.

COMMON_MASTER_OPTS="--datasource-user=tungsten --datasource-password=secret \

--home-directory=/opt/continuent --svc-parallelization-type=disk \

--svc-extractor-filters=shardfilter \

--property=replicator.filter.shardfilter.unknownShardPolicy=error"

# Set up sjc master.

tools/tungsten-installer --master-slave -a --master-host=logos1 \

--cluster-hosts=logos1 --service-name=sjc $COMMON_MASTER_OPTS --start-and-report

# Set up nyc master.

tools/tungsten-installer --master-slave -a --master-host=logos2 \

--cluster-hosts=logos2 --service-name=nyc $COMMON_MASTER_OPTS --start-and-report

The --svc-extractor-filters option adds shard filtering immediately after event extraction. The unknownShardPolicy=error setting will cause the masters to die if they process an undefined shard.

Now we can define the remote slave services for sjc and nyc. These are special slaves that write transactions onto another master as opposed to a normal slave. We would like slave services to error out on unknown shards as well. Also (and this is important) we want them to enforce shard homes. Here are the commands to create the services and start each one.

COMMON_SLAVE_OPTS="--release-directory=/opt/continuent/tungsten \

--service-type=remote --allow-bidi-unsafe=true --svc-parallelization-type=disk \

--svc-applier-filters=shardfilter \

--property=replicator.filter.shardfilter.unknownShardPolicy=error \

--property=replicator.filter.shardfilter.enforceHome=true"

# Set up sjc remote slave.

tools/configure-service -C -a --host=logos2 \

--local-service-name=nyc --role=slave \

--datasource=logos2 --master-host=logos1 $COMMON_SLAVE_OPTS sjc

$trepctl -host logos2 -service sjc start

# Set up nyc remote slave.

tools/configure-service -C -a --host=logos1 \

--local-service-name=sjc --role=slave \

--datasource=logos1 --master-host=logos2 $COMMON_SLAVE_OPTS nyc

trepctl -host logos1 -service nyc start

The --svc-applier-filters option adds shard filtering before applying to the DBMS. The unknownShardPolicy=error setting will cause the slaves to die if they process an undefined shard. Finally, the enforceHome=true option means that each slave will drop any transaction that lives on a different service from that slave's master.

At the end of this procedure, your services should be online and read to run. Use 'trepctl services' to make sure.

Loading Shard Definitions

To make shard definitions take effect, you must load the shard.map contents into each replication service. You can do this any time the replicator is running but after loading new definitions you must put the replicator online again. Here are the commands to load the shard maps onto each of the four replication services. For each replication service, you must delete the old definitions, reload new ones, and get the replicator to go online again.

#!/bin/bash

MAP=shard.map

for host in logos1 logos2

for service in sjc nyc

trepctl -host $host -service $service shard -deleteAll

trepctl -host $host -service $service shard -insert < $MAP

trepctl -host $host -service $service offline

trepctl -host $host -service $service wait -state OFFLINE

trepctl -host $host -service $service online

done

This looks a little clunky and will be reduced to a single command instead of five in a later release. I put it in a script to make it quicker to run. The good news is that there is just one shard map that works for all replication services, regardless of location or role.

Once you finish this step, you can go to any replication service and list the shards it knows about. Let's pick a service and demonstrate:

$ trepctl -host logos1 -service sjc shard -list

shard_id master critical

tungsten_nyc nyc false

tungsten_sjc sjc false

acme sjc false

pinnacle nyc false

With this we are ready to start processing some transactions.

Multi-Master Operation

At this point we have multi-master replication enabled between hosts logos1 and logos2. You can try it out. Let's add the acme database to the sjc master on logos1 as an example.

mysql -utungsten -psecret -hlogos1

mysql> create database acme;

mysql> use acme

mysql> create table foo (id int);

mysql> insert into foo values(1);

We can see that all of these commands replicate over to the logos2 server quite easily with the following command:

mysql -utungsten -psecret -hlogos2 -e 'select * from acme.foo'

+------+

| id |

+------+

| 1 |

+------+

That seems pretty simple. In fact it is. You can go over to logos2 and enter transactions for pinnacle in the same way. Data replicate back and forth. There are no replication loops. There are also no conflicts.

Adding a New Shard

So what happens when we add a new shard? The simplest way to see is to create a database using a schema name that does not exist in the shard map. Let's try to create a database named superior on the nyc master.

mysql -utungsten -psecret -hlogos2 -e 'create database superior'

Now check the status of the nyc master replication service. We see it has failed with an error due to the unknown shard. (Tungsten parses the create database command and assigns it the shard ID "superior.")

$ trepctl -host logos2 -service nyc status

Processing status command...

NAME VALUE

---- -----

...

pendingError : Stage task failed: binlog-to-q

pendingErrorCode : NONE

pendingErrorEventId : mysql-bin.000157:0000000000002475;1287

pendingErrorSeqno : 8

pendingExceptionMessage: Rejected event from unknown shard: seqno=8 shard ID=superior

...

state : OFFLINE:ERROR...

Finished status command...

This problem is quite easy to fix. We just open up the shard.map file and add a row for superior so that the file contents look like the following:

shard_id master critical

tungsten_nyc nyc false

tungsten_sjc sjc false

acme sjc false

pinnacle nyc false

superior nyc false

Reload the shard.map file as shown previously and you will be back in business. Incidentally, if you do not want the superior database to be replicated to other masters, you can also specify this in the rules. Just give superior the special master name #LOCAL as in the following example and it will not replicate outside the nyc service.

superior #LOCAL false

In fact, #LOCAL means that any schema named superior will not replicate outside the service in which it is defined. You can have an unshared schema named superior on every master.

Where to Next?

The shard support described in this article is now part of Tungsten 2.0.4 and will appear in the official build when it is finally ready. You can try it out right now using one of our handy nightly builds.

We plan to build out shard filtering quite a bit from the current base. One immediate fix is to put in a check so that if an application commits shard updates on the wrong DBMS server, the master replication service on that server will detect it and fail. This will tell you there's a problem immediately rather than letting you wallow in blissful ignorance while your data become hopelessly mixed up. We will also simplify the commands to update shards while replicators are online.

Longer term we will be adding features to propagate shard definitions through replication itself. Stay tuned for more work in this area. If you want to help fund work to enable your own applications, please get in contact with me at Continent. I can think of at least a dozen ways to make our multi-master support better but it's always nicer to spend the effort on features that enable real systems. In the meantime, I hope you find multi-master with shard filtering useful and look forward to your feedback.

Oct 29, 2014

An Ending and a Beginning: VMware Has Acquired Continuent

Feb 17, 2014

Why Aren't All Data Immutable?

Feb 19, 2013

Data Fabric Design Patterns: Fabric Connector

Apr 3, 2012

Solving the Cloud Database Memory Conundrum

Aug 30, 2011

Practical Multi-Master Replication using Shard Filters

Contributors

Blog Archive

Favorite Blogs