SAP_HANA_distributed_in-memory_database_system_Transaction_session_and_metadata_management
SAP_HANA_distributed_in-memory_database_system_Transaction_session_and_metadata_management
Abstract— One of the core principles of the SAP HANA database compression making it possible to store even large datasets
system is the comprehensive support of distributed query facility. within the main memory of the database server. Recent
Supporting scale-out scenarios was one of the major design developments in the hardware sector economically allow
principles of the system from the very beginning. Within this having off-the-shelf servers with 2 TByte of DRAM. The
paper, we first give an overview of the overall functionality with
main-memory centric approach therefore turns the classical
respect to data allocation, metadata caching and query routing.
We then dive into some level of detail for specific topics and architectural paradigm upside-down: While traditional disk-
explain features and methods not common in traditional disk- centric systems try to guess the hot data to be cached in main
based database systems. In summary, the paper provides a memory, the SAP HANA approach defaults to have
comprehensive overview of distributed query processing in SAP everything in main memory; only “cold” data—usually
HANA database to achieve scalability to handle large databases determined by complex business rules and not by buffer pool
and heterogeneous types of workloads. replacement strategies working without any knowledge of the
application domain and corresponding business objects—can
be staged out onto disk infrastructures. This allows SAP
I. INTRODUCTION
HANA to support even very large databases in terms of a
An efficient and holistic data management infrastructure is large number of tables and data volumes sufficient to serve all
one of the key requirements for making the right decisions at SAP customers with existing and future applications.
an operational, tactical, and strategic level. The SAP HANA In addition to performance, the SAP HANA database also
database is the core component of SAP’s HANA roadmap targets to support business processes from a holistic
playing the foundation to efficiently support all SAP and non- perspective. For example, the system may hold text
SAP business processes from a data management perspective documents of products within an order together with
[1]. In opposite to the traditional architecture of a database structured information of the customer and spatial information
system, the SAP HANA database takes a different approach to of the current delivery route. As outlined in [2], the SAP
provide support for a wide range of data management tasks. HANA database provides multiple engines exposing special
For example, the system is organized in a main-memory services. Data entered for example in text format can be
centric fashion to reflect the shift within the memory extracted, semantically enriched, and transformed into
hierarchy [9] and to consistently provide high performance structural data for combination with information coming from
without any slow disk interactions. an engine optimized for graph-structured analytics.
Completely transparent for the application, data is Combining heterogeneous datasets seamlessly within a single
organized along its life cycle either in column or row format, query processing environment and providing support for the
providing the best performance for different workload complete life cycle of data on a business object level are some
characteristics [10]. Transactional workloads with a high of the unique features of SAP HANA.
update rate and point queries are routed against a row store; Finally, SAP HANA is positioned to act as a consolidation
analytical workloads with range scans over large datasets are platform for many different use cases, from an application
supported by column oriented data structures. In addition to a perspective and from a data management perspective.
high scan performance over columns, the column-oriented Multiple SAP and non-SAP applications may run on top of
representation offers an extremely high potential for one single SAP HANA instance providing the right degree of
1166
Authorized licensed use limited to: Huawei Technologies Co Ltd. Downloaded on January 17,2024 at 01:55:18 UTC from IEEE Xplore. Restrictions apply.
scenarios with some very specific optimizations like connects to a particular server and starts the query compilation
optimizing the two-phase commit protocol (subsection III-D) process. Figure 3 shows the different steps.
or providing sophisticated mechanisms for session
management (subsections III-C.1, III-C.2, and III-C.3). As
mentioned the deployment of the system usually reflects the
intended use in order to have a benefit of a large node for
heavy transaction processing and a number of usually smaller
nodes for analytical workloads where the additional overhead
of distributed synchronization reflects a relatively small
portion of the overall query runtime. Since SAP HANA relies
on MVCC as the underlying concurrency control mechanism,
the system provides distributed snapshot isolation and
distributed locking to synchronize multiple writers. Therefore,
the system relies on a distributed locking scheme with a global
deadlock detection mechanism avoiding a centralized lock
server as a potential single point of failure.
C. Distributed Metadata Management
Within an SAP HANA database landscape, a coordinator Fig. 3. Query compilation and execution in a single node scenario
node stores and manages all the persistent metadata such as
table/view schema, user information, privileges on DB objects, The session layer forwards an incoming query request to
etc. To satisfy requirements for consistent metadata access, the optimizer (1). After consulting metadata (2) and checking
the metadata object container provides both MVCC based the plan cache (3), the query will eventually be optimized and
access and transactional update (ACID) on its contents. It also compiled. In opposite to other systems, the optimizer embeds
provides index-based fast object lookup. substantial metadata into the query plan and returns it to the
client. For example, metadata flowing back to the client
contains information about the most optimal node to actually
start and orchestrate the query execution. Within a single node
case, the client sends the query plan to the execution
component (4), puts the plan into the (current) query plan
cache and starts executing the query in a combination of
column- and row-store. In the case of an update, additional log
information will eventually be written to the persistency layer
to ensure atomicity and durability.
Fig. 2. Distributed metadata management within an SAP HANA landscape The query flow is a bit more complicated in a multi-node
deployment as illustrate in Figure 4. As before, a client may
In order to improve access to metadata at worker nodes, the send a query to a dedicated coordinator node (1). After the
concept of metadata caches enables local access to “remote” initial query compilation, the returned query plan contains
metadata in a distributed environment. Figure 2 shows the information about the node where the query should be
metadata object container and cache in the coordinator and executed (4). This recommendation is based on data locality
worker nodes. When a component in a worker node requires for the particular query and the current system status. The
access to a metadata object located at the (remote) coordinator, client then sends the query to the recommended (worker) node
the metadata manager first tries to locate it in the cache. If (5) and re-compiles the query with respect to the node-
there is no result object in the cache, a corresponding retrieval specific properties and statistics (6). As before, the
request is sent to the coordinator. The result is placed within optimization step requires access to metadata, which is either
the cache and access is granted to the requesting component. already available at the current node (7) or requested on-the-
In order to reduce potential round-trips to fetch different fly from the coordinator (8). After re-compilation, the query is
entries of metadata, the system applies group caching of executed (10) by extracting the plan from the plan cache (11),
tightly related metadata, e.g., a cache request for metadata passing it to the execution component which again routes the
related for a table also returns metadata about columns, individual requests to the local storage structures (12) and
existing indexes, etc. within a single request. For consistent potentially local persistency layer (13). Figure 5 illustrates the
query processing, the access to the metadata cache is tightly benefits of using statement re-routing with a simple single-
coupled with the transaction management. table select. The figure shows three cases: (i) single-node case;
(ii) multi-mode case with statement routing turned on; and (iii)
D. Distributed Query Compilation and Execution multi-node case with statement routing turned off. As we can
In order to illustrate the key concepts of distributed query see, cases (i) and (ii) are virtually identical. Case (iii) however
processing within SAP HANA, we will follow a query in is significantly and consistently slower than the other two
different scenarios. Within a single node setup, the client cases. In addition to statement routing, this scenario also
1167
Authorized licensed use limited to: Huawei Technologies Co Ltd. Downloaded on January 17,2024 at 01:55:18 UTC from IEEE Xplore. Restrictions apply.
Fig. 6. Distributed query spanning multiple nodes
1168
Authorized licensed use limited to: Huawei Technologies Co Ltd. Downloaded on January 17,2024 at 01:55:18 UTC from IEEE Xplore. Restrictions apply.
HANA landscape. Optimally partitioning and allocating
10 partitions to nodes within a landscape is therefore of utmost
importance. The SAP HANA database therefore comes with a
8
variety of tools optimized for SAP applications to support the
administration task. Second, the SAP HANA database
speedup
6 ●
●
query #1
provides full ACID support in the scale-out scenario which is
4 ●
therefore different to “Big Data”-systems like Hadoop and the
2 ● like. Third, the system provides a sophisticated and MVCC-
query #2
●
based metadata caching infrastructure reducing the overall
0
1 3 7 15 31
communication cost in a landscape. Finally, the SAP HANA
database pursues client-side statement routing as the core
# nodes
principle of load balancing—this paradigm is therefore
Fig. 7. Speedup for analytical workload orthogonal to traditional parallel database architectures relying
on page caching strategies. While page caching may result in
As outlined in the introduction, the SAP HANA database
having the same page in multiple nodes and thus reducing the
system is designed to deliver extremely good performance for
net main memory capacity, client-side statement routing
OLAP- and OLTP-style query workloads. We therefore also
brings the query to the node with the highest data locality thus
consider a typical update-heavy workload in the context of a
reducing network traffic and in cases of transactions against a
real SAP BW scenario. After loading extracted raw data into
small number of “related” tables even the need for two-phase-
the BW system and applying transformations with respect to
commit. This mechanism is especially useful for asymmetric
data cleansing and harmonization requirements, new data is
deployment schemes (Figure 1) by routing transactional
eventually “merged” with the existing state of the fact table
workload queries to the large node of an SAP HANA
implying a mix of insert, delete, and update operations on the
landscape.
potentially very large fact table. Following this “activation
step”, the system is applying reorganization steps to better III. SELECTED OPTIMIZATIONS
reflect the multidimensional model. Again this “compression
step” within the overall SAP BW loading process is a write- In this section we describe in detail some selected
intensive step which can nicely be parallelized over multiple optimization problems in the SAP HANA distributed
nodes. landscape and provide insights into the conceptual solution
1) The loading chain type 1 shown in Table I reflects an design for them.
activation step. As can be seen in Figure 8, the system A. Client-Side Statement Routing
scales very nicely with the number of nodes, despite the
In a distributed database environment a query statement
heavy write process.
issued by a client may be executed in any one of the available
2) The compression step (loading chain type 2) finally
nodes. The problem is determining the optimal node to run it
shows excellent scalability with a speedup of 13.6 for
on and route it to the node, which in HANA would be any one
15 nodes or 25.2 for the 31 node case and therefore
of the worker nodes in the system. HANA uses what we call
demonstrates the scale-out capability in an optimal
client-side statement routing to achieve this. Affinity
setting with respect to the physical database design.
information between a given compiled query and its accessed
data location is cached transparently at the client library. So,
30
loading chain #1 without changing the existing client programs and also
25 ●
without requiring any more information on clients, any
20 arbitrary query given by a client program can be routed to a
speedup
15 loading chain #2
desirable server node.
●
Figure 9 shows a conceptual view of the solution on how a
10
● query is routed by client library and how the affinity
5
● information is maintained at the client side. First, at compile
0
●
time the desired server location of a given query can be
1 3 7 15 31 decided and it is returned to the client library. On a
# nodes subsequent repeated execution of such a compiled query, it is
routed to the associated server node directly by the client
Fig. 8. Speedup for write-intensive workload
library. Sometimes, by DDL operations such as table
movement or repartitioning, however, the desired location of a
given query can be changed as for Table T2 in Figure 9. In
Summary such a case, on the next query execution following the DDL
As outlined, distributed query processing in SAP HANA is operation, such inefficiency is detected automatically by
based on four fundamental building blocks. First of all, the comparing the metadata timestamp of the client-side cached
SAP HANA system is working in a shared-nothing fashion information with the server-side latest metadata timestamp
with respect to the nodes' main memory within an SAP
1169
Authorized licensed use limited to: Huawei Technologies Co Ltd. Downloaded on January 17,2024 at 01:55:18 UTC from IEEE Xplore. Restrictions apply.
value and then the client-cached information is updated with transaction isolation mode, etc.). Without client-side
the latest information. statement routing, there is only 1-to-1 mapping between a
logical session and a client-server physical connection, which
means that the session context information can be stored in the
server-side connection. With client-side statement routing,
however, it should be shared across multiple client-server
physical connections. In HANA, some of the session context
information is also cached at the client library side. The
cached information can then be used when it has to make a
new physical connection to a different server within the same
logical session without having to contact the initial physical
connection of the logical session.
In this subsection, we primarily focused on the cases in
which the desired execution location of a query is decided at
query compilation time. However, there are other types of
queries whose desired location can be decided dynamically at
run time. Such cases are described in the subsection.
B. Dynamic Resolution in Statement Routing
1) Table partitioning: Using the partitioning feature of the
Fig. 9. Client-side statement routing
SAP HANA database, a table can be partitioned horizontally
How is the optimal target location decided given a client- into disjunctive sub-tables or “partitions”, each of which may
side statement then? A server extracts tables and/or views be used by each node of a distributed HANA database system.
needed to run the statement (or a stored procedure if that is the Problem is how to partition a table optimally so that each
case) and returns the locations (nodes) of the target tables partition is shipped to the optimal worker node. Partitioning
and/or views. The client then does the following: If the may be done by using one of three strategies: hash, round-
number of nodes returned is 1, route the query to that node. If robin, and range. Both hash partitioning and round-robin are
the number is greater than 1, then do the following: If the used to equally distribute the partitions to worker nodes.
table returned is a hash partitioned table or a round-robin Range partitioning can be used to create dedicated partitions
partitioned table, and furthermore the query is an INSERT for certain values or certain value ranges to be distributed to
query, then evaluate the input value of the partitioning key worker nodes.
and use the result as the target location. Otherwise, route it to Let us consider hash partitioning in more detail. To decide
the returned nodes in a round-robin fashion. a desirable partition of a given query to a partitioned table, we
The routing decision can be resolved at compile time need to consider the table's partitioning specification (or the
(which we call static resolution) in some cases or at run time partitioning function) and the execution-time input values of
(which we call dynamic resolution) in other cases. We now its partitioning keys. The partitioning specification can be
discuss static resolution optimizations in the rest of this given to the client library on the query compilation, but the
subsection and dynamic resolution optimizations will be execution-time input values can be known only during the
discussed in the subsection that follows. query execution time. This input value interpretation logic is
Let us consider stored procedures first. How is it decided normally done by the server-side query processing layer. For
at which node a stored procedure is executed? Within a stored this optimization though, such logic is also shipped together to
procedure there may be a variety of multiple queries. Usage the client library as well.
pattern of each query in a stored procedure is used in making This partitioning optimization technique is used to support
the routing decision. For example, if one query executes client-side routing of various statements such as inserts,
multiple times in a procedure, we can be a little smarter. selects, updates, and deletes as long as their WHERE clause
Suppose there are 10 queries and one of them is in a loop, in includes partitioning key information.
which case a higher priority is given to the query in a loop in 2) Load balancing: Each node in a distributed database
making the decision. system may be utilizing the key resources, i.e., CPU and
In some cases analyzing a stored procedure at a server memory, differently with different workloads at any given
alone may not be sufficient, in which case the server could get time. For example, one node may be busy with CPU-bound
some hints from the client code. Application programs could tasks while there may be at least one other node that is not
pass some domain knowledge into the application code. busy at all at that point in time. With HANA as a main-
With the client-side statement routing mechanism in place memory database system, memory is being used not only for
there is no more 1-to-1 mapping between a session and a processing but also for storage, i.e., holding table data. Again
physical connection or between a transaction and a physical for example, one node may almost be out of memory while
connection. This poses a technical challenge though, which there may be at least one other node with plenty of available
has to do with session context management. A session context memory. It is a correctness concern as well as performance
is a property of a session (for example, user name, locale, concern, i.e., we cannot ever get into a situation where a node
1170
Authorized licensed use limited to: Huawei Technologies Co Ltd. Downloaded on January 17,2024 at 01:55:18 UTC from IEEE Xplore. Restrictions apply.
fails because it ran out of memory. Therefore, it is important isolation mode) starts, it copies the current transaction token
to balance not only the processing load but also storage load into its context (called snapshot token). And, the transaction
among different nodes in the system. (or statement) decides which versions should be visible to
In addition to improving affinity between data location and itself based on the copied ‘snapshot token’.
processing location, we can extend client-side statement Now we describe distributed snapshot isolation (or
routing to achieve better load balancing across HANA server distributed MVCC). In our transaction protocol every
nodes. When returning a query result to a client library, a transaction started at a worker node should access the
HANA server can return its memory and CPU resource transaction coordinator to get its snapshot transaction token.
consumption status together with the result, i.e., without This could cause (1) a throughput bottleneck at the transaction
making any additional round trip. Alternatively, the client coordinator and (2) additional network delay to the worker-
library can periodically collect the resource status of HANA side local transactions. To remedy these situations we use
server nodes. If the client detects that one of computing nodes three techniques: (i) one that enables local read-only
does not have enough CPU or memory resource at that point transactions to run without accessing the global coordinator;
in time, the client library tries to temporarily re-route the (ii) another that enables local read or write transactions to run
current query to other nodes. Furthermore, at query without accessing the global coordinator; and (iii) third that
compilation time the client library can attach its expected uses “write TID buffering” to enable multi-node write
memory consumption to the query plan. If this information is transactions to run with only a single access to the global
then cached and attached to the compiled query at the client coordinator. We now describe all three techniques in order.
side, the client library can perform more efficient re-routing. 1) Optimization for worker-node local read transactions or
statements: Every update transaction accesses the transaction
C. Distributed Snapshot Isolation (Distributed MVCC) coordinator to access and update the global transaction token.
For distributed snapshot isolation figuring out how to Read-only statements on the other hand just start with its
reduce the overhead of synchronizing transaction ID or cached local transaction token. This local transaction token is
commit timestamp across multiple servers belonging to a refreshed
same transaction domain has been a challenging problem [3], • by the transaction token of an update transaction when it
[4], [5]. SAP HANA database focuses on optimization for a commits on the node, or
single-node query which is executed without accessing any • by the transaction token of a ‘global statement’ when it
other node. If the partitioning and table placement is done comes in to (or started at) the node.
optimally, then most of queries can be processed within a If the statement did not need to access any other node, it
single node. For long-running analytical queries that have to can just finish with the cached transaction token, i.e., without
be spanned across multiple nodes, the overhead incurred by any access to the transaction coordinator. If it detects that the
communicating such transactional information will be statement should also be executed in another node, however, it
relatively ignorable compared to the overall execution time of is switched to the 'global statement' type, after which the
the query. So, our optimization choice that favors single-node current statement is retried with the global transaction token
queries can make sense for many applications. obtained from the coordinator.
The question is how is it figured out whether a transactional Single-node read-only statements/transactions do not need
snapshot boundary will only touch one node or not in advance? to access the transaction coordinator at all. This is significant
It is feasible especially under the Read-Committed isolation to avoid the performance bottleneck at the coordinator and to
mode [6], which is the default isolation mode in many SAP reduce performance overhead of single-node statements (or
applications. transactions).
In Read-Committed isolation mode the MVCC snapshot 2) Optimization for worker-side local write transactions:
boundary is the lifetime of the query, which means that the Each node manages its own local transaction token
transaction does not need to consider any other query. So, the independently of the global transaction token. Even the
isolation boundary for the query can be determined at compile update transaction can just update its own local transaction
time, i.e., at that time it can figure out exactly what parts of token if it is a single-node transaction. The difference is that
which tables to access to execute the query. When the query each database record has two TID (or Transaction ID)
finishes, the snapshot finishes its life as well, i.e., the snapshot columns for MVCC version filtering: one for global TID and
is meaningful only on that local node while the query is another for local TID. (In the existing other schemes, there is
executing. The entire transaction context that is needed for a only one TID/Commit ID column.) If it is a local-only
query to execute is captured in a data structure called transaction, it reads/updates the local TID. If it is a global
transaction token (described below) and is cached on a local transaction, however, it reads either global or local TID (reads
node. For the “Repeatable Read” or “Serializable” isolation a global TID if there is a value in its global TID column;
level, the transaction token can be reused for the queries otherwise, reads a local TID) and updates both global and
belonging to the same transaction, which means that its local TIDs. So, the global transactions carry two snapshot
communication cost with the coordinator node is less transaction tokens: one for global transaction token and
important. another for the current worker node's local transaction token.
Whenever a transaction (in a transaction-level snapshot
isolation mode) or a statement (in a statement-level snapshot
1171
Authorized licensed use limited to: Huawei Technologies Co Ltd. Downloaded on January 17,2024 at 01:55:18 UTC from IEEE Xplore. Restrictions apply.
In the log record both global and local TIDs are also
recorded if it is a global transaction. On recovery a local
transaction's commit can be decided by its own local commit
log record. Only for global transactions it is required to check
the global coordinator. Here again statement type switch
protocol is necessary as it is the case in case 1) above.
3) Optimization for multi-node write transactions: A multi-
node write transaction needs a global write TID as all multi-
node transactions do. The optimization described in case 2)
above would not help. It can however be handled by the write
TID buffering technique which we describe now.
In a HANA scale-out system one of the server nodes
becomes the transaction coordinator which manages the
distributed transaction and controls two phase commit.
Executing a distributed transaction involves multiple network
communications between the coordinator node and worker Fig. 10. Returning commit ack early after first commit phase
nodes. Each write transaction is assigned a globally unique
For this optimization three things are considered.
write transaction ID. A worker-node-first transaction, one that
starts at a worker node first, should be assigned a TID from 1) Writing the commit log entries on the worker nodes
the coordinator node as well, which causes an additional can be done asynchronously. During crash recovery
network communication. Such a communication might then some committed transactions can be classified as
significantly affect the performance of distributed transaction in-doubt transactions, which will be resolved as
execution. This extra network communication is eliminated to committed finally by checking the transaction's status
improve the worker-node-first write transaction performance. in the coordinator.
We solve this problem by buffering such global write TIDs 2) If transaction tokens are cached asynchronously on the
in a worker node. When a request for a TID assignment is worker nodes, the data is visible by a transaction but
made the very first time, the coordinator node returns a range not by the next (local-only) transaction in the same
session. This situation can be detected by storing the
of TIDs which gets buffered in the worker node. The next
last transaction token information for each session at
transaction which needs a TID gets it from the local buffer
thus eliminating the extra network communication with the the client side. And then, until the second commit
coordinator node. A few key challenges are determining phase of the previous transaction is done, the next
optimal buffer size as a parameter and deciding when to flush query can be stalled.
the buffer if they are not being used. 3) If transactional locks are released after sending a
Therefore, by combination of the optimizations described commit acknowledgement to the client, a 'false lock
in this subsection SAP HANA provides transparency of conflict' may arise by the next transaction in the same
transaction performance regardless of where it is started and session. This situation can however be detected by the
committed without losing or mitigating any transactional same problem with (2) above. If this is detected, the
consistency. transaction can wait for a short time period until the
commit notification arrives to the worker node.
D. Optimizing Two-Phase Commit Protocol 2) Skipping writes of prepare logs: The second optimiza-
Two-phase commit (2PC) protocol is widely used to ensure tion is to remove additional log I/Os for writing prepare-
atomicity of distributed multi-node update transactions. A commit log entries.
series of optimization techniques we describe in this In a typical 2-phase-commit the prepare-commit log entry
subsection are our attempts to reduce the network and log I/O is used to ensure that the transaction's previous update logs are
delays during a two-phase commit thus increasing throughput. written to disk and to identify in-doubt transactions at the
1) Early commit acknowledgement after the first commit recovery time.
phase: Our first optimization is to return commit • Writing the transaction's previous update logs to disk can
acknowledgement early after the first commit phase [7], [8], be ensured without writing any additional log entry, by
as shown in Figure 10. just comparing the transaction-last-LSN (log sequence
Right after the first commit phase and the commit log is number) with the log-disk-last-LSN. If the log-disk-last-
written to the disk, the commit acknowledgement can be LSN is larger than the transaction-last-LSN, it means
returned to the client. And then, the second commit phase can that the transaction's update logs are already flushed to
be done asynchronously. disk.
• If we do not write the prepare-commit log entry, we can
handle all the uncommitted transactions at recovery time
as in-doubt transactions.
1172
Authorized licensed use limited to: Huawei Technologies Co Ltd. Downloaded on January 17,2024 at 01:55:18 UTC from IEEE Xplore. Restrictions apply.
Their commitance can be decided by checking with the IV. SUMMARY
transaction coordinator. So, the size of in-doubt The SAP HANA database is primarily designed to cover
transaction list can increase, but with less run-time the three difference scaling principles: scale-in, scale-up, and
overhead. scale-out. In this paper we outlined some of the hard problems
3) Group two-phase commit protocol: This is a similar idea in multi-node scenarios, showed the core architectural designs,
to the one described in Subsection III-C.3, but instead of and give optimization details for some of the problems.
sending commit requests to the coordinator node individually Specifically, we discuss optimization for query routing in
(i.e., one for each write transaction), we can group multiple single and multiple node scenarios, we show optimizations
concurrent commit requests into one and send it to the techniques for client-based statement routing, two-phase-
coordinator node in one shot. commit protocol, and finally give some insights into caching
Also, when the coordinator node multicasts a "prepare- strategies and techniques for metadata catalogue of the SAP
commit" request to multiple-related worker nodes of a HANA database.
transaction, we can group multiple "prepare-commit" requests
of multiple concurrent transactions which will go to the same ACKNOWLEDGMENT
worker node. We explicitly want to express our Thanks to the SAP
By this optimization we can get better throughput of HANA team, especially to all the readers of a preliminary
concurrent transactions. version of this paper. We also thank Sang K. Cha for
Two-phase commit itself cannot be avoided fundamentally suggesting the idea of writing this paper.
for ensuring global atomicity to multi-node write transactions.
However, by combination of the optimizations described in REFERENCES
this subsection, we can reduce its overhead significantly. [1] F. Färber, S. K. Cha, J. Primsch, C. Bornhövd, S. Sigg, and W. Lehner,
“SAP HANA database: data management for modern business
applications,” SIGMOD Record, vol. 40, no. 4, pp. 45–51, 2011.
E. Metadata Cache Management [2] V. Sikka, F. Färber, W. Lehner, S. K. Cha, T. Peh, and C. Bornhövd,
“Efficient transaction processing in SAP HANA database: the end of a
The system-wide metadata catalogue is kept in the column store myth,” in SIGMOD Conference, 2012, pp. 731–742.
coordinator node. As a worker node processes a query, it [3] C. Mohan, H. Pirahesh, and R. Lorie, “Efficient and flexible methods
would have to obtain the necessary metadata information from for transient versioning of records to avoid locking by read-only trans-
the coordinator node. To reduce the overhead caused by the actions,” in Proceedings of ACM SIGMOD International Conference
on Management of Data, 1992.
network latency in the communication between the [4] H. V. Jagadish, I. S. Mumick, and M. Rabinovich, “Scalable versioning
coordinator node and a worker node, HANA supports a in distributed databases with commuting updates,” in Proceedings of
worker node requesting multiple metadata objects with a the International Conference on Data Engineering (ICDE), 1997.
single request when beneficial. [5] H. V. Jagadish, I. S. Mumick, and M. Rabinovich,, “Asynchronous
version advancement in a distributed three version database,” in
When a worker node needs to obtain metadata information Proceedings of the International Conference on Data Engineering
from the coordinator node because it does not have that (ICDE), 1998.
information already cached on it, it will incur IPC. An [6] H. Berenson, P. Bernstein, J. Gray, J. Melton, E. O’Neil, and P.
example can be seen in Step 8 of Figure 4. Since each single O’Neil, “A critique on ansi sql isolation levels,” in Proceedings of
ACM SIGMOD International Conference on Management of Data,
metadata cache miss will cause IPC via network 1995.
communication, it can be a considerable performance penalty [7] C. Mohan, B. Lindsay, and R. Obermarck, “Transaction management
for the cases where there are many metadata cache misses for in the R* distributed database management system,” ACM
a single query execution. Group caching for metadata is Transactions on Database Systems, vol. 11, no. 4, 1986.
[8] R. Gupta, J. Haritsa, and K. Ramamritham, “Revisiting commit
introduced to minimize cache miss penalties of this kind. process- ing in distributed data-base systems,” in Proceedings of ACM
Instead of checking cache in an on-demand manner, a worker SIGMOD International Conference on Management of Data, 1997.
node collects the entire required metadata object IDs first and [9] J. Gray, “Tape is dead, disk is tape, flash is disk. ram locality is king,”
sends a single request for all the missing objects. This 2006.
(http://research.microsoft.com/en-us/um/people/gray/jimgraytalks.htm).
optimization is particularly good for a query with a complex [10] H. Plattner, “A common database approach for OLTP and OLAP using
query plan which requires accessing multiple metadata objects an in-memory column database,” in Proceedings of ACM SIGMOD
of various kinds such as tables, indexes, views, and privileges. International Conference on Management of Data, 2009.
1173
Authorized licensed use limited to: Huawei Technologies Co Ltd. Downloaded on January 17,2024 at 01:55:18 UTC from IEEE Xplore. Restrictions apply.