Ailibaba-Time-Series DB
Ailibaba-Time-Series DB
Chunhui Shen‡§∗ , Qianyu Ouyang‡†∗ , Feibo Li, Zhipeng Liu, Longcheng Zhu, Yujie Zou, Qing Su,
Tianhuan Yu, Yi Yi, Jianhong Hu, Cen Zheng, Bo Wen, Hanbang Zheng, Lunfan Xu, Sicheng Pan,
Bin Wu, Xiao He, Ye Li, Jian Tan, Sheng Wang, Dan Pei† , Wei Zhang, Feifei Li
Alibaba Group‡ Zhejiang University§ Tsinghua University†
{tianwu.sch,ouyangqianyu.oyqy,lizi,qingzhi.lzp,longcheng.zlc,yunxing.zyj,suqing.sq}@alibaba-inc.com
{yutianhuan.yth,claude.yy,jianhong.hjh,mingyan.zc,wenbo.wb,zhenghanbang.zhb,xulunfan.xlf}@alibaba-inc.com
{zhikuan.psc,binwu.wb,xiao.hx,liye.li,j.tan,sh.wang}@alibaba-inc.com
[email protected],{zwei,lifeifei}@alibaba-inc.com
3715
thousands of distinct values. The combination of the tags (e.g., more of detecting and localizing performance issues. However, existing
than ten tags per timeseries in Table 1) results in billion-scale time- TSDBs haven’t fully integrated ML-based time series analysis. Con-
series. More than 60% of timeseries are daily active (i.e., have newly sequently, users have to employ an external AI platform to handle
arrived data points). Consequently, it requires high data ingestion tasks such as algorithm development, model training and infer-
capacity of underlying monitoring system, especially when active ence. This not only complicates the overall architecture, but also
timeseries are massive. In addition to data ingestion, the large scale introduces additional latency and data synchronization problems.
of timeseries also complicates the query processing. For example, in Although some databases have supported ML-based data analy-
Sys-B, a single query hits more than a thousand timeseries, whose sis [16, 28, 30], they do not optimize the execution process of ML
data points are retrieved for aggregative analysis. Even worse, this algorithms for time series data, leading to poor performance.
number reaches a million in Sys-A.
In practice, time-series database (TSDB) is used as the backbone C4: Inefficient adaptability to scale time series management. The
of above monitoring systems to manage metric data and support number of timeseries in the monitoring system is continuously
queries [9]. However, we observe that it is highly inefficient for increasing as the business grows, where the quantities of both
existing TSDBs to handle data ingestion and queries over massive micro-services and machines expand along with more fine-grained
timeseries. Besides, they all lack the support of machine learning metrics being monitored [35]. The underlying TSDB is required
(ML) functions, requiring prohibitive efforts to implement complex to continuously scale up to cope with such demands. However,
ML algorithms on time series (e.g., anomaly detection) and maintain existing TSDBs usually need to redistribute data when scaling out
corresponding services externally. In a nutshell, existing TSDBs are a new node, which is prohibitive on the consumption of both re-
unable to fully meet the needs of monitoring systems in large-scale sources and time. One major reason behind this is that the compute
Internet services, facing four major challenges as follows: and storage resources are tightly coupled. Currently, distributed
TSDBs [4, 37] often have a shared-nothing architecture, where each
C1: Low write throughput for massive timeseries. When writing
node excursively manages its own memory and disk space. When
data points into a TSDB, the set of tags for the target timeseries
adding nodes to the cluster, they all suffer from high I/O pressure
is given to the database as well. A common approach to dealing
due to massive data migration. Although some TSDBs [15, 40] de-
with these tags is to create a forward index, whose index entry
ploy a shared storage, this shared storage acts more as a cold storage
maps the tag set to a timeseries id, i.e., a unique identifier internally
layer to reduce storage costs, rather than improve scaling efficiency.
used by the TSDB to distinguish timeseries. Since each index entry
To address above challenges, we present Lindorm TSDB, a dis-
contains many tags (e.g., over ten in Table 1), the footprint of the
tributed time-series database that is designed as a powerful back-
forward index will easily be overwhelming when a large number
bone for large-scale monitoring systems with massive monitoring
of timeseries are managed. This causes a high-cardinality problem,
metrics. It sustains high write throughput when massive active
which makes TSDB unable to accommodate the entire index in
timeseries exist. It also supports fast queries and ML-based analy-
memory due to cost constraints, leading to low write throughput
sis over massive timeseries. In addition, Lindorm TSDB is able to
from memory swapping during index lookups. Existing TSDBs,
retain stable performance even when it encounters node failures or
such as InfluxDB [18] and TimeUnion [40], use conventional cache
scaling. Our major contributions are summarized as follows:
mechanisms (e.g., Block Cache, MMap) to accelerate on-disk index
accesses. However, these mechanisms do not exploit the traits of • We design Lindorm TSDB, a distributed TSDB combining shared-
time series, hence still achieve unsatisfactory efficiency. nothing architecture and shared storage. It contains a cluster of
C2: High latency for queries that hit massive timeseries. A TSDB compute nodes and a reliable shared storage, which are logically
usually processes a query in two steps: first, given tags and time separated from each other. It partitions data into shards according
ranges, qualified data points from target timeseries are retrieved to their time and tags, facilitating parallel data query and write. In
from the storage; second, computations are performed on these a single shard, the optimized index structure and cache strategy
data points. In a large monitoring system, the first step usually further improves performance. (Target challenges C1/C2/C4;
hits a huge number of timeseries (e.g., reaches a million in Table 1). detailed in Section 4.)
We notice that hit timeseries are usually further grouped by a • We design an efficient pipelined execution engine for Lindorm
certain tag for subsequent computation. However, existing TSDBs TSDB to support common and important types of queries on
can not efficiently obtain tags of the hit timeseries from a large time series data. The execution engine not only parallelizes the
number of index entries. For the second step, the computational computation into different shards, but also optimizes the com-
frameworks in existing TSDBs are not well parallelized. For example, putation across multiple timeseries within one shard. On top of
TimescaleDB [23] cannot process data points in different partitions that, users can directly use SQL to perform a variety of queries.
in parallel when asked to group data by a non-primary tag. (Target challenge C2; detailed in Section 4.4.)
• We design Lindorm ML, an integrated machine learning com-
C3: Lack of advanced time series analysis capability. In a real- ponent inside Lindorm TSDB. It enables users to analyze data
world service, its workload may vary dynamically over time. Hence, with anomaly detection and time series forecasting algorithms
for the underlying monitoring system, rule-based analysis on met- through SQL, eliminating the effort of operating data and mod-
ric data usually fails to recognize performance issues precisely. els externally. More importantly, it takes advantage of Lindorm
As a solution, practitioners have turned to machine learning algo- TSDB’s data processing capability to achieve higher performance.
rithms for time series analysis in order to improve the precision (Target challenge C3; detailed in Section 5.)
3716
Table 2: Example of Lindorm TSDB’s data model Inter-timeseries aggregation Latest value
Timestamp: 1 2 3 4 5 6 7 8 9 10
• We conduct extensive experiments on a popular benchmark to
verify the effectiveness of Lindorm TSDB and its major compo- Figure 1: Time series query type
nents. We compare it with two widely-used open-source TSDBs,
InfluxDB and TimescaleDB. The results show that Lindorm TSDB
is able to achieve higher write throughput as well as lower query each timeseries, which is important for real-time status monitoring
latency compared to these baselines. (Detailed in Section 6.) of systems. A downsampling query groups the data points by a
given time window in each timeseries, e.g., every three data points
2 PRELIMINARIES in Figure 1, and then the aggregated value such as sum and average
for each window is returned. An inter-timeseries aggregate query
2.1 Data Model groups and aggregates data points in all hit timeseries by specified
Lindorm TSDB models metric data as time-series data in schema- columns, e.g., hostname and timestamp in Table 2, which is similar
tized tables. We make the data model consistent with the relational to the “group by” operation in relational databases.
data model so that users can easily understand it and fit it into In practice, downsampling queries and inter-timeseries aggre-
existing systems. There are three types of columns in each table: gate queries are often used in combination. Take Table 2 as an ex-
tags, fields and timestamp, as illustrated in Table 2. Tags describe ample, we may be interested in querying the averages of cpu_user
different attributes of the data source that generates the metric in each region for every 10 minutes within the last 24 hours.
data. A tag is a key-value pair (e.g., ⟨hostname, host-a⟩). At each
timestamp (e.g., 1670398200), a data source produces various types 3 LINDORM TSDB OVERVIEW
of metric data (e.g., cpu_user and cpu_sys), and we refer them as
Recall that Lindorm TSDB is designed to address the four challenges
fields. A timeseries is uniquely identified by one field and all asso-
discussed in Section 1. Figure 2 shows Lindorm TSDB’s overall
ciated tags, i.e., cpu_user and cpu_sys above are two timeseries.
architecture, which contains four major components, i.e., TSProxy,
A timeseries contains a sequence of data points from the same
TSCore, Lindorm ML and Lindorm DFS. Among them, both TSProxy
field, where each data point is a pair of ⟨timestamp, field value⟩.
and TSCore can be scaled horizontally.
For example, in Table 2, the cpu_user values, timestamps and tags
from the first and third row form a timeseries. Here cpu_user is the
field, [⟨hostname, host-a⟩, ⟨region, ap-1⟩, ⟨datacenter, ap-1a⟩] is the
tag list, and data points contain ⟨1670398200, 10⟩ and ⟨1670398210,
11⟩. When writing data to Lindorm TSDB, the field, tags and the
target table name are required. If the combination of the given field
and tags is not present in the table, Lindorm TSDB creates a new
timeseries.
3717
indexes (e.g., Table 3) are first maintained in the memory of the and near-data training and inference optimizations, thus enabling
TSCore they belong to, and later persisted to the shared Lindorm efficient time series data analysis (addressing Challenge C3).
DFS for storage. Lindorm DFS is a distributed file system that pro-
vides an HDFS-compatible interface. It leverages Alibaba Cloud’s 4 SYSTEM DESIGN
storage infrastructure, i.e., ESSD [12] cloud disk and Object Storage 4.1 Distributed Architecture
Service [13]. This overall architecture (Section 4.1) combines both
shared-nothing and shared-storage designs. It can therefore sustain Lindorm TSDB exploits a distributed architecture that combines
both horizontal scalability and the ability for each TSCore to access the best of shared-nothing and shared-storage. In particular, shared-
any data, ensuring elasticity and high availability at the same time. nothing architecture makes the database horizontally scalable, while
In addition, multidimensional sharding strategy (on timeseries iden- the cloud-native shared storage gives elasticity and high availabil-
tifier and time) avoids data migration when shards change dynami- ity to the database. The time series data is physically stored in the
cally, which effectively mitigates system performance degradation reliable shared storage. When scaling the TSDB (e.g., adding or
during node scaling (addressing Challenge C4). removing a node), the downtime can be minimized since no data
In a shard, indexes need to be updated whenever a new timeseries migration is required, improving to the quality of service.
is created. For fast lookups and maintenance, keeping all indexes in Our logical sharding strategy shards time series data according
memory is an ideal choice. However, when the number of timeseries to two dimensions: time and timeseries identifier (the identifier
becomes massive, indexes consume a huge space, which is known as is uniquely determined by a set of tags and one field). For a data
the high cardinality problem that makes memory bloat. To solve this point, we first determine the shard group assignment based on its
problem (addressing Challenge C1), Lindorm TSDB uses a structure timestamp. A shard group contains multiple shards that all manage
similar to Log Structured Merge tree (LSM-tree) to periodically data points from the same time range (𝑡0, 𝑡1]. Data points are then
flush in-memory indexes into the shared storage and merge them routed to shards in the group based on their identifiers’ hash values.
later. With this hybrid storage scheme, we query an index item When the number of database nodes changes (e.g., scales out), the
by first looking it up in the memory. If it misses in memory, we number of shards needs to change as well, i.e., a new shard group
then access the shared storage. Since the access to shared storage is will be automatically generated. This design avoids the massive data
significantly slower, we apply a tailored cache policy for speeding migration from data redistribution. As shown in Figure 3, when
up (Section 4.3). Moreover, considering many historical timeseries the number of shards increases at time 𝑇 , a new shard group is
are inactive, we use a time partitioning mechanism to boost the created to manage all data generated after 𝑇 , while all previous
memory utilization (Section 4.3). shard groups remain unchanged. In this way, the historical data
To allow users to easily query timeseries, Lindorm TSDB sup- points are still in their original shards, so that they do not need
ports SQL syntax. As introduced in Section 2.2, a single SQL state- migration. We observe that monitoring systems rarely query and
ment often involves multiple timeseries and conducts aggregate write historical data, and it is worthy to not change distribution of
operations in two dimensions, i.e., by time and by tags. Since data historical data for a stable system performance.
points from the same timeseries resides on the same shard and differ-
ent timeseries resides on different shards, we propose a pipelined Timeseries
TS Proxy
(Tags, Timestamp)
execution engine (Section 4.4) that supports computation push-
down (addressing Challenge C2). This pipelined execution engine Route TS to target shard
According to hash & timestamp
pushes down the query to all shards where hit timeseries are lo-
Shard-1 Shard-2 Shard-3
cated, and completes the scanning of multiple timeseries in parallel. After T
It then aggregates values back from the shard to TSCore, and assem- Shard-group-2:
bles the partial results from each TSCore as the final results. During Shard-1, 2, 3
TS1TS2 TS3 TS4
this process, once aggregated values are computed, we can skip Expansion time: T
loading and transferring massive original data points, saving con-
siderable memory and network resources. To further speed up the Before T
Shard-group-1: TS3TS1 TS4TS2
aggreation within one timeseries, we employ a pre-downsampling Shard-1, 2
mechanism (Section 4.4) that reduces retrievals and computations
on original timeseries data.
Apart from above designs, we also propose Lindorm ML (Sec- Shared
tion 5), which integrates machine learning algorithms (e.g., anomaly Storage shard shard shard
detection, time series forecasting) inside Lindorm TSDB. Lindorm
ML combines the data governance capabilities of a database and the
data analysis capabilities of machine learning algorithms. It allows Figure 3: Lindrom TSDB sharding (arrows with different col-
users to directly train machine learning models inside the database ors mean different timeseries)
via SQL, and using these models to make online inferences. All
data and model computations are in the database in both phases. In Next, we discuss how Lindorm TSDB organizes data physically.
addition, we utilize Lindorm TSDB’s features such as timeseries lay- The newly ingested data to a shard is first written to the Write
out and query push-down to achieve batched, distributed parallel Ahead Log (WAL) on the shared storage, and then to the memory
on TSCore. Periodically, the data in memory is flushed to the shared
3718
storage. The mapping relations between shards and TSCores are Table 3: Forward and inverted indexes (key ⇒ value)
stored in Apache ZooKeeper [2] as metadata. In this manner, com-
pute and storage can be separated. TSCores only need to read from forward index inverted index
and write to the shared storage, and each node is able to access
hostname=host-a®ion=ap-1 ⇒ 1 hostname=host-a ⇒ 1
all timeseries as well as the metadata. If one TSCore fails, other hostname=host-b®ion=ap-1 ⇒ 2 hostname=host-b ⇒ 2
active nodes can instantly take over its requests. In this case, the 1 ⇒ hostname=host-a®ion=ap-1 region=ap-1 ⇒ 1, 2
metadata needs to be updated, and then the unflushed data in the 2 ⇒ hostname=host-b®ion=ap-1
failed node’s memory is restored on the active node using WAL.
With the help of the distributed architecture above, both query
and write requests can be executed in parallel on multiple TSCores, to deal with TSD files of different sizes. The compaction ensures
bringing high efficiency. When ingesting the data, each data point that data belonging to the same timeseries and time period only
is routed to the corresponding TSCore, and then be written to the resides in a single TSD file, which reduces the number of TSD files
shard. In a query, we first determine whether this query can be to be scanned during a query. During compaction, the TSD files and
routed to certain shards according to the query conditions, e.g., the indexes will be dropped if their TTLs (Time-To-Live) are set and
query carries a primary key or a complete tag set. Otherwise, the have expired. In addition, we mark cold TSD files based on their
query is broadcast to all TSCores, each of which executes the query timestamps so that Lindorm DFS can automatically transfer them
on all shards managed by it. to cheaper storage medium in the compaction process.
4.2 TSM Storage Engine Time series customized compression. We use dictionary encoding,
Delta-of-Delta, XOR, ZigZag, RLE and other compression algo-
Lindorm TSDB employs an LSM-Tree-like (Log Structured Merge
rithms to compress timeseries, achieving a up to 15× compression
Tree) storage optimized for time series data, which we call TSM
ratio. Recall that in our data model (Section 2.1), a timeseries is
(Time Structured Merge Tree) [19], as shown in Figure 4. By taking
identified by a combination of field and tags, and a write request
the characteristics of time series data into account, TSM optimizes
may write multiple timeseries with the same tags. Hence, fields and
the WAL writing, memory organization, compression algorithm,
tags of different timeseries will contain a large amount of redun-
and compaction policy over standard LSM.
dant information. Note that values from the same timeseries often
change smoothly over time, which makes compression effective.
We use different compression methods for different data. Lock-free
compression is applied to in-memory data to improve memory uti-
lization. WAL logs are compressed by dictionary compression in
batch way to reduce I/O and improve throughput. In persistent TSD
files, data points from the same timeseries over a continuous period
is composed into a data chunk, which is internally compressed
using Delta-of-delta, XOR, ZigZag, and RLE.
3719
be created in the Memtable’s forward index. After that, each tag SQL with a relational-like data model. It extends the syntax for time
in the new timeseries is updated in the Memtable’s inverted index. series queries while still being compatible with ANSI SQL standard.
When the flush is triggered, Both forward and inverted indexes in As discussed in Section 2.2, in monitoring systems, users are
the Memtable will be written to the shared storage, generating new often less interested in individual data points, but more on the
FwdIdx files and InvIdx files, respectively. aggregated analysis of multiple data points, e.g., the average metrics
To speed up the index lookups on disk, we perform a series of within a minute. Lindorm TSDB extends the standard SQL based
optimizations. First, the index files are merged in the background on Apache Calcite [10] with one new syntax sample by for the
to reduce the total number of files. Second, we add a bloom filter to downsampling query:
each file, through which unrelated files can be filtered out quickly. SELECT max(cpu_user) WHERE hostname='host-a'
The bloom filters are cached in memory to further speed up the AND timestamp >= '2023-1-1 12:00' sample by '10min'
file filtering. Besides, we use a block cache to cache index files in
memory to reduce storage accesses. Write optimization. The time series data ingestion process can
Compared to inverted indexes, forward indexes are accessed be characterized as a bulk repetition of simple INSERT SQL state-
much more frequently. During the write process, Forward indexes ments. We observe that parsing SQL directly using Calcite results in
are looked up to determine the existence of timeseries. In the inter- very low write throughput, because the SQL parser and execution
timeseries aggregate query, we also need to obtain the tags of plan generator in Calcite consume a lot of CPU cycles. To opti-
the timeseries from forward indexes. As a result, the efficiency of mize the performance of above two parts, we design a fast path for
searching forward indexes is crucial. Hence, in addition to block write processing, as depicted in Figure 5. The vast majority of write
cache, we design an additional layer of cache for the forward index, statements are very simple, containing only three elements: tag
called seriescache. While the block cache is used to cache file data, set, timestamp, and field value. It is very easy to parse them even
the seriescache only stores the mapping between timeseries IDs without the sophisticated parser in Calcite. Therefore, we have im-
and tags that are accessed recently, consuming less space. The plemented a small parser that only handles simple write statements,
block cache and seriescache both use the LRU policy. In those cases and it is only responsible for parsing out the time series related
that the tag lengths vary much or are too long, seriescache may information. This parser is invoked first upon Lindorm TSDB re-
occupy a lot of memory. Fortunately, we can optionally use the MD5 ceives a SQL statement. If the parsing is successful, the data points
values instead of the original tags to reduce the memory footprint. are bypassed Calcite and sent directly to the execution engine, oth-
We observe that in real-world monitoring systems, MD5-encoding erwise it will continue to go through Calcite as normal. We observe
seriescache can cache up to 5× of items than the original version. that the write throughput in fast path mode is 15× higher than that
When looking up inverted indexes, we need to conduct intersec- in Calcite path mode. In addition, SQL prepare statement can be
tion operations on the posting lists. For example, when the query used for batch write optimization in clients. Our tests have shown
conditions are hostname=‘host-a’ and region=‘ap-1’, we first that by combining the fast path and prepare statement execution,
find the posting lists corresponding to these two conditions, which we can achieve 20× of write throughput improvement.
are {2} and {1,2}, respectively. Then, we get the intersection of two Extending Calcite or customized implementation Original Calcite
lists, which is {2} here. We use RoaringBitmap [26] as the data struc-
Avatica Server
ture for the posting list. Compared with integer type timeseries IDs,
bitmap saves much space and supports fast set operations. Lindorm SQL Meta
Calcite Insert
Lindorm TSDB supports standard SQL syntax, as well as extended DownSample Query
3720
In order to improve the execution efficiency of the aggregate query, execution engine in both TSProxy and TSCore, based on which the
Lindorm TSDB takes advantage of the distributed storage of time query push-down feature is implemented, allowing some compu-
series data and optimizes from two aspects: pre-downsampling and tations to be distributed and parallelized among multiple TSCore
pipelined execution engine. nodes. In addition, the pipelined execution engine can also process
queries in parallel between multiple shards within one TSCore and
Pre-downsampling. A naive approach for the downsampling query
between multiple timeseries to further improve query performance.
‘sample by 𝑡’ is to: scan each related data points, divide them into
different 𝑡-time windows according to timestamp, and then com- SELECT device_id, region, time, last(temperature) AS temperature FROM sensor SAMPLE BY 5m;
3721
SQL Iterator 5.1 In-Database Training
Figure 9 depicts the Lindorm ML training procedure. TSProxy on
RowIterator the TSDB node receives the user’s CREATE MODEL command and,
after partially decoding the SQL syntax, delivers it to the Lindorm
LimitOp ML node in the same cluster. The SQL query is then fully parsed
by Lindorm ML. The CREATE MODEL statement executes in two
Cross time series process
steps: First, a model management module generates the model’s
AggOp
metadata, including the model name, task, algorithm, and so on
and persists it in ZooKeeper; then, a train() internal SQL function
DeltaOp FillingOp call is sent back to the TSProxy that performs the model’s train-
No Yes ing process. By design, the training function as the TrainingOp
RateOp DownSample? DSAggOp operator of the TSDB pipeline can be pushed down to the TSCore
Single time series process nodes for distributed execution. Before entering the training op-
SeriesScanOp erator TrainingOp, the pipelined execution engine processes the
data in two steps: the SeriesScanOp operator extracts relevant
features and the PreProcessingOp operator performs the neces-
RPC Network IO Storage IO sary data preprocessing. In Lindorm TSDB, the SeriesScanOp,
PreProcessingOp and TrainingOp operator all process each in-
dividual timeseries separately, thereby naturally satisfying the re-
Figure 7: Lindrom TSDB pipelined execution engine
quirement of input data for the time series machine learning algo-
rithms (e.g., anomaly detection and forecasting).
The model management module ModelManager in the Lindorm
TSDB node, without the participation of Lindorm ML, thus reusing ML plugin manages the model partitioning, persists the trained
the high availability and scalability of the TSDB services natu- model data to Lindorm DFS, updates the model metadata stored in
rally. In addition, the management of model data and metadata is ZooKeeper (e.g., the training progress, evaluation metrics).
shared between Lindorm ML and TSDB nodes through the under- When the training operators are pushed down to multiple TSCore
lying distributed file system and ZooKeeper. Lindorm ML utilizes nodes for execution, the physical models trained from the timeseries
TSDB’s distributed storage and query of time series data to propose on one TSCore node naturally forms a partition. The advantage of
a model partitioning design and implementation. A user-created this way is that it is easy to adapt to the scenarios of adding/deleting
logical model actually consists of many physical models, which and failover of TSDB nodes. If the training operator is not pushed
correspond to different timeseries data. These physical models are down but executed on TSProxy, the physical models will not be
divided into model partitions according to the partitioning of the partitioned. In this way, multiple timeseries data are cached in the
timeseries. This design brings the ability of using the query push- training operator, thus enabling batch training and improving effi-
down technology of TSDB execution engine to enable the model ciency. In summary, utilizing the distributed storage of TSDB data
training and inference pushdown, further enabling the distributed and the operator pushed-down technique enables Lindorm ML’s
parallel, near-data training and inference optimization. batch, distributed parallel and near-data training optimization.
Algorithm support. We support popular statistical and deep learn-
ing based time series anomaly detection and forecasting algorithms
(e.g., ARIMA[6], DeepAR[33], TFT[27]) provided by open source
algorithm packages. Further, we support our in-house online al-
gorithms that support real-time anomaly detection [17]. All these
algorithms are uniformly managed by the Lindorm ML plugin on
the TSDB node.
3722
5.2 In-Database Inference • TimescaleDB-3: three-node TimescaleDB.
Unlike the training process, inference can be done entirely on the • Lindorm-1: single-node Lindorm TSDB.
TSDB node. The SQL inference function called by the user (e.g., • Lindorm-3: three-node Lindorm TSDB.
anomaly_detect()) is also a TSDB pipelined execution engine oper- Configurations. For InfluxDB, we tune its cache limits to get
ator that can be pushed down and executed. Similarly, before the the best performance. Specifically, we set its cache-max-memory-
inference function InferenceOp obtains its input data, the data is size to 16g, cache-snapshot-memory-size to 4g, and GOGC to 30.
first processed by the timeseries scan operator SeriesScanOp and For TimescaleDB, we adjust the configuration for TimescaleDB-3
the preprocessing operator PreProcessingOp at the pipelined exe- according to its official guidelines to achieve the best performance.
cution engine layer. The inference operator InferenceOp also calls We deploy an additional access node for TimescaleDB-3.
the Lindorm ML plugin, which finds the corresponding model and
algorithm from the model metadata according to a user-specified 6.2 Writing Performance Evaluation
model name. When loading a model, the model partition corre-
We evaluate the write throughput, i.e., the number of data points
sponding to the input data is found according to the same time-
ingested into the database per second, of each database at different
series routing rules. As with the training process, batch, distributed
timeseries scales. We use the DevOps data generated from Time
parallel and near-data inference optimization can be achieved when
Series Benchmark Suite, TSBS [3] as the insertion test data. In
multiple TSCore nodes are involved in the inference function.
particular, TSBS generates 101 timeseries for each host to represent
different type of system or application metrics, e.g., CPU usage,
5.3 Model metadata management number of diskio, number of nginx requests, etc. Each timeseries
The metadata of models is maintained in ZooKeeper, to be con- contains about 11 tags.
sistent with the way that TSDB manages the metadata of tables. In TSBS, we adjust the number of timeseries generated by chang-
We have extended the implementations of the Schema and Table ing the number of hosts, host_scale. The number of timeseries equals
in the SQL layer, so that the model metadata can be queried as if to host_scale * 101. To improve the write performance, we set a
they were tables. We also encapsulate the syntactic sugar “SHOW large write batch for each database, i.e., 10000, and we also set the
MODEL(S)” statements to simplify its usage. number of workers to be the number of cores, i.e., 16 for single-node
databases and 48 for three-node databases.
6 EVALUATIONS
We evaluate Lindorm TSDB in four aspects. We first compare Lin-
dorm TSDB with two popular open-source TSDBs on write (Sec-
tion 6.2) and query (Section 6.3) performance. Then, we evaluate
the efficiency on time-series machine learning tasks of Lindorm ML
(Section 6.4). Finally, we study the contributions of the main com-
ponents in Lindorm TSDB to the overall performance (Section 6.5).
3723
encoding method for timeseries tags make the seriescache able to Table 6: Q2’s query latency (ms)
cache the tags and IDs of numerous timeseries that have recent
InfluxDB Lindorm TimescaleDB
data writes. Thus, when host_scale increases to 100000, the number Host Scale
of accesses to the index in disk does not increase much in Lindorm 1-node 1-node 3-node 1-node 3-node
TSDB. 10000 72 89 91 53 67
100000 1046 177 190 502 471
6.3 Query Performance Evaluation 1000000 15261 1165 934 51916 10012
In query evaluation, we adjust the DevOps data generation in TSBS
and collect 1 timeseries for each host, where the total number of three-node Lindorm TSDB, InfluxDB has 4.5× and 15.3× higher
timeseries equals host_scale. In this way, one query can hit more query latencies at the scales of 100000 and 1M hosts respectively.
timeseries at the same host_scale. For each query, we restart the And three-node TimescaleDB has 1.5× and 9.7× higher latencies.
databases, repeat 5 times with different filter conditions, and present
the average latency. Table 7: Q3’s query latency (ms)
Table 4 describes the three query patterns that we have men-
InfluxDB Lindorm TimescaleDB
tioned in Section 2.2. Q1 and Q2 use region as the filter tag, e.g., Host Scale
WHERE region=ap-1, and hit ℎ𝑜𝑠𝑡_𝑠𝑐𝑎𝑙𝑒/9 timeseries in each query. 1-node 1-node 3-node 1-node 3-node
There is no tag selector in Q3 and thus Q3 queries on all timeseries 10000 559 175 164 91 427
in the TSDB. 100000 9437 1390 809 898 4296
1000000 111815 21177 6884 43630 30651
Table 4: Three query patterns
The results of Q3 query are shown in Table 7. In a Q3 query,
Query Description InfluxDB and Lindorm TSDB need to find region values for all
Q1 - Latest value The last data points of timeseries in 1 region.
hit timeseries in order to group them. Lindorm TSDB has the se-
aggregate on each timeseries in 1 region per
Q2 - Downsampling riescache to optimize the process of searching tag values in the
5 minutes for 1 hour.
Q3 - inter-timeseries aggregate on all timeseries in each region per 5
forward index. In addition, Lindorm TSDB is able to push down the
aggregate minutes for 2 hours. downsampling operator together with the inter-timeseries aggre-
gate operator. This allows the data points to be aggregated by time
window and tags in each shard and TsCore node before they are col-
Table 5: Q1’s query latency (ms) lected by higher level, significantly improving the efficiency. When
the query hits 1M timeseries, both single-node and three-node Lin-
InfluxDB Lindorm TimescaleDB dorm TSDB outperform other TSDBs by a large margin. It is worth
Host Scale
1-node 1-node 3-node 1-node 3-node noting that single-node TimescaleDB outperforms three-node ver-
10000 22 44 53 237 210
sion at small scales. by checking the query execution process in
100000 145 90 95 1530 1610 three-node TimescaleDB, we find that the query tasks on partitions
1000000 2083 464 284 211689 13452 are executed serially. It is probably because region tag is not set as
the partition key, which is hostname. We run queries where the data
Table 5 shows the results for Q1, the latest value query. At smaller are aggregated by hostname and find that computations in partions
host scales (10,000 and 100,000), InfluxDB and Lindorm TSDB per- are parallel, which verifies our hypothesis. When the timeseries
form closely. This is because Lindorm TSDB needs to push down scale becomes very large, the memory of single-node TimescaleDB
the query and collect results from all shards or nodes through RPC. was not enough for such large amount of data, so the performance
The time consumed by RPC is not negligible when the total la- drops heavily.
tency is low. But at the large scale such as 1M, the query latency
of InfluxDB is 4.48× as high as that of single-node Lindorm TSDB 6.4 Advanced Time-Series Analysis Evaluation
and 7.33× as high as that of three-node Lindorm TSDB. Because We evaluate the efficiency of Lindorm ML in performing time-series
Lindorm TSDB can push the query down to the storage engine anomaly detection tasks. We still use the data generation approach
and can scan multiple timeseries parallelly to get their last data in Section 6.3 to prepare data for machine learning tasks. Each
points. TimescaleDB is not able to utilize the index on timestamp timeseries contains two consecutive segments of data for training
in the latest value query hitting multiple timeseries [24], resulting and inference, both of which are one-day long.
in particularly low efficiency. In evaluation, we create training and inference tasks at different
For the downsampling query whose results are in Table 6, it timeseries scales (10,000 and 100,000) via SQL provided by Lindorm
requires more data points computed than Q1. And for an aggrega- ML, where we run OneShotSTL [17] as anomaly detection algorithm.
tion query such as Q2 and Q3, the number of returned values is Meanwhile, we run the same algorithm outside Lindorm TSDB
much smaller than the number of data points involved in the query. for training and inference as the baseline. Specifically, we first read
Therefore, Lindorm TSDB’s streaming optimization in the execu- data from Lindorm TSDB and then apply OneShotSTL to them. We
tion engine reduces a lot of memory footprint and data transfer record the time spent in each way respectively.
consumption. The larger the timeseries scale, the more significant As shown in Table 8, compared to performing machine learning
the advantage of Lindorm TSDB over other TSDBs. Compared with externally, Lindorm ML consumes about half the time for both
3724
training and inference at different scales. This is because Lindorm Table 11: Write throughput (M/s) of Lindorm TSDBs with
ML reduces the time-consuming transmission of the raw data. In different nodes
addition, various optimizations in the pipelined execution engine
Host Scale 2-node 4-node 6-node
also improve the efficiency of machine learning computations.
10000 5.05 11.55 19.78
Table 8: Efficiency of time-series anomaly detection 100000 5.08 11.14 19.01
1000000 4.64 10.99 18.06
Training Time(s) Inference Time (s)
Host Scale 5 3.6
4.9
3725
finishes replaying all records in the WAL. To address this, we time for aggregate query on data of single timeseries across long
designed an asynchronous WAL replaying mechanism, which time range, which is not common in monitoring systems (see in
allows the shard to start serving write requests immediately after Table 1). Gorilla [31] proposes the delta-of-delta timestamps and
it is started. The read service is enabled only after the replaying XOR’d floating point values, which are widely used in the existing
is completed to ensure data consistency. This guarantees high TSDBs for data compression. TimeUnion [40] and Byteseries [35]
availability of write requests as a priority after shard migration. mitigate the high-cardinality problem by compressing the inverted
With this feature enabled, write service interruption time drops index in memory. But they ignore the acceleration of access to
from minutes to seconds. the index on disk. There are also TSDBs designed for Internet
• The adoption of a schematized multi-fields data model and the of Things (IoT) scenario, such as DB2 [15] and IoTDB [39]. They
support of SQL syntax not only helps users understand the time are not efficient for complex tags query in monitoring systems.
series data model and simplifies its usage, but also facilitates trou- To tackle the increasing timeseries scale in monitoring systems,
bleshooting for DBAs. For example, we can use SQL "explain" to more and more TSDBs [4, 37] are deployed in distributed way.
see if the entire execution plan meets expectations. Additionally, They use shared-nothing architecture and suffer from performance
it allows for easy integration with third-party ecosystems. degradation due to data migration in the case of node scaling.
• Enabling the pre-downsampling feature effectively reduces query
latency by 80% in businesses, at the cost of an 8% increase in In-Database Machine Learning. To the best of our knowledge, no
storage space. This cost is manageable with the storage tiering existing TSDBs integrate machine learning functions. The existing
feature in Lindorm DFS. Also, since the computation occurs systems [7, 16, 25, 30, 34] that support in-database machine learning
during compaction, the additional CPU usage is minimal, at less are limited to relational data model. Although Oracle ML [30], Azure
than 5%. Compared to instant computing at query time or using Data Explorer [28] and BigQuery ML [16] allows applying time
features like Continuous Query [21], resource consumption is series forecasting and anomaly detection to time-series data, they
significantly lower. do not optimize the computation based on characteristics of time-
• In its early versions, Lindorm TSDB did not have a pipelined exe- series data. Lindorm ML is inspired by SQL Server’s Raven [25].
cution engine. When a large amount of data was queried, all of the It reuses the ONNX RUNTIME [29] inference engine for cross-
data had to be read out at once and cached in memory for calcu- optimization on relational and linear algebra. We also utilize open-
lation. This led to memory exhaustion with FullGC and affected source inference engines for specific ML computation.
the service, making it difficult to meet the needs of supporting im-
portant business operations, such as the dashboard of Alibaba’s 9 CONCLUSION
Global Shopping Festivals. The newly designed pipelined execu- In this paper, we first summarize data scales and common query
tion engine solves this problem and improves performance by at patterns in large-scale monitoring systems. Then we present Lin-
least 10×. dorm TSDB, a distributed time-series databases that is designed
• In monitoring scenarios, the latest value query is often used to for handling massive timeseries in large-scale monitoring systems.
check the health status of system. This requires high QPS and low Lindorm TSDB combines shared-nothing architecture and shared
latency. To address this, we have designed a cache specifically storage to scale nodes efficiently as the number of timeseries in sys-
for this query. The latest value of each timeseries is cached when tems increases. Lindorm TSDB adopts an optimized index structure
queried and will be updated when new data points are written to with cache and a novel pipelined execution engine to acheive high
that timeseries. After implementing this cache, query response write throughput and efficient processing of queries hitting a large
time was reduced by 85%. number of timeseries. For better detection and diagnosis of system
performance issues, Lindorm TSDB enables users to analyze data
with ready-to-use anomaly detection and time series forecasting
8 RELATED WORK algorithms through SQL.
Time-series database. There are many previous works focusing
on time-series databases. OpenTSDB [8] uses HBase [1], a key-value ACKNOWLEDGMENTS
database to store time-series data points, where each data point is an We thank the anonymous reviewers for their valuable suggestions
individual data row with rowkey. It leads to low data compression and helpful opinions. We would also like to thank Yong Lin, Wei
ratio and access efficiency. InfluxDB [18] develops TSM storage Zou, Songzheng Ma, Dengke He, Yaguang Li, Yuan Cui, Xiang Wang,
architecture based on LSM, greatly improving the write throughput. Wenlong Yang, Yang Liu, Qingyi Meng, Xing Jin and Youdong Zhang
But it lacks optimization in query execution (e.g., InfluxDB does not who contributed significantly to the development of Lindorm TSDB.
parallelly perform computation on multiple timeseries in one data
partition). TimescaleDB [23] is a Postgres-based TSDB. It mainly REFERENCES
relies on partitioning technology for parallel data ingestion and [1] 2023. Apache HBase. https://hbase.apache.org/. Last accessed: 2023-07-07.
query. But its performance drops significantly when executing [2] 2023. Apache ZooKeeper. https://zookeeper.apache.org/. Last accessed: 2023-07-
07.
the query hitting multiple timeseries. QuestDB [32] is a column- [3] 2023. Time Series Benchmark Suite. https://github.com/timescale/tsbs. Last
oriented TSDB showing high single-node write performance, but it accessed: 2023-07-07.
does not offer distributed deployment and scalability. Timon [11], [4] Colin Adams, Luis Alonso, Benjamin Atkin, John Banning, Sumeer Bhola, Rick
Buskens, Ming Chen, Xi Chen, Yoo Chung, Qin Jia, et al. 2020. Monarch: Google’s
BTrDb [5] and Peregreen [38] propose novel data structures storing planet-scale in-memory time series database. Proceedings of the VLDB Endowment
data points in the same timeseries. They can have fast response 13, 12 (2020), 3181–3194.
3726
[5] Michael P Andersen and David E Culler. 2016. Btrdb: Optimizing storage system [24] TimeScale Inc. 2023. TimeScaleDB does not use index in the last(). https:
design for timeseries processing. In 14th { USENIX } Conference on File and Storage //docs.timescale.com/api/latest/hyperfunctions/last/. Last accessed: 2023-07-07.
Technologies ( { FAST } 16). 39–52. [25] Konstantinos Karanasos, Matteo Interlandi, Doris Xin, Fotis Psallidas, Rathijit Sen,
[6] Adebiyi A. Ariyo, Adewumi O. Adewumi, and Charles K. Ayo. 2014. Stock Price Kwanghyun Park, Ivan Popivanov, Supun Nakandal, Subru Krishnan, Markus
Prediction Using the ARIMA Model. In Proceedings of the 2014 UKSim-AMSS Weimer, et al. 2019. Extending relational query processing with ML inference.
16th International Conference on Computer Modelling and Simulation (UKSIM ’14). arXiv preprint arXiv:1911.00231 (2019).
IEEE Computer Society, USA, 106–112. https://doi.org/10.1109/UKSim.2014.67 [26] Daniel Lemire, Gregory Ssi-Yan-Kai, and Owen Kaser. 2016. Consistently faster
[7] Nikos Armenatzoglou, Sanuj Basu, Naga Bhanoori, Mengchu Cai, Naresh and smaller compressed bitmaps with roaring. Software: Practice and Experience
Chainani, Kiran Chinta, Venkatraman Govindaraju, Todd J Green, Monish Gupta, 46, 11 (2016), 1547–1569.
Sebastian Hillig, et al. 2022. Amazon Redshift re-invented. In Proceedings of the [27] Bryan Lim, Sercan Ö Arık, Nicolas Loeff, and Tomas Pfister. 2021. Temporal fusion
2022 International Conference on Management of Data. 2205–2217. transformers for interpretable multi-horizon time series forecasting. International
[8] The OpenTSDB Authors. 2021. OpenTSDB. http://opentsdb.net/. Last accessed: Journal of Forecasting 37, 4 (2021), 1748–1764.
2023-07-07. [28] Microsoft. 2023. Azure Data Explorer. https://azure.microsoft.com/en-us/
[9] The OpenTelemetry Authors. 2023. OpenTelemetry. https://opentelemetry.io/. products/data-explorer. Last accessed: 2023-07-07.
Last accessed: 2023-07-07. [29] Microsoft. 2023. ONXX Runtime. https://onnxruntime.ai/. Last accessed:
[10] Edmon Begoli, Jesús Camacho-Rodríguez, Julian Hyde, Michael J Mior, and 2023-07-07.
Daniel Lemire. 2018. Apache calcite: A foundational framework for optimized [30] Oracle. 2023. Oracle Machine Learning for SQL. https://docs.oracle.com/en/
query processing over heterogeneous data sources. In Proceedings of the 2018 database/oracle/machine-learning/oml4sql/21/dmcon/time-series.html. Last
International Conference on Management of Data. 221–230. accessed: 2023-07-07.
[11] Wei Cao, Yusong Gao, Feifei Li, Sheng Wang, Bingchen Lin, Ke Xu, Xiaojie Feng, [31] Tuomas Pelkonen, Scott Franklin, Justin Teller, Paul Cavallaro, Qi Huang, Justin
Yucong Wang, Zhenjun Liu, and Gejin Zhang. 2020. Timon: A timestamped event Meza, and Kaushik Veeraraghavan. 2015. Gorilla: A fast, scalable, in-memory
database for efficient telemetry data processing and analytics. In Proceedings of time series database. Proceedings of the VLDB Endowment 8, 12 (2015), 1816–1827.
the 2020 ACM SIGMOD International Conference on Management of Data. 739–753. [32] QuestDB. 2023. QuestDB. https://questdb.io/. Last accessed: 2023-07-07.
[12] Alibaba Cloud. 2023. Alibaba Cloud ESSDs. https://www.alibabacloud.com/help/ [33] David Salinas, Valentin Flunkert, Jan Gasthaus, and Tim Januschowski. 2020.
en/elastic-compute-service/latest/essds. Last accessed: 2023-07-07. DeepAR: Probabilistic forecasting with autoregressive recurrent networks. Inter-
[13] Alibaba Cloud. 2023. Alibaba Cloud OSS. https://www.alibabacloud.com/product/ national Journal of Forecasting 36, 3 (2020), 1181–1191.
object-storage-service. Last accessed: 2023-07-07. [34] Maximilian Schüle, Frédéric Simonis, Thomas Heyenbrock, Alfons Kemper,
[14] Alibaba Cloud. 2023. Alibaba ECS. https://www.alibabacloud.com/product/ecs. Stephan Günnemann, and Thomas Neumann. 2019. In-database machine learn-
Last accessed: 2023-07-07. ing: Gradient descent and tensor algebra for main memory database systems.
[15] Christian Garcia-Arellano, Hamdi Roumani, Richard Sidle, Josh Tiefenbach, BTW 2019 (2019).
Kostas Rakopoulos, Imran Sayyid, Adam Storm, Ronald Barber, Fatma Ozcan, [35] Xuanhua Shi, Zezhao Feng, Kaixi Li, Yongluan Zhou, Hai Jin, Yan Jiang, Bing-
Daniel Zilio, et al. 2020. Db2 event store: a purpose-built IoT database engine. sheng He, Zhijun Ling, and Xin Li. 2020. ByteSeries: an in-memory time series
Proceedings of the VLDB Endowment 13, 12 (2020), 3299–3312. database for large-scale monitoring systems. In Proceedings of the 11th ACM
[16] Google. 2023. Bigquery ML. https://cloud.google.com/bigquery/docs/bqml- Symposium on Cloud Computing. 60–73.
introduction. Last accessed: 2023-07-07. [36] solid IT. 2023. DB-Engines Ranking of Time Series DBMS. https://db-engines.
[17] Xiao He, Ye Li, Jian Tan, Bin Wu, and Feifei Li. 2023. OneShotSTL: One-Shot com/en/ranking/time+series+dbms. Last accessed: 2023-07-07.
Seasonal-Trend Decomposition For Online Time Series Anomaly Detection And [37] TDengine. 2023. TDengine. https://tdengine.com/. Last accessed: 2023-07-07.
Forecasting. Proc. VLDB Endow. 16, 6 (2023), 1399–1412. [38] Alexander A Visheratin, Alexey Struckov, Semen Yufa, Alexey Muratov, Denis
[18] InfluxData Inc. 2023. InfluxDB. https://docs.influxdata.com/influxdb/v2.6/. Last Nasonov, Nikolay Butakov, Yury Kuznetsov, and Michael May. 2020. Peregreen-
accessed: 2023-07-07. modular database for efficient storage of historical time series in cloud environ-
[19] InfluxData Inc. 2023. InfluxDB TSM. https://docs.influxdata.com/influxdb/v1.3/ ments. In Proceedings of the 2020 USENIX Conference on Usenix Annual Technical
concepts/storage_engine/. Last accessed: 2023-07-07. Conference. 589–601.
[20] InfluxData Inc. 2023. InfluxQL. https://docs.influxdata.com/influxdb/v1.8/query_ [39] Chen Wang, Xiangdong Huang, Jialin Qiao, Tian Jiang, Lei Rui, Jinrui Zhang,
language/. Last accessed: 2023-07-07. Rong Kang, Julian Feinauer, Kevin A McGrail, Peng Wang, et al. 2020. Apache
[21] InfluxData Inc. 2023. InfluxQL Continuous Queries. https://docs.influxdata.com/ IoTDB: Time-series Database for Internet of Things. Proceedings of the VLDB
influxdb/v1.8/query_language/continuous_queries/. Last accessed: 2023-07-07. Endowment 13, 12 (2020), 2901–2904.
[22] TimeScale Inc. 2020. TimescaleDB vs InfluxDB. https://www.timescale. [40] Zhiqi Wang and Zili Shao. 2022. TimeUnion: An Efficient Architecture with
com/blog/timescaledb-vs-influxdb-for-time-series-data-timescale-influx-sql- Unified Data Model for Timeseries Management Systems on Hybrid Cloud
nosql-36489299877/. Last accessed: 2023-07-07. Storage. In Proceedings of the 2022 International Conference on Management of
[23] TimeScale Inc. 2023. TimeScaleDB. https://www.timescale.com. Last accessed: Data. 1418–1432.
2023-07-07.
3727