0% found this document useful (0 votes)
34 views13 pages

Ailibaba-Time-Series DB

Uploaded by

Ding Rui
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
34 views13 pages

Ailibaba-Time-Series DB

Uploaded by

Ding Rui
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 13

Lindorm TSDB: A Cloud-native Time-series Database for

Large-scale Monitoring Systems

Chunhui Shen‡§∗ , Qianyu Ouyang‡†∗ , Feibo Li, Zhipeng Liu, Longcheng Zhu, Yujie Zou, Qing Su,
Tianhuan Yu, Yi Yi, Jianhong Hu, Cen Zheng, Bo Wen, Hanbang Zheng, Lunfan Xu, Sicheng Pan,
Bin Wu, Xiao He, Ye Li, Jian Tan, Sheng Wang, Dan Pei† , Wei Zhang, Feifei Li
Alibaba Group‡ Zhejiang University§ Tsinghua University†
{tianwu.sch,ouyangqianyu.oyqy,lizi,qingzhi.lzp,longcheng.zlc,yunxing.zyj,suqing.sq}@alibaba-inc.com
{yutianhuan.yth,claude.yy,jianhong.hjh,mingyan.zc,wenbo.wb,zhenghanbang.zhb,xulunfan.xlf}@alibaba-inc.com
{zhikuan.psc,binwu.wb,xiao.hx,liye.li,j.tan,sh.wang}@alibaba-inc.com
[email protected],{zwei,lifeifei}@alibaba-inc.com

ABSTRACT Table 1: Performance indicator monitoring workloads in two


real-world monitoring systems
Internet services supported by large-scale distributed systems have
become essential for our daily life. To ensure the stability and high Sys-A Sys-B
quality of services, diverse metric data are constantly collected and
managed in a time-series database to monitor the service status. No. of tags per timeseries 14 13
No. of total timeseries ≥ 1 billion 0.5 billion
However, when the number of metrics becomes massive, existing
No. of daily active timeseries 0.6 billion 0.4 billion
time-series databases are inefficient in handling high-rate data data point sampling interval 15s ∼ 2h 15s ∼ 2h
ingestion and queries hitting multiple metrics. Besides, they all lack No. of data points per second 200 million 150 million
the support of machine learning functions, which are crucial for time range in a query 80% in [1h, 6h] 80% in [1h, 3h]
sophisticated analysis of large-scale time series. In this paper, we No. of timeseries a query hits 0.1 ∼ 6 million 1 ∼ 20 thousand
present Lindorm TSDB, a distributed time-series database designed
for handling monitoring metrics at scale. It sustains high write
that sustaining the reliability of the service becomes extremely
throughput and low query latency with massive active metrics. It
challenging. To address this issue, monitoring systems play an
also allows users to analyze data with anomaly detection and time
indispensable role, which constantly collect massive and diverse
series forecasting algorithms directly through SQL. Furthermore,
metric data to monitor the status of the entire service. They pro-
Lindorm TSDB retains stable performance even during node scaling.
vide real-time analysis on the metric data to identify performance
We evaluate Lindorm TSDB under different data scales, and the
issues (e.g., via diagnosis and alerting) or to prevent such issues by
results show that it outperforms two popular open-source time-
triggering actions in advance (e.g., resource scale-up).
series databases on both writing and query, while executing time-
The metric data handled by monitoring systems is inherently a
series machine learning tasks efficiently.
type of time-series data, where a metric (e.g., a machine’s CPU us-
age) is modeled as a timeseries. A timeseries consists of a sequence
PVLDB Reference Format:
of data points collected over time, and each data point contains a
Chunhui Shen, Qianyu Ouyang, Feibo Li, Zhipeng Liu, Longcheng Zhu,
Yujie Zou, Qing Su, Tianhuan Yu, Yi Yi, Jianhong Hu, Cen Zheng, Bo Wen, timestamp and a field value. Each timeseries is attached a set of
Hanbang Zheng, Lunfan Xu, Sicheng Pan, Bin Wu, Xiao He, Ye Li, Jian Tan, tags, which collectively describe different attributes of a metric.
Sheng Wang, Dan Pei, Wei Zhang, Feifei Li. Lindorm TSDB: A For example, a CPU usage metric usually contains three tags —
Cloud-native Time-series Database for Large-scale Monitoring Systems. datacenter, region, hostname. In a large-scale service, massive time-
PVLDB, 16(12): 3715 - 3727, 2023. series are generated from a variety of data sources: performance
doi:10.14778/3611540.3611559 indicator metrics (e.g., CPU, memory and network usage) are from
each host or container; applications-oriented metrics (e.g., request
rate and response time) are from each micro-service. Typically, an
1 INTRODUCTION e-commerce service contains billions of timeseries and generates
Nowadays, a large-scale service is usually built atop tens of thou- hundreds of millions of data points every second. At such a scale,
sands of micro-service applications and physical machines, such it is extremely challenging both to write the data points into the
monitoring system and to analyze them in real time.
∗ These authors contributed equally to this work. To help characterize the traits of metric data and understand
This work is licensed under the Creative Commons BY-NC-ND 4.0 International the difficulty of handling them, we study two application perfor-
License. Visit https://creativecommons.org/licenses/by-nc-nd/4.0/ to view a copy of
this license. For any use beyond those covered by this license, obtain permission by
mance monitoring systems from our real-world businesses. Table 1
emailing [email protected]. Copyright is held by the owner/author(s). Publication rights lists the timeseries workload statistics from these systems, namely
licensed to the VLDB Endowment. Sys-A and Sys-B. Both systems collect metric data at very high
Proceedings of the VLDB Endowment, Vol. 16, No. 12 ISSN 2150-8097.
doi:10.14778/3611540.3611559 rates, 200M and 150M points per second, respectively. Meanwhile,
tag cardinalities are large in both systems, while some tags have

3715
thousands of distinct values. The combination of the tags (e.g., more of detecting and localizing performance issues. However, existing
than ten tags per timeseries in Table 1) results in billion-scale time- TSDBs haven’t fully integrated ML-based time series analysis. Con-
series. More than 60% of timeseries are daily active (i.e., have newly sequently, users have to employ an external AI platform to handle
arrived data points). Consequently, it requires high data ingestion tasks such as algorithm development, model training and infer-
capacity of underlying monitoring system, especially when active ence. This not only complicates the overall architecture, but also
timeseries are massive. In addition to data ingestion, the large scale introduces additional latency and data synchronization problems.
of timeseries also complicates the query processing. For example, in Although some databases have supported ML-based data analy-
Sys-B, a single query hits more than a thousand timeseries, whose sis [16, 28, 30], they do not optimize the execution process of ML
data points are retrieved for aggregative analysis. Even worse, this algorithms for time series data, leading to poor performance.
number reaches a million in Sys-A.
In practice, time-series database (TSDB) is used as the backbone C4: Inefficient adaptability to scale time series management. The
of above monitoring systems to manage metric data and support number of timeseries in the monitoring system is continuously
queries [9]. However, we observe that it is highly inefficient for increasing as the business grows, where the quantities of both
existing TSDBs to handle data ingestion and queries over massive micro-services and machines expand along with more fine-grained
timeseries. Besides, they all lack the support of machine learning metrics being monitored [35]. The underlying TSDB is required
(ML) functions, requiring prohibitive efforts to implement complex to continuously scale up to cope with such demands. However,
ML algorithms on time series (e.g., anomaly detection) and maintain existing TSDBs usually need to redistribute data when scaling out
corresponding services externally. In a nutshell, existing TSDBs are a new node, which is prohibitive on the consumption of both re-
unable to fully meet the needs of monitoring systems in large-scale sources and time. One major reason behind this is that the compute
Internet services, facing four major challenges as follows: and storage resources are tightly coupled. Currently, distributed
TSDBs [4, 37] often have a shared-nothing architecture, where each
C1: Low write throughput for massive timeseries. When writing
node excursively manages its own memory and disk space. When
data points into a TSDB, the set of tags for the target timeseries
adding nodes to the cluster, they all suffer from high I/O pressure
is given to the database as well. A common approach to dealing
due to massive data migration. Although some TSDBs [15, 40] de-
with these tags is to create a forward index, whose index entry
ploy a shared storage, this shared storage acts more as a cold storage
maps the tag set to a timeseries id, i.e., a unique identifier internally
layer to reduce storage costs, rather than improve scaling efficiency.
used by the TSDB to distinguish timeseries. Since each index entry
To address above challenges, we present Lindorm TSDB, a dis-
contains many tags (e.g., over ten in Table 1), the footprint of the
tributed time-series database that is designed as a powerful back-
forward index will easily be overwhelming when a large number
bone for large-scale monitoring systems with massive monitoring
of timeseries are managed. This causes a high-cardinality problem,
metrics. It sustains high write throughput when massive active
which makes TSDB unable to accommodate the entire index in
timeseries exist. It also supports fast queries and ML-based analy-
memory due to cost constraints, leading to low write throughput
sis over massive timeseries. In addition, Lindorm TSDB is able to
from memory swapping during index lookups. Existing TSDBs,
retain stable performance even when it encounters node failures or
such as InfluxDB [18] and TimeUnion [40], use conventional cache
scaling. Our major contributions are summarized as follows:
mechanisms (e.g., Block Cache, MMap) to accelerate on-disk index
accesses. However, these mechanisms do not exploit the traits of • We design Lindorm TSDB, a distributed TSDB combining shared-
time series, hence still achieve unsatisfactory efficiency. nothing architecture and shared storage. It contains a cluster of
C2: High latency for queries that hit massive timeseries. A TSDB compute nodes and a reliable shared storage, which are logically
usually processes a query in two steps: first, given tags and time separated from each other. It partitions data into shards according
ranges, qualified data points from target timeseries are retrieved to their time and tags, facilitating parallel data query and write. In
from the storage; second, computations are performed on these a single shard, the optimized index structure and cache strategy
data points. In a large monitoring system, the first step usually further improves performance. (Target challenges C1/C2/C4;
hits a huge number of timeseries (e.g., reaches a million in Table 1). detailed in Section 4.)
We notice that hit timeseries are usually further grouped by a • We design an efficient pipelined execution engine for Lindorm
certain tag for subsequent computation. However, existing TSDBs TSDB to support common and important types of queries on
can not efficiently obtain tags of the hit timeseries from a large time series data. The execution engine not only parallelizes the
number of index entries. For the second step, the computational computation into different shards, but also optimizes the com-
frameworks in existing TSDBs are not well parallelized. For example, putation across multiple timeseries within one shard. On top of
TimescaleDB [23] cannot process data points in different partitions that, users can directly use SQL to perform a variety of queries.
in parallel when asked to group data by a non-primary tag. (Target challenge C2; detailed in Section 4.4.)
• We design Lindorm ML, an integrated machine learning com-
C3: Lack of advanced time series analysis capability. In a real- ponent inside Lindorm TSDB. It enables users to analyze data
world service, its workload may vary dynamically over time. Hence, with anomaly detection and time series forecasting algorithms
for the underlying monitoring system, rule-based analysis on met- through SQL, eliminating the effort of operating data and mod-
ric data usually fails to recognize performance issues precisely. els externally. More importantly, it takes advantage of Lindorm
As a solution, practitioners have turned to machine learning algo- TSDB’s data processing capability to achieve higher performance.
rithms for time series analysis in order to improve the precision (Target challenge C3; detailed in Section 5.)

3716
Table 2: Example of Lindorm TSDB’s data model Inter-timeseries aggregation Latest value

Tags Fields Timeseries 1


hostname region datacenter timestamp cpu_user cpu_sys
host-a ap-1 ap-1a 1670398200 10 4 Timeseries 2
host-b ap-1 ap-1a 1670398200 20 11
Timeseries 3
host-a ap-1 ap-1a 1670398210 11 5 Down Sample
host-b ap-1 ap-1a 1670398210 21 12 Timeseries 4

Timestamp: 1 2 3 4 5 6 7 8 9 10
• We conduct extensive experiments on a popular benchmark to
verify the effectiveness of Lindorm TSDB and its major compo- Figure 1: Time series query type
nents. We compare it with two widely-used open-source TSDBs,
InfluxDB and TimescaleDB. The results show that Lindorm TSDB
is able to achieve higher write throughput as well as lower query each timeseries, which is important for real-time status monitoring
latency compared to these baselines. (Detailed in Section 6.) of systems. A downsampling query groups the data points by a
given time window in each timeseries, e.g., every three data points
2 PRELIMINARIES in Figure 1, and then the aggregated value such as sum and average
for each window is returned. An inter-timeseries aggregate query
2.1 Data Model groups and aggregates data points in all hit timeseries by specified
Lindorm TSDB models metric data as time-series data in schema- columns, e.g., hostname and timestamp in Table 2, which is similar
tized tables. We make the data model consistent with the relational to the “group by” operation in relational databases.
data model so that users can easily understand it and fit it into In practice, downsampling queries and inter-timeseries aggre-
existing systems. There are three types of columns in each table: gate queries are often used in combination. Take Table 2 as an ex-
tags, fields and timestamp, as illustrated in Table 2. Tags describe ample, we may be interested in querying the averages of cpu_user
different attributes of the data source that generates the metric in each region for every 10 minutes within the last 24 hours.
data. A tag is a key-value pair (e.g., ⟨hostname, host-a⟩). At each
timestamp (e.g., 1670398200), a data source produces various types 3 LINDORM TSDB OVERVIEW
of metric data (e.g., cpu_user and cpu_sys), and we refer them as
Recall that Lindorm TSDB is designed to address the four challenges
fields. A timeseries is uniquely identified by one field and all asso-
discussed in Section 1. Figure 2 shows Lindorm TSDB’s overall
ciated tags, i.e., cpu_user and cpu_sys above are two timeseries.
architecture, which contains four major components, i.e., TSProxy,
A timeseries contains a sequence of data points from the same
TSCore, Lindorm ML and Lindorm DFS. Among them, both TSProxy
field, where each data point is a pair of ⟨timestamp, field value⟩.
and TSCore can be scaled horizontally.
For example, in Table 2, the cpu_user values, timestamps and tags
from the first and third row form a timeseries. Here cpu_user is the
field, [⟨hostname, host-a⟩, ⟨region, ap-1⟩, ⟨datacenter, ap-1a⟩] is the
tag list, and data points contain ⟨1670398200, 10⟩ and ⟨1670398210,
11⟩. When writing data to Lindorm TSDB, the field, tags and the
target table name are required. If the combination of the given field
and tags is not present in the table, Lindorm TSDB creates a new
timeseries.

2.2 Query Patterns


When querying Lindorm TSDB, filtering conditions that consist of
target fields, tag selectors and a timerange should be provided:
SELECT max(cpu_user), sum(cpu_sys)
WHERE hostname='host-a' AND timestamp >= '2023-1-1 12:00'
All timeseries that match the tag selectors will be selected, and the
data points in specified time ranges are retrieved for subsequent Figure 2: Lindorm TSDB Overview
computations.
In monitoring systems, the vast majority of queries can be di- Lindorm TSDB partitions data on TSProxy into different shards
vided into three categories according to their computation patterns: according to two dimensions: timeseries identifier and time. Each
latest-value query, downsampling query, and inter-timeseries ag- shard can be viewed as an independent storage engine exclusively
gregate query. Figure 1 depicts how these three types of queries managed by a single TSCore. A TSCore manages multiple shards
are performed, where four timeseries are hit by these queries. For and is responsible for executing data ingestion and query requests
brevity, the figure shows only a subset of results in downsampling on these shards. Data that belongs to the same timeseries within
query and inter-timeseries aggregate query. A latest-value query a period is stored on the same shard, which facilities query push-
refers to retrieving the data point with the latest timestamp for down optimization (Section 4.4). In a shard, data and corresponding

3717
indexes (e.g., Table 3) are first maintained in the memory of the and near-data training and inference optimizations, thus enabling
TSCore they belong to, and later persisted to the shared Lindorm efficient time series data analysis (addressing Challenge C3).
DFS for storage. Lindorm DFS is a distributed file system that pro-
vides an HDFS-compatible interface. It leverages Alibaba Cloud’s 4 SYSTEM DESIGN
storage infrastructure, i.e., ESSD [12] cloud disk and Object Storage 4.1 Distributed Architecture
Service [13]. This overall architecture (Section 4.1) combines both
shared-nothing and shared-storage designs. It can therefore sustain Lindorm TSDB exploits a distributed architecture that combines
both horizontal scalability and the ability for each TSCore to access the best of shared-nothing and shared-storage. In particular, shared-
any data, ensuring elasticity and high availability at the same time. nothing architecture makes the database horizontally scalable, while
In addition, multidimensional sharding strategy (on timeseries iden- the cloud-native shared storage gives elasticity and high availabil-
tifier and time) avoids data migration when shards change dynami- ity to the database. The time series data is physically stored in the
cally, which effectively mitigates system performance degradation reliable shared storage. When scaling the TSDB (e.g., adding or
during node scaling (addressing Challenge C4). removing a node), the downtime can be minimized since no data
In a shard, indexes need to be updated whenever a new timeseries migration is required, improving to the quality of service.
is created. For fast lookups and maintenance, keeping all indexes in Our logical sharding strategy shards time series data according
memory is an ideal choice. However, when the number of timeseries to two dimensions: time and timeseries identifier (the identifier
becomes massive, indexes consume a huge space, which is known as is uniquely determined by a set of tags and one field). For a data
the high cardinality problem that makes memory bloat. To solve this point, we first determine the shard group assignment based on its
problem (addressing Challenge C1), Lindorm TSDB uses a structure timestamp. A shard group contains multiple shards that all manage
similar to Log Structured Merge tree (LSM-tree) to periodically data points from the same time range (𝑡0, 𝑡1]. Data points are then
flush in-memory indexes into the shared storage and merge them routed to shards in the group based on their identifiers’ hash values.
later. With this hybrid storage scheme, we query an index item When the number of database nodes changes (e.g., scales out), the
by first looking it up in the memory. If it misses in memory, we number of shards needs to change as well, i.e., a new shard group
then access the shared storage. Since the access to shared storage is will be automatically generated. This design avoids the massive data
significantly slower, we apply a tailored cache policy for speeding migration from data redistribution. As shown in Figure 3, when
up (Section 4.3). Moreover, considering many historical timeseries the number of shards increases at time 𝑇 , a new shard group is
are inactive, we use a time partitioning mechanism to boost the created to manage all data generated after 𝑇 , while all previous
memory utilization (Section 4.3). shard groups remain unchanged. In this way, the historical data
To allow users to easily query timeseries, Lindorm TSDB sup- points are still in their original shards, so that they do not need
ports SQL syntax. As introduced in Section 2.2, a single SQL state- migration. We observe that monitoring systems rarely query and
ment often involves multiple timeseries and conducts aggregate write historical data, and it is worthy to not change distribution of
operations in two dimensions, i.e., by time and by tags. Since data historical data for a stable system performance.
points from the same timeseries resides on the same shard and differ-
ent timeseries resides on different shards, we propose a pipelined Timeseries
TS Proxy
(Tags, Timestamp)
execution engine (Section 4.4) that supports computation push-
down (addressing Challenge C2). This pipelined execution engine Route TS to target shard
According to hash & timestamp
pushes down the query to all shards where hit timeseries are lo-
Shard-1 Shard-2 Shard-3
cated, and completes the scanning of multiple timeseries in parallel. After T
It then aggregates values back from the shard to TSCore, and assem- Shard-group-2:
bles the partial results from each TSCore as the final results. During Shard-1, 2, 3
TS1TS2 TS3 TS4
this process, once aggregated values are computed, we can skip Expansion time: T
loading and transferring massive original data points, saving con-
siderable memory and network resources. To further speed up the Before T
Shard-group-1: TS3TS1 TS4TS2
aggreation within one timeseries, we employ a pre-downsampling Shard-1, 2
mechanism (Section 4.4) that reduces retrievals and computations
on original timeseries data.
Apart from above designs, we also propose Lindorm ML (Sec- Shared
tion 5), which integrates machine learning algorithms (e.g., anomaly Storage shard shard shard
detection, time series forecasting) inside Lindorm TSDB. Lindorm
ML combines the data governance capabilities of a database and the
data analysis capabilities of machine learning algorithms. It allows Figure 3: Lindrom TSDB sharding (arrows with different col-
users to directly train machine learning models inside the database ors mean different timeseries)
via SQL, and using these models to make online inferences. All
data and model computations are in the database in both phases. In Next, we discuss how Lindorm TSDB organizes data physically.
addition, we utilize Lindorm TSDB’s features such as timeseries lay- The newly ingested data to a shard is first written to the Write
out and query push-down to achieve batched, distributed parallel Ahead Log (WAL) on the shared storage, and then to the memory
on TSCore. Periodically, the data in memory is flushed to the shared

3718
storage. The mapping relations between shards and TSCores are Table 3: Forward and inverted indexes (key ⇒ value)
stored in Apache ZooKeeper [2] as metadata. In this manner, com-
pute and storage can be separated. TSCores only need to read from forward index inverted index
and write to the shared storage, and each node is able to access
hostname=host-a&region=ap-1 ⇒ 1 hostname=host-a ⇒ 1
all timeseries as well as the metadata. If one TSCore fails, other hostname=host-b&region=ap-1 ⇒ 2 hostname=host-b ⇒ 2
active nodes can instantly take over its requests. In this case, the 1 ⇒ hostname=host-a&region=ap-1 region=ap-1 ⇒ 1, 2
metadata needs to be updated, and then the unflushed data in the 2 ⇒ hostname=host-b&region=ap-1
failed node’s memory is restored on the active node using WAL.
With the help of the distributed architecture above, both query
and write requests can be executed in parallel on multiple TSCores, to deal with TSD files of different sizes. The compaction ensures
bringing high efficiency. When ingesting the data, each data point that data belonging to the same timeseries and time period only
is routed to the corresponding TSCore, and then be written to the resides in a single TSD file, which reduces the number of TSD files
shard. In a query, we first determine whether this query can be to be scanned during a query. During compaction, the TSD files and
routed to certain shards according to the query conditions, e.g., the indexes will be dropped if their TTLs (Time-To-Live) are set and
query carries a primary key or a complete tag set. Otherwise, the have expired. In addition, we mark cold TSD files based on their
query is broadcast to all TSCores, each of which executes the query timestamps so that Lindorm DFS can automatically transfer them
on all shards managed by it. to cheaper storage medium in the compaction process.

4.2 TSM Storage Engine Time series customized compression. We use dictionary encoding,
Delta-of-Delta, XOR, ZigZag, RLE and other compression algo-
Lindorm TSDB employs an LSM-Tree-like (Log Structured Merge
rithms to compress timeseries, achieving a up to 15× compression
Tree) storage optimized for time series data, which we call TSM
ratio. Recall that in our data model (Section 2.1), a timeseries is
(Time Structured Merge Tree) [19], as shown in Figure 4. By taking
identified by a combination of field and tags, and a write request
the characteristics of time series data into account, TSM optimizes
may write multiple timeseries with the same tags. Hence, fields and
the WAL writing, memory organization, compression algorithm,
tags of different timeseries will contain a large amount of redun-
and compaction policy over standard LSM.
dant information. Note that values from the same timeseries often
change smoothly over time, which makes compression effective.
We use different compression methods for different data. Lock-free
compression is applied to in-memory data to improve memory uti-
lization. WAL logs are compressed by dictionary compression in
batch way to reduce I/O and improve throughput. In persistent TSD
files, data points from the same timeseries over a continuous period
is composed into a data chunk, which is internally compressed
using Delta-of-delta, XOR, ZigZag, and RLE.

4.3 Index Optimization


Recall that for fast data ingestion, Lindorm TSDB creates forward
indexes in each shard to maintain the mapping between tag sets and
Figure 4: Lindrom TSDB TSM storage engine
timeseries IDs. At the same time, in order to speed up the lookup
of timeseries at query time, inverted indexes are created to maintain
Similar to most LSM-based storages, data in TSM is first writ-
the mapping of each tag to the set of timeseries IDs that contain
ten to the append-only WAL to ensure durability and high write
the tag. Table 3 shows forward and inverted indexes that contain
throughput. Then, it is written to the Memtable in memory, and
two timeseries, where hostname and region are tag keys.
ready to subsequent accesses. When the Memtable accumulates to
For the case of monitoring system, massive short-time-span time-
a certain threshold, a flush will be triggered to persist the table into
series are created due to the creation or destruction of containers.
storage according to the policy: a forward index file FwdIdx file, an
These timeseries will soon become inactive and lead to index infla-
inverted index file InvIdx file and a time series data file TSD file. All
tion. To resolve this issue, we partition the data in shard according
the files (i.e., FwdIdx file, InvIdx file and TSD file) will be periodi-
to time. Hence, each time partition has its own indexes, which only
cally compacted to new files in the background. A TSD file contains
manage those timeseries written within a time period. In addition,
a batch of data chunks (containing timeseries), and it can quickly
we observe that recent timeseries are more favored by queries.
locate the timeseries in data chunks according to its timeseries ID.
When there are too many time partitions, we provide a lazy loading
When a query arrives, the set of timeseries IDs that meet the query
mode to only load the latest time partition in high priority, with the
conditions will first be retrieved from the InvIdx files. TSD files
historical ones loaded asynchronously. This significantly reduces
will then be fast filtered out according to the query time range, and
the service interruption time caused by partition loading process.
target data chunks will be located by qualified timeseries IDs.
When writing a timeseries, we first search its tag set in the
Compaction. TSD file compaction happens in the background in-memory forward index, and then in the disk index files. If the
according to certain policies. We use the level compaction strategy timeseries does not exist, a new timeseries ID with the tag set will

3719
be created in the Memtable’s forward index. After that, each tag SQL with a relational-like data model. It extends the syntax for time
in the new timeseries is updated in the Memtable’s inverted index. series queries while still being compatible with ANSI SQL standard.
When the flush is triggered, Both forward and inverted indexes in As discussed in Section 2.2, in monitoring systems, users are
the Memtable will be written to the shared storage, generating new often less interested in individual data points, but more on the
FwdIdx files and InvIdx files, respectively. aggregated analysis of multiple data points, e.g., the average metrics
To speed up the index lookups on disk, we perform a series of within a minute. Lindorm TSDB extends the standard SQL based
optimizations. First, the index files are merged in the background on Apache Calcite [10] with one new syntax sample by for the
to reduce the total number of files. Second, we add a bloom filter to downsampling query:
each file, through which unrelated files can be filtered out quickly. SELECT max(cpu_user) WHERE hostname='host-a'
The bloom filters are cached in memory to further speed up the AND timestamp >= '2023-1-1 12:00' sample by '10min'
file filtering. Besides, we use a block cache to cache index files in
memory to reduce storage accesses. Write optimization. The time series data ingestion process can
Compared to inverted indexes, forward indexes are accessed be characterized as a bulk repetition of simple INSERT SQL state-
much more frequently. During the write process, Forward indexes ments. We observe that parsing SQL directly using Calcite results in
are looked up to determine the existence of timeseries. In the inter- very low write throughput, because the SQL parser and execution
timeseries aggregate query, we also need to obtain the tags of plan generator in Calcite consume a lot of CPU cycles. To opti-
the timeseries from forward indexes. As a result, the efficiency of mize the performance of above two parts, we design a fast path for
searching forward indexes is crucial. Hence, in addition to block write processing, as depicted in Figure 5. The vast majority of write
cache, we design an additional layer of cache for the forward index, statements are very simple, containing only three elements: tag
called seriescache. While the block cache is used to cache file data, set, timestamp, and field value. It is very easy to parse them even
the seriescache only stores the mapping between timeseries IDs without the sophisticated parser in Calcite. Therefore, we have im-
and tags that are accessed recently, consuming less space. The plemented a small parser that only handles simple write statements,
block cache and seriescache both use the LRU policy. In those cases and it is only responsible for parsing out the time series related
that the tag lengths vary much or are too long, seriescache may information. This parser is invoked first upon Lindorm TSDB re-
occupy a lot of memory. Fortunately, we can optionally use the MD5 ceives a SQL statement. If the parsing is successful, the data points
values instead of the original tags to reduce the memory footprint. are bypassed Calcite and sent directly to the execution engine, oth-
We observe that in real-world monitoring systems, MD5-encoding erwise it will continue to go through Calcite as normal. We observe
seriescache can cache up to 5× of items than the original version. that the write throughput in fast path mode is 15× higher than that
When looking up inverted indexes, we need to conduct intersec- in Calcite path mode. In addition, SQL prepare statement can be
tion operations on the posting lists. For example, when the query used for batch write optimization in clients. Our tests have shown
conditions are hostname=‘host-a’ and region=‘ap-1’, we first that by combining the fast path and prepare statement execution,
find the posting lists corresponding to these two conditions, which we can achieve 20× of write throughput improvement.
are {2} and {1,2}, respectively. Then, we get the intersection of two Extending Calcite or customized implementation Original Calcite
lists, which is {2} here. We use RoaringBitmap [26] as the data struc-
Avatica Server
ture for the posting list. Compared with integer type timeseries IDs,
bitmap saves much space and supports fast set operations. Lindorm SQL Meta

Calcite Insert

4.4 SQL Execution Engine Built-in Parser

Lindorm TSDB supports standard SQL syntax, as well as extended DownSample Query

syntax for the downsampling query to simplify the usage. It opti-


Simple Executor Complex Executor
mizes the execution of data ingestion by using a fast path based on Lindorm TSDB Adapter
Insert Parser Optimizer Schema API
the characteristics of time series data’s write pattern. In order to
optimize the efficiency of the downsampling query, it adds a pre- Executor Planner & Executor Table API

downsampling mechanism in the process of data writing, which


aggregates the original timeseries in time dimension in advance.
Lindorm TSDB also exploits the fact that timeseries are organized Figure 5: Lindorm TSDB write path optimization
in groups, and proposes a pipelined execution engine that computes
in a timeseries-wise manner and supports query push down.
Query optimization. In monitoring systems, the execution pro-
Lindorm SQL. Many time series databases are equipped with cess of a typical time-series aggregate query can be divided into
dedicated query languages to handle time series data, such as In- three steps:
fluxDB’s InfluxQL [20] and OpenTSDB’s [8] HTTP API. Compared (1) Find the timeseries that meet the predicates.
to these highly customized query languages, SQL has the advan- (2) Perform a ‘sample by’ operation on each timeseries to ob-
tage of ease-to-use and a rich ecology. As a standard language for tain the aggregated values of each timeseries on the time
databases, most developers can use SQL proficiently without extra windows.
learning efforts. Hence, Lindorm TSDB chooses to fully support (3) Perform a ‘group by’ operation on all the aggregated values.

3720
In order to improve the execution efficiency of the aggregate query, execution engine in both TSProxy and TSCore, based on which the
Lindorm TSDB takes advantage of the distributed storage of time query push-down feature is implemented, allowing some compu-
series data and optimizes from two aspects: pre-downsampling and tations to be distributed and parallelized among multiple TSCore
pipelined execution engine. nodes. In addition, the pipelined execution engine can also process
queries in parallel between multiple shards within one TSCore and
Pre-downsampling. A naive approach for the downsampling query
between multiple timeseries to further improve query performance.
‘sample by 𝑡’ is to: scan each related data points, divide them into
different 𝑡-time windows according to timestamp, and then com- SELECT device_id, region, time, last(temperature) AS temperature FROM sensor SAMPLE BY 5m;

pute the aggregated value for each window. The complexity of


this approach is linear to the number of data points. When deal- SQL Layer
DownSampleOp
ing with high-frequency sampling, it has to scan a considerable Row
number of original data points. To solve this problem, we use RowIterator
BatchRows TSProxy B TSProxy C
a pre-downsampling mechanism when writing data points. Pre-
Pipelined Execution Engine
downsampling means that the downsampled values are calculated SeriesComputeOp
during writing and then stored in the database. Pre-downsampling SerisScanOp
allows the aggregated values to be extracted directly without calcu-
lation. For example, at write time, the database simultaneously com- Query push down
putes the sum of data points every 1, 10, and 60 minutes and stores Pipelined Execution Engine
Pipelined Execution Pipelined Execution
them. When the user issues a ‘sample by 10𝑚𝑖𝑛’ query, the database SeriesComputeOp
Engine Engine
can return the result directly without scanning the original data SerisScanOp

points. If the user performs a ‘sample by 30𝑚𝑖𝑛’ query, which is not


Shard-1 Shard-2 Shard-3 Shard-4 Shard-5 Shard-6
within existing sampling rates, the database can also compute the
30𝑚𝑖𝑛 aggregated value using three consecutive 10𝑚𝑖𝑛 aggregated
TSCore A TSCore B TSCore C
values. Compared to scanning the raw data, pre-downsampling
eliminates the data scanning and computation to a huge extent. Figure 6: Lindorm TSDB SQL query overview
To minimize the impact on write throughput, pre-downsampling
is not performed when the data is written to Memtable. It only Figure 7 shows the internals of the pipelined execution engine.
happens when the Memtable is flushed to shared storage or when As can be seen, the timeseries scan operator, located at the bottom
TSD files are merged at compaction. Access to the original data of the pipelined execution engine, takes data input from the lower
is very convenient on these occasions, and the computation can layers. The data input can either be the network RPC (from TSProxy
be highly efficient. In addition, the number of pre-downsampled to TSCore) or the storage IO from the storage engine (due to query
files will be much smaller than the original data files, which further push down). The upstream operators of the timeseries scan operator
improves the query efficiency. Currently, we support a collection can be divided into two categories according to whether downsam-
of common operators, e.g., count, first, last, min, max and sum. pling is required, including the commonly used downsampling type
Pipelined execution engine. To take advantage of our timeseries of aggregation (DSAgg) and interpolation (Filling) operators, and
optimized storage and to optimize queries in monitoring scenarios non-downsampling type of operators such as obtaining the rate
(e.g., sample by and group by), we propose a pipelined execution of change (Rate) and obtaining the difference (Delta). In addition,
engine below the SQL layer and above the storage layer, which is upstream of these operators, other operators that can be used for
shown in Figure 6. This engine is designed as an operator pipeline, cross-time series aggregation are implemented to meet the needs
with the lowermost layer being a timeseries scan operator responsi- of a wide variety of time series processing.
ble for finding the specified timeseries from the storage engine, and
the uppermost layer implementing an adapter for Calcite to provide 5 LINDORM ML
a row-iterator interface. These timeseries flow between pipeline In this section, we introduce Lindorm ML, a machine learning
operators in the form of multiple rows. At query time, the query component integrated into Lindorm TSDB, to enable advanced
statement goes through Calcite’s syntax parsing in the SQL layer, time series analysis. It leverages SQL syntax extensions to provide
bypasses the original Calcite executor (into our customized simple Lindorm TSDB with sophisticated algorithms for anomaly detection
executor), and finally goes through the entire pipelined execution and forecasting of timeseries. Figure 8 illustrates the simplicity of
engine driven by the row iterator to read data from the storage using Lindorm ML, where users can still interact with Lindorm
engine. In the pipeline of the execution engine, various timeseries TSDB through SQL. Firstly, users can train a machine learning
operators can be extended and defined. The difference between model, e.g., an anomaly detector, by specifying an extended CREATE
these operators and those in the SQL layer is that they compute the MODEL syntax together with predicates that filter the data from
data in the time series dimension rather than in the row dimension, Lindorm TSDB. Then, they can use another extended SQL syntax
and can therefore serve as optimizations for batch computation. As to perform inference with the trained model.
the name of the pipelined execution engine suggests, the data is As an internal service in the database, Lindorm ML accepts the
streamed through all the operators in the pipeline and released as model training request forwarded by the TSDB node and takes
soon as it is processed by each operator, avoiding data dwell and on the main control logic to drive the model training process in
reducing memory usage. In addition, we have embedded pipelined the database. The inference service is provided directly by the

3721
SQL Iterator 5.1 In-Database Training
Figure 9 depicts the Lindorm ML training procedure. TSProxy on
RowIterator the TSDB node receives the user’s CREATE MODEL command and,
after partially decoding the SQL syntax, delivers it to the Lindorm
LimitOp ML node in the same cluster. The SQL query is then fully parsed
by Lindorm ML. The CREATE MODEL statement executes in two
Cross time series process
steps: First, a model management module generates the model’s
AggOp
metadata, including the model name, task, algorithm, and so on
and persists it in ZooKeeper; then, a train() internal SQL function
DeltaOp FillingOp call is sent back to the TSProxy that performs the model’s train-
No Yes ing process. By design, the training function as the TrainingOp
RateOp DownSample? DSAggOp operator of the TSDB pipeline can be pushed down to the TSCore
Single time series process nodes for distributed execution. Before entering the training op-
SeriesScanOp erator TrainingOp, the pipelined execution engine processes the
data in two steps: the SeriesScanOp operator extracts relevant
features and the PreProcessingOp operator performs the neces-
RPC Network IO Storage IO sary data preprocessing. In Lindorm TSDB, the SeriesScanOp,
PreProcessingOp and TrainingOp operator all process each in-
dividual timeseries separately, thereby naturally satisfying the re-
Figure 7: Lindrom TSDB pipelined execution engine
quirement of input data for the time series machine learning algo-
rithms (e.g., anomaly detection and forecasting).
The model management module ModelManager in the Lindorm
TSDB node, without the participation of Lindorm ML, thus reusing ML plugin manages the model partitioning, persists the trained
the high availability and scalability of the TSDB services natu- model data to Lindorm DFS, updates the model metadata stored in
rally. In addition, the management of model data and metadata is ZooKeeper (e.g., the training progress, evaluation metrics).
shared between Lindorm ML and TSDB nodes through the under- When the training operators are pushed down to multiple TSCore
lying distributed file system and ZooKeeper. Lindorm ML utilizes nodes for execution, the physical models trained from the timeseries
TSDB’s distributed storage and query of time series data to propose on one TSCore node naturally forms a partition. The advantage of
a model partitioning design and implementation. A user-created this way is that it is easy to adapt to the scenarios of adding/deleting
logical model actually consists of many physical models, which and failover of TSDB nodes. If the training operator is not pushed
correspond to different timeseries data. These physical models are down but executed on TSProxy, the physical models will not be
divided into model partitions according to the partitioning of the partitioned. In this way, multiple timeseries data are cached in the
timeseries. This design brings the ability of using the query push- training operator, thus enabling batch training and improving effi-
down technology of TSDB execution engine to enable the model ciency. In summary, utilizing the distributed storage of TSDB data
training and inference pushdown, further enabling the distributed and the operator pushed-down technique enables Lindorm ML’s
parallel, near-data training and inference optimization. batch, distributed parallel and near-data training optimization.
Algorithm support. We support popular statistical and deep learn-
ing based time series anomaly detection and forecasting algorithms
(e.g., ARIMA[6], DeepAR[33], TFT[27]) provided by open source
algorithm packages. Further, we support our in-house online al-
gorithms that support real-time anomaly detection [17]. All these
algorithms are uniformly managed by the Lindorm ML plugin on
the TSDB node.

Figure 8: Lindrom ML overview


Figure 9: Lindrom ML In-Database Training

3722
5.2 In-Database Inference • TimescaleDB-3: three-node TimescaleDB.
Unlike the training process, inference can be done entirely on the • Lindorm-1: single-node Lindorm TSDB.
TSDB node. The SQL inference function called by the user (e.g., • Lindorm-3: three-node Lindorm TSDB.
anomaly_detect()) is also a TSDB pipelined execution engine oper- Configurations. For InfluxDB, we tune its cache limits to get
ator that can be pushed down and executed. Similarly, before the the best performance. Specifically, we set its cache-max-memory-
inference function InferenceOp obtains its input data, the data is size to 16g, cache-snapshot-memory-size to 4g, and GOGC to 30.
first processed by the timeseries scan operator SeriesScanOp and For TimescaleDB, we adjust the configuration for TimescaleDB-3
the preprocessing operator PreProcessingOp at the pipelined exe- according to its official guidelines to achieve the best performance.
cution engine layer. The inference operator InferenceOp also calls We deploy an additional access node for TimescaleDB-3.
the Lindorm ML plugin, which finds the corresponding model and
algorithm from the model metadata according to a user-specified 6.2 Writing Performance Evaluation
model name. When loading a model, the model partition corre-
We evaluate the write throughput, i.e., the number of data points
sponding to the input data is found according to the same time-
ingested into the database per second, of each database at different
series routing rules. As with the training process, batch, distributed
timeseries scales. We use the DevOps data generated from Time
parallel and near-data inference optimization can be achieved when
Series Benchmark Suite, TSBS [3] as the insertion test data. In
multiple TSCore nodes are involved in the inference function.
particular, TSBS generates 101 timeseries for each host to represent
different type of system or application metrics, e.g., CPU usage,
5.3 Model metadata management number of diskio, number of nginx requests, etc. Each timeseries
The metadata of models is maintained in ZooKeeper, to be con- contains about 11 tags.
sistent with the way that TSDB manages the metadata of tables. In TSBS, we adjust the number of timeseries generated by chang-
We have extended the implementations of the Schema and Table ing the number of hosts, host_scale. The number of timeseries equals
in the SQL layer, so that the model metadata can be queried as if to host_scale * 101. To improve the write performance, we set a
they were tables. We also encapsulate the syntactic sugar “SHOW large write batch for each database, i.e., 10000, and we also set the
MODEL(S)” statements to simplify its usage. number of workers to be the number of cores, i.e., 16 for single-node
databases and 48 for three-node databases.
6 EVALUATIONS
We evaluate Lindorm TSDB in four aspects. We first compare Lin-
dorm TSDB with two popular open-source TSDBs on write (Sec-
tion 6.2) and query (Section 6.3) performance. Then, we evaluate
the efficiency on time-series machine learning tasks of Lindorm ML
(Section 6.4). Finally, we study the contributions of the main com-
ponents in Lindorm TSDB to the overall performance (Section 6.5).

6.1 Experiment Setup


We conduct experiments on five Alibaba Cloud Elastic Compute
Service (ECS) [14] servers, with efficient cloud disk (ESSD) [12]
mounted as the disk for each server. We deploy the TSDBs on four
servers, each of which has 16 cores and 64GB RAM. The fifth server
runs as a client to generate writes and queries, which has 32 cores
and 128GB RAM.
Figure 10: Write throughput at different timeseries scales
Comparison databases. For the end-to-end performance com-
parison, we choose two representative and open-source TSDBs, Figure 10 shows the write throughput of each TSDB at different
InfluxDB and TimescaleDB, as baselines. InfluxDB is a very popular timeseries scales, where each timeseries contains 12 hours of data
TSDB, ranking first in DB-Engines Ranking [36]. TimescaleDB is an with the data interval as 1 minute. The results show that both single-
open-source TSDB with both its standalone and distributed version node and three-node Lindorm TSDB outperform other baselines.
available. Meanwhile, benchmark results show that TimescaleDB At the largest scale (i.e., 100000 hosts and 10M timeseries), three-
has excellent performance [22]. When evaluating the functions of node Lindorm TSDB has about 10× higher write throughput than
the main components in Lindorm TSDB, we turn off the push-down other TSDBs. The first reason is that Lindorm TSDB partitions
optimization and the seriescache respectively, and then we study the timeseries according to the tags, allowing multiple timeseries
how Lindorm TSDB works. Finally, we also compare the perfor- being written in parallel at the same time. Secondly, Lindorm TSDB
mance of Lindorm TSDB with different number of nodes to verify creates the seriescache for the forward index, which facilitates
its horizontal scalability. In summary, there are five TSDBs deployed determining the ID of the timeseries specified by tags in a write
on the ECS servers: command.
• InfluxDB: single-node InfluxDB. As the timeseries scale increases, Lindorm TSDB have much
• TimescaleDB-1: single-node TimescaleDB. lower performance degradation than TimescaleDB. Because MD5

3723
encoding method for timeseries tags make the seriescache able to Table 6: Q2’s query latency (ms)
cache the tags and IDs of numerous timeseries that have recent
InfluxDB Lindorm TimescaleDB
data writes. Thus, when host_scale increases to 100000, the number Host Scale
of accesses to the index in disk does not increase much in Lindorm 1-node 1-node 3-node 1-node 3-node
TSDB. 10000 72 89 91 53 67
100000 1046 177 190 502 471
6.3 Query Performance Evaluation 1000000 15261 1165 934 51916 10012
In query evaluation, we adjust the DevOps data generation in TSBS
and collect 1 timeseries for each host, where the total number of three-node Lindorm TSDB, InfluxDB has 4.5× and 15.3× higher
timeseries equals host_scale. In this way, one query can hit more query latencies at the scales of 100000 and 1M hosts respectively.
timeseries at the same host_scale. For each query, we restart the And three-node TimescaleDB has 1.5× and 9.7× higher latencies.
databases, repeat 5 times with different filter conditions, and present
the average latency. Table 7: Q3’s query latency (ms)
Table 4 describes the three query patterns that we have men-
InfluxDB Lindorm TimescaleDB
tioned in Section 2.2. Q1 and Q2 use region as the filter tag, e.g., Host Scale
WHERE region=ap-1, and hit ℎ𝑜𝑠𝑡_𝑠𝑐𝑎𝑙𝑒/9 timeseries in each query. 1-node 1-node 3-node 1-node 3-node
There is no tag selector in Q3 and thus Q3 queries on all timeseries 10000 559 175 164 91 427
in the TSDB. 100000 9437 1390 809 898 4296
1000000 111815 21177 6884 43630 30651
Table 4: Three query patterns
The results of Q3 query are shown in Table 7. In a Q3 query,
Query Description InfluxDB and Lindorm TSDB need to find region values for all
Q1 - Latest value The last data points of timeseries in 1 region.
hit timeseries in order to group them. Lindorm TSDB has the se-
aggregate on each timeseries in 1 region per
Q2 - Downsampling riescache to optimize the process of searching tag values in the
5 minutes for 1 hour.
Q3 - inter-timeseries aggregate on all timeseries in each region per 5
forward index. In addition, Lindorm TSDB is able to push down the
aggregate minutes for 2 hours. downsampling operator together with the inter-timeseries aggre-
gate operator. This allows the data points to be aggregated by time
window and tags in each shard and TsCore node before they are col-
Table 5: Q1’s query latency (ms) lected by higher level, significantly improving the efficiency. When
the query hits 1M timeseries, both single-node and three-node Lin-
InfluxDB Lindorm TimescaleDB dorm TSDB outperform other TSDBs by a large margin. It is worth
Host Scale
1-node 1-node 3-node 1-node 3-node noting that single-node TimescaleDB outperforms three-node ver-
10000 22 44 53 237 210
sion at small scales. by checking the query execution process in
100000 145 90 95 1530 1610 three-node TimescaleDB, we find that the query tasks on partitions
1000000 2083 464 284 211689 13452 are executed serially. It is probably because region tag is not set as
the partition key, which is hostname. We run queries where the data
Table 5 shows the results for Q1, the latest value query. At smaller are aggregated by hostname and find that computations in partions
host scales (10,000 and 100,000), InfluxDB and Lindorm TSDB per- are parallel, which verifies our hypothesis. When the timeseries
form closely. This is because Lindorm TSDB needs to push down scale becomes very large, the memory of single-node TimescaleDB
the query and collect results from all shards or nodes through RPC. was not enough for such large amount of data, so the performance
The time consumed by RPC is not negligible when the total la- drops heavily.
tency is low. But at the large scale such as 1M, the query latency
of InfluxDB is 4.48× as high as that of single-node Lindorm TSDB 6.4 Advanced Time-Series Analysis Evaluation
and 7.33× as high as that of three-node Lindorm TSDB. Because We evaluate the efficiency of Lindorm ML in performing time-series
Lindorm TSDB can push the query down to the storage engine anomaly detection tasks. We still use the data generation approach
and can scan multiple timeseries parallelly to get their last data in Section 6.3 to prepare data for machine learning tasks. Each
points. TimescaleDB is not able to utilize the index on timestamp timeseries contains two consecutive segments of data for training
in the latest value query hitting multiple timeseries [24], resulting and inference, both of which are one-day long.
in particularly low efficiency. In evaluation, we create training and inference tasks at different
For the downsampling query whose results are in Table 6, it timeseries scales (10,000 and 100,000) via SQL provided by Lindorm
requires more data points computed than Q1. And for an aggrega- ML, where we run OneShotSTL [17] as anomaly detection algorithm.
tion query such as Q2 and Q3, the number of returned values is Meanwhile, we run the same algorithm outside Lindorm TSDB
much smaller than the number of data points involved in the query. for training and inference as the baseline. Specifically, we first read
Therefore, Lindorm TSDB’s streaming optimization in the execu- data from Lindorm TSDB and then apply OneShotSTL to them. We
tion engine reduces a lot of memory footprint and data transfer record the time spent in each way respectively.
consumption. The larger the timeseries scale, the more significant As shown in Table 8, compared to performing machine learning
the advantage of Lindorm TSDB over other TSDBs. Compared with externally, Lindorm ML consumes about half the time for both

3724
training and inference at different scales. This is because Lindorm Table 11: Write throughput (M/s) of Lindorm TSDBs with
ML reduces the time-consuming transmission of the raw data. In different nodes
addition, various optimizations in the pipelined execution engine
Host Scale 2-node 4-node 6-node
also improve the efficiency of machine learning computations.
10000 5.05 11.55 19.78
Table 8: Efficiency of time-series anomaly detection 100000 5.08 11.14 19.01
1000000 4.64 10.99 18.06
Training Time(s) Inference Time (s)
Host Scale 5 3.6
4.9

Write throughput (M/s)

Write throughput (M/s)


Lindorm ML outside Lindorm ML outside 4.8 3.5
4.7 3.4
10000 19.69 36.72 19.89 36.37 4.6
4.5 3.3
100000 198.53 431.66 206.11 391.89
4.4
4.3 3.2
Node Failure Node Scaling
4.2 Occurs Occurs
3.1
4.1
6.5 Ablation Study 4
0 60 120 180 240 300 360 420
3
0 60 120 180 240 300 360 420
Time (second) Time (second)
We study the contributions of the main modules in Lindorm TSDB
by evaluating Lindorm TSDBs with different configurations: (a) Node failure (b) Node scaling
(1) Turn off the push-down optimization in the pipeline stream- Figure 11: Write throughput over time when the database
ing execution engine. cluster status changes
(2) Turn off the seriescache for the forward index.
In addition to experiments on the above configurations, we also Q3 query where tag values are required for grouping timeseries.
evaluate the write throughput in three cases to investigate the adapt- With the seriescache, query latencies are reduced by 15.3% to 32.2%.
ability and scalability: node scaling event, node failure event, and We compare the write throughputs of distributed Lindorm TSDB
deployment with different cluster size. We use the data generation deployed in 2, 4, and 6 nodes to study the scalability of data in-
method in Section 6.3, i.e., one timeseries on each host. gestion. The results in Table 11 shows over 100% scalability. At all
timeseries scales, the per-node write throughput is higher when
Table 9: Ablation study on the push-down optimization there are more nodes in Lindorm TSDB. When the number of nodes
increases, a single node manages fewer timeseries, making the data
Q3 query latency (ms)
Host Scale structures in the storage engine more efficient. For example, most
with push-down w/o push-down of accesses to the index in one node cannot be hit in the cache when
10000 900 2256 data of a large number of timeseries are written to the node. If the
100000 7525 25569 Lindorm TSDB cluster has more nodes, it is possible for those index
1000000 94082 322840 entries that are accessed in the disk to be cached in other nodes.
We study Lindorm TSDB’s adaptability to two cases, i.e., single-
To investigate the effectiveness of push-down optimization in node failure event and node scaling event. We provide two stable
Lindorm TSDB’s pipeline streaming execution engine. We perform traffic data inputs and then manually close one TsCore node and
Q3 query on three-node Lindorm TSDB with and without push- add two TsCore nodes respectively during data ingestion. Figure 11
down optimization respectively. The results are in Table 9. The displays the write throughput over time before and after manual
query aggregate on data in all timeseries for 8 hours to guarantee a operations. When a node goes down (Figure 11a), the write through-
large computational workload. When push-down optimization is put of Lindorm TSDB slightly drops by 4%. After that, within 30
unavailable, Lindorm TSDB have to collect all data and then finish seconds, other healthy nodes take over the data of the failure node
inter-timeseries aggregate operation in the proxy level. This leads from the shared storage and the system performance returns to be
to about 2× higher query latency. stable. When adding new nodes to the database (Figure 11b), there
is no significant change in the write throughput. Because the data
Table 10: Ablation study on the seriescache that are written before the scaling does not need to be migrated
with the help of the sharding strategy taking time into account.
Write throughput (M/s) Q3 query latency (ms)
Host Scale
with cache w/o cache with cache w/o cache 7 LESSONS LEARNED
1000 5.88 4.75 160 189
Lindorm TSDB serves several large-scale monitoring systems within
10000 5.28 3.6 383 485 Alibaba and provides external services in Alibaba Cloud. Lindorm
100000 4.66 1.4 3549 5235 TSDB has undergone many years of iterations of several versions.
During the evolution, we accumulate some business observations
In Table 10, we explore how the seriescache for forward index and system design experience summarized as follows:
improves the performance of Lindorm TSDB. The results show very • In distributed databases, node failure is common, and when a
large improvement in write throughput from seriescache, between node crashes in Lindorm TSDB, a healthy node takes over its
23.8% to 232%. The seriescache also contirbutes to the efficiency of shards. However, the new node cannot provide service until it

3725
finishes replaying all records in the WAL. To address this, we time for aggregate query on data of single timeseries across long
designed an asynchronous WAL replaying mechanism, which time range, which is not common in monitoring systems (see in
allows the shard to start serving write requests immediately after Table 1). Gorilla [31] proposes the delta-of-delta timestamps and
it is started. The read service is enabled only after the replaying XOR’d floating point values, which are widely used in the existing
is completed to ensure data consistency. This guarantees high TSDBs for data compression. TimeUnion [40] and Byteseries [35]
availability of write requests as a priority after shard migration. mitigate the high-cardinality problem by compressing the inverted
With this feature enabled, write service interruption time drops index in memory. But they ignore the acceleration of access to
from minutes to seconds. the index on disk. There are also TSDBs designed for Internet
• The adoption of a schematized multi-fields data model and the of Things (IoT) scenario, such as DB2 [15] and IoTDB [39]. They
support of SQL syntax not only helps users understand the time are not efficient for complex tags query in monitoring systems.
series data model and simplifies its usage, but also facilitates trou- To tackle the increasing timeseries scale in monitoring systems,
bleshooting for DBAs. For example, we can use SQL "explain" to more and more TSDBs [4, 37] are deployed in distributed way.
see if the entire execution plan meets expectations. Additionally, They use shared-nothing architecture and suffer from performance
it allows for easy integration with third-party ecosystems. degradation due to data migration in the case of node scaling.
• Enabling the pre-downsampling feature effectively reduces query
latency by 80% in businesses, at the cost of an 8% increase in In-Database Machine Learning. To the best of our knowledge, no
storage space. This cost is manageable with the storage tiering existing TSDBs integrate machine learning functions. The existing
feature in Lindorm DFS. Also, since the computation occurs systems [7, 16, 25, 30, 34] that support in-database machine learning
during compaction, the additional CPU usage is minimal, at less are limited to relational data model. Although Oracle ML [30], Azure
than 5%. Compared to instant computing at query time or using Data Explorer [28] and BigQuery ML [16] allows applying time
features like Continuous Query [21], resource consumption is series forecasting and anomaly detection to time-series data, they
significantly lower. do not optimize the computation based on characteristics of time-
• In its early versions, Lindorm TSDB did not have a pipelined exe- series data. Lindorm ML is inspired by SQL Server’s Raven [25].
cution engine. When a large amount of data was queried, all of the It reuses the ONNX RUNTIME [29] inference engine for cross-
data had to be read out at once and cached in memory for calcu- optimization on relational and linear algebra. We also utilize open-
lation. This led to memory exhaustion with FullGC and affected source inference engines for specific ML computation.
the service, making it difficult to meet the needs of supporting im-
portant business operations, such as the dashboard of Alibaba’s 9 CONCLUSION
Global Shopping Festivals. The newly designed pipelined execu- In this paper, we first summarize data scales and common query
tion engine solves this problem and improves performance by at patterns in large-scale monitoring systems. Then we present Lin-
least 10×. dorm TSDB, a distributed time-series databases that is designed
• In monitoring scenarios, the latest value query is often used to for handling massive timeseries in large-scale monitoring systems.
check the health status of system. This requires high QPS and low Lindorm TSDB combines shared-nothing architecture and shared
latency. To address this, we have designed a cache specifically storage to scale nodes efficiently as the number of timeseries in sys-
for this query. The latest value of each timeseries is cached when tems increases. Lindorm TSDB adopts an optimized index structure
queried and will be updated when new data points are written to with cache and a novel pipelined execution engine to acheive high
that timeseries. After implementing this cache, query response write throughput and efficient processing of queries hitting a large
time was reduced by 85%. number of timeseries. For better detection and diagnosis of system
performance issues, Lindorm TSDB enables users to analyze data
with ready-to-use anomaly detection and time series forecasting
8 RELATED WORK algorithms through SQL.
Time-series database. There are many previous works focusing
on time-series databases. OpenTSDB [8] uses HBase [1], a key-value ACKNOWLEDGMENTS
database to store time-series data points, where each data point is an We thank the anonymous reviewers for their valuable suggestions
individual data row with rowkey. It leads to low data compression and helpful opinions. We would also like to thank Yong Lin, Wei
ratio and access efficiency. InfluxDB [18] develops TSM storage Zou, Songzheng Ma, Dengke He, Yaguang Li, Yuan Cui, Xiang Wang,
architecture based on LSM, greatly improving the write throughput. Wenlong Yang, Yang Liu, Qingyi Meng, Xing Jin and Youdong Zhang
But it lacks optimization in query execution (e.g., InfluxDB does not who contributed significantly to the development of Lindorm TSDB.
parallelly perform computation on multiple timeseries in one data
partition). TimescaleDB [23] is a Postgres-based TSDB. It mainly REFERENCES
relies on partitioning technology for parallel data ingestion and [1] 2023. Apache HBase. https://hbase.apache.org/. Last accessed: 2023-07-07.
query. But its performance drops significantly when executing [2] 2023. Apache ZooKeeper. https://zookeeper.apache.org/. Last accessed: 2023-07-
07.
the query hitting multiple timeseries. QuestDB [32] is a column- [3] 2023. Time Series Benchmark Suite. https://github.com/timescale/tsbs. Last
oriented TSDB showing high single-node write performance, but it accessed: 2023-07-07.
does not offer distributed deployment and scalability. Timon [11], [4] Colin Adams, Luis Alonso, Benjamin Atkin, John Banning, Sumeer Bhola, Rick
Buskens, Ming Chen, Xi Chen, Yoo Chung, Qin Jia, et al. 2020. Monarch: Google’s
BTrDb [5] and Peregreen [38] propose novel data structures storing planet-scale in-memory time series database. Proceedings of the VLDB Endowment
data points in the same timeseries. They can have fast response 13, 12 (2020), 3181–3194.

3726
[5] Michael P Andersen and David E Culler. 2016. Btrdb: Optimizing storage system [24] TimeScale Inc. 2023. TimeScaleDB does not use index in the last(). https:
design for timeseries processing. In 14th { USENIX } Conference on File and Storage //docs.timescale.com/api/latest/hyperfunctions/last/. Last accessed: 2023-07-07.
Technologies ( { FAST } 16). 39–52. [25] Konstantinos Karanasos, Matteo Interlandi, Doris Xin, Fotis Psallidas, Rathijit Sen,
[6] Adebiyi A. Ariyo, Adewumi O. Adewumi, and Charles K. Ayo. 2014. Stock Price Kwanghyun Park, Ivan Popivanov, Supun Nakandal, Subru Krishnan, Markus
Prediction Using the ARIMA Model. In Proceedings of the 2014 UKSim-AMSS Weimer, et al. 2019. Extending relational query processing with ML inference.
16th International Conference on Computer Modelling and Simulation (UKSIM ’14). arXiv preprint arXiv:1911.00231 (2019).
IEEE Computer Society, USA, 106–112. https://doi.org/10.1109/UKSim.2014.67 [26] Daniel Lemire, Gregory Ssi-Yan-Kai, and Owen Kaser. 2016. Consistently faster
[7] Nikos Armenatzoglou, Sanuj Basu, Naga Bhanoori, Mengchu Cai, Naresh and smaller compressed bitmaps with roaring. Software: Practice and Experience
Chainani, Kiran Chinta, Venkatraman Govindaraju, Todd J Green, Monish Gupta, 46, 11 (2016), 1547–1569.
Sebastian Hillig, et al. 2022. Amazon Redshift re-invented. In Proceedings of the [27] Bryan Lim, Sercan Ö Arık, Nicolas Loeff, and Tomas Pfister. 2021. Temporal fusion
2022 International Conference on Management of Data. 2205–2217. transformers for interpretable multi-horizon time series forecasting. International
[8] The OpenTSDB Authors. 2021. OpenTSDB. http://opentsdb.net/. Last accessed: Journal of Forecasting 37, 4 (2021), 1748–1764.
2023-07-07. [28] Microsoft. 2023. Azure Data Explorer. https://azure.microsoft.com/en-us/
[9] The OpenTelemetry Authors. 2023. OpenTelemetry. https://opentelemetry.io/. products/data-explorer. Last accessed: 2023-07-07.
Last accessed: 2023-07-07. [29] Microsoft. 2023. ONXX Runtime. https://onnxruntime.ai/. Last accessed:
[10] Edmon Begoli, Jesús Camacho-Rodríguez, Julian Hyde, Michael J Mior, and 2023-07-07.
Daniel Lemire. 2018. Apache calcite: A foundational framework for optimized [30] Oracle. 2023. Oracle Machine Learning for SQL. https://docs.oracle.com/en/
query processing over heterogeneous data sources. In Proceedings of the 2018 database/oracle/machine-learning/oml4sql/21/dmcon/time-series.html. Last
International Conference on Management of Data. 221–230. accessed: 2023-07-07.
[11] Wei Cao, Yusong Gao, Feifei Li, Sheng Wang, Bingchen Lin, Ke Xu, Xiaojie Feng, [31] Tuomas Pelkonen, Scott Franklin, Justin Teller, Paul Cavallaro, Qi Huang, Justin
Yucong Wang, Zhenjun Liu, and Gejin Zhang. 2020. Timon: A timestamped event Meza, and Kaushik Veeraraghavan. 2015. Gorilla: A fast, scalable, in-memory
database for efficient telemetry data processing and analytics. In Proceedings of time series database. Proceedings of the VLDB Endowment 8, 12 (2015), 1816–1827.
the 2020 ACM SIGMOD International Conference on Management of Data. 739–753. [32] QuestDB. 2023. QuestDB. https://questdb.io/. Last accessed: 2023-07-07.
[12] Alibaba Cloud. 2023. Alibaba Cloud ESSDs. https://www.alibabacloud.com/help/ [33] David Salinas, Valentin Flunkert, Jan Gasthaus, and Tim Januschowski. 2020.
en/elastic-compute-service/latest/essds. Last accessed: 2023-07-07. DeepAR: Probabilistic forecasting with autoregressive recurrent networks. Inter-
[13] Alibaba Cloud. 2023. Alibaba Cloud OSS. https://www.alibabacloud.com/product/ national Journal of Forecasting 36, 3 (2020), 1181–1191.
object-storage-service. Last accessed: 2023-07-07. [34] Maximilian Schüle, Frédéric Simonis, Thomas Heyenbrock, Alfons Kemper,
[14] Alibaba Cloud. 2023. Alibaba ECS. https://www.alibabacloud.com/product/ecs. Stephan Günnemann, and Thomas Neumann. 2019. In-database machine learn-
Last accessed: 2023-07-07. ing: Gradient descent and tensor algebra for main memory database systems.
[15] Christian Garcia-Arellano, Hamdi Roumani, Richard Sidle, Josh Tiefenbach, BTW 2019 (2019).
Kostas Rakopoulos, Imran Sayyid, Adam Storm, Ronald Barber, Fatma Ozcan, [35] Xuanhua Shi, Zezhao Feng, Kaixi Li, Yongluan Zhou, Hai Jin, Yan Jiang, Bing-
Daniel Zilio, et al. 2020. Db2 event store: a purpose-built IoT database engine. sheng He, Zhijun Ling, and Xin Li. 2020. ByteSeries: an in-memory time series
Proceedings of the VLDB Endowment 13, 12 (2020), 3299–3312. database for large-scale monitoring systems. In Proceedings of the 11th ACM
[16] Google. 2023. Bigquery ML. https://cloud.google.com/bigquery/docs/bqml- Symposium on Cloud Computing. 60–73.
introduction. Last accessed: 2023-07-07. [36] solid IT. 2023. DB-Engines Ranking of Time Series DBMS. https://db-engines.
[17] Xiao He, Ye Li, Jian Tan, Bin Wu, and Feifei Li. 2023. OneShotSTL: One-Shot com/en/ranking/time+series+dbms. Last accessed: 2023-07-07.
Seasonal-Trend Decomposition For Online Time Series Anomaly Detection And [37] TDengine. 2023. TDengine. https://tdengine.com/. Last accessed: 2023-07-07.
Forecasting. Proc. VLDB Endow. 16, 6 (2023), 1399–1412. [38] Alexander A Visheratin, Alexey Struckov, Semen Yufa, Alexey Muratov, Denis
[18] InfluxData Inc. 2023. InfluxDB. https://docs.influxdata.com/influxdb/v2.6/. Last Nasonov, Nikolay Butakov, Yury Kuznetsov, and Michael May. 2020. Peregreen-
accessed: 2023-07-07. modular database for efficient storage of historical time series in cloud environ-
[19] InfluxData Inc. 2023. InfluxDB TSM. https://docs.influxdata.com/influxdb/v1.3/ ments. In Proceedings of the 2020 USENIX Conference on Usenix Annual Technical
concepts/storage_engine/. Last accessed: 2023-07-07. Conference. 589–601.
[20] InfluxData Inc. 2023. InfluxQL. https://docs.influxdata.com/influxdb/v1.8/query_ [39] Chen Wang, Xiangdong Huang, Jialin Qiao, Tian Jiang, Lei Rui, Jinrui Zhang,
language/. Last accessed: 2023-07-07. Rong Kang, Julian Feinauer, Kevin A McGrail, Peng Wang, et al. 2020. Apache
[21] InfluxData Inc. 2023. InfluxQL Continuous Queries. https://docs.influxdata.com/ IoTDB: Time-series Database for Internet of Things. Proceedings of the VLDB
influxdb/v1.8/query_language/continuous_queries/. Last accessed: 2023-07-07. Endowment 13, 12 (2020), 2901–2904.
[22] TimeScale Inc. 2020. TimescaleDB vs InfluxDB. https://www.timescale. [40] Zhiqi Wang and Zili Shao. 2022. TimeUnion: An Efficient Architecture with
com/blog/timescaledb-vs-influxdb-for-time-series-data-timescale-influx-sql- Unified Data Model for Timeseries Management Systems on Hybrid Cloud
nosql-36489299877/. Last accessed: 2023-07-07. Storage. In Proceedings of the 2022 International Conference on Management of
[23] TimeScale Inc. 2023. TimeScaleDB. https://www.timescale.com. Last accessed: Data. 1418–1432.
2023-07-07.

3727

You might also like