0% found this document useful (0 votes)
135 views49 pages

DZone ScyllaDB Database Systems Trend Report

The document summarizes key findings from a 2022 survey on database systems trends conducted by DZone. The survey explored how organizations approach data persistence and management. Key findings include: 1. Fewer developers reported writing SQL manually more than once per week compared to 2021, indicating greater use of ORMs. 2. While senior developers still write more SQL than juniors, the gap has narrowed. 3. Developers at large companies (>1,000 employees) were more likely to write SQL daily than those at smaller firms. The survey provided insights into challenges of data persistence, popularity of NoSQL solutions, and how developers think about database design and normalization.

Uploaded by

Sidharth Pallai
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
135 views49 pages

DZone ScyllaDB Database Systems Trend Report

The document summarizes key findings from a 2022 survey on database systems trends conducted by DZone. The survey explored how organizations approach data persistence and management. Key findings include: 1. Fewer developers reported writing SQL manually more than once per week compared to 2021, indicating greater use of ORMs. 2. While senior developers still write more SQL than juniors, the gap has narrowed. 3. Developers at large companies (>1,000 employees) were more likely to write SQL daily than those at smaller firms. The survey provided insights into challenges of data persistence, popularity of NoSQL solutions, and how developers think about database design and normalization.

Uploaded by

Sidharth Pallai
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 49

BROUGHT TO YOU IN PARTNERSHIP WITH

Welcome Letter
By Lauren Forbes, Community Support Manager at DZone

Data is arguably one of the most valuable assets that as JSON-like documents, to relational database systems,
businesses rely on for success, and it is truly the basis for which are types of databases that store and provide access
all reporting both internally and externally. Data provides to data points that are related to one another. Among the
information that is used in a multitude of ways and can other various types, we've also seen a rise in cloud migration.
benefit you and your business, including finding solutions
to problems, determining better ways of working, increasing In DZone’s 2022 Database Systems Trend Report, you’ll

efficiency, implementing better strategies — you name it. learn about the different types of database systems as
well as considerations when shifting an organization to a
With the immense growth we’ve seen in generated data, we cloud database.
definitely need a way to manage it.
Our subject matter experts will also provide information and
Enter database management systems (DBMSs). insight into data management patterns for microservices;
a multi-cloud approach to DBMSs; strategies for governing
With data being so important, DBMSs are, in turn, equally data quality, accuracy, and consistency; and more.
as important, if not more so. It’s one thing to have a bunch
of data, but it’s another thing entirely to be able to store, We hope you are able to learn from and put into practice
organize, read, dissect, and understand that data. much of what is provided in the report.

I like to think of it like this: Having data without a DBMS is like Thank you for reading!
having video games without a console or like having a bunch
of thoughts without a brain to sort through them. It’s hard Sincerely,

to have one without the other. Though some of my family


Lauren Forbes
would probably argue that my brain doesn’t always organize
my thoughts like it should! But you get the idea.

There are many different types of database management


systems — from document database systems, a kind of non-
relational database that is designed to store and query data

Lauren Forbes, Community Support Manager at DZone


@laurenf on DZone | @laurenforbes26 on LinkedIn

Lauren interacts with DZone members to facilitate a positive experience, whether they are facing
technical issues, need help contributing content, or want to exercise their personal data rights. She
shares resources, escalates technical issues to engineering, and follows up with customers to ensure their
issues and questions are resolved. When not working, Lauren enjoys playing with her cats, Stella and Louie, reading, and
playing video games.

DZONE TREND REPORT | DATABASE SYSTEMS PAGE 2


ORIGINAL RESEARCH

Key Research Findings


An Analysis of Results from DZone's 2022 Database
Systems Survey

By Mike Gates, Guest Writer & Former Senior Editor at DZone

From July to August 2022, DZone surveyed software developers, architects, and other IT professionals in order to understand
how persistent data storage and retrieval pathways are designed and evaluated. We also sought to explore trends from data
gathered in previous Trend Report surveys, most notably for our 2021 report on data persistence.

Major research targets included:


1. Distribution of data persistence logic from file system to application levels (i.e., what part of the supra-OS stack
decides how to store and retrieve data)

2. Thought processes of software professionals regarding data persistence, particularly as NoSQL solutions continue
to gain popularity

Methods: We created a survey and distributed it to a global audience of software professionals. Question formats included
multiple choice, free response, and ranking. Survey links were distributed via email to an opt-in subscriber list, pop-ups on
DZone.com, the DZone Core Slack Workspace, LinkedIn, and other social media channels. The survey was open from July 18 to
August 5 and recorded 568 full and partial responses.

In this report, we review some of our key research findings. Many secondary findings of interest are not included here.

Research Target One: Distribution of Data Persistence Logic


Motivations:
1. We wanted to understand the challenges that application developers face when working with data. Particularly, we
wanted to gauge the motivation behind DBMS selection, whether it be relational or of the NoSQL variety.

2. Relatedly, given the wide array of opinions on manual SQL writing and ORM usage, we wanted to check the pulse of our
database community to understand where they stand on the matter.

3. We also wanted to see how data is normalized in relational DBMSs as part of the wider database design process.

MANUAL SQL vs. ORMS


To borrow a phrase, SQL is easy to learn but hard to master. Cobbling together basic JOINs for rudimentary analysis is
something non-developers can learn in an afternoon. That being said, executing performant queries across hundreds of
thousands of rows is a bit more taxing — to logically puzzle out, test, and to actually type out.

Enter: the ORM (and similar tools), which can certainly lend a hand. It would come as no surprise that many developers prefer
the feel of their favorite programming language over even the most user-friendly flavor of SQL. Alas, the great debate over
ORMs vs. manual SQL is beyond the scope of this report, but we wanted to continue the research we began last year.

We wanted to know both how much developers think in relational models directly and how well RDBMS automation works.
Therefore, we repeated this question from our 2021 Data Persistence Trend Report survey:

How often do you write SQL manually?

Results (n=568):

SEE FIGURE 1 ON NEXT PAGE

DZONE TREND REPORT | DATABASE SYSTEMS PAGE 3


Figure 1

FREQUENCY OF MANUAL SQL WRITING: 2021 vs. 2022

2021 RESULTS 2022 RESULTS

2.5% 4.6%

Never
21.9% 15.8% 22.4% 15.8% A few times per year

A few times per month

Once per week


23.1%
19.9% 22.4%
25.7% A few times per week

10.9% 14.9% Every day

Observations:
1. A smaller percentage of respondents reported writing SQL more often than once per week in this year's survey (42.3% vs.
47.6% in 2021). This was more in line with our initial supposition in the 2021 report, although not massively so.

2. The biggest (relative) changes were in the "few times per week," "once per week," and "never" responses. While the other
categories stayed more or less the same, roughly 5 percentage points' worth of respondents shifted from writing SQL
every few days to writing once per week. This may be a sign that more companies are shifting toward ORMs as they
modernize their applications, or, at least, they are moving to systems that don't involve as much manual SQL.

3. Diving into the data a bit more deeply, the gap we saw last year of senior respondents writing manual SQL more
frequently than their junior colleagues seems to have lessened considerably. For the purposes of this report, we defined
senior respondents as those with more than five years of IT experience and junior respondents as those with five or fewer
years of experience. In every category, senior and junior respondents were within 2 percentage points of each other.

4. However, when looking at company size, the results of this year's survey bear the hypothesis we had last year — generally
speaking, workers at larger companies (>1,000 employees) were more likely to write SQL every day than those at
companies under that threshold. That being said, there was a brief swap at the levels of a few times per month and once
per week, and employees at companies of both sizes were about as likely to write manual SQL a few times per month as
every day. You can see the transition points in Figure 2.

Results (n=568):

Figure 2

FREQUENCY OF MANUAL SQL WRITING BY COMPANY SIZE

25

20

15 >1,000 employees

10 <1,000 employees

0
Never A few times A few times Once per week A few times Every day
per year per month per week

DZONE TREND REPORT | DATABASE SYSTEMS PAGE 4


Meanwhile, we also took a look at our community's thoughts on ORM usage across their careers. To gauge trends, we asked the
following question with two sets of answers:

Over the course of your professional career, you have used object-relational mappers (ORMs): {More often than I should, Less
often than I should, Just the right amount, No opinion} and {More frequently now than in the past, Less frequently now than
in the past, The same amount now as in the past, I have no idea}

Results (n=568):

Figure 3

ORM EXPERIENCE: PERCEPTION OF CORRECT USAGE IN HINDSIGHT AND CHANGE IN USAGE OVER TIME

CORRECT USAGE CHANGE OVER TIME

8.7% 9.5%

17.2%
More often than I should 14.3% More frequently now than in the past
30.7%
34.0%
Less often than I should 24.1% Less frequently now than in the past
11.1% 20.1%
Just the right amount 39.0% The same amount now as in the past

No opinion I have no idea


21.6%
32.4%

Observations:
1. The results are largely in line from our findings in last year's Data Persistence report. The most significant changes were a
1.5 percentage point increase in the number of respondents who felt they used ORMs more often than they should and a 1.4
percentage point decrease in the number of respondents who felt they used them less often than they should. Regardless,
our reasoning from last year still holds true — "I got it right 40% of the time" sounds bad in a vacuum, but given the context
of software design and the discussion surrounding the object-relational mismatch, it could be much worse.

2. In looking at ORM usage over time, again, our community reported findings in line with last year. Of note, fewer
respondents reported using ORMs more frequently now than in the past (34.0% in 2022 vs. 37.6% in 2021). Meanwhile, the
number of respondents who reported using ORMs less frequently in their work increased (32.4% in 2022 vs. 29.6% in 2021).
It will be interesting to see if the trend continues next year and whether this signifies an overall decline in ORM usage
throughout the industry.

IMPEDANCE MISMATCHES: DOMAIN MODEL vs. READ/WRITE PERFORMANCE


To continue last year's work, we wanted an overview of how often some mismatch between the domain model and physical
execution resulted in poorer performance. Thus, we again asked:

How often have you thought "the DBMS we're using cannot simultaneously (a) effectively model this domain and (b) read/
write performantly?" Consider any definition of "effectively" and "performantly."

Results (n=568):

Figure 4

PREVALENCE OF PERCEIVED MISMATCH BETWEEN MODEL AND RUNTIME IN DBMSs

8.1% 2.8%

Never
20.9%
Rarely
26.4%
Somewhat often

More often than I’d like

41.9% All the time

DZONE TREND REPORT | DATABASE SYSTEMS PAGE 5


Observations:
1. There was a fair bit of movement this year compared to what we saw in our Data Persistence report, but the overall
categories kept their general relationships (i.e., "Never" was the least popular response, "Somewhat often" was the most
popular response, etc.). That being said, our community reported seeing mismatches more frequently overall than last
year. In Data Persistence, just over two-thirds (66.8%) of respondents reported seeing mismatches more than rarely. Now,
that number is much higher — over three-quarters (76.4%).

2. Looking into the data a bit more, this could be because of an influx of new developers in the field. According to our
findings, there is an inverse attitude regarding mismatches when taking seniority into consideration (see Figure 5). The
linear trendline for senior professionals is -2.47, whereas the trendline for their junior colleagues is 2.45.

Depending on which end of the continuum you fall, we can empirically conclude that less-experienced workers are
impatient, panicky, and hot-headed, whereas more seasoned workers are stuck in their ways, more tolerant of poor
conditions, and less willing to stand up for themselves.

Results (n=568):

Figure 5

PERCEIVED MISMATCH BETWEEN MODEL AND RUNTIME IN DBMSs: SENIOR vs. JUNIOR

SENIOR RESPONDENTS JUNIOR RESPONDENTS

50 70

60
40
50

30
40

30
20

20
10
10

0 0
Never Rarely Somewhat More often All the time Never Rarely Somewhat More often All the time
often than I’d like often than I’d like

RELATIONAL DATABASE DESIGN


We wanted to continue previous research on data normalization. For clarity, we defined the various database normalization
forms as follows:

• Non-normal – flat records, no uniqueness enforced


• First normal form (1NF) – single value per column, uniqueness over all columns compounded
• Second normal form (2NF) – 1NF plus single-column primary key
• Third normal form (3NF) – 2NF plus no dependent non-primary key columns
• Boyce-Codd normal form (3.5NF) – 3NF plus only one candidate key

In the 2021 Data Persistence Trend Report, we established that a combination of 3NF and 3.5NF was the most common type of
normalization among the DZone community, although 1NF and 2NF outweighed them individually. The question we asked:

Please rank how frequently you have enforced the following approaches to normalizing relations in a database (top = most
frequently, bottom = least frequently).

The results were sum of responses weighted by rank, which we maintained for this year's survey.

Results (n=758 across all 2021 results, n=503 across all 2022 results):

SEE TABLE 1 ON NEXT PAGE

DZONE TREND REPORT | DATABASE SYSTEMS PAGE 6


Table 1

FREQUENCY OF ENFORCING DATABASE NORMALIZATION FORMS

Score
Normalization Form
2021 2022

Second normal form (2NF) 3,506 2,280

First normal form (1NF) 3,184 2,130

Third normal form (3NF) 3,171 1,983

Non-normal 2,252 1,516

Boyce-Codd normal form (3.5NF) 1,987 1,432

Fourth normal form (4NF) or higher 1,345 1,120

Observations:
1. Of note, 1NF overtook 2NF among our community within the past year, although both forms remained competitive
in terms of popularity. It's likely that this is just a difference in the respondents from last year, but it could indicate
that the industry is putting pressure on developers to under-normalize data. Meanwhile, 3.5NF moved past non-
normalized data.

2. The percentage of 3NF+3.5NF to the total results has remained roughly the same year over year (~33.4% of the total score).
This indicates that the broader 3NF paradigm remains the most widely enforced type. Of course, this isn't surprising,
considering 3NF's reputation as the hallmark of normalized relational schema.

To look more deeply into the motives behind these answers, we also asked:

Please rank how frequently you have regretted enforcing the following approaches to normalizing relations in a database
(top = most frequently, bottom = least frequently).

And:

Please rank how frequently you have regretted NOT enforcing the following approaches to normalizing relations in a
database (top = most frequently, bottom = least frequently).

Results (n=467 across all results):

Table 2

FREQUENCY OF REGRET TOWARD NORMALIZING DATABASE RELATIONS

Regretted Enforcing Regretted NOT Enforcing

3NF + 3.5NF (3,138) 3NF + 3.5NF (3,094)

1NF (1,744) 2NF (1,852)

2NF (1,652) 1NF (1,804)

Non-normal (1,457) Non-normal (1,327)

4NF or higher (1,315) 4NF (1,176)

Observation: Again, because of the popularity of the 3NF+3.5NF paradigm, it is effectively an outlier for this analysis. Last
year, we noted that developers tended to regret less normalization more frequently than not enforcing it, and that has held
true this year as well.

DZONE TREND REPORT | DATABASE SYSTEMS PAGE 7


Research Target Two: A Look at Modern Paradigms
Motivations:
1. We wanted to see how the development world is adapting to the increasingly hybrid SQL/NoSQL reality settling in.

2. As microservices and their respective design patterns are becoming more prevalent, we wanted to examine how
developers are handling data management for them.

CONSIDERING NOSQL
The continuing rise of Postgres seems to have given high-performance SQL a new lease on life and pushed back the threshold
that makes NoSQL attractive (use case permitting, of course). Despite NoSQL solutions' valiant efforts in creating SQL-like
languages, the comfort of relational thinking is hard to discard. However, the golden rule of databases is that no one database
is right for all use cases, and the various NoSQL paradigms have found their niches (well, very large niches in the case of big
data, IIoT, etc.) and matured in the past decade.

Of course, using a hybrid system seems to be the most common response to this conundrum. Relational databases, no
matter how efficient they get, aren't right for every use case, and the benefits of NoSQL solutions are undeniable. In our 2020
Database Trend Report, we showed that most companies involved in the big data space had already adopted a hybrid model,
and that trend likely will continue.

However, we also wanted to see what kinds of RDBMSs our community is using. So we asked:

What type(s) of DBMS have you worked with?

Respondents could select as many as they liked. Unsurprisingly, RDBMSs maintained their healthy lead over the rest of the
field — more than 91% in popularity — although we wanted to take a closer look at which NoSQL options were gaining or losing
steam as well.

Results (n=568 across all results):

Figure 6

COMPARISON OF NOSQL DBMSs

80

60

40

20

0
Graph Key-Value Document Column Time Series Other

Observations:
1. Looking back at our 2021 Data Persistence Trend Report, it seems that every NoSQL RDBMS saw gains in popularity
except one — graphs. Looking at the year-over-year data, key-value systems grew from 64% last year to 73.4% this year.
Document systems grew from 56% last year to 62.2% this year. Columnar-oriented systems grew from 29.1% in 2021 to
37.8% this year — a substantial jump.

We did not report on time series data specifically last year, but we can infer growth in experience because it was not
heavily cited as an "other - write in" option when we asked the question in 2021. For what it's worth, the primary use
cases for time series data were real-time analytics (34.5%), monitoring (24.4%), security (19.6%), and (I)IoT (14.1%). We
suspect it is unlikely that any of these use cases will decline in the near future.

DZONE TREND REPORT | DATABASE SYSTEMS PAGE 8


2. The dip in graph database system experience (33.2% in 2021 vs. 27.4% in 2022) could have multiple explanations. Most
likely, it's simply a difference in the respondent makeup last year versus this year. However, it also could be a sign that
graph databases, while an incredible technology, might not be as mature or ready for wide adoption as they theoretically
are (we'll be touching on this in the next section).

Lastly, it could just signify that, with a limited number of hours in the day, respondents and their leaders are choosing
to put their time toward other, more established or use-case-appropriate paradigms/solutions. Undoubtedly, graphs
are capable of helping solve a lot of complex business problems — it may just not be their time to shine yet.

THEORY vs. PRACTICE


As stated in the previous section, we wanted to get a better understanding of how various DBMS paradigms have played out
in the community. Everyone has experienced a bit of buyer's remorse or an otherwise not perfectly informed decision in their
lives. Furthermore, given the growing complexity of modern IT and the constant evolution of software designed to support it,
it's important to touch base on how trials of predominant technologies have panned out.

Thus, we wanted to examine how various DBMS paradigms are viewed both in theory and in practice. The immediate purpose
of this question was to see the gap between "30,000-ft" and "on-the-ground" views of each approach to data persistence.
The broader purpose was to understand how well theorized each approach is, and how well "digested" the theory is at the
granularity required for implementation.

With that in mind, we asked respondents to rate relational, graph, key-value store, document-oriented, column-oriented, and
time series DBMSs in both theory and practice from 0 to 5 stars:

Figure 7

DBMSs IN THEORY vs. IN PRACTICE

In Theory In Practice

Relational (n=487) (n=484)

Graph (n=452) (n=443)

Key-value (n=470) (n=466)

Document-oriented (n=460) (n=465)

Column-oriented (n=451) (n=437)

Time series (n=434) (n=441)

Observations:
1. For the most part, it seems that our community had a solid theoretical grasp on the technologies they then proceeded
to work with. Due to the design of the question, any difference in star rating between the two columns is 10 percentage
points and, thus, technically substantial, but being half a star away is the closest result the question has to being right on
target, which is worth bearing in mind.

2. That being said, it's interesting to note that the paradigms that saw the most pronounced jumps in popularity in the
previous section are the ones perceived to be the most consistent. Given RDBMSs' ubiquity, they can be safely set aside
for the purposes of this discussion. However, key-value and document- and column-oriented systems were both seen as
being roughly equally good in theory and good in practice by our community. Meanwhile, graph databases, which saw a
year-over-year dip in popularity, were seen as being modestly less good under the hood than at first glance.

3. Of course, the complication with the previous hypothesis is that time series DBMSs were also seen as less helpful in
practice than in theory, and we inferred substantial year-over-year growth from our data. As an explanation, time series
databases may simply be so critical to their niches, particularly in IoT, that they're seen as more indispensable for their use
cases than graphs are for theirs.

DZONE TREND REPORT | DATABASE SYSTEMS PAGE 9


DATA INTEGRITY AND POLYGLOT PERSISTENCE
We also wanted to briefly touch base and expand upon the experiment we conducted in the 2021 Data Persistence report.
We theorized that more experienced developers wanted at least one "relational checkpoint" when working with systems that
include both SQL and NoSQL DBMSs (henceforth for this section, "polyglot persistence systems"). So once again, we asked:

Agree/disagree: Every polyglot persistence system should include at least one relational "checkpoint" to enforce referential
integrity at some specified level, time interval, or triggering event.

We then filtered the results by experience level and company size.

Results (n=569):

Figure 8

REFERENTIAL CHECKPOINT OPINION BY EXPERIENCE LEVEL

50

40

30
Senior

Junior
20

10

0
Strongly disagree Disagree Neutral Agree Strongly agree

Figure 9

REFERENTIAL CHECKPOINT OPINION BY COMPANY SIZE

50

40

30 >1,000 employees

<1,000 employees
20

10

0
Strongly disagree Disagree Neutral Agree Strongly agree

Observations:
1. Last year, we theorized that senior professionals were more likely to want at least one relational checkpoint in polyglot
persistence systems because they were a bit more forward-thinking in avoiding unnecessary work and, thus, wanting the
DBMS to handle more of the load. The data mostly bore that hypothesis, with the exception that it seemed as though
senior developers were more likely to agree and disagree with the idea. This year, that trend has continued — in fact, it
has grown more pronounced.

The more experienced members of our community held stronger opinions across the range of responses — with the
lone exception of not having an opinion. The implication is, of course, that people working with polyglot persistence
systems will form an opinion on the matter one way or another the longer they work in the field.

DZONE TREND REPORT | DATABASE SYSTEMS PAGE 10


2. Looking at company size, it's interesting to note that larger and smaller organizations respectively mirrored senior
professionals' opinions almost exactly. It seems that the larger the company you work for, the stronger your opinion on
having a relational checkpoint in polyglot persistence systems is.

REVISITING THE CAP THEOREM


Of course, everyone is familiar with the CAP theorem — that, when working with distributed datastores, you can have two
among consistency, availability, and partition tolerance. Last year, we suggested that developers were beginning to think more
about availability now than in the past, especially as the cloud has become more prominently adopted in the past decade or so.
We also hypothesized that partition tolerance was the lowest concern, but it was becoming increasingly more thought about
over time. The data bore those suppositions. Therefore, we again asked our community:

Rank how much you've thought about the following guarantees of a distributed datastore over the course of your professional
career (top = thought most about, bottom = thought least about):

We summed the ranks to create the following scores.

Results (n=554 across all responses):

Table 3

ATTITUDES TOWARD DISTRIBUTED DATASTORES

Rank Over Career Rank on Recent Projects

Availability (1,168) Availability (1,170)

Consistency (1,134) Consistency (1,081)

Partition tolerance (744) Partition tolerance (795)

Non-normal (1,457) Non-normal (1,327)

4NF or higher (1,315) 4NF (1,176)

Observations:
1. The results are telling — availability has slid into first place. This is true among junior and senior professionals in our
survey, although availability's lead has become marginally wider among junior respondents. That makes sense, since it
means availability has been more heavily considered for a larger relative percentage of our community members' careers.

2. While availability beat out consistency in recent projects last year, that range has widened considerably. In the 2021 report,
availability comprised about 38.8% of the total score. This year, that figure is 47.8%.

3. Lastly, the results are a modest sign that partition tolerance is becoming more attractive, or at least the technology is
finding use cases (or vice versa). Last year, partition tolerance constituted about 23.6% of the total score in recent projects.
This year, that total is 26.1%. That isn't gargantuan growth but is a sign that technologies like blockchain, which prioritizes
availability and partition tolerance at the expense of immediate consistency, are slowly finding their footing. For many
projects, it seems that eventual consistency is still consistency.

MANAGING MICROSERVICES
Few innovations in recent years have generated as much discussion as microservices. Of course, every segment of
development has its own buzzwords, but it's rare to see a technology deliver on such promise in so short an amount of time as
microservices have. As always, with new architectures come new ways of handling old problems. So in this final section of our
key research findings, we'll take a look at how our community is adapting to microservices in the context of data.

First, we wanted to see how popular various distributed design patterns have grown. To answer that question, we asked:

What distributed design patterns have you implemented?

Results (n=554 across all responses):

SEE FIGURE 10 ON NEXT PAGE

DZONE TREND REPORT | DATABASE SYSTEMS PAGE 11


Figure 10

DISTRIBUTED DESIGN PATTERN IMPLEMENTATION

Pub/sub

Ambassador

Sharding

Bulkhead

Adapter

Circuit breaker

CQRS

Pipes and filters

Leader election

Federated identity

Sequential convoy

Event sourcing

Sidecar

Other - write in

0 10 20 30 40 50

Observation: For this Trend Report, we wanted to take a particular focus on CQRS (as an alternative to CRUD) and event
sourcing. We haven't collected much historical data on distributed design patterns, but it's worth noting that nearly 30% of
our community has experience in working with CQRS and just under 25% have worked with event sourcing. Regardless of
the historical data, and at the risk of stating the obvious, it's clear that microservices are going to be an important part of the
ecosystem going forward.

Future Research
These key research findings were aimed at a mix of developers and DBMS designers. In addition to these questions, our survey
also included coverage of topics such as isomorphism between object models and physical data models, types of file systems
personally implemented by respondents, how often respondents think about physical characteristics of secondary storage,
how prospective consideration of physical storage details might impact performance, and more. Of course, we'll keep that data
and future questions along the lines we asked in this report in mind for future surveys and Trend Reports.

Please contact [email protected] if you would like to discuss any of our findings or supplementary data.

Mike Gates, Guest Writer & Former Senior Editor at DZone


@Michael_Gates on DZone

Mike has spent more than five years working with DZone contributors and clients alike, with roles ranging
from frontline web editor to editorial consultant to managing editor. These days, he uses his history of
recognizing readership trends to write and edit articles for DZone.com — when he isn’t busy listening to
audiobooks and searching for the perfect oven-baked beef brisket recipe.

DZONE TREND REPORT | DATABASE SYSTEMS PAGE 12


ScyllaDB: The Database
for Gamechangers
Gamechangers rely on ScyllaDB, the database for data-intensive
apps that require high performance and low latency.

ScyllaDB’s unique close-to-the-hardware architecture powers


engaging experiences at scale with impressive speed.

5X
high throughput
2X 75%
lower latency TCO savings

DISCOVER SCYLLADB
PARTNER CASE STUDY

Case Study: Comcast


Cutting P99 by 95 Percent While Reducing 962 Nodes to 78

Comcast is a global media and technology company with three primary


businesses: Comcast Cable (one of the United States' largest video, high-
speed internet, and phone providers to residential customers), NBCUniversal,
and Sky.

Challenge COMPANY
Comcast's Xfinity service serves 15 million households with over 2 billion API Comcast
calls (reads/writes) and more than 200 million new objects per day. Over the
course of seven years, the project expanded from supporting 30K devices COMPANY SIZE
to over 31 million devices. They first began with Oracle, then later moved 189,000 employees
to Apache Cassandra (via DataStax). When Cassandra's long-tail latencies
proved unacceptable at the company's rapidly increasing scale, they began INDUSTRY
exploring new options. In addition to lowering latency, the team also wanted Entertainment, Telecommunications
to reduce complexity.
PRODUCTS USED
To mask Cassandra's latency issues from users, they placed 60 cache servers ScyllaDB Enterprise
in front of their database. Keeping this cache layer consistent with the
PRIMARY OUTCOME
database was causing major admin headaches.
By moving to ScyllaDB, Comcast

Solution improved latency by more than 95

Comcast selected ScyllaDB, the NoSQL database for data-intensive apps that percent and dramatically reduced their

require high performance and low latency. ScyllaDB's close-to-the-metal, total database infrastructure from 962

shard-per-core architecture delivers greater performance for a fraction of the cluster nodes to 78.

cost of DynamoDB, Apache Cassandra, MongoDB, and Google Bigtable.

Thanks to ScyllaDB's ability to take full advantage of modern infrastructure "What we saw was pretty phenomenal.
— allowing it to scale up as much as scale out — Comcast was able to replace We simulated 2.5X our peak load with a
962 Cassandra nodes with just 78 nodes of ScyllaDB. They improved overall 95 percent drop in our response times.
availability and performance while completely eliminating the 60 cache That's value that you pay back to the
servers. The result: a 10x latency improvement with the ability to handle over end-user right away."
twice the requests — at a fraction of the cost.
— Phil Zimich,
Results Senior Director of Engineering, Comcast
By moving from Cassandra to ScyllaDB, Comcast:

• Reduced their total database infrastructure from 962 Cassandra CREATED IN PARTNERSHIP WITH

nodes to 78

• Decreased P99, P999, and P9999 latencies by 95 percent


• Achieved 60 percent savings over Cassandra operating costs
• Saved $2.5 million annually in infrastructure costs and staff overhead

DZONE TREND REPORT | DATABASE SYSTEMS PAGE 14


CONTRIBUTOR INSIGHTS

Strategies for Governing


Data Quality, Accuracy,
and Consistency
By Ted Gooch, Staff Software Engineer at Stripe

Introduction
In 2006, mathematician and entrepreneur Clive Humby coined the phrase, "Data is the new oil." The primary point of this
comparison is to highlight that, while extremely useful, data must be extracted, processed, and refined before its full value can
be realized. Now over fifteen years later, it is easier than ever to accumulate data, but many businesses still face challenges
ensuring that the data captured is both complete and correct.

THE IMPACT OF DATA QUALITY


In the beginning of a company's data journey, simply loading application data into a database provides valuable information.
However, the purpose of data-driven decision making is to reduce uncertainty. If the data is of low-quality, this may introduce
additional risk and lead to negative outcomes. To put this into concrete numbers, data quality issues cost organizations about
$12.9 million annually, according to a 2021 Gartner survey.

Apart from decision making, reactively detecting and remediating data issues takes a significant amount of developer
resources. Specifically, a 2022 Wakefield Research Survey of 300 data professionals highlighted that business stakeholders are
often affected by erroneous data before they are found by data teams. Additionally, the survey outlined that data teams spent
793 hours per month fixing data quality related problems. This is negative in two dimensions. First, teams spend significant
effort to fix issues. Secondly, stakeholders lose trust in the quality of data produced.

Moreover, there are numerous and diverse approaches to proactively curate data quality. The remainder of this brief outlines
key techniques and methodologies to ensure data is consistently up to standard and matches the expected semantics.
Furthermore, these strategies ensure most errors are detected early, and the scope of any late detected problems are
quickly understood.

DATA PIPELINE CONTRACTS


Defining the core requirements for a dataset brings clarity for both the producers and consumers of that dataset. When should
this data arrive? How fresh should it be? Are there any expectations on column bounds? Is consistency required with other
tables in the data warehouse?

These are just a few of the questions that should be explored so that there is alignment in expectations between the teams
that are originating the data and the downstream consumers. Succinctly put, the first step in better data quality practices is
declaring what exactly a dataset should look like for a specific use case.

Importantly, the amount of rigor around a specific set of tables should be anchored to both the cost of audit execution and the
analytic value those subjects provide. For example, data flowing into an expensive, high-performance database for use in regulatory
reporting requires more quality assurances than an exploratory dataset stored in a data lake of distributed object storage.

DATA OBSERVABILITY
The journey to better data quality practices starts with looking at individual jobs. However, similar techniques used to reduce
risk in business decisions can also be leveraged when improving data governance. Instead of looking at single jobs, a systems-
based, ongoing approach is used to offer the necessary context. A comprehensive, cross-organizational strategy will surface
issues that are not obvious when looking at single jobs.

DZONE TREND REPORT | DATABASE SYSTEMS PAGE 15


In particular, data observability is the term to underscore a comprehensive overview of a data system's health. A more holistic
approach shifts the readiness stance from reactive to proactive. A proactive stance limits the scope of remediation needed
when errors are introduced — and it also reduces the time spent by data engineers tracking down the provenance of a
particular failure.

What Is Data Quality?


Before laying out methodologies for data quality initiatives, it is helpful to define what exactly is meant when discussing data
quality. This section will clarify terms and provide examples to make the term "data quality" more concrete. Put simply, data
quality prescribes the attributes to which producers and consumers agree a dataset must conform in order to drive accurate,
timely downstream analysis.

DATA QUALITY AS A CONTRACT


The union of requirements described above constitutes a contract between parties in the data pipeline. This may manifest
such that it's explicitly enforced by producers or it may be an implicit set of expectations from users. To shed more light on this
relationship, four types of data standards are outlined below:

• Correctness refers to ensuring that the column values for data conform to the expected domain. It essentially is
answering this question: Does the data produce correct answers when business calculations are applied?

• Completeness checks ensure that the data contains all the content that is expected for a given dataset. Did the
expected amount of data arrive and are all columns populated with the necessary information?

• Consistency may have several different meanings depending on the usage. In this instance, it focuses on whether
the data in a table matches what exists in related tables elsewhere in the ecosystem.

• Timeliness is key. Some analysis requires data to be current within a certain time period from the event that
generated the data.

SHARED UNDERSTANDING OF DATASET SEMANTICS


A key objective is the alignment of expectations between data sources and downstream dependencies. Data quality is a
codification of these agreements. In order to have a successful implementation, ownership should be pushed to the teams that
have the best understanding and frame of reference.

METRIC DEFINITION OWNERSHIP


Often, product teams do not have a strong intuition about how their data will be used in downstream analysis. For this reason,
they cannot be the sole owners of the data quality metric definition. There must be a negotiation about what is possible for an
application to produce and the properties stakeholders need for their reporting to create actionable results.

PRODUCER CONSTRAINTS AND CONSUMER REQUIREMENTS


Typically, producers define the rules for a complete and correct dataset from their application. Typical questions for producers
are indicated below:

• Is there a reasonable volume of data given the historical row counts for a specific job?
• Are all column values populated as expected? (e.g., no unexpected null values)
• Do column values fall in the anticipated domain?
• Is data being committed at an appropriate cadence?

Figure 1: Producer constraints and consumer requirements

DZONE TREND REPORT | DATABASE SYSTEMS PAGE 16


Consumers' needs are tightly coupled to the actual usage of the data. Typical questions are as follows:

• Did the complete data arrive in time to publish reports?


• Do column values have referential integrity between facts and dimensions?
• Do business calculations produce results in the expected range?

Frameworks for Promoting Data Standards


Data quality frameworks successfully provide a systemic approach to data quality. A comprehensive approach guides data
engineers towards best practices and ensures that a rigorous methodology is consistently applied across an ecosystem.

AUDITS
Data audits are a central piece of a data standards framework and expose the low-level information about whether a dataset
complies with agreed upon properties. There are several types of audits and methods in which to apply them.

TYPES OF AUDITS
Below is a brief overview of common data quality audits:

• Row counts within range


• Null counts for column within range
• Column value within domain
• Referential integrity check
• Relationship between column values
• Sum produces non-zero value

Furthermore, there are endless possibilities on what can be defined as an audit, but they all reduce to answering this question:
does the content match what the producer intends to send and what the consumer expects to receive? When creating audits,
data owners must be cognizant of the tradeoff between the cost of collecting the audit versus the increase in confidence that
passing a check will give.

Audits may also be blocking or non-blocking. Blocking audits prevent the failed data pipeline from proceeding until a
correction is applied. Conversely, non-blocking audits alert pipeline owners to the failure, and allow the pipeline to proceed.
Ideally, each consumer will determine which audits are blocking/non-blocking for their specific use-case. Exploratory use cases
may even be comfortable executing against data that has not yet been audited with the understanding that there are no
quality guarantees.

WRITE-AUDIT-PUBLISH
Write-Audit-Publish (WAP) is a pattern where all data is written first to a staging location in the database and must pass all
blocking audits before the commit is made visible to readers. Typically, this is enabled by special functionality within a database
or by swapping the table or view.

Figure 2: WAP pattern workflow

Specifically, WAP ensures that there are no race conditions where consumers unintentionally read data that has not yet been
validated. Proactive action must be taken to perform reads of unaudited data, and it must be a conscious decision by the reader.

DZONE TREND REPORT | DATABASE SYSTEMS PAGE 17


INTEGRATED AUDITS
A variation of WAP that is supported in some database engines is the concept of integrated audits. A job writing data can
specify the expected column values or range of values. The engine will then validate that all rows being written conform to
those expectations, or otherwise, the write will fail.

REMEDIATION
What happens when errors are introduced into a system? According to the 2022 Wakefield Research Survey referenced in the
introduction, the majority of respondents stated that it took four or more hours to detect data issues. Additionally, more than
half responded that remediation took an average of nine hours. The goal of a robust data quality strategy is to reduce both of
these metrics.

ISSUE DETECTION
First, comprehensive audits throughout the dependency graph of a data pipeline reduces the amount of time before
erroneous data is identified. Detection at the point of introduction guarantees that the scope of corruption is limited.
Additionally, it is more likely that the team that discovers the problem is also able to enact the necessary fixes to remediate
the pipeline. Dependent jobs will have high confidence that the data they receive matches expectations, and they can further
reduce the scope of validation that is needed.

MEASURING IMPACT
Once an issue is detected, a specific audit failure gives the investigating engineer adequate insight to begin debugging the
failure. This is in contrast to the scenario where stakeholders discover errors. In that case, there must be an investigation to
track the flow of jobs backwards up the dependency graph. This investigation will necessarily increase the time to resolution
due to the increase in scope of jobs that must be evaluated before debugging can occur.

DATASET PROVENANCE AND LINEAGE


Lineage refers to the set of source nodes in a dependency graph upstream from a given execution. With debugging data
issues, significant time is saved if there is strong tooling around understanding the provenance of an erroneous dataset.
Knowing the places where an error may have been introduced reduces the search space, and consequently allows data
engineers to focus their debugging efforts.

Figure 3: Data provenance

In addition, when an issue is detected post-hoc, lineage tools help assess the set of jobs that are impacted by a data failure.
Without such tooling, it is labor intensive to manually search through dependencies and discover all affected operations.

TRACKING
Building trust requires a history of delivering on promises. Demonstrating an adherence to commitments over time gives
consumers confidence in the product of a process. Data quality, just like other relevant company key performance indicators,
benefits from mindful collection and review.

DZONE TREND REPORT | DATABASE SYSTEMS PAGE 18


MEASURE QUALITY OVER TIME
Tracking metrics on data quality over time allows an organization to guide resources and improve the areas which will have
the most impact. Are there certain use cases that are consistently failing audits? Does data typically arrive past service-level
objectives? If these questions are getting affirmative responses, it is a signal that there must be a deeper discussion between
the responsible team and upstream teams. The data contracts must be re-evaluated if compliance is not possible.

KEY INDICATORS TIED TO IMPACTFUL USE CASES


Audits must have a clear connection to the ground-truth of the organization. This ensures a focus on quality metrics that are
directly tied to real, actionable aspects of an organization's data health. In this case, the value of the audit is directly measurable
and corresponds one-to-one with the value of the business calculation that it supports.

Conclusion
Businesses are increasingly leveraging data to improve their organizational decision making. According to a 2021 NewVantage
survey, a staggering 97 percent of respondents indicated investment in data initiatives. High standards for data quality
establish trust and reduce the uncertainty when using data as an input for decision making.

Data quality frameworks enforce a consistent approach across all processes. Automatically and consistently applied tooling
reduces the amount of engineering hours necessary to provide adequate auditing coverage. A high-level of coverage results in
issues being caught early in the pipeline and improves remediation metrics. A reduction in the amount and scope of impact
of errors builds trust with business stakeholders. Finally, a high-level of trust between engineering and business teams is a
requirement to build a successful data-driven culture.

Ted Gooch, Staff Software Engineer at Stripe


@TGooch44 on DZone | @tedgooch on LinkedIn

Ted is a seasoned data professional with 15+ years of work in the data space, and he has worked with many
organizations to improve infrastructure and best practices around data. While at Netflix, he worked on the
initial implementation of Iceberg and is currently a committer on the Iceberg open-source project. In his
current role, he is helping Stripe build the next generation of data tooling with an emphasis on security, compliance, and
user experience.

DZONE TREND REPORT | DATABASE SYSTEMS PAGE 19


CONTRIBUTOR INSIGHTS

Migrate RDBMS Dinosaurs


to the Cloud
To Evolve, You Must Lift and Shift First

By Kellyn Gorman, Principal CSA & SME for Oracle on Azure at Microsoft

Dinosaurs are not extinct. Many of the top businesses of today have either migrated to the cloud or are in the process
of currently migrating. As part of their IT organizations, it’s common to possess one or more large relational database
management systems (RDBMS) that are at the core of the business. These monstrous dinosaurs are often the most mission
critical of all company data and are in no way extinct but can also serve as an anchor from a full migration to the cloud. No
matter the cloud strategy, these monolithic databases are essential to the ecosystem and should be part of the migration
strategy to be successful.

Figure 1: Cloud migration example

A common mistake is when teams attempt to separate the application or smaller systems connected to the large relational
databases, as demonstrated in Figure 1. To be successful, the relational databases and all connected resources — no matter
if they're applications, secondary databases, web servers, etc. — must migrate as one. Furthermore, that success requires a
strategy to migrate large amounts of relational data, multiple servers, software installations, jobs, and network configurations
as part of the data ecosystem.

After all of this complexity, the network is the last bottleneck and will be one of the biggest challenges to overcome as part of
this herculean effort.

How Large Relational Databases Are Impactful to Cloud Migrations


Relational systems historically have a minimum of two tiers — a relational database and application or access tier. In their
more complex designs, they have multiple application server tiers, servers to manage FTP access, ETL/ELT, web servers,
middleware, and corresponding databases that either feed or are fed from the main relational system. Some platforms, such
as Oracle, are architected around schemas, which result in a historically larger database that is more difficult to migrate unless
taken as a whole.

DZONE TREND REPORT | DATABASE SYSTEMS PAGE 20


THE DICHOTOMY OF THE RELATIONAL DINOSAUR
The relational database dinosaur’s natural life is one of growth, and with a RDBMS based on a schema design vs. smaller tenancy
architecture, each database can possess terabytes and sometimes petabytes of data. Depending on the interconnectivity of the
data to other systems, the database size can create its own gravity, pulling systems closer to the source to provide the best user
experience. In the cloud, this pull is amplified by the massive real estate covered by an enterprise cloud.

Data gravity will pull applications, connected data Figure 2


estates, and resources to the largest body of data,
most often a legacy relational database possessing
critical business data.

As more data travels between applications and


databases to the larger relational system (via ETL/
ELT processing or database links), there is a need
for all systems involved to be closely connected to
the larger relational body to eliminate latency. This,
in essence, is data gravity.

When architecting an RDBMS for the cloud, data


gravity must be taken into consideration. Not just
for choices in infrastructure, but even for services, a
cloud solution must have awareness of application
and database connections to deploy them for the
most optimal performance. Design begins from the largest of the systems, then radiates out to the smallest components/
services, ensuring the most impactful systems receive the focus required for success in the architecture design.

ALL OR NOTHING TO THE CLOUD


As customers migrate to the cloud, they may have dipped their toes in with a few migrated systems, then decided to move
everything to the cloud in earnest. With this in mind, there is a goal to leave nothing on-premises, and this requires an
understanding of archaic relational systems and the requirements for migrating them to the cloud.

One of the most significant weaknesses with the trickle-to-the-cloud strategy is that previous, smaller cloud migration projects
may have shifted various workloads across multiple clouds, and if there is data interaction between the systems, this results in
discovery around multi-cloud dependencies. The network becomes our last bottleneck, which no one has discovered how to
overcome. Close data center locations with peered networks and accelerated networking may assist in eliminating some of the
latency, but as demonstrated in Figure 3, until new networking technology is developed, this challenge will continue. Multi-cloud
solutions can provide some benefits of data between cloud providers, but it will never perform like a single cloud solution.

Figure 3: Network latency differences between cloud providers can vary between regions and geographies

DZONE TREND REPORT | DATABASE SYSTEMS PAGE 21


The first goal to overcome a cross-cloud latency issue is to identify what data is required for moving between the environments
daily, weekly, etc. A second goal should be around how developers have performed their work on-premises and optimized it for
cloud development, eliminating excess whenever possible. Always choose to simplify any additional IO that could be created
when pulling or pushing data across the network.

All cross-cloud data processing should be tested fully to ensure it can meet the demands of the business and is acceptable
even with potential data growth over time.

Infrastructure as a Service vs. Platform or Software as a Service


Upon investigating cloud migrations, Platform as a Service (PaaS) and Software as a Service (SaaS) are repeatedly marketed
as attractive options for all on-premises technology. Users are thrilled to hear they may be able to spend less on supporting
infrastructure and platforms, but they forget how much technical debt has already been built into the relational environments
they want to move to the cloud.

WHY ARE VERY LARGE RDBMS LIMITED SO OFTEN TO IAAS?


Once it becomes apparent PaaS and SaaS will require users to give up many customizations and functionalities, the user is
back to considering Infrastructure as a Service (IaaS). This occurs due to a combination of factors, but most of the challenges
revolve around years of complexity built into the systems and a lack of features in SaaS/PaaS offerings. When deciding what
options are available in the cloud vs. data estates moving to the cloud, follow these simple guiding principles:

• SaaS:
– You are working on a greenfield (new) project
– There is no customized code required at the database layer
– The system possesses application-driven development and has simple data storage requirement
– You are working with smaller user bases and simple recovery point objectives (RPOs)/recovery time objectives (RTOs)

• PaaS:
– You are working on a greenfield project
– The resource usage for vCPU, memory, and IO easily fit in limits of PaaS
– There are few IT resources to manage infrastructure, or there is a desire to remove this requirement
– There are less advanced features or customized options implemented to the database tier

• IaaS:
– You are working with large, terabyte-petabyte relational systems
– You require the same or similar architecture as your on-premises application
– You have unique demands on resources — IO, vCPU, and/or memory
– You have very demanding workloads with complex RPOs/RTOs and development demands

If there is a need to go with IaaS, it is important to realize that cloud vendors can provide solutions for an incredible array of
workloads, and relational workloads are unique, requiring the correct IaaS solution to meet the requirements.

HOW TO BUILD OUT A RDBMS MIGRATION STRATEGY


Migrations are challenging and being prepared is the best course of action to succeed. Relational databases with multi-
tier systems, no matter whether you are working with an archaic client/server architecture or a mainframe solution, require
planning to ensure success. Although each project is unique, there are certain aspects that are universal and, if satisfied as part
of the plan, will help to guarantee a successful migration. The universal list often includes:

• Database size and complexity


• Data loads and connected ecosystems
• Application, job, web, and other servers
• Network latency

DZONE TREND REPORT | DATABASE SYSTEMS PAGE 22


WHAT IMPORTANT METRICS MUST BE IDENTIFIED IN RDBMS?
Most relational workloads are resource heavy — in other words, they are more demanding on infrastructure than other
workloads. But as much as we may focus on CPU and memory, relational workloads, especially ones such as Oracle, can require
high IO storage solutions.

Most IO storage and benchmarks will focus heavily on requests (IOPs); however, request sizes can vary, leaving these values
compromised for marketing benefits. From my experience, a recommendation is made to focus less on IOPs and ensure that
the solution chosen, both around virtual machine and storage IO limits, can handle the megabytes per second (throughput).

CREATING TIERS OF RDBMS COMPLEXITY


As services, high availability, and backups change in the cloud, all decisions around storage and solutions must focus on RPO
and RTO. Any required customer uptime SLAs that may be different from the RPO/RTO should also be considered because
services could be bundled into storage solutions chosen as part of the architecture.

Ensure that all architectural decisions are based on how cloud architecture should be designed for recommended practices
and not just replicating what a customer has built into their on-premises architecture. This is a common mistake seen in the
cloud, creating holes and redundancy.

A good starting point is to lift and shift the relational database workload, which will remove any infrastructure debt built into
the existing, on-premises hardware. If this hardware isn’t considered and all focus goes into the relational workload, a new
architecture can be designed based on its needs.

Rinse and Repeat to Success


Because most data ecosystems require the main database and connected systems to not only be migrated but also duplicated
for non-production copies, there’s significant importance to building out a framework that can be simplified, automated, and
deployed as part of a DevOps practice. Performing all the actions involved sequentially without a framework each time would
be incredibly time consuming and prone to mistakes.

BUILD A FRAMEWORK
Building a cloud migration framework starts with documenting what is required to deploy a relational system to the cloud
from end to end. The beginning outline can look similar to the high-level example shown in Figure 4 and be built out to
complete a migration project plan.

Once this is built out, use tools and scripts to automate as much of it as possible while including enough flexibility to be reused
for numerous systems and architecture going forward.

Figure 4: An example of a high-level framework for a cloud migration

Ensure the scripting language and tools can scale as your cloud migrations do, and verify that they can manage the
infrastructure, relational system, and the data. As issues arise and are resolved, document them and ensure that these aren’t
repeated in the future, allowing for efficiency to develop as part of cloud migration strategies.

DZONE TREND REPORT | DATABASE SYSTEMS PAGE 23


Conclusion
Large relational databases are targeted as the first to disappear from the technical landscape, like an asteroid aiming for
dinosaurs and yet, these archaic systems are more often the center point for many cloud migrations. Once moved to the
cloud, multiple projects may be proposed to modernize and eliminate these dinosaurs, but more often, their bones become
the foundation for new application strategies with the data residing in the same relational systems as they did on-premises.
Modernization, due to limited resources, lacking ROI, or the amount of effort to modernize, often removes the urgency to
change the system.

As businesses continue to move to the cloud, recommended practices to move large RDBMS as part of these data centers and
data estates will be necessary due to the role these relational systems still play in the data estate.

Kellyn Gorman, Principal CSA & SME for Oracle on Azure at Microsoft
@dbakevlar on DZone | @kellyngorman on LinkedIn | @DBAKevlar on Twitter | dbakevlar.com

Kellyn Gorman is a talented and accomplished Principal Cloud Solution Architect and Oracle SME on
Azure at Microsoft, specializing in multi-platform databases, Azure Infrastructure and DevOps automation.
A proud member of the Oak Table Network, an Oracle ACE Director Alumnus, and former Idera ACE
with over 20 years’ experience in database administration, plus extensive experience in optimization,
migrations, automation, and cloud architecture.

DZONE TREND REPORT | DATABASE SYSTEMS PAGE 24


CONTRIBUTOR INSIGHTS

Should You Move to


the Cloud?
Questions to Consider Before Migrating

By Monica Rathbun, Consultant at Denny Cherry and Associates Consulting

Companies are moving to the cloud at lightning speed nowadays, but many who make the leap are facing challenges and
wish they had asked these essential questions first. Not having discussions around these key issues before moving to the
cloud can lead to costly outcomes. Let’s walk through what questions you should ask before your company embarks on a
cloud migration.

What’s the Best Service Cloud Option to Fit Your Environment?


First and foremost, you need to discover what options are available and best fit your environment — you shouldn’t necessarily
just look to rebuild your on-premises environment in the cloud. Do you know the difference between cloud service models
such as PaaS and IaaS?

PaaS (Platform as a Service) is essentially having the cloud provider host and manage both the infrastructure and database
components. PaaS services vary, but in nearly all cases, backups, high availability, and patching are taken care of by the cloud
vendor. In some cases, your service may auto scale with the workload.

IaaS (Infrastructure as a Service) gives you networking, storage, and VMs. In IaaS, your VMs will be maintained fully by your
internal staff, however, there are add-ons that can automate services like backups and even patching in some cases. You can
compare this to using a hosting provider who provides the hardware, and you install, configure, and maintain anything housed
on that VM.

Figure 1: PaaS vs. IaaS

The next question is, have you mapped out your current environment and do you have an understanding of what the
resource requirements are? You need to get a baseline and gather metrics, like how much memory is being consumed versus
what is allocated, and what your storage requirements are. Storage requirements go far beyond just data volume, which is
normally what is thought of when referring to storage, but you need to take it a step further and dig into bandwidth and IOPs
utilized by various workloads (as seen in Figure 2 on the next page). How many CPUs are currently allocated versus needed?

DZONE TREND REPORT | DATABASE SYSTEMS PAGE 25


All of these metrics should be considered as you look at what platform options, pricing tiers, and sizing will fit your cloud
environment. You’ll learn more on this later.

Figure 2: Key metrics to consider

Lastly, when looking at the best cloud option to fit your environment, ask who manages your current databases,
infrastructure, and network, and ask how moving to the cloud will change that. When choosing between cloud service
options, the role of your current staff needs to be considered since each option changes the role of your team.

If you are choosing a PaaS option, you no longer rely on an internal infrastructure staff to maintain and build the infrastructure,
which can be a huge productivity savings. Along with that, your DBA’s role slightly changes as well. DBAs are still needed to
tune performance, validate, and manage the data, but this will allow their focus to shift from maintenance to improvement.
This leads us to our next big question.

Does My In-House Team Have the Skills to Support Moving to the Cloud?
Cloud migrations have a lot of planning and moving parts in both design and execution. It’s important to make sure you have
the skills and support staff to take on a migration and ensure its success from start to finish. The first series of questions to ask
are as follows:

• What are the cloud skills available within your current staff?

• Have any of your staff ever performed a migration to the cloud?

• Is it advantageous to get your staff trained and certified before embarking on your migration journey?

– All cloud providers offer training certification programs and exams to help ready their skill sets.

There is nothing more costly both in time and money than a partial migration that has to switch resources mid-stream due to
errors, missed steps, and bad decisions based on poor planning or missing in-house skills. This cost means it is important to
ask if you should outsource the project. It’s very possible that the best plan is to augment your staff by bringing in consultants
to assist with the project. Or even more, for the success of your project, it could be better to outsource the full migration to a
consultancy that has both the skills and broad experience across a number of migrations.

Is Moving to the Cloud Financially Advantageous?


There is nothing worse than sticker shock after a successful cloud migration. Choosing the cloud option with the right
environment and right sizing is essential to avoid sticker shock, which is why we start with this question first: which cloud
option is our best fit? Moving to the cloud provides scalability, security, and flexibility, but all that comes with a price. If you
don’t choose the proper scaling or tier, costs can multiply exponentially and, in some cases, end up higher in the cloud versus
staying in your current on-premises environments.

The next important question to ask is whether moving to the cloud is financially advantageous. Each cloud provider offers
great tools to allow you to estimate your monthly spend. As part of your migration planning, it’s vital to take the time to analyze
the metrics you have gathered against each pricing tier option that meet those needs. By doing so, you can see what an
estimated monthly cost might be. It is important to monitor those costs to ensure that your budget expectation is in line with
your actual spend.

DZONE TREND REPORT | DATABASE SYSTEMS PAGE 26


What can you do to lower costs? Have you evaluated all the cost options available within your chosen cloud provider?
Cloud providers understand that pricing is a huge concern when moving to the cloud, so many providers have introduced
options to help minimize that. By far, the largest cost savings come from purchasing resource reservations instead of operating
on a monthly basis, which introduces costs savings based on term agreements. When you make a reservation, you can reduce
costs up to 72 percent — and you still have flexibility to move reservations around different resources. You can also pay either
monthly or upfront, depending on your financial situation.

Another option is to take advantage of flexible resources. Normally, most PaaS resources are provisioned to a specific compute
tier that provides dedicated resources with fixed costs billed on an hourly rate. This is great when your resource has consistent
usage patterns. But what if your usage fluctuates? What if you could save money when usage is lower?

That’s where flexible resources come in — these are especially good for dev/test workloads where usage may be limited.
Not all cloud resources have flexible offerings, but when it is offered, it is another good way to reduce the costs of your cloud
migration. It can also be a great way to save by taking advantage of the auto scaling feature that comes with it, especially
for a new workload where you may not understand the workload profile. It is important to note that in some cases, the
per hour costs of these flexible resources are more than a fully allocated service — this means you are only saving money if
those resources are idle. Taking the time to question costs and research cost-saving options should be part of the questions
answered prior to migrating.

Summary
As you can see, migrating to the cloud is more than just a procedural process; there are many questions to be asked before
taking the first step in the journey. It’s important to consider and question all things that will govern the migration process,
in-house skills, cost, and resource needs. Have you asked and answered all the questions posed? If not, it’s time to take a step
back and get some answers.

Monica Rathbun, Consultant at Denny Cherry and Associates Consulting


@sqlespresso on DZone | @sqlespresso on LinkedIn | @sqlespresso on Twitter | www.sqlespresso.com

Monica Rathbun is a Microsoft MVP for Data Platform, Microsoft Certified Solutions Expert, and VMWare
vExpert. She has 20 years' experience working with a wide variety of database platforms with a focus on
SQL Server and the Microsoft Data Platform. She is a frequent speaker at IT industry conferences on topics
including performance tuning and configuration management, the Leader of the Hampton Roads SQL Server User Group,
on the Microsoft Azure Data Community Board, and on the Data Saturdays Board. She is passionate about SQL Server and
the SQL Server community, doing anything she can to give back.

DZONE TREND REPORT | DATABASE SYSTEMS PAGE 27


CONTRIBUTOR INSIGHTS

(Don’t) Follow the Hype


By Daniel Stori, Software Development Manager at AWS

Daniel Stori, Software Development Manager at Amazon


@Daniel Stori on DZone | @dstori on LinkedIn | @turnoff_us on Twitter | turnoff.us

Passionate about computing since writing my first lines of code in Basic on Apple 2, I share my time
raising my young daughter and working on AWS Cloud Quest, a fun learning experience based on 3D
games. In my (little) spare time, I like to make comics related to programming, operating systems, and
funny situations in the routine of an IT professional.

DZONE TREND REPORT | DATABASE SYSTEMS PAGE 28


CONTRIBUTOR INSIGHTS

Data Management Patterns


for Microservices
By Abhishek Gupta, Principal Developer Advocate at AWS

One of the key components of microservices is how to manage and access data. The means to do that are different compared
to traditional monolithic or three-tier applications. Some patterns are quite common, but others are specific and need to be
evaluated before being incorporated into a solution. We will briefly go over some of these common database patterns for
microservices before exploring CQRS (including how it differs from CRUD) and, finally, look at how it can be combined with
event sourcing.

Common Database Patterns for Microservices


There are multiple patterns for using databases in the context of microservices. In this section, we will cover a few, starting with
one of the most common patterns.

DATABASE PER MICROSERVICE


Instead of "one size fits all," using a database per microservice helps ensure that each service can use the database per its
requirements based on data storage, modeling, etc. There are situations where a relational database is the perfect fit, while
other use cases will benefit from a key-value or even document-based (JSON) database. The details of the underlying database
are abstracted by a service-specific API, ultimately leading to flexible and loosely coupled architectures.

API COMPOSITION
Each service has its own database and exposes its API thanks to the database per microservice pattern.

The API composition pattern introduces another level of abstraction, where an API composer component takes on the
responsibility of querying individual services' APIs. This scatter-gather type of approach avoids complexity by providing a
unified interface to client applications.

SAGA
This is an advanced pattern that helps overcome the constraints introduced by the database per microservice pattern and
the distributed nature of microservices architectures in general. These patterns require interacting with multiple services (and
their respective databases), making transactional workflows (with ACID compliance) difficult to implement. The Saga pattern
involves orchestrating multiple local (service-specific) transactions and executing a compensating transaction(s) to undo them
in case of failure.

COMMAND QUERY RESPONSIBILITY SEGREGATION AND EVENT SOURCING


Like the saga pattern, Command Query Responsibility Segregation (CQRS) and event sourcing are relatively advanced
techniques that involve separating read and write paths. The next section will discuss more about CQRS and how it compares
with another widely used technique — create, read, update, delete (CRUD).

CQRS and CRUD


Let's get a basic understanding of these terminologies.

CRUD
CRUD-based solutions are commonly used to implement simple application logic. Take, for example, a user management
service that handles user registration, listing users, updating user information (enabling/disabling), and removing users. CRUD
is so attractive because it uses a single data store and is well understood, making it easy to embrace common architectural
patterns, such as REST over HTTP (with JSON being a widely used data format).

DZONE TREND REPORT | DATABASE SYSTEMS PAGE 29


CQRS
In contrast to CRUD-based solutions, Figure 1: CQRS
CQRS is about using different data models
for read and write operations — within
a single database or across multiple
databases. This allows for managing
read/write portions of the application
independently. For example, you can have
separate database tables for read/write
operations or leverage read-replicas to
Same or different data stores
scale out read-heavy applications.

CRUD AND CQRS: DIFFERENCES


Let's learn more about CQRS and, in the process, understand how it's different from CRUD by examining various characteristics.

TYPE OF DATA STORES


As mentioned earlier, it's possible to implement CQRS with single or multiple data stores. In the case of a single database, you
might use techniques such as separate read/write tables and read replicas. Alternatively, you could leverage different databases
to cater to specific requirements of the read/write paths of your operation — for example, you could use a relational database
with ACID semantics to handle low to medium write workloads and use an in-memory cache (like Redis) to serve read requests.

SYNCHRONOUS OR ASYNCHRONOUS
CRUD operations are mostly executed in a synchronous way, although, depending on the programming language and client
library, you can have asynchronous implementation on the client side. CQRS also can be implemented in a synchronous
way, but that's rare. Because CQRS involves multiple data stores (same or different databases), it benefits from a mechanism
wherein the read model is updated asynchronously in response to changes in the write data store. For example, user
information inserted (or updated) in a RDBMS table results in the Redis for low-latency reads at high scale/volume. But this
approach forces us to think about another important attribute — consistency.

CONSISTENCY
Since CRUD systems are synchronous in nature and benefit from ACID support (in relational databases), they get strong consistency
(almost for free). With CQRS implementations, your architecture needs to embrace eventual consistency due to its asynchronous
nature and make sure your applications can tolerate reading (possibly) stale data, even though it might for only a short period of time.

SCALABILITY
The fact that CQRS solutions can leverage multiple data stores and can operate asynchronously allows them to be more
scalable. For example, leveraging separate data stores optimizes for high-read volumes if there is a need to do so. Scalability of
CRUD solutions is constrained by the single data store and a synchronous mode of operation.

COMPLEXITY
As obvious as this might sound, it's quite complex to architect applications using the CQRS pattern. As mentioned previously,
you may need to use multiple data stores, implement asynchronous communication between them, and also tackle eventual
consistency and its caveats. CRUD operations are well understood and widely used, so there often is great support, such as
code-generators, object-relational modeling libraries, etc., which significantly eases the development process.

Here is a table summarizing the differences:

Table 1

Characteristic CQRS CRUD

Type of data store Single or multiple data stores Single data store

Synchronous or asynchronous Asynchronous (mostly) Synchronous

Consistency Eventual consistency Strong consistency

Scalability High Limited

Complexity High Relatively low

DZONE TREND REPORT | DATABASE SYSTEMS PAGE 30


CQRS and Event Sourcing — Better Together?
With event sourcing, the "C" in CQRS (which stands for command) comes alive. The write part of the system is now responsible
for storing commands in an event store while the read part is all about having a denormalized form of the data — this form is
also referred to as materialized views that support specific queries for a UI or application. An event sourcing- and CQRS-based
solution might have several such views across multiple data stores. Often, these patterns are natively supported by data stores
themselves, such as Postgres, Cassandra, etc.

Events are not ordinary data — these represent actions in your system; hence, the write model leverages an append-only,
immutable data store. Sometimes, event sourcing solutions also leverage streaming platforms, such as Apache Kafka, in the
write path instead of traditional SQL/NoSQL databases.

Figure 2: Event sourcing

Let's look at a simple example. Think of a limited subset of functionality in an application like Twitter — users send tweets,
follow each other, and see tweets from users they follow. A naive solution is one where the follower timelines are updated
synchronously when a user tweets. This is not scalable for a system where users can have millions of followers. A better
approach is to split this into different parts where "tweet sent" (the command) is used as an event to that can trigger the
timeline update of followers, which can happen asynchronously. Followers can then see the tweet, which would be handled
by the "query" part of the solution and potentially backed by a different data store — the system has to tolerate eventual
consistency, which is acceptable for this particular use case.

PROS AND CONS


Event sourcing and CQRS do give you a lot of power, but with that comes great responsibility. If you don't have a complicated
data model, stick to a CRUD-based solution to simplify your overall architecture.

Event sourcing involves capturing the commands/events (that have changed the system) in an append-only, immutable manner
— this implies that there is a possibility of replaying "historical" data to rebuild some/entire parts of the system if there is a need to
do so. However, this is not easy in practical scenarios, and such processes need to be planned out well to minimize downtime and
service disruption. If your application does not have such intensive requirements, it's a good idea to avoid event sourcing/CQRS.

Your architecture deals with events, and you can build multiple applications to handle them. For example, a user-created event
can trigger a send email handler while another handler can take care of a different operation (in parallel). At the same time, you
need to factor in error handling, retries, and eventual consistency. Be aware of these requirements, and don't use CRQS/event
sourcing if your application cannot tolerate eventual consistency.

Conclusion
As microservices have become the norm, many techniques and patterns have emerged to tackle their increased complexity.
CRUD-based solutions are widely used but have their limitations. Advanced patterns, such as CQRS and event sourcing, can
help with scalability, but you need to avoid falling into the premature optimization trap. Using the right tool/pattern for the job
goes a long way in long-term application maintainability.

Abhishek Gupta, Principal Developer Advocate at AWS


@abhirockzz on DZone

Over the course of his career, Abhishek has worn multiple hats including engineer, product manager, and
developer advocate. Most of his work has revolved around open-source technologies including distributed
data systems and cloud-native platforms. Abhishek is also an open-source contributor and avid blogger.

DZONE TREND REPORT | DATABASE SYSTEMS PAGE 31


CONTRIBUTOR INSIGHTS

Data Management in
Complex Systems
By Oren Eini, CEO at Hibernating Rhinos & Founder of RavenDB

It is somewhat of a cliché to consider the data in your systems as far more valuable than the actual applications that compose
it. Those applications are updated, mutated, retired, and replaced, but the data persists. For many enterprises, that data is
their most important asset. It used to be simple. You had the database for the organization. There was only one place where
everything the organization did and knew was stored, and it was what you went to for all needs. One database to administer,
monitor, optimize, backup, etc. — to rule the entire organization’s data needs.

As the organization grew, there was ever more data and therefore more needs and requirements added to the database. At
some point, you hit the absolute limits of what you can do. You cannot work with a single database anymore; you must break
your systems and your database into discrete components. In this article, I’m going to discuss how you can manage the growth
in your data scope and size.

The Death of the Shared Database: Why We Can't Get a Bigger Machine
While not commonplace, it’s by far not uncommon to work with databases in the terabyte range with billions of records
these days. So what is the problem? The issue isn’t with the technical limitations of a particular database engine. It is the
organizational weight of throwing everything (including at least two kitchen sinks and a birthday cake) into a single database.
At one company that I worked with, the database had a bit over 30,000 tables in it, for example. The number of views and
stored procedures were much higher. We won’t talk about the number of triggers.

None of the tooling for working with the database expected to deal with this number of tables. Connecting to the database
through any GUI tool would often cause the tool to freeze for minutes at a time while it read the schema descriptions for
a short eternity. Absolutely no one actually had an idea of what was going on inside that database, but the data and the
processes around it were critical to the success of the organization. The only choice was to either stagnate in place or start
breaking the database apart into manageable chunks.

That was many years ago, and the industry landscape has changed. Today, when thinking about data, we have so many more
concerns to juggle, such as:

• Personal data belonging to a European citizen, which means that any data associated with them must also
physically reside in the EU and is subject to GDPR rules.

• Healthcare information (directly or indirectly), which has a whole new set of rules to follow (e.g., HIPAA, HITECH,
or ENISA rulings).

Concerns such as data privacy and provenance are far more important, like being able to audit and analyze who accesses a
particular data item and why it can be a hard requirement in many fields. The notion of one bucket in which all the information
in the organization resides is no longer viable.

Another important sea change was common architectural patterns. Instead of a single monolithic system to manage
everything in the organization, we now break apart our systems into much smaller components. Those components have
different needs and requirements, are released on different schedules, and use different technologies. The sheer overhead
of trying to coordinate between all of those teams when you need to make a change is a big barrier when you want to make
changes in your system. The cost of coordination across so many teams and components is simply too high.

Instead of using a single shared database, the concept of independent application databases is commonly used. This is an
important piece of a larger architectural concept. You’ll encounter that under the terms microservices and service-oriented
architecture, typically.

DZONE TREND REPORT | DATABASE SYSTEMS PAGE 32


The Application Database as an Implementation Decision
One of the most important distinctions between moving from a single shared database to a set of applications databases is
that we aren’t breaking apart the shared database. A proper separation at the database level is key. A set of shared databases
will have the exact same coordination issues, with too many cooks in the kitchen. Application databases, when properly
separated, will benefit us by allowing us to choose the best database engine for each task, localizing change, and reducing the
overhead in communicating changes. The downside of this approach is that we’ll have more systems to support in production.

Let’s talk in more depth about the distinction between shared databases and application databases. It’s easy to make the
mistake, as you can see in Figure 1, for example:

Figure 1: The wrong migration path from single shared database to multiple (still shared) databases

While a shared database is something that you implement because there isn’t another option, an application database is an
internal choice and isn’t meant to be accessible to anyone but the application. In the same sense that we have encapsulation in
object-oriented programming, with private variables that hide our state, the application database is very explicitly a concern of
no one outside of the application. I feel quite strongly in the matter.

When you write code, you know that directly working with the private state of other objects is wrong. You may be violating
invariants; you will complicate future maintenance and development. This has been hammered down so much that most
developers have an almost instinctual reluctance to do so. The exact same occurs when you directly touch another application’s
database, but that is an all-too-common occurrence.

In some cases, I resorted to encrypting all the names of the tables and columns in my database to make it obvious that you
are not supposed to be looking into my database. The application database is the internal concern of the application, no one
else. The idea is simple. If any entity outside of the application needs some data, they need to ask the application for that.
They must not go directly into the application’s database to figure things out. It’s the difference between asking "Who are you
talking to" and going over all their call logs and messages. In theory this is a great approach, but you need to consider that your
application is rarely the application for the system. You have to integrate with the rest of the ecosystem. The question is how
you do that.

If the system described here sounds familiar, it's because you have likely heard about it before. It began as part of DCOM/
COBRA systems, then it was called service-oriented architecture, and nowadays it's referred to as microservices.

Let’s assume that the application that deals with shipping in our system needs to access some customer data to complete its
tasks. How would it go about getting that data? When using a shared database, query the customers' tables directly. When the
team that owns the customers’ application needs to add a column, or refactor the data, your system will be broken. There is no
encapsulation or separation between them. The path of direct dependencies on another team’s implementation details leads
to ruin, stagnation, and ever-increasing complexity. I’m not a fan, if you can’t tell.

Working With Global Data


Alternatively, the shipping application can (through a published service interface) ask the application that owns the customers’
data to fetch the details it needs. This is usually done through RCP calls from one application to the other. The issue here is
that this creates a strong tie between the two applications. If the customers' application is down for maintenance, the shipping

DZONE TREND REPORT | DATABASE SYSTEMS PAGE 33


application will not work. Compound that with a few dozen such applications and their interdependencies, and you have a
recipe for a gridlock. We need to consider a better way to approach this situation.

My recommendation is to go about the whole process in the other direction. Instead of the shipping application querying the
customers’ application for relevant data, we’ll inverse things. As part of the service interface of the customers’ application, we
can decide what kind of information we want to make public to the rest of the organization.

It is important to note that the data we publish is absolutely part of the service contract. We do not provide direct access to our
database. The application should publish its data to the outside world. That can be a daily CSV file uploaded to an FTP site or a
GraphQL endpoint to choose two very different technologies and semantics.

I included CSV over FTP specifically to point out that the way this data share is done is not relevant. What matters is that there
is an established way for the data to be published from the application because a key aspect of this architectural style is that
we don’t query for the data at the time we need it. Instead, we ingest that into our own systems. I hope it is clear why the
shipping application won’t just open an FTP connection to the customers’ daily CSV dump file to find some details. In the same
sense, it shouldn’t be querying the GraphQL endpoint as part of its normal routine.

Instead, we have an established mechanism whereby the customers’ data (that the customers’ application has made public to
the rest of the organization) is published. That is ingested by other applications in the system and when they need to query on
customers’ details, they can do that from their own systems. You can see how this looks like in Figure 2.

In each application, the data may be Figure 2: The customers’ application publishing data for use by the shipping application
stored and represented in a different
manner. In each case, that would be
whatever is optimal for their scenarios.

The publishing application can also


work with the data in whatever manner
they choose. The service boundary
between the database and the manner
in which the data is published allow the freedom to modify internal details without having to coordinate with external systems.

Another option is to have a two-stage process, as shown in Figure 3. Instead of the customers’ application sending its
updates to the shipping application, the customers’ application will send it to the organization data lake. In this manner, each
application is sending the data they wish to make public to a central location. And other applications can copy the data they
need from the data lake to their own database.

Figure 3: Publishing to a data lake from each application and pulling data to each application

DZONE TREND REPORT | DATABASE SYSTEMS PAGE 34


The end result is a system where the data is shared, but without the temporal dependencies between applications and
services. It also ensures the boundary between different teams and systems. As long as the published interfaces remain the
same, there is no need for coordination or complexity.

Applying This Architecture Style in the Real World


Let's dive into some concrete advice on how to apply this architectural approach. You can publish the data globally by emitting
events on a service bus or by publishing daily files. You can publish the data for a specific scenario, such as an ETL process from
the customers’ database to the shipping’s database. The exact how doesn’t matter, as long as we have the proper boundary in
place and we can change how we do it without incurring global coordination costs.

This style of operation only works when we need to reference data or make decisions on data where consistency isn’t relevant.
This approach is not relevant if we need to make or coordinate changes to the data. A great example of a scenario where
consistency doesn’t matter is looking up the customer’s name from their ID. If we have the old name, that isn’t a major issue.
It will fix itself shortly, and we don’t make decisions based on the customer's name. At the same time, we can run all our
computations and work completely locally within the scope of the application, which is a great advantage.

Consistency matters when we need to make a decision or modify the data. For example, in the shipping scenario, if we need
to charge for overweight shipping, we need to ensure that the customer has sufficient funds in their account (and we need to
deduct the shipping amount). In this case, we cannot operate on our own data. We don’t own the funds in the account, after
all. Nor are we able to directly change it. In such a case, we need to go to the customers’ application and ask to deduct those
funds and raise an error if there are insufficient funds. Note that we want the shipping process to fail if the customer cannot
pay for it.

Our applications are no longer deployed to a single server or even a single data center. It is common today to have applications
that run on edge systems (such as mobile applications or IoT devices). Pushing all that data to our own systems may cause us
to store a lot of data. This architecture style of data encapsulation and publishing only the details we want to expose to other
parties plays very well in this scenario.

Instead of having to replicate all the information to a central location, we can store the data in the edge and accept just enough
data from the edge devices to be able to make decisions and operate the global state of the system. Among other advantages,
this approach keeps the user in control of all of their data, which I consider a major plus.

Closing Thoughts
There are a few reasons to use application databases and explicit publishing of data in your architecture. First, it means that
the operations are running with local resources and minimal coordination. That, in turn, means that the operations are going
to be faster and more reliable. Second, it reduces the coordination overhead across the board, meaning that we can deploy and
change each application independently as the need arises.

Finally, it means that we can choose the best option for each scenario independently. Instead of catering for the lowest
common denominator, we can select the best of breed for each option. For example, we can use a document database to store
shipping manifest but throw the historical data into a data lake.

Each one of the applications is independent and isolated from one another, and we can make the best technical choice for
each scenario without having to consider any global constraints. The result is a system that is easier to change, composed of
smaller components (hence simpler to understand), and far more agile.

Oren Eini, CEO at Hibernating Rhinos


@ayende on DZone | @ayende on Twitter | @ravendb on LinkedIn | ayende.com/blog

Oren Eini, pseudonym Ayende Rahien, is a frequent blogger at ayende.com has over 20 years of
experience in the development world. Oren has led the RavenDB project since 2009 and has been
recognized as a Microsoft MVP since 2007. He published the books Inside RavenDB and DSLs in Boo:
Domain Specific Languages in .NET by Manning Publications.

DZONE TREND REPORT | DATABASE SYSTEMS PAGE 35


CONTRIBUTOR INSIGHTS

Time Series Compression


Algorithms and Their
Applications
By Rosana de Oliveira Gomes, Senior Data Scientist at HAKOM Time Series

Time series is present in our daily lives in multiple sectors of society, such as finance, healthcare, and energy management.
Some of these domains require high data volume so that insights from analysis or forecasting the behavior of target variables
can be obtained. Transferring and processing high data rates and volume across platforms with several users requires storage
and computer power availability. Compression techniques are a powerful approach to avoid overwhelming systems. In what
follows, time series compression algorithms will be discussed along with their role in real-world applications in different sectors.

What Is Time Series?


Time series is defined as a sequence of values of a quantity obtained at successive times, often with equally spaced intervals.
We experience the use of timestamped data from when we monitor our exercises with a fitness app to when we track our
pizza delivery traveling through the city all the way to our doorstep. Time series is relevant to problems when understanding
the evolution of a variable over time is needed, such as understanding the time profile of a variable or forecasting their values.

Time series most commonly appears in the form of timestamped numerical data in a tabular format. Audio data itself is
already represented as a time series, as it is defined in terms of frequencies. However, although time series are themselves
a data type, they can also be combined with other data types in order to produce more complex entities, which contain an
embedded temporal aspect such as:

• A sequence of images over time defines a video.


• A time series of coordinate pairs from geospatial data defines a tracking path.

Time series data requires specific techniques in order to obtain insights not only from the different patterns in data, but also
among those over time. In a data science pipeline, these techniques are employed during the preprocessing, analysis, and
modeling steps. The most common time series techniques are illustrated in Figure 1 below:

Figure 1: Time series methods

PREPROCESSING ANALYSIS MODELING


OUTLIER DETECTION: DECOMPOSITION: FEATURE ENGINEERING:
anomalous data point in specific identifying trend, seasonality, creating new variables from lagged
times and anomalies among cycles, and noise in the time series. values, as "Week before value."
different time series.
CORRELATION: SUPERVISED LEARNING:
IMPUTATION: time series autocorrelation forecasting the qualitative and
replacing missing values with with itself over time and with quantitative behavior of a time
average of lagged time (e.g., other variables. series; classifying which class a
average of last weekend). time series belongs to.
MOVING AVERAGES:
CALENDAR: mean of a given set of values UNSUPERVISED LEARNING:
merging a dataset to identify over a specified period — used identifying similar temporal
patterns of specific dates, such for smoothing a time series. behavior and metadata patterns
as weekends or holidays. for different time series.

DZONE TREND REPORT | DATABASE SYSTEMS PAGE 36


Data Compression
Data compression is the process of transforming data in order to reduce the number of bits necessary to represent it. This
process is done via the alteration of the data through encoding or the rearrangement of its bits structure. Compression is a
valuable technique utilized in scenarios where resources are crucial for storage, processing, and transmission of data. Two
extremes of such scenarios are:

• Limited resources scenario, where the available storage and processing are limited by costs
• Big data scenario, where a high-frequency data influx requires efficient data management

The process of data compression involves encoding the data into a smaller format. In order to perform the reverse process, a
decoder is needed to decompress the data, as illustrated in Figure 2:

Figure 2: Data compression scheme

Compression expresses the same information present in the data in a smaller format with less bits. Since it searches for
patterns in data that can be encoded, it is a computationally expensive procedure that may demand time and memory. State-
of-the-art compression techniques are:

• Lossless compression identifies and removes data redundancy in a way that no information is lost in the process. The
decoded data is restored exactly to its original state.
– Common uses: databases, emails, spreadsheets, documents, and source code
• Lossy compression identifies and permanently removes redundant data bits, not making it possible to recover the
original data after decompression.
– Common uses: audio, images, graphics, videos, and scanned files

The trade-off between accuracy and compression, present in the attempt to preserve the data while still addressing the
storage bound, is an often-encountered challenge in data compression. Lossless data compression can only shrink data to a
certain extent, having Shannon’s information as a threshold. For high-frequency data, lossy compression is needed in order to
perform an effective reduction in size.

Table 1

ADVANTAGES AND DISADVANTAGES OF DATA COMPRESSION

Advantages of Compression Disadvantages of Compression

Reduction of file size and storage usage costs Time consuming for large data volume

Increase data reading/writing speed due to the Algorithms need intensive processing from the
reduction of memory gaps during disk storage system, which becomes costly for large data volume

Faster file transfer via the internet, requiring less Quality of decompressed data may depend on
computational resources level of compression

Algorithms can be used to approximate and/or Requires a decoder program in order to


predict the data, as well as identify noise decompress the files

Compression Methods for Time Series


The rise of big data and use of smart devices reveal a demand for powerful compression techniques able to fulfill the
processing needs of industries that rely on time series data. In the case of high frequencies (around 10kHz), even databases
that specialize in time series data can get overloaded. Compression algorithms are widely explored due to their high value
returns. The quality of a compression technique is measured by its compression ratio (between compressed and original files),
speed (measured in cycles per bite), and the accuracy of the restored data.

DZONE TREND REPORT | DATABASE SYSTEMS PAGE 37


Time series compression algorithms take advantage of specific characteristics in time series produced by sensors — such as
the fact that some time series segments often repeat themselves in the same or other related time series (redundancy), or the
possibility to recover a time series via approximating it by functions or predicting them through neural network models. The
state-of-the-art methods are listed below:

Table 2

TIME SERIES COMPRESSION ALGORITHMS

Algorithm Description Common Methods Performance

Dictionary- • Represents time series through • TRISTAN is an algorithm divided into a learning and a • Effective for
based a series of common segments, compression phase, with a dictionary that may contain datasets with high
using a dictionary to translate typical patterns or that learns them from a training set. redundancy.
the segments into content. • CORAD is an extension of the latter that considers • Can be lossy and
• The segments’ size determines autocorrelations to improve compression and accuracy. lossless.
accuracy and compression.

Function • Divides the time series into • Piecewise polynomial approximation (PPA) and Suitable for smooth
approximation segments and applies a function Chebyshev polynomial transform (CPT) are two time series, low
to approximate each of them. lossy techniques that split a time series into several compression ratios, and
• Each method follows a segments and fit polynomial functions to them. high accuracy.
different family of functions. • The Discrete Wavelet Transform (DWT) method
approximates time series to wavelet functions.

Sequential • Sequential combination of • Delta encoding, run-length, and Huffman (DRH) is a • Majority of methods
algorithms several compression techniques. method that requires low computational power. are lossless and
• The most common are Huffman • Spritz is designed for high decompression speed and computationally
coding, delta encoding, run- low energy consumption. efficient.
length encoding, and Fibonacci • Run-Length Binary Encoding (RLBE) is developed for • Suitable for Internet
binary coding. low memory and computational resources devices. of Things (IoT) devices
that have limited
• RAKE is an algorithm with a preprocessing and a computational
compression phase that utilizes sparsity to compress resources.
the data.

Autoencoders • Neural network architectures Recurrent Neural Network Autoencoder (RNNA) Accuracy and
composed of a symmetric pair methods consider a time-dependent neural network compression ratio
of encoder and decoder. that has a lossy compression and a loss threshold depend strongly on
• Trained to generate an output parameter. the ability of the RNN
that reproduces the input of finding patterns in
passed to it. the training set.

Time Series Applications


The compression algorithm to be chosen for a certain problem depends on the domain of the application and data in question.
Applications of compression can be found in many sectors, with multimedia through the compression of images, video, and
audio data being the most popular. In particular, time series compression is used in crucial industries. Time series use cases in
different sectors and the highlights on compression in such applications are shown in Table 3. In all use cases, the advantages
presented in Table 1 are also applicable.

Table 3

TIME SERIES USE CASES

Medicine Maintenance Energy Economics

Use case Monitoring of multiple Monitoring of industrial Short-term forecast of Data collected at high
life signals of patients equipment and further energy consumption by frequencies reports the
integrated into a warning automated report of equipment smart meters. status of stock market
system to guarantee full status, ensuring safety and statistics in real time.
time assistance. efficient production.

Compression Faster data processing for Easier and cheaper storage of Encoding algorithms Faster transmission of
added value performing calculations, large data volumes, making it help gather insights information through a
such as triggering warnings. affordable for manufacturing from the data, like noise large network, permitting
companies to adopt data- and behavior, making users to make decisions in
driven solutions. more accurate forecasts. real time.

DZONE TREND REPORT | DATABASE SYSTEMS PAGE 38


Conclusion
We live in the era of big data, where over 250 exabytes of data are produced every day, from which a large portion is present in
the form of time series in a broad range of industries (note: 250 exabytes = 250×1018). Time series compression techniques are a
powerful approach to efficiently collect, store, manipulate, and transfer data, which is crucial for database maintenance and the
implementation of robust data management pipelines.

The great adoption of smart devices has also increased the need for compression techniques that are suitable for scenarios of
low computational resources, as in IoT. This article presented both lossy and lossless techniques that are suitable for multiple
time series profiles and applications scenarios and discussed the limitations and strengths of such methods.

Rosana de Oliveira Gomes, PhD, Senior Data Scientist at HAKOM Time Series
@rogomes on DZone | @rosanaogomes on LinkedIn | @rogomes on GitHub

Rosana is a senior data scientist in the energy sector who transitioned to industry after an academic
career in astrophysics research. In her free time, she volunteers in AI for good initiatives, working both with
start-ups and NGOs. She is an advocate of inclusion, mentoring minorities into tech careers, and founded
the award-winning AI Wonder Girls team, which engages in projects on social impact.

DZONE TREND REPORT | DATABASE SYSTEMS PAGE 39


CONTRIBUTOR INSIGHTS

Modern Enterprise
Data Architecture
By Dr. Magesh Kasthuri, Distinguished Member of Technical Staff at Wipro Limited

Data plays a vital role in conceptualizing the preliminary design for an architecture. You may want to decide the requirements
for security, performance, and infrastructure to handle workload, scalability, and agility in design. In this case, you need
to understand data models and how to handle architectural decisions, including data privacy and security, compliance
requirements, data size to handle, and user handling requirements.

Figure 1: Data architecture as the foundational pillar for enterprise architecture

This is the reason that data-driven architecture is the driving factor for an enterprise design development. The modern
enterprise architectures that are referred to in this article include microservices, cloud-native applications, event-driven
solutions, and data-intensive solutions. The article intends to share modern enterprise data architecture perspectives, including
solution approaches and architectural models to develop new-age solutions catering to velocity, veracity, volume, and variety of
data handling services.

Polyglot Persistence and Database as a Service


A recent data architecture trend based on various case studies recommends the move toward polyglot persistence, which
is a group of multiple types of data storage technologies for integrated architecture. Integrated architecture is needed to
provide high performance for any type of data processing across different services. Typically used in cloud adoption, this kind of
implementation is supported by Database as a Service (DBaaS). DBaaS is implemented for the following benefits:

• Cloud-based database management system (e.g., Amazon Aurora, Azure Cosmos DB, Google Spanner)
• High scalability
• Rapid provisioning
• Enhanced security in cloud architecture
• Suitable for large enterprise design
• Shared infrastructure
• Availability of monitoring and performance tuning tools

This type of polyglot-persistence-based microservices architecture helps to develop resilient, robust, and high-performing
architecture to support variant types of data services for different microservices (see Figure 2 on the next page).

DZONE TREND REPORT | DATABASE SYSTEMS PAGE 40


Figure 2: Example architecture using polyglot persistence

In the example above, each microservice handles data at different capacities and performance requirements. Hence, based
on these data access requirements, the choice of database can be picked to cater to performance, scalability, and the optimal
data model to be stored in the system. This leverages microservices to make cost-effective and agile architectures in a polyglot
persistence model.

Data Modeling in Modern Data Architecture


In traditional architecture development, data modeling is the simple task of deriving data elements from requirements,
depicting the relation between the entities through entity relationship (ER) diagrams, and defining the parameters (data
types, constraints, validations) around the data elements. This means that data modeling is done as a single-step activity in a
traditional architecture by defining the data definition language (DDL) scripts from requirements.

In modern enterprise data architecture, this is split into a multi-stage activity as conceptual, logical, and physical data
modeling, as illustrated below in Figure 3. In a data-driven architecture where data intensity is high, data modeling is a
fundamental and crucial step (and, of course, time consuming). In such an architecture development, data modeling is
innovatively divided into three different types:
Figure 3: Modern enterprise data modeling stages
1. Conceptual data model (CDM) – Derived from business
requirements to define "what" is handled in the data flow.
Usually, CDM is defined by business stakeholders (e.g.,
consultants, business owners, application analysts) and
data architects.

2. Logical data model (LDM) – Derived from the CDM to


drill down the logical relation between the entities and
detail the data types of the entities. Deals with "how" data
is handled in the data flow and is defined by business
consultants and data architects/engineers.

3. Physical data model (PDM) – Defines the actual


blueprint based on th LDM, which gets translated to data
scripts for execution in the live environment. This is the
crucial stage where the performance of data structure,
the transaction handling mechanism, and the tuning and
optimization of the data model are being carried out and typically handled by database administrators or data engineers.

Data Intelligence
Data intelligence and data analytics are modern techniques used in modern enterprise data architectures for NoSQL databases
to handle big data as well as data-intensive application architectures. It involves one or more of the popular technology
solutions, like cloud platforms for agility and scalability of solutions, AI/ML for advanced algorithms to build intelligence in data
processing, and big data platforms to handle the storage and analysis of the data.

DZONE TREND REPORT | DATABASE SYSTEMS PAGE 41


Data intelligence handles data for visualization and analytics by using predictive intelligence to focus the data so that it's
visualized for the future (forecasting). How a stock for an enterprise has trended so far is an example of data intelligence, and
data analytics is using history to predict how it will change in the next year. Both use AI/ML and deep learning techniques,
and both read data from various sources, including data feeds, images, video streaming, and audio extracts to interpret and
prepare the results of data processing. The use of data intelligence helps to visually interpret data to different stakeholders
including business and technical personas.

When you develop a data intelligence solution, you need to have a self-management facility in the database system so that the
database is self-sufficient and able to automatically handle crisis situations. This is known as an autonomous database and is
part of the future of new-age data persistence systems.

Autonomous Database
A database acts as the brain for an IT application because it serves as the central store for data being transacted and
referenced in the application. Database administrators (DBAs) handle database tuning, security activities, backup, DR activities,
server/platform updates, health checks, and all other management and monitoring activities of databases.

When you use a cloud platform for application and database development, the aforementioned activities are critical for
better security, performance, and cost efficiency. The important aspect here is to operationalize these by reducing effort and
making them more proactive in nature. Oracle has coined this an "autonomous database," and it automates many of the DBA's
activities in managing the database platform and reducing human interventions.

Data Mesh
Traditionally, data used to be monolithic in nature by having all domains in a single data store, and effective data portioning
and data solutioning would be done through data warehouse and data lake solutions. Data lakes used to be more efficient in
data management and modern data analytics, which cater to agile data architectures, but the way data is accessed is missing
a federated or autonomous approach in a data lake.

Figure 4: Data mesh architecture

As shown in Figure 4, unified data solutions are addressed by modern enterprise data architectures with a data mesh, which
is a microservices pattern of a data store. A data mesh replicates a service mesh in terms of features. Where a service mesh
creates a proxy to interface between services, a data mesh creates a proxy for data abstraction and interfacing for consuming
applications like data analytics, dashboards, and data querying applications.

Data mesh architectures help to develop multi-dimensional data solutions to handle an operational data plane and an
analytical data plane together in a unified architecture without the need for developing two distinct data solutions.

DZONE TREND REPORT | DATABASE SYSTEMS PAGE 42


Heterogeneous Data Management Using Lakehouse Architecture
For data analytics and intelligent data management, we prefer to use a data lake solution or a data warehouse solution, but
these solutions both have their own way of organizing and managing data. A data warehouse handles relational data (raw data
feed) and processed data (after data ingestion), which is organized in a schema structure before it gets stored in a data storage
service (data enrichment). Therefore, data analytics work on cleansed data.

A data warehouse is a costly form of storage, but it's faster at query processing since it handles schema-based structured data
and is suitable for data intelligence, batch processing, and data visualization in real time.

A data lakehouse architecture is a hybrid approach that handles heterogeneous data management using the following:

• Data lake
• Data warehouse
• Purpose-built store for intermediate data handling
• Data governance mechanism for better data handling policies
• Data integrity services

A data lakehouse architecture overcomes the shortcomings of both data lake and data warehouse solutions and, hence, is
increasingly popular in modern enterprise data solutions like lead generation and market analysis using data feeds from
various sources.

A lakehouse architecture can handle inside-out data movement, from data stored in a data lake to a set of extracted data to
a purpose-built store for analytics or querying activities. Outside-in data movement, from data warehouse to a data lake, can
help run analytics on complete a dataset. With a lakehouse architecture, we can also handle data for both massively parallel
processing (MPP), as in data warehouse applications, high-velocity data querying, and data lake applications.

Conclusion
Modern enterprise data practices elevate the approach toward building a resilient, agile, and scalable architecture. These
approaches will address performance efficiency, operational excellence, high availability, and security/compliance requirements
for building integrated data solutions by adopting technology and frameworks such as a polyglot persistence, modern data
modeling stages, data lakes, and data meshes.

Dr. Magesh Kasthuri, Distinguished Member of Technical Staff at Wipro Limited


@magesh678 on DZone | @magesh-kasthuri on LinkedIn

Dr. Magesh has published more than 50 technical articles in popular magazines and journals and has
written more than 800 technical blogs on topics like AI/ML, cloud, blockchain, enterprise architecture,
and metaverse on LinkedIn with the tag #shorticle. Dr. Magesh did his PhD in deep learning and genetic
algorithms. The article expresses the opinion of the author and doesn’t express the opinion of his organization.

DZONE TREND REPORT | DATABASE SYSTEMS PAGE 43


ADDITIONAL RESOURCES

Diving Deeper Into


Database Systems
BOOKS REFCARDS

Practical Time Series Analysis: Prediction MySQL Essentials


With Statistics & Machine Learning This Refcard contains all things MySQL. From MySQL's most
By Aileen Nielsen important applications, popular features, common data
types, and commands to how to get started on Linux, this
With the rise of the digital age, continuous
Refcard is a must-read for all developers, DBAs, and other tech
monitoring and data collection have become
professionals working in MySQL.
more common, thus sparking the need for
time series analysis with statistical and ML techniques.
Getting Started With Distributed SQL
Author Aileen Nielson introduces this concept and gives NoSQL distributed databases have become common as they
guidance so that the reader can discover, monitor, simulate, are built from the ground up to be distributed, yet they force
and manage time series data confidently. difficult design choices to meet the need for scale. This Refcard
serves as a reference to the key characteristics of distributed
Database Concepts, 9th Edition
SQL databases, how functionality compares across database
By David M. Kroenke, David J. Auer, Scott L. offerings, and the criteria for designing a proof of concept.
Vandenberg, and Robert C. Yoder

Database Concepts dives into how to create MULTIMEDIA


and manage small databases while remaining
neutral, meaning that everything you learn in
SQL Server Radio
this book will be concept-based and applicable — no matter Anyone interested in the SQL Server platform
the software you decide to use. should check out this podcast. Covering a wide
array of database platforms and data-related
With three complete sample databases and three ongoing
technologies, hosts Guy Glantser and Eitan
projects, you will learn to apply the concepts and techniques
Blumin will teach you something new or give you a quick topic
to real-world situations.
refresher while making you laugh.

TREND REPORTS Data Engineering Podcast


Host Tobias Macey, manager of the Technical
Data Pipelines: Ingestion, Warehousing, and Processing
Operations team at MIT Open Learning, dives
In this Trend Report, we review the key components of a data
deep into data management every week
pipeline, propose solutions to common data pipeline design
by discussing new approaches, edge cases,
challenges, dive into engineered decision intelligence, and
technical platforms, and lessons learned. From scaling to data
offer an assessment of the best way to modernize testing with
lakes, you'll learn something new with each podcast and your
data synthesis.
time will always be well spent.
Our goal is to provide insights into and recommendations for
the best ways to accept, store, and interpret data. Postgres FM
This weekly podcast gets into the nitty
Data Persistence gritty details of PostgreSQL. Hosts Michael
As data management tools and strategies have matured Christofides and Nikolay Samokhvalov guide
rapidly in recent years, the complexity of architectural and you through this side of the database world,
implementation choices has also intensified, creating unique covering topics like NULLs, queries, and how to become a DBA.
challenges and opportunities for those who are designing
data-intensive applications. What to Consider When Moving IoT Data Into
Cloud Platforms
This Trend Report examines the current state of the industry,
The trend is clear, enterprises are generating more and more
with a specific focus on effective tools and strategies for data
IoT data, and they are choosing to lean on hyperscale cloud
storage and persistence.
players for analytics, data processing, and storage. In this
webinar, the hosts look closely at five criteria to keep in mind
when crafting your IoT data movement strategy.

DZONE TREND REPORT | DATABASE SYSTEMS PAGE 44


ADDITIONAL RESOURCES

Solutions Directory
This directory contains databases and database performance tools to help you store, organize, manage,
and query the data you need. It provides free trial data and product category information gathered
from vendor websites and project pages. Solutions are selected for inclusion based on several impartial
criteria, including solution maturity, technical innovativeness, relevance, and data availability.

DZONE'S 2022 DATABASE SYSTEMS SOLUTIONS DIRECTORY

Company Product Product Type Availability Website

Cockroach Labs CockroachDB Distributed SQL database Open source cockroachlabs.com/product

Couchbase Couchbase Capella Database as a Service Trial period couchbase.com/products/capella

Astra DB Multi-cloud DBaaS datastax.com/products/datastax-astra

datastax.com/products/astra-
Astra Streaming Multi-cloud streaming as a service
DataStax Trial period streaming

datastax.com/products/datastax-
DataStax Enterprise Open hybrid NoSQL database
2022 PARTNERS

enterprise

InfluxDB Cloud Time series data platform influxdata.com/cloud


InfluxData Free tier
Telegraf Data collection agent influxdata.com/telegraf

percona.com/solutions/database-
Percona Monitoring
Percona Database management Open source observability-monitoring-and-
and Management
management

ScyllaDB Cloud NoSQL DBaaS scylladb.com/product/scylla-cloud


Trial period
ScyllaDB Enterprise Enterprise NoSQL database scylladb.com/product/scylla-enterprise
ScyllaDB
scylladb.com/open-source-nosql-
ScyllaDB Open Source Open source NoSQL database Open source
database

Company Product Product Type Availability Website

4D 4D RDBMS Trial period us.4d.com

Avalanche Cloud Data


Hybrid cloud data platform actian.com/analytic-database/avalanche
Platform
Trial period
actian.com/data-management/nosql-
Actian NoSQL Object database
object-database
Actian
Low code data integration actian.com/data-integration/
Actian DataConnect By request
platform dataconnect-integration

actian.com/data-management/actian-x-
Actian X Transactional database Free tier
hybrid-rdbms

In-memory, distributed key-value


Aerospike Aerospike Database 6 Open source aerospike.com/products/database
NoSQL DBMS

Altibase Altibase Enterprise DB RDBMS Open source altibase.com

Amazon DynamoDB NoSQL database service aws.amazon.com/dynamodb


Amazon Web
Free tier
Services
Amazon SimpleDB NoSQL data store aws.amazon.com/simpledb

DZONE TREND REPORT | DATABASE SYSTEMS PAGE 45


DZONE'S 2022 DATABASE SYSTEMS SOLUTIONS DIRECTORY

Company Product Product Type Availability Website

Apache Cassandra KV, wide column cassandra.apache.org

Non-relational distributed
Apache HBase hbase.apache.org
Apache database
Open source
Foundation
Apache Ignite In-memory, distributed DBMS ignite.apache.org

Apache OpenJPA ORM openjpa.apache.org

developer.apple.com/documentation/
Apple Core Data ORM Free
coredata

Managed graph database,


ArangoDB Open source arangodb.com
document store, search engine
ArangoDB
ArangoDB Oasis Cloud service for ArangoDB Trial period cloud.arangodb.com

developer.atlassian.com/server/
Atlassian Active Objects ORM Free
framework/atlassian-sdk/active-objects

CakeDC CakePHP ORM Open source cakedc.com/cakephp

Cambridge cambridgesemantics.com/anzo-
Anzo Platform Data integration and analytics By request
Semantics platform

Cloud Foundry Cloud Foundry Multi-cloud application PaaS Open source cloudfoundry.org

cloudera.com/products/cloudera-data-
Cloudera Cloudera Data Platform Hybrid data platform By request
platform.html

CUBRID CUBRID SQL-based RDBMS Open source cubrid.org

Third-party risk management


CyberGRX CyberGRX By request cybergrx.com
using data analytics

End-to-end data consulting and


DAS42 DAS42 By request das42.com
implementation

Dassault 3ds.com/nuodb-distributed-sql-
NuoDB Distributed SQL database Free tier
Systèmes database

Database Release dbmaestro.com/database-release-


DBmaestro Continuous delivery for DBs By request
Automation Platform automation

DbVisualizer DbVisualizer Universal database tool Free tier dbvis.com

Delphix Delphix DevOps test data management By request delphix.com

Eclipse
EclipseLink ORM Open source eclipse.org/eclipselink
Foundation

Embarcadero InterBase SQL database Free tier embarcadero.com/products/interbase

Fully managed PostgreSQL DBaaS enterprisedb.com/products/biganimal-


BigAnimal
in the cloud cloud-postgresql
EnterpriseDB Trial period enterprisedb.com/products/edb-
EDB Postgres
RDBMS postgres-advanced-server-secure-ha-
Advanced Server
oracle-compatible

FairCom FairCom DB Multi-model database By request faircom.com/products/faircom-db

Firebird Firebird RDBMS Open source firebirdsql.org

Fivetran Fivetran ELT Trial period fivetran.com

BigTable Key-value, NoSQL database service cloud.google.com/bigtable

Cloud Spaner RDBMS cloud.google.com/spanner


Google Cloud Trial period
Firestore Cloud-native document database cloud.google.com/firestore

Big Query Multi-cloud data warehouse cloud.google.com/bigquery

DZONE TREND REPORT | DATABASE SYSTEMS PAGE 46


DZONE'S 2022 DATABASE SYSTEMS SOLUTIONS DIRECTORY

Company Product Product Type Availability Website

Graphite Graphite Time series data monitoring tool Open source graphiteapp.org

Cloud-native management and


GridGain GridGrain Nebula Trial period gridgain.com/products/gridgain-nebula
monitoring for Apache Ignite

Stream processing and in- hazelcast.com/products/hazelcast-


Hazelcast Platform Open source
memory data storage platform
Hazelcast
In-memory, data grid, cloud-
Hazelcast Viridian Free tier hazelcast.com/products/viridian
managed service

Hibernating Non-relational document


RavenDB Free tier ravendb.net
Rhinos database

IBM Db2 RDBMS ibm.com/db2


IBM Free tier
Informix Embeddable database for IoT data ibm.com/products/informix

InterSystems InterSystems IRIS Cloud-first data platform Free tier intersystems.com/data-platform

JOOQ JOOQ ORM for Java Free tier jooq.org

Liquibase Liquibase Database version control Open source liquibase.com

Mammoth Cloud-based data management


Mammoth Analytics By request mammoth.io
Analytics platform

ManageEngine Site24x7 CloudSpend Cloud cost optimization Trial period site24x7.com/cloudspend

MariaDB Community mariadb.com/products/community-


Relational database Open source
Server server
MariaDB
MariaDB SkySQL Fully managed cloud database Trial period mariadb.com/products/skysql

marklogic.com/product/marklogic-
MarkLogic MarkLogic Server Multi-model database Free tier
database-overview

Micro Focus Vertica Hybrid cloud SQL database Free tier vertica.com

azure.microsoft.com/en-us/products/
Microsoft Azure Azure SQL Database RDBMS Trial period
azure-sql/database

Entity Framework Core ORM for .NET Free docs.microsoft.com/en-us/ef


Microsoft
microsoft.com/en-us/sql-server/sql-
SQL Server 2019 RDBMS Trial period
server-2019

MongoDB Enterprise Self-managed database and mongodb.com/products/mongodb-


By request
Advanced services enterprise-advanced
MongoDB
MongoDB Atlas Cloud database Trial period mongodb.com/atlas

Hybrid cloud application


Morpheus Morpheus By request morpheusdata.com
orchestration

MyBatis MyBatis ORM Open source blog.mybatis.org

neo4j.com/cloud/platform/aura-graph-
Neo4j Neo4j AuraDB Fully managed graph DBaaS Free tier
database

Nhibernate Nhibernate ORM for .NET Open source nhibernate.info

OpenText Gupta Fully relational, high-performance,


OpenText Trial period opentext.com/products/gupta-sqlbase
SQLBase and embeddable web database

Oracle Autonomous
Automated database service Trial period oracle.com/autonomous-database
Database

Oracle MySQL HeatWave RDBMS oracle.com/mysql


Free tier
docs.oracle.com/middleware/1213/
Oracle Toplink ORM
toplink

DZONE TREND REPORT | DATABASE SYSTEMS PAGE 47


DZONE'S 2022 DATABASE SYSTEMS SOLUTIONS DIRECTORY

Company Product Product Type Availability Website

OrientDB OrientDB DBMS Open source orientdb.org

OrmLite OrmLite ORM Open source ormlite.com

PostgreSQL PostgreSQL RDBMS Open source postgresql.org

QuestDB QuestDB Time series database Open source questdb.io

Red Hat Hibernate ORM Open source hibernate.org

Database deployment
Redgate Deploy red-gate.com/products/redgate-deploy
automation
Redgate Trial period
Tools for SQL Server development red-gate.com/products/sql-
SQL Toolbelt Essentials
and deployment development/sql-toolbelt-essentials

redis.com/redis-enterprise-cloud/
Redis Enterprise Cloud Fully managed serverless DBaaS
overview
Redis Labs Trial period
Redis Enterprise redis.com/redis-enterprise-software/
Self-managed data platform
Software overview

Distributed NoSQL key-value


Riak KV riak.com/products/riak-kv
data store
Riak Open source
Distributed NoSQL database for
Riak TS riak.com/products/riak-ts
time series data

guides.rubyonrails.org/active_record_
Ruby on Rails Active Record ORM Free
basics.html

Pipelines and APIs for data


Rudderstack Rudderstack Free tier rudderstack.com
integration

sap.com/products/technology-platform/
SAP SAP HANA Cloud DBaaS Trial period
hana.html

scaleoutsoftware.com/products/
ScaleOut StateServer In-memory data grid
ScaleOut stateserver
Trial period
Software ScaleOut In-Memory scaleoutsoftware.com/products/in-
In-memory database
Database memory-database

SingleStore SingleStoreDB Cloud Distributed SQL database Trial period singlestore.com

softwareag.com/en_corporate/platform/
Software AG Adabas DBMS Trial period
adabas-natural.html

Monitoring and optimization for


Database Performance solarwinds.com/database-performance-
open-source and cloud-native
Monitor monitor
Solarwinds databases Trial period
Database Performance solarwinds.com/database-performance-
Database performance tuning
Analyzer analyzer

splicemachine.com/product/data-
Splice Machine Splice Machine SQL RDBMS Free tier
platform

SQLite SQLite SQL database engine Open source sqlite.org

Studio 3T Studio 3T GUI for MongoDB Trial period studio3t.com

SymmetricDS SymmetricDS Database replication software Open source symmetricds.org

Connected multi-cloud data


Teradata Teradata Vantage By request teradata.com/Vantage
platform

Relational database for time


Timescale TimescaleDB Free tier timescale.com
series and analytics

In-memory data and


Tanzu Gemfire Free tier tanzu.vmware.com/gemfire
compute grid
VMWare
Tanzu SQL Relational DBaaS Open source tanzu.vmware.com/sql

DZONE TREND REPORT | DATABASE SYSTEMS PAGE 48


DZONE'S 2022 DATABASE SYSTEMS SOLUTIONS DIRECTORY

Company Product Product Type Availability Website

Volt Active Data Volt Active Data Data platform for IoT By request voltactivedata.com

Cloud native distributed SQL


YugabyteDB Open source yugabyte.com/yugabytedb
database
Yugabyte
YugabyteDB Anywhere Private DBaaS for enterprises Trial period yugabyte.com/anywhere

DZONE TREND REPORT | DATABASE SYSTEMS PAGE 49

You might also like