Unit II Nosql Data Management (1)
Unit II Nosql Data Management (1)
A database that is not SQL is referred to as "Not Only SQL" or "Not SQL". NoSQL has become
popular, despite the fact that a better term would be "NoREL." In 1998, Carl Strozz coined
the term "NoSQL.” Traditional RDBMS employ SQL syntax to store and retrieve data for
further processing. A NoSQL database system, on the other hand, refers to a collection of
database systems that may hold structured, semi-structured, unstructured, and
polymorphic data
NoSQL, also referred to as “not only SQL”, “non-SQL”, is an approach to database design
that enables the storage and querying of data outside the traditional structures found
in relational databases. While it can still store data found within relational database
management systems (RDBMS), it just stores it differently compared to an RDBMS. The
decision to use a relational database versus a non-relational database is largely contextual,
and it varies depending on the use case.
Instead of the typical tabular structure of a relational database, NoSQL databases, house
data within one data structure, such as JSON document. Since this non-relational database
design does not require a schema, it offers rapid scalability to manage large and typically
unstructured data sets.
NoSQL is also type of distributed database, which means that information is copied and
stored on various servers, which can be remote or local. This ensures availability and
reliability of data. If some of the data goes offline, the rest of the database can continue to
run.
Today, companies need to manage large data volumes at high speeds with the ability to
scale up quickly to run modern web applications in nearly every industry. In this era of
growth within cloud, big data, and mobile and web applications, NoSQL databases provide
that speed and scalability, making it a popular choice for their performance and ease of
use.
● Flexible Schema: NoSQL Databases do not require the same schema as compared to
SQL Databases. The document in the same collection does not need to have the same
set of fields and data type.
● High Availability: Unlike Relational Databases that use primary and secondary
nodes for fetching data. NoSQL Databases use master place architecture.
● These Aggregate Data Models in NoSQL Database are used for storing the user
session data.
● Key Value-based Data Models are used for maintaining schema-less user profiles.
● Document Data Models are well suited for Blogging and Analytics platforms.
(iii) Column Family Model
Column family is an Aggregate Data Models in NoSQL Database usually with big-table style
Data Models that are referred to as column stores. It is also called a two-level map as it
offers a two-level aggregate structure. In this Aggregate Data Models in NoSQL, the first
level of the Column family contains the keys that act as a row identifier that is used to select
the aggregate data. Whereas the second level values are referred to as columns.
Use Cases:
● Column Family Data Models are used in systems that maintain counters.
● These Aggregate Data Models in NoSQL are used for services that have expiring
usage.
Fig 2.1.2.6 UML Diagram for E-Commerce Site - Aggregate Data Models in NoSQL
The Data Model for customer and order would look like this.
// in customers
{
"customer": {
"id": 1,
"name": "Martin",
"billingAddress": [{"city": "Chicago"}],
"orders": [
{
"id":99,
"customerId":1,
"orderItems":[
{
"productId":27,
"price": 32.45,
"productName": "NoSQL Distilled"
}
],
"shippingAddress":[{"city":"Chicago"}]
"orderPayment":[
{
"ccinfo":"1000-1000-1000-1000",
"txnId":"abelif879rft",
"billingAddress": {"city": "Chicago"}
}],
}]
}
}
In these Aggregate Data Models in NoSQL, if you want to access a customer along with all
customer’s orders at once. Then designing a single aggregate is preferable. But if you want
to access a single order at a time, then you should have separate aggregates for each order.
It is very content-specific.
● Retrieving a value (if there is one) stored and associated with a given key
● Deleting the value (if there is one) stored and associated with a given key
● Setting, updating, and replacing the value (if there is one) associated with a given
key
Modern applications will probably require more than the above, but this is the bare
minimum for a key-value store.
WHEN TO USE A KEY-VALUE DATABASE
1. Handling Large Volume of Small and Continuous Reads and Writes
Key-value databases are particularly suitable when your application requires handling a
large volume of small and continuous reads and writes. These databases are designed for
efficient and fast access to data stored as key-value pairs. Whether the data is volatile or
frequently changing, key-value databases can provide high-performance in-memory access,
making them ideal for use cases that demand quick retrieval and modification of data.
2. Storing Basic Information
Key-value databases are well-suited for storing basic information, such as customer details,
user profiles, or simple configurations. In these scenarios, each piece of information can be
associated with a unique key, allowing for easy retrieval and updates based on the key
value. For example, a key-value database can be used to store webpages with the URL as
the key and the webpage content as the value. Similarly, storing shopping-cart contents,
product categories, or e-commerce product details can be efficiently managed using key-
value databases.
3. Applications with Infrequent Updates and Simple Queries
Key-value databases are a good choice for applications that don’t require frequent updates
or complex queries. If your application primarily focuses on data retrieval and simple
CRUD operations, key-value databases provide an efficient and straightforward solution.
These databases prioritize simplicity and high-performance data access, making them
suitable for applications that require quick lookups and modifications without the need for
complex query capabilities or extensive data manipulation.
4. Key-Value Databases for Volatile Data
When your application needs to handle lots of small continuous reads and writes, that may
be volatile, key-value databases offer fast in-memory access.
USE CASES FOR KEY-VALUE DATABASES
1. Session Management on a Large Scale
Key-value databases are well-suited for managing session data in applications that require
handling a large number of concurrent users. These databases can efficiently store and
retrieve session information, such as user authentication tokens, user preferences, or
temporary data. With their fast in-memory access and ability to handle high volumes of
small reads and writes, key-value databases provide an optimal solution for session
management in applications with a large user base.
2. Using Cache to Accelerate Application Responses
Key-value databases are often employed as cache layers to accelerate application
responses. By caching frequently accessed data in a key-value store, applications can
reduce the need for expensive and time-consuming operations, such as database queries or
complex computations. This caching strategy allows for faster data retrieval, leading to
improved application performance and responsiveness.
3. Storing Personal Data on Specific Users
Key-value databases can efficiently store personal data on specific users. For example, they
can be used to store user profile information, user preferences, or other user-specific data.
With their simple key-value storage model, these databases allow for quick and efficient
access to user data, making them suitable for applications that need to handle a large
volume of user-specific data.
4. Product Recommendations and Personalized Lists
Key-value databases can be used to generate and store product recommendations and
personalized lists. They provide a quick and efficient way to store and retrieve user
preferences and other user-specific data, which can be used to personalize product
recommendations and lists. This can lead to a more engaging and personalized user
experience, improving user satisfaction and potentially driving increased revenue for
businesses.
5. Managing Player Sessions in Massive Multiplayer Online Games
Key-value databases are excellent for managing player sessions in massive multiplayer
online games (MMOGs). These games require real-time management of a large number of
simultaneous player sessions, and key-value databases can provide the necessary
performance and scalability to handle this challenge.
Collections
A collection is a group of documents. Collections typically store documents that have
similar contents.
Not all documents in a collection are required to have the same fields, because document
databases have a flexible schema. Note that some document databases provide schema
validation, so the schema can optionally be locked down when needed.
Continuing with the example above, the document with information about Tom could be
stored in a collection named users. More documents could be added to the users collection
in order to store information about other users. For example, the document below that
stores information about Donna could be added to the users collection.
CRUD operations
Document databases typically have an API or query language that allows developers to
execute the CRUD (create, read, update, and delete) operations.
● Create: Documents can be created in the database. Each document has a unique
identifier.
● Read: Documents can be read from the database. The API or query language allows
developers to query for documents using their unique identifiers or field values.
Indexes can be added to the database in order to increase read performance.
● Document model: Data is stored in documents (unlike other databases that store
data in structures like tables or graphs). Documents map to objects in most popular
programming languages, which allows developers to rapidly develop their
applications.
● Flexible schema: Document databases have a flexible schema, meaning that not all
documents in a collection need to have the same fields. Note that some document
databases support schema validation, so the schema can be optionally locked down.
● Distributed and resilient: Document databases are distributed, which allows for
horizontal scaling (typically cheaper than vertical scaling) and data distribution.
Document databases provide resiliency through replication.
1. The intuitiveness of the data model: Documents map to the objects in code, so they are
much more natural to work with. There is no need to decompose data across tables, run
expensive joins, or integrate a separate Object Relational Mapping (ORM) layer. Data that is
accessed together is stored together, so developers have less code to write and end users
get higher performance.
2. The ubiquity of JSON documents: JSON has become an established standard for data
interchange and storage. JSON documents are lightweight, language-independent, and
human-readable. Documents are a superset of all other data models so developers can
structure data in the way their applications need — rich objects, key-value pairs, tables,
geospatial and time-series data, or the nodes and edges of a graph.
A user can like many things (meaning there is a one-to-many relationship between a user
and likes), so we will create a new table named "Likes" to store a user’s likes. The Likes
table will have a foreign key that references the ID column in the Users table.
Similarly, a user can run many businesses, so we will create a new table named
"Businesses" to store business information. The Businesses table will have a foreign key
that references the ID column in the Users table.
● The flexible schema allows for the data model to change as an application's
requirements change.
● Document databases have rich APIs and query languages that allow developers to
easily interact with their data.
While a relational database stores data in rows and reads data row by row, a column store
is organized as a set of columns. This means that when you want to run analytics on a small
number of columns, you can read those columns directly without consuming memory with
the unwanted data. Columns are often of the same type and benefit from more efficient
compression, making reads even faster. Columnar databases can quickly aggregate the
value of a given column (adding up the total sales for the year, for example). Use cases
include analytics.
The keys and the column names of this type of database are not fixed. Columns within the
same column family, or cluster of columns, can have a different number of rows and can
accommodate different types of data and names. These databases are most often utilized
when there is a need for a large data model. They are very useful for data warehouses, or
when there is a need for high performance or handling intensive querying.
Column-oriented databases work flow
Relational databases have a set schema and they function as tables of rows and columns.
Wide-column databases have a similar, but different schema. They also have rows and
columns. However, they are not fixed within a table, but have a dynamic schema. Each
column is stored separately. If there are similar (related) columns, they are joined into
column families and then the column families are stored separately from other column
families.
The row key is the first column in each column family, and it serves as an identifier of a
row. Furthermore, each column after that has a column key (name). It identifies columns
within rows and thus enables the querying of the columns. The value and the timestamp
come after the column key, leaving a trace of when the data was entered or modified.
The number of columns pertaining to each row, or their name, can vary. In other words, not
every column of a column family, and thus a database, has the same number of rows. In
fact, even though they might share their name, each column is contained within one row
and does not run across all rows.
● Super column family. A super column represents an array of columns. Each super
column has a name and a value mapping the super column out to several different
columns. Related super columns are joined under a single row into super column
families. Compared to a relational database, this is like a view of several different
tables within a database. Imagine you had the view of the columns and values
available for a single row -- that is a single identifier across many different tables --
and were able to store them all in one place: That is the super column family.
Advantages of column-oriented databases
● Scalability. This is a major advantage and one of the main reasons this type of
database is used to store big data. With the ability to be spread over hundreds of
different machines depending on the scale of the database, it supports massively
parallel processing. This means it can employ many processors to work on the same
set of computations simultaneously.
● Compression. Not only are they infinitely scalable, but they are also good at
compressing data and thus saving storage.
● Very responsive. The load time is minimal, and queries are performed fast, which
is expected given that they are designed to hold big data and be practical for
analytics.
Disadvantages of column-oriented databases
● Online transactional processing. These databases are not very efficient with
online transactional processing as much as they are for online analytical processing.
This means they are not very good with updating transactions but are designed to
analyze them. This is why they can be found holding data required for business
analysis with a relational database storing data in the back end.
Graph databases
A graph database focuses on the relationship between data elements. Each element is
stored as a node (such as a person in a social media graph). The connections between
elements are called links or relationships. In a graph database, connections are first-class
elements of the database, stored directly. In relational databases, links are implied, using
data to express the relationships.
A graph database is optimized to capture and search the connections between data
elements, overcoming the overhead associated with JOINing multiple tables in SQL.Very
few real-world business systems can survive solely on graph queries. As a result, graph
databases are usually run alongside other more traditional databases. Use cases include
fraud detection, social networks, and knowledge graphs.
Graph Based Data Model
The semantic graph database is a type of NoSQL graph database that is capable of
integrating heterogeneous data from many sources and making links between datasets.
The semantic graph database, also referred to as an RDF triplestore, focuses on the
relationships between entities and is able to infer new knowledge out of existing
information. It is a powerful tool to use in relationship-centered analytics and knowledge
discovery.
In addition, the capability to handle massive datasets and the schema-less approach
support the NoSQL semantic graph database usage in real-time big data analytics.
● In relational databases, the need to have the schemas defined before adding new
information restricts data integration from new sources because the whole schema
needs to be changed anew.
● With the schema-less NoSQL semantic graph database with no need to change
schemas every time a new data source is about to be added, enterprises integrate
data with less effort and cost.
The semantic graph database stands out from the other types of graph databases with its
ability to additionally support rich semantic data schema, the so-called ontologies.
The semantic NoSQL graph database gets the best of both worlds: on the one hand, data is
flexible because it does not depend on the schema. On the other hand, ontologies give the
semantic graph database the freedom and ability to build logical models any way
organizations find it useful for their applications, without having to change the data.
Advantages of Graph Data Model:
● No standard query language: Since the language depends on the platform that is
used so there is no certain standard query language.
● Small User Base: The user base is small which makes it very difficult to get support
when running into a system.
Applications of Graph Data Model:
● Graph data models are very much used in fraud detection which itself is very much
useful and important.
With the schemaless MongoDB database, there is some additional structure — the system
namespace contains an explicit list of collections and indexes. Collections may be implicitly
or explicitly created — indexes must be explicitly declared.
benefits of using a schemaless database?
The lack of schema means that your NoSQL database can accept any data type — including
those that you do not yet use. This future-proofs your database, allowing it to grow and
change as your data-driven operations change and mature.
● No data truncation
A schemaless database makes almost no changes to your data; each item is saved in its own
document with a partial schema, leaving the raw information untouched. This means that
every detail is always available and nothing is stripped to match the current schema. This is
particularly valuable if your analytics needs to change at some point in the future.
With the ability to process unstructured data, applications built on NoSQL databases are
better able to process real-time data, such as readings and measurements from IoT sensors.
Schemaless databases are also ideal for use with machine learning and artificial intelligence
operations, helping to accelerate automated actions in your business.
With NoSQL, you can use whichever data model is best suited to the job. Graph databases
allow you to view relationships between data points, or you can use traditional wide table
views with an exceptionally large number of columns. You can query, report, and model
information however you choose. And as your requirements grow, you can keep adding
nodes to increase capacity and power.
When a record is saved to a relational database, anything (particularly metadata) that does
not match the schema is truncated or removed. Deleted at write, these details cannot be
recovered at a later point in time.
Materialized view is useful when the view is accessed frequently, as it saves the
computation time, as the result is stored in the database beforehand. Materialized view can
also be helpful in case where the relation on which view is defined is very large and the
resulting relation of the view is very small. Materialized view has storage cost and updation
overheads associated with it.
Materialized View Examples
For example, let’s say you have a database with two tables: one contains the number of
employees in your business, and the other contains the number of departments in your
business.
Using a materialized view, you could query the database to retrieve all the employees who
are associated with a particular department.
Or, say you have a database with two tables: one for the total number of sales you’ve made
and one for the total amount of revenue you’re generating. You could use a materialized
view to see how much revenue each sale brings with it in real-time.
An example of a command for PostgreSQL that could be used to create a materialized view
is as follows:
▪ For materialized view columns that are references to a table column, the following
naming convention is used:
<MaterializedViewColumnName>:
<SourceTableName>.<SourceColumnName>
▪ For materialized view columns that are user-defined expressions, the following
naming convention is used:
<MaterializedViewColumnName >: <expression>
After you draw a view relationship between a table and a materialized view, you must use
on-diagram editing or the Views editor to migrate columns in the table to the materialized
view. By default, the <MaterializedViewColumnName> is the same as the
<SourceColumnName> until you edit the materialized view column name.
You can edit materialized views directly in the diagram window using on-diagram editing.
When you drag a column from a table into a materialized view, the materialized view
column and the relationship are created.
If you delete a table column that is referenced by a materialized view, the corresponding
materialized view column is deleted. If you delete a table that is referenced by a
materialized view, the corresponding materialized view columns are deleted.
Update Consistency
We’ll begin by considering updating a telephone number. Coincidentally, Martin and
Pramod are looking at the company website and notice that the phone number is out of
date. Implausibly, they both have update access, so they both go in at the same time to
update the number. To make the example interesting, we’ll assume they update it slightly
differently, because each uses a slightly different format. This issue is called a write-write
conflict: two people updating the same data item at the same time.
When the writes reach the server, the server will serialize them—decide to apply one, then
the other. Let’s assume it uses alphabetical order and picks Martin’s update first, then
Pramod’s. Without any concurrency control, Martin’s update would be applied and
immediately overwritten by Pramod’s. In this case Martin’s is a lost update. Here the lost
update is not a big problem, but often it is. We see this as a failure of consistency because
Pramod’s update was based on the state before Martin’s update, yet was applied after it.
Approaches for maintaining consistency in the face of concurrency are often described as
pessimistic or optimistic. A pessimistic approach works by preventing conflicts from
occurring; an optimistic approach lets conflicts occur, but detects them and takes action to
sort them out. For update conflicts, the most common pessimistic approach is to have write
locks, so that in order to change a value you need to acquire a lock, and the system ensures
that only one client can get a lock at a time.
So Martin and Pramod would both attempt to acquire the write lock, but only Martin (the
first one) would succeed. Pramod would then see the result of Martin’s write before
deciding whether to make his own update.
A common optimistic approach is a conditional update where any client that does an
update tests the value just before updating it to see if it’s changed since his last read. In this
case, Martin’s update would succeed but Pramod’s would fail. The error would let Pramod
know that he should look at the value again and decide whether to attempt a further
update.
Both the pessimistic and optimistic approaches rely on a consistent serialization of the
updates. With a single server, this is obvious—it has to choose one, then the other. But if
there’s more than one server, such as with peer-to-peer replication, then two nodes might
apply the updates in a different order, resulting in a different value for the telephone
number on each peer.
Read Consistency
Read Consistency in NoSQL databases refers to the guarantee that a read operation will
always return the most recent and up-to-date data from the database. In distributed
systems, such as NoSQL databases, achieving strong consistency (as in traditional ACID-
compliant databases) across all nodes can be challenging due to factors like network
latency, node failures, and the need for horizontal scalability. As a result, NoSQL databases
often adopt a more relaxed form of consistency known as eventual consistency.
Eventual Consistency: In an eventually consistent system, after a write operation is
performed, the data may not be immediately propagated to all replicas or nodes in the
cluster. Instead, the system allows for a delay, during which the data will eventually be
propagated and reconciled across all nodes. This means that if a read operation occurs
shortly after a write operation, it might return a slightly outdated value until the data
reaches all replicas and becomes fully consistent.
In practical terms, eventual consistency implies that, given enough time and assuming no
further writes, all replicas will eventually converge to the same value. The time it takes to
achieve consistency depends on various factors, such as network conditions, system load,
and the specific NoSQL database's implementation.
Use Cases for Eventual Consistency: Eventual consistency is appropriate for certain types
of applications where the immediate consistency of data across all replicas is not critical,
and some level of temporary inconsistency is acceptable. For example:
1. Social media platforms, where the order of likes, comments, and updates might not
be immediately reflected on all users' devices but eventually converge to the correct
state.
2. Analytical applications that perform big data processing, where slight delays in
propagating data are acceptable for handling large volumes of data efficiently.
3. Collaborative applications where users can edit the same document concurrently,
and the system resolves conflicts in an eventual manner.
Tunable Consistency Levels: Many NoSQL databases allow developers to choose their
preferred consistency levels based on the specific use case. This means that developers can
select stronger consistency models (e.g., strong or causal consistency) when needed or opt
for weaker consistency (e.g., eventual consistency) for better availability and performance
in other scenarios. This flexibility allows developers to tailor the database's behavior to the
application's requirements.
It's essential for developers to understand the consistency guarantees provided by their
chosen NoSQL database and design their applications to handle potential inconsistencies
gracefully when working with distributed data systems.
Having a data store that maintains update consistency is one thing, but it doesn’t guarantee
that readers of that data store will always get consistent responses to their requests.
Figure With two breaks in the communication lines, the network partitions into two
groups.
A single-server system is the obvious example of a CA system—a system that has
Consistency and Availability but not Partition tolerance. A single machine can’t partition, so
it does not have to worry about partition tolerance. There’s only one node—so if it’s up, it’s
available. Being up and keeping consistency is reasonable
Relaxing Durability
As it turns out, there are cases where you may want to trade off some durability for higher
performance. If a database can run mostly in memory, apply updates to its in-memory
representation, and periodically flush changes to disk, then it may be able to provide
substantially higher responsiveness to requests. The cost is that, should the server crash,
any updates since the last flush will be lost.
Quorums
In the context of NoSQL databases, "quorums" refer to the minimum number of nodes or
replicas that must participate in read and write operations to achieve a specific level of
data consistency and availability. Quorums are used in distributed databases to ensure that
a sufficient number of replicas acknowledge and agree on an operation to guarantee
certain consistency guarantees while tolerating node failures or network partitions.
Quorums play a crucial role in maintaining data consistency and fault tolerance in
distributed systems, especially in databases that use replication for high availability and
data redundancy. Different NoSQL databases use different quorum strategies based on
their underlying architecture and consistency models.
There are two primary types of quorums used in NoSQL databases:
1. Read Quorum: In a read quorum, a certain number of replicas must participate in a
read operation before a response is considered valid. The read quorum size
determines how consistent the data will be for read operations. There are typically
two types of read quorums:
a. Strong Read Quorum: Requires all replicas to participate in the read operation. This
ensures strong consistency because the read operation will return the most recent data
available in the system. However, this approach might lead to higher latency, especially in
the presence of network partitions or node failures.
b. Eventual Read Quorum: Requires only a subset of replicas to participate in the read
operation. This allows for eventual consistency, where the read might return slightly
outdated data until all replicas converge. Eventual read quorums provide better read
availability and lower latency.
2. Write Quorum: In a write quorum, a certain number of replicas must participate in a
write operation before it is considered successful. The write quorum size
determines how many replicas need to acknowledge a write before it is considered
durable. There are also two main types of write quorums:
a. Strict or Synchronous Write Quorum: Requires all replicas to acknowledge the write
operation before it is considered successful. This ensures that the write is committed to all
replicas before acknowledging the client, providing strong consistency but potentially
increasing write latency.
b. Sloppy or Asynchronous Write Quorum: Requires only a subset of replicas to
acknowledge the write operation before it is considered successful. This allows for higher
write availability and lower write latency at the cost of eventual consistency.
The choice of read and write quorum sizes depends on the consistency model desired, the
desired level of fault tolerance, and the trade-offs between consistency, availability, and
performance for the specific application.
It's important to note that not all NoSQL databases use quorums, as they are primarily
associated with distributed databases that use replication strategies to ensure data
availability and consistency.
CASSANDRA
Cassandra is a popular NoSQL distributed database known for its scalability, fault
tolerance, and high availability. Its data model is designed to handle large amounts of data
across multiple nodes while providing a flexible schema and excellent read and write
performance. Cassandra follows a "wide-column" data model, which is also often referred
to as a "distributed multi-dimensional map."
Cassandra is a distributed database management system which is open source with wide
column store, NoSQL database to handle large amount of data across many commodity
servers which provides high availability with no single point of failure. It is written in Java
and developed by Apache Software Foundation.
The design goal of a Cassandra is to handle big data workloads across multiple nodes
without any single point of failure. Cassandra has peer-to-peer distributed system across
its nodes, and data is distributed among all the nodes of the cluster.
All the nodes of Cassandra in a cluster play the same role. Each node is independent, at the
same time interconnected to other nodes. Each node in a cluster can accept read and write
requests, regardless of where the data is actually located in the cluster. When a node goes
down, read/write request can be served from other nodes in the network.
Key features of the Cassandra data model include:
1. Distributed Architecture: Cassandra is designed to operate in a distributed manner
across a cluster of nodes. Each node can hold a subset of the data, and the data is
partitioned across nodes using a hash function on the primary key. This allows
Cassandra to scale horizontally by adding more nodes to the cluster.
2. Column Families (Tables): In Cassandra, data is organized into "column families,"
which are analogous to tables in a relational database. Each column family can have
different columns, and rows are identified by a primary key. Unlike traditional
relational databases, Cassandra does not enforce a fixed schema for each row,
allowing great flexibility in the data structure.
3. Composite Keys: The primary key in Cassandra can be a composite key consisting of
multiple columns. This allows for efficient querying on multiple dimensions and
offers flexibility in data modeling.
4. Columns and Rows: Each row in a column family consists of multiple columns.
Columns can have different names and data types, even within the same column
family. Rows are identified by their unique primary keys and are distributed across
the cluster based on the hash of the primary key.
5. Wide Rows: Cassandra allows rows to have an unlimited number of columns, which
means they can store a large amount of data. This design is particularly useful for
applications that require high-speed read and write access to large datasets.
6. Consistency Levels: Cassandra provides tunable consistency levels, allowing
developers to choose the desired level of data consistency for read and write
operations. This allows developers to strike a balance between data consistency and
system performance based on the application's requirements.
7. No Single Point of Failure: Cassandra is designed to be fault-tolerant. Data is
replicated across multiple nodes, ensuring that the system can withstand node
failures without losing data or compromising availability.
8. Secondary Indexes: Cassandra supports secondary indexes, allowing queries on
non-primary key columns for efficient data retrieval based on different criteria.
The data model in Cassandra makes it well-suited for use cases that require massive scale,
high availability, and low-latency data access. It is commonly used in applications that
handle large volumes of time-series data, real-time analytics, and other scenarios where
traditional relational databases may not be able to meet the performance and scalability
requirements.
Apache Cassandra is used to manage very large amounts of structure data spread out
across the world. It provides highly available service with no single point of failure. Listed
below are some points of Apache Cassandra:
● It is column-oriented database.
● Its distributed design is based on Amazon’s Dynamo and its data model on Google’s
Big table.
1. Node:
Node is the basic component in Apache Cassandra. It is the place where actually data
is stored. For Example:As shown in diagram node which has IP address 10.0.0.7
contain data (keyspace which contain one or more tables).
Fig Node
2. Data Centre:
Data Centre is a collection of nodes.
For example:
DC – N1 + N2 + N3 ….
DC: Data Centre
N1: Node 1
N2: Node 2
N3: Node 3
3. Cluster:
It is the collection of many data centers.
For example:
C = DC1 + DC2 + DC3….
C: Cluster
DC1: Data Center 1
DC2: Data Center 2
DC3: Data Center 3
4. Cluster:
It is the collection of many data centers.
For example:
C = DC1 + DC2 + DC3….
C: Cluster
DC1: Data Center 1
DC2: Data Center 2
DC3: Data Center 3
Data modeling is the process of identifying entities and their relationships. In relational
databases, data is placed in normalized tables with foreign keys used to reference related
data in other tables. Queries that the application will make are driven by the structure of
the tables and related data are queried as table joins. In Cassandra, data modeling is query-
driven. The data access patterns and application queries determine the structure and
organization of data which then used to design the database tables.
The data model is a conceptual model that must be analyzed and optimized based on
storage, capacity, redundancy and consistency. A data model may need to be modified as a
result of the analysis. Considerations or limitations that are used in data model analysis
include:
✔ Partition Size
✔ Data Redundancy
✔ Disk space
✔ Lightweight Transactions (LWT)
A simple domain model that is easy to understand in the relational world, and then see how
you might map it from a relational to a distributed hash table model in Cassandra.
For example, let’s use a domain that is easily understood and that everyone can relate to:
making hotel reservations.
The conceptual domain includes hotels, guests that stay in the hotels, a collection of rooms
for each hotel, the rates and availability of those rooms, and a record of reservations
booked for guests. Hotels typically also maintain a collection of “points of interest,” which
are parks, museums, shopping galleries, monuments, or other places near the hotel that
guests might want to visit during their stay. Both hotels and points of interest need to
maintain geolocation data so that they can be found on maps for mashups, and to calculate
distances.
RDBMS Design
When you set out to build a new data-driven application that will use a relational database,
you might start by modeling the domain as a set of properly normalized tables and use
foreign keys to reference related data in other tables.
The figure below shows how you might represent the data storage for your application
using a relational database model. The relational model includes a couple of “join” tables in
order to realize the many-to-many relationships from the conceptual model of hotels-to-
points of interest, rooms-to-amenities, rooms-to-availability, and guests-to-rooms (via a
reservation).
● Q2. Find information about a given hotel, such as its name and location.
All of the queries are shown in the context of the workflow of the application in the figure
below. Each box on the diagram represents a step in the application workflow, with arrows
indicating the flows between steps and the associated query. If you’ve modelled the
application well, each step of the workflow accomplishes a task that “unlocks” subsequent
steps. For example, the “View hotels near POI” task helps the application learn about
several hotels, including their unique keys. The key for a selected hotel may be used as part
of Q2, in order to obtain detailed description of the hotel. The act of booking a room creates
a reservation record that may be accessed by the guest and hotel staff at a later time
through various additional queries.
Logical Data Modeling
Create a logical model containing a table for each query, capturing entities and
relationships from the conceptual model.
Step 1: To name each table, you’ll identify the primary entity type for which you are
querying and use that to start the entity name. If you are querying by attributes of other
related entities, append those to the table name, separated with by. For
example, hotels_by_poi.
Step 2: Identify the primary key for the table, adding partition key columns based on the required
query attributes, and clustering columns in order to guarantee uniqueness and support desired sort
ordering.
The design of the primary key is extremely important, as it will determine how much data
will be stored in each partition and how that data is organized on disk, which in turn will
affect how quickly Cassandra processes reads.
Complete each table by adding any additional attributes identified by the query. If any of
these additional attributes are the same for every instance of the partition key, mark the
column as static.
Each table is shown with its title and a list of columns. Primary key columns are identified
via symbols such as K for partition key columns and C↑ or C↓ to represent clustering
columns. Lines are shown entering tables or between tables to indicate the queries that
each table is designed to support.
Physical Data Modeling
Each of the logical model tables, assigning types to each item. Use any valid CQL data type
<data-types>, including the basic types, collections, and user-defined types. Identify
additional user-defined types that can be created to simplify your design. After assigned
data types, analyze the model by performing size calculations and testing out how the
model works. Make some adjustments based on findings.
The figure includes a designation of the keyspace containing each table and visual cues for
columns represented using collections and user-defined types. Note the designation of
static columns and secondary index columns. There is no restriction on assigning these as
part of a logical model, but they are typically more of a physical data modeling concern.
Evaluating and Refining Data Models
Evaluate and refine table designs to help ensure optimal performance.
Calculating Partition Size
Partition size is measured by the number of cells (values) that are stored in the partition.
In order to calculate the size of partitions, use the following formula:
Nv=Nr(Nc−Npk−Ns)+Ns
The number of values (or cells) in the partition (Nv) is equal to the number of static
columns (Ns) plus the product of the number of rows (Nr) and the number of of values per
row. The number of values per row is defined as the number of columns (Nc) minus the
number of primary key columns (Npk) and static columns (Ns).
Calculating Size on Disk
In addition to calculating the size of a partition, it is also an excellent idea to estimate the
amount of disk space that will be required for each table you plan to store in the cluster. In
order to determine the size, use the following formula to determine the size S t of a
partition:
This is a bit more complex than the previous formula, but let’s break it down a bit at a time. Let’s
take a look at the notation first:
● You’ll recognize the number of rows Nr and number of values Nv from previous
calculations.
● The sizeOf() function refers to the size in bytes of the CQL data type of each
referenced column.
The first term asks you to sum the size of the partition key columns.
The second term asks you to sum the size of the static columns. This table has no static
columns, so the size is 0 bytes.
The third term is the most involved, and for good reason—it is calculating the size of the
cells in the partition. Sum the size of the clustering columns and regular columns. The two
clustering columns are the date, which is 4 bytes, and the room_number, which is a 2-byte
short integer, giving a sum of 6 bytes. There is only a single regular column, the
boolean is_available, which is 1 byte in size. o finish up the term, multiply this value by the
number of rows (73,000), giving a result of 511,000 bytes (0.51 MB).
The fourth term is simply counting the metadata that that Cassandra stores for each cell. In
the storage format used by Cassandra 3.0 and later, the amount of metadata for a given cell
varies based on the type of data being stored, and whether or not custom timestamp or TTL
values are specified for individual cells. For this table, reuse the number of values from the
previous calculation (73,000) and multiply by 8, which gives 0.58 MB. Adding these terms
together, you get a final estimate:
This formula is an approximation of the actual size of a partition on disk, but is accurate
enough to be quite useful. Remembering that the partition must be able to fit on a single
node, it looks like the table design will not put a lot of strain on disk storage.
Breaking up Large Partitions
The goal is to design tables that can provide the data you need with queries that touch a
single partition, or failing that, the minimum possible number of partitions. However, as
shown in the examples, it is quite possible to design wide partition-style tables that
approach Cassandra’s built-in limits. Performing sizing analysis on tables may reveal
partitions that are potentially too large, either in number of values, size on disk, or both.
The technique for splitting a large partition is straightforward: add an additional column to
the partition key. In most cases, moving one of the existing columns into the partition key
will be sufficient. Another option is to introduce an additional column to the table to act as
a sharding key, but this requires additional application logic.
Continuing to examine the available rooms example, if you add the date column to the
partition key for the available_rooms_by_hotel_date table, each partition would then
represent the availability of rooms at a specific hotel on a specific date. This will certainly
yield partitions that are significantly smaller, perhaps too small, as the data for consecutive
days will likely be on separate nodes.
Another technique known as bucketing is often used to break the data into moderate-size
partitions. For example, you could bucketize the available_rooms_by_hotel_date table by
adding a month column to the partition key, perhaps represented as an integer. The
comparision with the original design is shown in the figure below. While the month column
is partially duplicative of the date, it provides a nice way of grouping related data in a
partition that will not get too large.
● Hackolade is a data modeling tool that supports schema design for Cassandra and
many other NoSQL databases. Hackolade supports the unique concepts of CQL such
as partition keys and clustering columns, as well as data types including collections
and UDTs. It also provides the ability to create Chebotko diagrams.
● Kashlev Data Modeler is a Cassandra data modeling tool that automates the data
modeling methodology described in this documentation, including identifying
access patterns, conceptual, logical, and physical data modeling, and schema
generation. It also includes model patterns that you can optionally leverage as a
starting point for your designs.
● DataStax DevCenter is a tool for managing schema, executing queries and viewing
results. While the tool is no longer actively supported. DevCenter features syntax
highlighting for CQL commands, types, and name literals. DevCenter provides
command completion as you type out CQL commands and interprets the commands
you type, highlighting any errors you make. The tool provides panes for managing
multiple CQL scripts and connections to multiple clusters. The connections are used
to run CQL commands against live clusters and view the results. The tool also has a
query trace feature that is useful for gaining insight into the performance of your
queries.
● IDE Plugins - There are CQL plugins available for several Integrated Development
Environments (IDEs), such as IntelliJ IDEA and Apache NetBeans. These plugins
typically provide features such as schema management and query execution.
Some IDEs and tools that claim to support Cassandra do not actually support CQL natively,
but instead access Cassandra using a JDBC/ODBC driver and interact with Cassandra as if it
were a relational database with SQL support. When selecting tools for working with
Cassandra make sure they support CQL and reinforce Cassandra best practices for data
modeling.
CASSANDRA EXAMPLES
Refer the tutorial Apache Cassandra
https://cassandra.apache.org/doc/latest/cassandra/data_modeling/intro.html
CASSANDRA CLIENTS
https://cassandra.apache.org/doc/latest/cassandra/getting_started/drivers.html