Big Data Storage and Processing
Big Data Storage and Processing
The main goal of DBMS is to provide solution for storing and retrieving an
information efficiently.
RDBMS stores the relations between these tables in columns (i.e., Primary
keys and foreign keys) that serves as a reference for refer to another table.
Data in the table is stored in the rows and columns and the size of the file
go on increasing as new record are added resulting the increase in the
size of database.
These files are shared across nodes by several users through database
servers.
PRIMARY KEYS IN RDBMS
What is Primary Key?
A primary key is used to ensure that data in the specific column is unique.
PRIMARY KEYS
Customer ID is Primary Key in the customer Table and the CityID is the
primary key in the City Table.
IN CUSTOMER TABLE
City ID is Foreign Key in the customer Table and get linked to the City
Table easily.
WHAT IS PRIMARY AND FOREIGN KEY?
Example: STUD_NO, as well as STUD_PHONE both, are candidate
keys for relation STUDENT but STUD_NO can be chosen as the primary
key (only one out of many candidate keys).
Similar to a warehouse for storing physical goods, a data warehouse is a large building
facility which its primary function is to store and process data on an enterprise level.
It is an important tool for big data analytics. These large data warehouses support the
various reporting, business intelligence (BI), analytics, data mining, research, cyber
monitoring, and other related activities.
These warehouses are usually optimised to retain and process large amounts of data at
all times while feeding them in and out through online servers where users can access
their data without delay.
With cloud storage, data and information are stored electronically online
where it can be accessed from anywhere, negating the need for direct
attached access to a hard drive or computer. With this approach, you can
store virtually boundless amount of data online and access it where.
A class of novel data storage systems able to cope with Big Data are
subsumed under the term NoSQL database
DATA STORING AND RETRIEVING IN
NOSQL
KEY-VALUE STORES
Figure 1 illustrates how user account data and settings might be stored in a
key-value store.
Either all the changes made by the transaction are applied to the
database, or none of them are.
Example: money is deducted from the source and if any anomaly occurs,
the changes are discarded and the transaction fails.
Consistency:
Consistency guarantees that changes made within a transaction are
consistent with database constraints.
If the data gets into an illegal state, the whole transaction fails.
For example, let’s say that our account balance is $200. Two transactions
for a $100 withdrawal start at the same time. The transactions run in
isolation which guarantees that when they both complete, we’ll have a
balance of $0 instead of $100.
Durability
Durability guarantees that once the transaction completes and changes
are written to the database, they are persisted.
This ensures that data within the system will persist even in the case of
system failures like crashes or power outages.
BASE PROPERTY
The BASE property is a set of principles that is often used in the context of
distributed and NoSQL databases.
Unlike the ACID properties, which provide strong guarantees for data
consistency and reliability but may impose performance and scalability
limitations, BASE provides a more relaxed set of principles suitable for
distributed and large-scale systems.
Basically Available: This means that the system remains operational and
available for reads and writes, even in the presence of failures or
network partitions.
In other words, the system doesn't guarantee 100% uptime, but it strives
to be available most of the time.
The data doesn't have to be in a fully consistent state at all times, as long
as it converges towards consistency eventually.
The system allows for some degree of inconsistency but ensures that it
will be resolved without human intervention.
DIFFERENCE BETWEEN BASE PROPERTIES AND
ACID PROPERTIES
CAP PROPERTIES IN A DISTRIBUTED
DATABASE SYSTEM
Consistency (C): Reads and writes are always executed atomically and are
strictly consistent
Put differently, all clients have the same view on the data at all times.
This condition states that all nodes see the
same data at the same time. Simply put,
performing a read operation will return the
value of the most recent write operation
causing all nodes to return the same data.
Availability (A): Every non-failing node in the system can always
accept read and write requests by clients and will eventually return with a
meaningful response, i.e. not with an error message.
This condition states that every request gets a response on
success/failure.
Thus, these systems provide consistent data at all times, but are also
inherently difficult to scale.
The idea is to distribute data that cannot fit on a single node onto a
cluster of database nodes.
EXAMPLE
VERTICAL AND HORIZONTAL
PARTITIONING
THREE BASIC DISTRIBUTION
TECHNIQUES
There are three basic distribution techniques: range-sharding, hash-
sharding and entity-group sharding.
RANGE SHARDING:
This should be sufficient in practice even for very large data sets or cluster
sizes.
As an example, for a table with sixteen tablets the overall space [0x0000 to
0xFFFF) is divided into sixteen subranges, one for each tablet: [0x0000,
0x1000), [0x1000, 0x2000), … , [0xF000, 0xFFFF]. Read and write
operations are processed by the primary key
HASH-SHARDING
Partitioning data over several machines is hash-sharding where every
data item is assigned to a shard server according to some hash value
built from the primary key.
This approach does not require a coordinator and also guarantees the
data to be evenly distributed across the shards, as long as the used hash
function produces an even distribution.
The obvious disadvantage, is that it only allows lookups and makes scans
unfeasible.
Read and write operations are processed by converting the primary key
into an internal key and its hash value, and determining to which tablet
the operation should be routed
CONSISTENT HASHING
Elastic systems use consistent hashing where only a fraction of the data
have to be reassigned upon such system changes.
ENTITY-GROUP SHARDING
Entity-group sharding is a data partitioning scheme with the goal of
enabling single-partition transactions on co-located data.
The partitions are called entity-groups and either explicitly declared by
the application.
If a transaction accesses data that spans more than one group, data
ownership can be transferred between entity-groups or the transaction
manager has to fallback to more expensive multinode transaction
protocols.