0% found this document useful (0 votes)
48 views49 pages

Big Data Storage and Processing

The document discusses big data storage and processing. It covers topics like relational database management systems, primary keys, foreign keys, data warehouses, cloud storage, NoSQL databases, and the CAP theorem as it relates to distributed systems.

Uploaded by

Celina Sawan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
48 views49 pages

Big Data Storage and Processing

The document discusses big data storage and processing. It covers topics like relational database management systems, primary keys, foreign keys, data warehouses, cloud storage, NoSQL databases, and the CAP theorem as it relates to distributed systems.

Uploaded by

Celina Sawan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 49

DIGITAL IMAGE PROCESSING

BIG DATA ANALYTICS

Big Data Storage and Processing


BIG DATA STORAGE
 A simple DBMS stores data in the form of schemas or tables comprising of
rows and columns.

 The main goal of DBMS is to provide solution for storing and retrieving an
information efficiently.

 SQL is used to fetch the data stored in these tables.

 RDBMS stores the relations between these tables in columns (i.e., Primary
keys and foreign keys) that serves as a reference for refer to another table.
 Data in the table is stored in the rows and columns and the size of the file
go on increasing as new record are added resulting the increase in the
size of database.

 These files are shared across nodes by several users through database
servers.
PRIMARY KEYS IN RDBMS
 What is Primary Key?
 A primary key is used to ensure that data in the specific column is unique.

 A column cannot have NULL values. It is either an existing table column


or a column that is specifically generated by the database according to a
defined sequence.


PRIMARY KEYS

Customer ID is Primary Key


FOREIGN KEY.
 A foreign key is a column or group of columns in a relational database
table that provides a link between data in two tables.

 It is a column (or columns) that references a column (most often the


primary key) of another table.
Customer Table City Table

Customer ID is Primary Key in the customer Table and the CityID is the
primary key in the City Table.
IN CUSTOMER TABLE

City ID is Foreign Key in the customer Table and get linked to the City
Table easily.
WHAT IS PRIMARY AND FOREIGN KEY?
 Example: STUD_NO, as well as STUD_PHONE both, are candidate
keys for relation STUDENT but STUD_NO can be chosen as the primary
key (only one out of many candidate keys).

 Example: STUD_NO in STUDENT_COURSE is a foreign key to


STUD_NO in STUDENT relation.
WAREHOUSE STORAGE
 In addition to data files Data Warehouse is also used to store large amount of data.

 Similar to a warehouse for storing physical goods, a data warehouse is a large building
facility which its primary function is to store and process data on an enterprise level.

 It is an important tool for big data analytics. These large data warehouses support the
various reporting, business intelligence (BI), analytics, data mining, research, cyber
monitoring, and other related activities.

 These warehouses are usually optimised to retain and process large amounts of data at
all times while feeding them in and out through online servers where users can access
their data without delay.

 The greatest benefit of data warehouses is the ability to translate raw


data into information and insight. Data warehouses offer an effective
way to support queries, analytics, reporting, as well as providing
forecasts and trends based the collected data.
CLOUD STORAGE
 Cloud Storage – The other method of storing massive amounts of data is
cloud storage, which is something more people are familiar with. If you
have ever used iCloud or Google Drive, this means you were using cloud
storage to store your documents and files.

 With cloud storage, data and information are stored electronically online
where it can be accessed from anywhere, negating the need for direct
attached access to a hard drive or computer. With this approach, you can
store virtually boundless amount of data online and access it where.

 Cloud storage is also significantly cheaper than the physical storage of


data. Data warehouses consume large amounts of power, space, resources
and come with more risk. However, with cloud storage, a substantial
amount of cost is saved.
NOSQL DATABASE SYSTEMS:
 Traditional relational database management systems (RDBMSs)
provide powerful mechanisms to store and query structured data under
strong consistency and transaction guarantees and have reached an
unmatched level of reliability, stability and support through decades of
development.

 User-generated content in social networks or data retrieved from large


sensor networks are only two examples of this phenomenon commonly
referred to as Big Data.

 A class of novel data storage systems able to cope with Big Data are
subsumed under the term NoSQL database
DATA STORING AND RETRIEVING IN
NOSQL
KEY-VALUE STORES
 Figure 1 illustrates how user account data and settings might be stored in a
key-value store.

 A key-value store consists of a set of key-value pairs with unique keys.

 Key-value stores are therefore often referred to as schemaless


BIG DATA AND RDBMS
 All the data transactions done in the relational data bases need to adhere to
the ACID standards.
 ACID Standards

 The ACID standard, often used to describe the properties of database


transactions, stands for Atomicity, Consistency, Isolation, and Durability.

 These properties ensure that database transactions are reliable and


maintain data integrity, even in the face of system failures or concurrent
access by multiple users or processes.
ACID BACKGROUND
 Imagine you were building a function to transfer money from one
account to another where each account is its own record. If you
successfully take money from the source account, but never credit it to
the destination, you have a serious accounting problem. You’d have just
as big a problem (if not bigger) if you instead credited the destination, but
never took money out of the source to cover it.
ACID
 Atomicity: This property ensures that a transaction is treated as a single,
indivisible unit.

 Either all the changes made by the transaction are applied to the
database, or none of them are.

 In the case of a failure or error, the transaction should be rolled back to


its original state, so the database remains in a consistent state.

 Example: money is deducted from the source and if any anomaly occurs,
the changes are discarded and the transaction fails.
 Consistency:
 Consistency guarantees that changes made within a transaction are
consistent with database constraints.

 This includes all rules, constraints, and triggers.

 If the data gets into an illegal state, the whole transaction fails.

 Example: let’s say there is a constraint that the balance should be a


positive integer. If we try to overdraw money, then the balance won’t
meet the constraint. Because of that, the consistency of the ACID
transaction will be violated and the transaction will fail.
 Isolation
 Isolation ensures that all transactions run in an isolated environment.
That enables running transactions concurrently because transactions
don’t interfere with each other.

 For example, let’s say that our account balance is $200. Two transactions
for a $100 withdrawal start at the same time. The transactions run in
isolation which guarantees that when they both complete, we’ll have a
balance of $0 instead of $100.
 Durability
 Durability guarantees that once the transaction completes and changes
are written to the database, they are persisted.

 This ensures that data within the system will persist even in the case of
system failures like crashes or power outages.
BASE PROPERTY
 The BASE property is a set of principles that is often used in the context of
distributed and NoSQL databases.

 BASE stands for "Basically Available, Soft state, Eventually consistent."

 Unlike the ACID properties, which provide strong guarantees for data
consistency and reliability but may impose performance and scalability
limitations, BASE provides a more relaxed set of principles suitable for
distributed and large-scale systems.
 Basically Available: This means that the system remains operational and
available for reads and writes, even in the presence of failures or
network partitions.

 In other words, the system doesn't guarantee 100% uptime, but it strives
to be available most of the time.

 During failures or under certain conditions, it may provide reduced


functionality or performance.
 Soft State:
 Soft state implies that the data stored in the system may be in an
intermediate or transitional state.

 The data doesn't have to be in a fully consistent state at all times, as long
as it converges towards consistency eventually.

 This allows for flexibility and scalability by not enforcing strict


consistency at all times.
 Eventually Consistent:

 Eventually consistent means that over time, assuming no further updates,


all replicas of the data will converge to the same consistent state.

 This doesn't guarantee immediate consistency, and there might be a delay


in achieving it.

 The system allows for some degree of inconsistency but ensures that it
will be resolved without human intervention.
DIFFERENCE BETWEEN BASE PROPERTIES AND
ACID PROPERTIES
CAP PROPERTIES IN A DISTRIBUTED
DATABASE SYSTEM
 Consistency (C): Reads and writes are always executed atomically and are
strictly consistent

 Put differently, all clients have the same view on the data at all times.


This condition states that all nodes see the
same data at the same time. Simply put,
performing a read operation will return the
value of the most recent write operation
causing all nodes to return the same data.
 Availability (A): Every non-failing node in the system can always
accept read and write requests by clients and will eventually return with a
meaningful response, i.e. not with an error message.
 This condition states that every request gets a response on
success/failure.

 Achieving availability in a distributed system requires that the system


remains operational 100% of the time.

 Every client gets a response, regardless of the state of any


individual node in the system.
 Partition-tolerance (P): The system upholds the previously displayed
consistency guarantees and availability in the presence of message loss
between the nodes or partial system failure.
 This condition states that the system continues to run, despite the
number of messages being delayed by the network between nodes.

 A system that is partition-tolerant can sustain any amount of network


failure that doesn’t result in a failure of the entire network.

 Data records are sufficiently replicated across combinations of nodes and


networks to keep the system up through intermittent outages.

 When dealing with modern distributed systems, Partition Tolerance is not


an option. It’s a necessity.
WHAT IS CAP THEORM
 CAP Theorem tells that it is not possible for a distributed database
system to provide all the 3 Consistency, Availability and Partition
Tolerance condition at the same point of time.
SHARDING
 Several distributed relational database systems such as Oracle RAC or IBM
DB2 pureScale rely on a shared-disk architecture where all database
nodes access the same central data repository (e.g. a NAS or SAN).

 Thus, these systems provide consistent data at all times, but are also
inherently difficult to scale.

 In contrast, the (NoSQL) database systems are built upon a shared-nothing


architecture, meaning each system consists of many servers with private
memory and private disks that are connected through a network.

 Thus, high scalability in throughput and data volume is achieved by


sharding (partitioning) data across different nodes (shards) in the
system.
SHARDING
 Sharding is the process of breaking up large tables into smaller chunks
called shards that are spread across multiple servers.

 Sharding is also referred to as horizontal partitioning, and a shard is


essentially a horizontal data partition that contains a subset of the total data
set, and hence is responsible for serving a portion of the overall workload.

 The idea is to distribute data that cannot fit on a single node onto a
cluster of database nodes.
EXAMPLE
VERTICAL AND HORIZONTAL
PARTITIONING
THREE BASIC DISTRIBUTION
TECHNIQUES
 There are three basic distribution techniques: range-sharding, hash-
sharding and entity-group sharding.
RANGE SHARDING:

 The data can be partitioned into ordered and contiguous value


ranges by range-sharding.

 Range sharding involves splitting the rows of a table into contiguous


ranges that respect the sort order of the table based on the primary
key column values.

 However, this approach requires some coordination through a master


that manages assignments.

 To ensure elasticity, the system has to be able to detect and resolve


hotspots automatically by further splitting an overburdened shard.

 Range sharding is supported by wide-column stores like BigTable,


HBase or Hypertable
EXAMPLE
 Range of the Tables is 2-byte range from 0x0000 to 0xFFFF.
 Such a table may therefore have at most 64K tablets.

 This should be sufficient in practice even for very large data sets or cluster
sizes.

 As an example, for a table with sixteen tablets the overall space [0x0000 to
0xFFFF) is divided into sixteen subranges, one for each tablet: [0x0000,
0x1000), [0x1000, 0x2000), … , [0xF000, 0xFFFF]. Read and write
operations are processed by the primary key
HASH-SHARDING
 Partitioning data over several machines is hash-sharding where every
data item is assigned to a shard server according to some hash value
built from the primary key.

 This approach does not require a coordinator and also guarantees the
data to be evenly distributed across the shards, as long as the used hash
function produces an even distribution.

 The obvious disadvantage, is that it only allows lookups and makes scans
unfeasible.

 Hash sharding is used in key-value stores and is also available in some


wide-coloumn stores like Cassandra [34] or Azure Tables
 The shard server responsible for a record can be determined as

serverid = hash(id) mod servers.


 However, this hashing scheme requires all records to be reassigned every
time a new server joins or leaves because it changes with the number of
shard servers (servers).

 It is actually not used in elastic systems like Dynamo, Riak or Cassandra,


which allow additional resources to be added on-demand and again be
removed when dispensable
EXAMPLE

Read and write operations are processed by converting the primary key
into an internal key and its hash value, and determining to which tablet
the operation should be routed
CONSISTENT HASHING
 Elastic systems use consistent hashing where only a fraction of the data
have to be reassigned upon such system changes.
ENTITY-GROUP SHARDING
 Entity-group sharding is a data partitioning scheme with the goal of
enabling single-partition transactions on co-located data.
 The partitions are called entity-groups and either explicitly declared by
the application.

 If a transaction accesses data that spans more than one group, data
ownership can be transferred between entity-groups or the transaction
manager has to fallback to more expensive multinode transaction
protocols.

You might also like