0% found this document useful (0 votes)

48 views49 pages

Big Data Storage and Processing

The document discusses big data storage and processing. It covers topics like relational database management systems, primary keys, foreign keys, data warehouses, cloud storage, NoSQL databases, and the CAP theorem as it relates to distributed systems.

Uploaded by

Celina Sawan

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

48 views49 pages

Big Data Storage and Processing

Uploaded by

Celina Sawan

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 49

DIGITAL IMAGE PROCESSING

BIG DATA ANALYTICS

Big Data Storage and Processing

BIG DATA STORAGE
 A simple DBMS stores data in the form of schemas or tables comprising of
rows and columns.

 The main goal of DBMS is to provide solution for storing and retrieving an
information efficiently.

 SQL is used to fetch the data stored in these tables.

 RDBMS stores the relations between these tables in columns (i.e., Primary
keys and foreign keys) that serves as a reference for refer to another table.
 Data in the table is stored in the rows and columns and the size of the file
go on increasing as new record are added resulting the increase in the
size of database.

 These files are shared across nodes by several users through database
servers.
PRIMARY KEYS IN RDBMS
 What is Primary Key?
 A primary key is used to ensure that data in the specific column is unique.

 A column cannot have NULL values. It is either an existing table column

or a column that is specifically generated by the database according to a
defined sequence.


PRIMARY KEYS

Customer ID is Primary Key

FOREIGN KEY.
 A foreign key is a column or group of columns in a relational database
table that provides a link between data in two tables.

 It is a column (or columns) that references a column (most often the

primary key) of another table.
Customer Table City Table

Customer ID is Primary Key in the customer Table and the CityID is the
primary key in the City Table.
IN CUSTOMER TABLE

City ID is Foreign Key in the customer Table and get linked to the City
Table easily.
WHAT IS PRIMARY AND FOREIGN KEY?
 Example: STUD_NO, as well as STUD_PHONE both, are candidate
keys for relation STUDENT but STUD_NO can be chosen as the primary
key (only one out of many candidate keys).

 Example: STUD_NO in STUDENT_COURSE is a foreign key to

STUD_NO in STUDENT relation.
WAREHOUSE STORAGE
 In addition to data files Data Warehouse is also used to store large amount of data.

 Similar to a warehouse for storing physical goods, a data warehouse is a large building
facility which its primary function is to store and process data on an enterprise level.

 It is an important tool for big data analytics. These large data warehouses support the
various reporting, business intelligence (BI), analytics, data mining, research, cyber
monitoring, and other related activities.

 These warehouses are usually optimised to retain and process large amounts of data at
all times while feeding them in and out through online servers where users can access
their data without delay.

 The greatest benefit of data warehouses is the ability to translate raw

data into information and insight. Data warehouses offer an effective
way to support queries, analytics, reporting, as well as providing
forecasts and trends based the collected data.
CLOUD STORAGE
 Cloud Storage – The other method of storing massive amounts of data is
cloud storage, which is something more people are familiar with. If you
have ever used iCloud or Google Drive, this means you were using cloud
storage to store your documents and files.

 With cloud storage, data and information are stored electronically online
where it can be accessed from anywhere, negating the need for direct
attached access to a hard drive or computer. With this approach, you can
store virtually boundless amount of data online and access it where.

 Cloud storage is also significantly cheaper than the physical storage of

data. Data warehouses consume large amounts of power, space, resources
and come with more risk. However, with cloud storage, a substantial
amount of cost is saved.
NOSQL DATABASE SYSTEMS:
 Traditional relational database management systems (RDBMSs)
provide powerful mechanisms to store and query structured data under
strong consistency and transaction guarantees and have reached an
unmatched level of reliability, stability and support through decades of
development.

 User-generated content in social networks or data retrieved from large

sensor networks are only two examples of this phenomenon commonly
referred to as Big Data.

 A class of novel data storage systems able to cope with Big Data are
subsumed under the term NoSQL database
DATA STORING AND RETRIEVING IN
NOSQL
KEY-VALUE STORES
 Figure 1 illustrates how user account data and settings might be stored in a
key-value store.

 A key-value store consists of a set of key-value pairs with unique keys.

 Key-value stores are therefore often referred to as schemaless

BIG DATA AND RDBMS
 All the data transactions done in the relational data bases need to adhere to
the ACID standards.
 ACID Standards

 The ACID standard, often used to describe the properties of database

transactions, stands for Atomicity, Consistency, Isolation, and Durability.

 These properties ensure that database transactions are reliable and

maintain data integrity, even in the face of system failures or concurrent
access by multiple users or processes.
ACID BACKGROUND
 Imagine you were building a function to transfer money from one
account to another where each account is its own record. If you
successfully take money from the source account, but never credit it to
the destination, you have a serious accounting problem. You’d have just
as big a problem (if not bigger) if you instead credited the destination, but
never took money out of the source to cover it.
ACID
 Atomicity: This property ensures that a transaction is treated as a single,
indivisible unit.

 Either all the changes made by the transaction are applied to the
database, or none of them are.

 In the case of a failure or error, the transaction should be rolled back to

its original state, so the database remains in a consistent state.

 Example: money is deducted from the source and if any anomaly occurs,
the changes are discarded and the transaction fails.
 Consistency:
 Consistency guarantees that changes made within a transaction are
consistent with database constraints.

 This includes all rules, constraints, and triggers.

 If the data gets into an illegal state, the whole transaction fails.

 Example: let’s say there is a constraint that the balance should be a

positive integer. If we try to overdraw money, then the balance won’t
meet the constraint. Because of that, the consistency of the ACID
transaction will be violated and the transaction will fail.
 Isolation
 Isolation ensures that all transactions run in an isolated environment.
That enables running transactions concurrently because transactions
don’t interfere with each other.

 For example, let’s say that our account balance is $200. Two transactions
for a $100 withdrawal start at the same time. The transactions run in
isolation which guarantees that when they both complete, we’ll have a
balance of $0 instead of $100.
 Durability
 Durability guarantees that once the transaction completes and changes
are written to the database, they are persisted.

 This ensures that data within the system will persist even in the case of
system failures like crashes or power outages.
BASE PROPERTY
 The BASE property is a set of principles that is often used in the context of
distributed and NoSQL databases.

 BASE stands for "Basically Available, Soft state, Eventually consistent."

 Unlike the ACID properties, which provide strong guarantees for data
consistency and reliability but may impose performance and scalability
limitations, BASE provides a more relaxed set of principles suitable for
distributed and large-scale systems.
 Basically Available: This means that the system remains operational and
available for reads and writes, even in the presence of failures or
network partitions.

 In other words, the system doesn't guarantee 100% uptime, but it strives
to be available most of the time.

 During failures or under certain conditions, it may provide reduced

functionality or performance.
 Soft State:
 Soft state implies that the data stored in the system may be in an
intermediate or transitional state.

 The data doesn't have to be in a fully consistent state at all times, as long
as it converges towards consistency eventually.

 This allows for flexibility and scalability by not enforcing strict

consistency at all times.
 Eventually Consistent:

 Eventually consistent means that over time, assuming no further updates,

all replicas of the data will converge to the same consistent state.

 This doesn't guarantee immediate consistency, and there might be a delay

in achieving it.

 The system allows for some degree of inconsistency but ensures that it
will be resolved without human intervention.
DIFFERENCE BETWEEN BASE PROPERTIES AND
ACID PROPERTIES
CAP PROPERTIES IN A DISTRIBUTED
DATABASE SYSTEM
 Consistency (C): Reads and writes are always executed atomically and are
strictly consistent

 Put differently, all clients have the same view on the data at all times.


This condition states that all nodes see the
same data at the same time. Simply put,
performing a read operation will return the
value of the most recent write operation
causing all nodes to return the same data.
 Availability (A): Every non-failing node in the system can always
accept read and write requests by clients and will eventually return with a
meaningful response, i.e. not with an error message.
 This condition states that every request gets a response on
success/failure.

 Achieving availability in a distributed system requires that the system

remains operational 100% of the time.

 Every client gets a response, regardless of the state of any

individual node in the system.
 Partition-tolerance (P): The system upholds the previously displayed
consistency guarantees and availability in the presence of message loss
between the nodes or partial system failure.
 This condition states that the system continues to run, despite the
number of messages being delayed by the network between nodes.

 A system that is partition-tolerant can sustain any amount of network

failure that doesn’t result in a failure of the entire network.

 Data records are sufficiently replicated across combinations of nodes and

networks to keep the system up through intermittent outages.

 When dealing with modern distributed systems, Partition Tolerance is not

an option. It’s a necessity.
WHAT IS CAP THEORM
 CAP Theorem tells that it is not possible for a distributed database
system to provide all the 3 Consistency, Availability and Partition
Tolerance condition at the same point of time.
SHARDING
 Several distributed relational database systems such as Oracle RAC or IBM
DB2 pureScale rely on a shared-disk architecture where all database
nodes access the same central data repository (e.g. a NAS or SAN).

 Thus, these systems provide consistent data at all times, but are also
inherently difficult to scale.

 In contrast, the (NoSQL) database systems are built upon a shared-nothing

architecture, meaning each system consists of many servers with private
memory and private disks that are connected through a network.

 Thus, high scalability in throughput and data volume is achieved by

sharding (partitioning) data across different nodes (shards) in the
system.
SHARDING
 Sharding is the process of breaking up large tables into smaller chunks
called shards that are spread across multiple servers.

 Sharding is also referred to as horizontal partitioning, and a shard is

essentially a horizontal data partition that contains a subset of the total data
set, and hence is responsible for serving a portion of the overall workload.

 The idea is to distribute data that cannot fit on a single node onto a
cluster of database nodes.
EXAMPLE
VERTICAL AND HORIZONTAL
PARTITIONING
THREE BASIC DISTRIBUTION
TECHNIQUES
 There are three basic distribution techniques: range-sharding, hash-
sharding and entity-group sharding.
RANGE SHARDING:

 The data can be partitioned into ordered and contiguous value

ranges by range-sharding.

 Range sharding involves splitting the rows of a table into contiguous

ranges that respect the sort order of the table based on the primary
key column values.

 However, this approach requires some coordination through a master

that manages assignments.

 To ensure elasticity, the system has to be able to detect and resolve

hotspots automatically by further splitting an overburdened shard.

 Range sharding is supported by wide-column stores like BigTable,

HBase or Hypertable
EXAMPLE
 Range of the Tables is 2-byte range from 0x0000 to 0xFFFF.
 Such a table may therefore have at most 64K tablets.

 This should be sufficient in practice even for very large data sets or cluster
sizes.

 As an example, for a table with sixteen tablets the overall space [0x0000 to
0xFFFF) is divided into sixteen subranges, one for each tablet: [0x0000,
0x1000), [0x1000, 0x2000), … , [0xF000, 0xFFFF]. Read and write
operations are processed by the primary key
HASH-SHARDING
 Partitioning data over several machines is hash-sharding where every
data item is assigned to a shard server according to some hash value
built from the primary key.

 This approach does not require a coordinator and also guarantees the
data to be evenly distributed across the shards, as long as the used hash
function produces an even distribution.

 The obvious disadvantage, is that it only allows lookups and makes scans
unfeasible.

 Hash sharding is used in key-value stores and is also available in some

wide-coloumn stores like Cassandra [34] or Azure Tables
 The shard server responsible for a record can be determined as

serverid = hash(id) mod servers.

 However, this hashing scheme requires all records to be reassigned every
time a new server joins or leaves because it changes with the number of
shard servers (servers).

 It is actually not used in elastic systems like Dynamo, Riak or Cassandra,

which allow additional resources to be added on-demand and again be
removed when dispensable
EXAMPLE

Read and write operations are processed by converting the primary key
into an internal key and its hash value, and determining to which tablet
the operation should be routed
CONSISTENT HASHING
 Elastic systems use consistent hashing where only a fraction of the data
have to be reassigned upon such system changes.
ENTITY-GROUP SHARDING
 Entity-group sharding is a data partitioning scheme with the goal of
enabling single-partition transactions on co-located data.
 The partitions are called entity-groups and either explicitly declared by
the application.

 If a transaction accesses data that spans more than one group, data
ownership can be transferred between entity-groups or the transaction
manager has to fallback to more expensive multinode transaction
protocols.

BDA UT2 QB Answers
100% (1)
BDA UT2 QB Answers
22 pages
Experiment 4: Circle Drawing Using Bresenham's Circle Algorithm
No ratings yet
Experiment 4: Circle Drawing Using Bresenham's Circle Algorithm
6 pages
Database - Q and A
No ratings yet
Database - Q and A
9 pages
Lecture 8 Chapter 5 Part 4 Big Data Storage Concepts (4)
No ratings yet
Lecture 8 Chapter 5 Part 4 Big Data Storage Concepts (4)
9 pages
Mid Term Lab AnswerSheet
No ratings yet
Mid Term Lab AnswerSheet
5 pages
Securing Cloud Applications MEAP V03 Adib Saikali pdf download
No ratings yet
Securing Cloud Applications MEAP V03 Adib Saikali pdf download
51 pages
HSM Mainhelp
No ratings yet
HSM Mainhelp
194 pages
Tinysa Control Section V2 Per M0Wid'S Pinout: 2.8" Ili9341 With Ts
100% (1)
Tinysa Control Section V2 Per M0Wid'S Pinout: 2.8" Ili9341 With Ts
1 page
Computing TM 20373
100% (5)
Computing TM 20373
140 pages
04-NoSQL
No ratings yet
04-NoSQL
126 pages
Nosql Tricks
No ratings yet
Nosql Tricks
34 pages
BackUp and Restore Sharepoint Sites
No ratings yet
BackUp and Restore Sharepoint Sites
2 pages
Title - Quantum Computing Applications For Blockchain Technology - A Comprehensive Review
No ratings yet
Title - Quantum Computing Applications For Blockchain Technology - A Comprehensive Review
2 pages
Studio One 6.0 Audio Flow HD
No ratings yet
Studio One 6.0 Audio Flow HD
2 pages
Abey Resume Template 4
No ratings yet
Abey Resume Template 4
2 pages
2- NoSQL
No ratings yet
2- NoSQL
32 pages
Assignment 4 (Given: Nov 13, Due: Nov 20) - No Extensions
No ratings yet
Assignment 4 (Given: Nov 13, Due: Nov 20) - No Extensions
1 page
Quiz 1 & 2
No ratings yet
Quiz 1 & 2
4 pages
ASHOKRAJU
No ratings yet
ASHOKRAJU
2 pages
Carman Pairing
No ratings yet
Carman Pairing
4 pages
Introduction To NoSQL
No ratings yet
Introduction To NoSQL
29 pages
NoSQL Databases
No ratings yet
NoSQL Databases
52 pages
NoSQL Database
No ratings yet
NoSQL Database
8 pages
Module 2.3
No ratings yet
Module 2.3
25 pages
Bda QB 2
No ratings yet
Bda QB 2
15 pages
EPON Configuration Guide 201101-PON Service Management
No ratings yet
EPON Configuration Guide 201101-PON Service Management
40 pages
SimpleBeacon Editor Release Notes
No ratings yet
SimpleBeacon Editor Release Notes
14 pages
SQL Vs Nosql DB
No ratings yet
SQL Vs Nosql DB
26 pages
Problem - 1691B - Codeforces
No ratings yet
Problem - 1691B - Codeforces
2 pages
ngd unit 1-4
No ratings yet
ngd unit 1-4
43 pages
Full Stack UNIT3
No ratings yet
Full Stack UNIT3
57 pages
Manual Testing Interview Question by Shammi Jha
100% (7)
Manual Testing Interview Question by Shammi Jha
25 pages
CIS - 468 - 04 - NOSQL Databases and Big Data Storage Systems
No ratings yet
CIS - 468 - 04 - NOSQL Databases and Big Data Storage Systems
102 pages
Module 2 Notes
No ratings yet
Module 2 Notes
19 pages
09 - Cloud-Enabling Technologies - v2
No ratings yet
09 - Cloud-Enabling Technologies - v2
45 pages
BDA Class3
No ratings yet
BDA Class3
15 pages
03 Database
No ratings yet
03 Database
14 pages
Infinite Reality
No ratings yet
Infinite Reality
42 pages
Simplicity Ai Brochure
No ratings yet
Simplicity Ai Brochure
4 pages
zpa-private-service-edge-at-a-glance
No ratings yet
zpa-private-service-edge-at-a-glance
2 pages
CS3492-DBMS unit-5
No ratings yet
CS3492-DBMS unit-5
9 pages
NoSQL D
No ratings yet
NoSQL D
26 pages
unitw12w2
No ratings yet
unitw12w2
18 pages
NoSQL Database
No ratings yet
NoSQL Database
64 pages
Dear Seller,: Use of This
No ratings yet
Dear Seller,: Use of This
1,060 pages
SM 1
No ratings yet
SM 1
5 pages
Internal Sla
No ratings yet
Internal Sla
7 pages
RK NoSQL
No ratings yet
RK NoSQL
35 pages
Intro 2 DB
No ratings yet
Intro 2 DB
126 pages
MODULE 3
No ratings yet
MODULE 3
37 pages
Intro to NoSQL DBs
No ratings yet
Intro to NoSQL DBs
44 pages
Tps 53513
No ratings yet
Tps 53513
42 pages
nosql
No ratings yet
nosql
64 pages
Module 1
No ratings yet
Module 1
34 pages
ABDMS-UNIT 2 AND UNIT 5 NOTES
No ratings yet
ABDMS-UNIT 2 AND UNIT 5 NOTES
10 pages
Bda - 4 Unit
No ratings yet
Bda - 4 Unit
10 pages
Lecture 6 - NoSQL
No ratings yet
Lecture 6 - NoSQL
28 pages
Introduction to NoSQL
No ratings yet
Introduction to NoSQL
13 pages
Hadoop Architecture
No ratings yet
Hadoop Architecture
84 pages
AWS1-1
No ratings yet
AWS1-1
38 pages
Bcse302l Dbms Module-7 Nosql
No ratings yet
Bcse302l Dbms Module-7 Nosql
30 pages
Dbt
No ratings yet
Dbt
24 pages
(Communications in Computer and Information Science 608) Piotr Gaj, Andrzej Kwiecień, Piotr Stera (Eds.) - Computer Networks - 23r
No ratings yet
(Communications in Computer and Information Science 608) Piotr Gaj, Andrzej Kwiecień, Piotr Stera (Eds.) - Computer Networks - 23r
444 pages
Unit VI_1
No ratings yet
Unit VI_1
31 pages
Software Engineer Concepts_4030afdb-00a4-4f83-a520_241007_202416
No ratings yet
Software Engineer Concepts_4030afdb-00a4-4f83-a520_241007_202416
26 pages
Camera Based Luggage Bag: Supervisor
No ratings yet
Camera Based Luggage Bag: Supervisor
12 pages
unit 4 BDA
No ratings yet
unit 4 BDA
22 pages
NOSQL Lecture 1 Notes
No ratings yet
NOSQL Lecture 1 Notes
31 pages
Unit 3 NoSQL
No ratings yet
Unit 3 NoSQL
98 pages
UDBMS NOTES
No ratings yet
UDBMS NOTES
18 pages
Lecture 1 - NoSQL
No ratings yet
Lecture 1 - NoSQL
31 pages
NoSql 2024 Assign2
No ratings yet
NoSql 2024 Assign2
189 pages
Hbase Hive Pig
No ratings yet
Hbase Hive Pig
144 pages
Secure Shell
No ratings yet
Secure Shell
20 pages
2014 Ieee Computer Nosql
No ratings yet
2014 Ieee Computer Nosql
4 pages
Unit 4: Big Data Tehnology Landscape Two Inportant Technologies
No ratings yet
Unit 4: Big Data Tehnology Landscape Two Inportant Technologies
42 pages
Module 5_NoSQL databases
No ratings yet
Module 5_NoSQL databases
33 pages
Lecture 1
No ratings yet
Lecture 1
31 pages
Fundamentals of Cyber security and Cyber law
No ratings yet
Fundamentals of Cyber security and Cyber law
5 pages
No SQL
No ratings yet
No SQL
109 pages
Unit Ii - Nosql Databases
No ratings yet
Unit Ii - Nosql Databases
112 pages
Description of The Properties: Commit Rollback
No ratings yet
Description of The Properties: Commit Rollback
4 pages
NoSQL_Notes
No ratings yet
NoSQL_Notes
11 pages
NoSQL (1)
No ratings yet
NoSQL (1)
12 pages
BDA Class1
No ratings yet
BDA Class1
26 pages
What Is Need of Big Data in Enterprises and How It Is Different From Business Intelligence
No ratings yet
What Is Need of Big Data in Enterprises and How It Is Different From Business Intelligence
56 pages
DS CC08 Lab2 2252222
No ratings yet
DS CC08 Lab2 2252222
12 pages
Introduction To: Nosql
No ratings yet
Introduction To: Nosql
27 pages
Nosql What Does It Mean
No ratings yet
Nosql What Does It Mean
8 pages
SDC - Grammar - CFG
No ratings yet
SDC - Grammar - CFG
46 pages
Nosql What Does It Mean
No ratings yet
Nosql What Does It Mean
15 pages
Nosql Databases: P.Krishna Reddy Iiit Hyderabad
No ratings yet
Nosql Databases: P.Krishna Reddy Iiit Hyderabad
30 pages
Cassandra: Types of Nosql Databases
No ratings yet
Cassandra: Types of Nosql Databases
6 pages
NOSQL
No ratings yet
NOSQL
23 pages
Database And Computer Management: SERIES 1, #3
From Everand
Database And Computer Management: SERIES 1, #3
Elias Mutegi
No ratings yet

Big Data Storage and Processing

Uploaded by

Big Data Storage and Processing

Uploaded by

DIGITAL IMAGE PROCESSING

BIG DATA ANALYTICS

Big Data Storage and Processing

 SQL is used to fetch the data stored in these tables.

 A column cannot have NULL values. It is either an existing table column

Customer ID is Primary Key

 It is a column (or columns) that references a column (most often the

 Example: STUD_NO in STUDENT_COURSE is a foreign key to

 The greatest benefit of data warehouses is the ability to translate raw

 Cloud storage is also significantly cheaper than the physical storage of

 User-generated content in social networks or data retrieved from large

 A key-value store consists of a set of key-value pairs with unique keys.

 Key-value stores are therefore often referred to as schemaless

 The ACID standard, often used to describe the properties of database

 These properties ensure that database transactions are reliable and

 In the case of a failure or error, the transaction should be rolled back to

 This includes all rules, constraints, and triggers.

 Example: let’s say there is a constraint that the balance should be a

 BASE stands for "Basically Available, Soft state, Eventually consistent."

 During failures or under certain conditions, it may provide reduced

 This allows for flexibility and scalability by not enforcing strict

 Eventually consistent means that over time, assuming no further updates,

 This doesn't guarantee immediate consistency, and there might be a delay

 Achieving availability in a distributed system requires that the system

 Every client gets a response, regardless of the state of any

 A system that is partition-tolerant can sustain any amount of network

 Data records are sufficiently replicated across combinations of nodes and

 When dealing with modern distributed systems, Partition Tolerance is not

 In contrast, the (NoSQL) database systems are built upon a shared-nothing

 Thus, high scalability in throughput and data volume is achieved by

 Sharding is also referred to as horizontal partitioning, and a shard is

 The data can be partitioned into ordered and contiguous value

 Range sharding involves splitting the rows of a table into contiguous

 However, this approach requires some coordination through a master

 To ensure elasticity, the system has to be able to detect and resolve

 Range sharding is supported by wide-column stores like BigTable,

 Hash sharding is used in key-value stores and is also available in some

serverid = hash(id) mod servers.

 It is actually not used in elastic systems like Dynamo, Riak or Cassandra,

You might also like