0% found this document useful (0 votes)
4 views26 pages

Lecture 5 Distributed Storage Systems

The document discusses distributed storage systems in cloud computing, highlighting their importance for reliable, scalable, and high-performance data storage. It covers various cloud storage services, distributed file systems, NoSQL databases, data consistency models, cloud-based data warehousing, and real-time processing solutions. Additionally, it addresses backup, disaster recovery, and security measures essential for maintaining data integrity and availability.

Uploaded by

5699silver
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views26 pages

Lecture 5 Distributed Storage Systems

The document discusses distributed storage systems in cloud computing, highlighting their importance for reliable, scalable, and high-performance data storage. It covers various cloud storage services, distributed file systems, NoSQL databases, data consistency models, cloud-based data warehousing, and real-time processing solutions. Additionally, it addresses backup, disaster recovery, and security measures essential for maintaining data integrity and availability.

Uploaded by

5699silver
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 26

Distributed Storage Systems

Cloud Computing
Spring 2025
Introduction
• In cloud computing, storage is not confined to a single server or
location.
• Distributed storage systems enable reliable, scalable, and high-
performance data storage across a network of machines.
• These systems underpin many cloud services and are fundamental to
supporting modern applications that require access to large-scale,
highly available data.
• This chapter explores the various facets of distributed storage in the
cloud, from fundamental storage services to advanced architectures
for real-time processing and disaster recovery.
Cloud Storage Services
Cloud providers offer highly scalable and durable storage solutions for
unstructured data. Key services include:

• Amazon S3

• Google Cloud Storage

• Azure Blob Storage


Amazon S3
• Amazon Simple Storage Service (S3) is an object storage service that
offers industry-leading scalability, data availability, with high durability
(99.999999999%), and security.
• S3 organizes data into buckets and allows users to store and retrieve
any amount of data at any time.
• Key features include lifecycle management, cross-region replication,
and fine-grained access control.
• Supports versioning, lifecycle policies, and encryption.
• Integrates with AWS analytics and compute services.
Google Cloud Storage
• Google Cloud Storage offers unified object storage for developers and
enterprises.
• It provides multiple storage classes (Standard, Nearline, Coldline,
Archive) designed for different access frequencies.
• Features include strong consistency, automatic redundancy across
regions, and integration with other Google Cloud services such as
BigQuery and AI/ML tools .
• Strong consistency model.
Google Cloud Storage
Azure Blob Storage
• Azure Blob Storage is Microsoft’s object storage solution for the
cloud.
• It is optimized for storing massive amounts of unstructured data such
as text and binary data.
• Blob Storage supports three access tiers: hot, cool, and archive access
tiers, enabling cost-effective storage based on usage patterns.
• Supports block blobs, append blobs, and page blobs
• Integrated with Azure Data Lake for analytics
Distributed File Systems
Distributed file systems enable large-scale data storage across clusters.
Key systems include:

• Hadoop Distributed File System (HDFS)

• Ceph

• Lustre
Hadoop Distributed File System (HDFS)
• HDFS is a scalable, fault-tolerant distributed file system designed to
run on commodity hardware.
• Designed for batch processing with MapReduce
• Replicates data across nodes for fault tolerance
• Optimized for large, sequential reads
• It divides large files into blocks and distributes them across nodes in a
cluster.
• Each block is replicated to ensure data durability and availability.
Ceph
• Ceph is a unified, distributed storage system designed for excellent
performance, reliability, and scalability.
• It provides object, block, and file system storage in a single platform.
• Ceph uses the CRUSH (Controlled Replication Under Scalable Hashing)
algorithm for data placement, eliminating the need for a central
metadata server.
• Highly scalable with self-healing capabilities
Lustre
• Lustre is a high-performance distributed file system commonly used
in large-scale cluster computing.
• Supports POSIX (Portable OS Interface) compliance for compatibility
• It is widely deployed in supercomputing environments where
performance and throughput are critical.
• Used in scientific computing and financial modeling.
NoSQL Databases in the Cloud
NoSQL databases provide flexible schemas and horizontal scalability for
cloud applications.

• Amazon DynamoDB

• Apache Cassandra

• MongoDB Atlas
Amazon DynamoDB
• DynamoDB is a fully managed NoSQL database service that supports
key-value and document data models.
• Single-digit millisecond latency with auto-scaling
• Supports ACID transactions (atomicity, consistency, isolation, and
durability) and global tables
• It is designed for low-latency and high-throughput applications and
offers features such as on-demand scaling, DAX (DynamoDB
Accelerator), and global tables.
Apache Cassandra
• Cassandra is a highly scalable NoSQL database designed for handling
large amounts of data across multiple commodity servers with no
single point of failure.
• It uses a peer-to-peer architecture and supports eventual consistency.
• Decentralized, wide-column store with tunable consistency
• Linear scalability across multiple data centers
• Used by Netflix, Apple, and other large-scale applications
MongoDB Atlas
• MongoDB Atlas is a fully managed cloud version of MongoDB, a
document-based NoSQL database.
• Atlas supports multi-region deployments, automated backups, and
integrated monitoring tools.
• Document-oriented database with JSON-like schema.
• Supports sharding for horizontal scaling.
• Available as a managed service.
Data Consistency Models and Replication
Strategies
Distributed storage systems often face trade-offs between consistency,
availability, and partition tolerance (CAP theorem). Various consistency
models are used to balance these trade-offs:
• Strong Consistency: Guarantees that all users see the same data at
the same time.
• Eventual Consistency: Updates will eventually propagate through the
system, but immediate consistency is not guaranteed.
• Causal Consistency: Ensures that causally related updates are seen by
all nodes in the same order.
Data Consistency Models and Replication
Strategies
Replication strategies include:
• Master-slave replication: One node handles writes, others replicate
data.
• Multi-master replication: Multiple nodes can handle writes, requiring
conflict resolution.
• Quorum-based replication: Read and write operations require a
quorum of nodes to agree. Balances consistency and availability (e.g.,
Dynamo-style systems)
• Synchronous replication: Ensures data consistency but increases
latency
• Asynchronous replication: Lower latency but risk of data loss
Cloud-Based Data Warehousing
Modern data warehouses enable large-scale analytics with serverless
architectures.

• Google BigQuery

• Snowflake

• Amazon Redshift
Google BigQuery
• BigQuery is a serverless, highly scalable data warehouse that allows
users to run SQL-like queries on large datasets.

• It supports real-time analytics and integrates with various data


ingestion tools.

• Real-time querying and integration with ML models.


Snowflake
• Snowflake offers a cloud-native data warehouse with separate
compute and storage, enabling elastic scalability and concurrent
workloads.

• Its architecture supports structured and semi-structured data.


Amazon Redshift
• Redshift is a fully managed data warehouse that uses columnar
storage and parallel processing to deliver high performance for
analytical queries.

• It integrates with S3 for data lakes and supports Redshift Spectrum for
querying data directly from S3.
Data Streaming and Real-Time Processing
Real-time data processing is crucial for applications such as fraud
detection, log analysis, and recommendation systems.
Cloud-based streaming services include:
• Apache Kafka: A distributed event streaming platform that enables
real-time data feeds.
• Amazon Kinesis/ Azure Event Hubs: A suite of services for real-time
data ingestion and processing.
- Managed streaming services for real-time analytics
- Supports ingestion from IoT, logs, and transactions
• Google Cloud Dataflow: A serverless data processing service for
stream and batch data using Apache Beam SDK.
Backup, Disaster Recovery, and Storage
Security

• Backup and Disaster Recovery

• Storage Security
Backup and Disaster Recovery
Cloud providers offer automated backup services with options for
versioning and point-in-time recovery.
Disaster recovery strategies include:
• Cold standby: Delayed recovery using periodically updated backups.
• Warm standby: Partially active infrastructure that can be quickly
scaled.
• Hot standby: Fully active and redundant systems across regions.
Storage Security
Security in cloud storage involves:

• Encryption: Both in transit (TLS) and at rest (AES-256).

• Access Control: Fine-grained IAM policies, Access control, and access logs.

• Immutable storage: to prevent ransomware attacks

• Compliance: Adherence to standards like GDPR, HIPAA, and SOC 2.


Conclusion
• Distributed storage systems are foundational to the reliability,
performance, and scalability of cloud-based solutions.
• From object storage services and distributed file systems to NoSQL
databases and real-time processing platforms, understanding these
systems is essential for architects and developers building cloud-
native applications.
• Moreover, robust replication strategies, consistency models, and
security mechanisms ensure the integrity and availability of data in a
distributed environment.

You might also like