Lecture 5 Distributed Storage Systems
Lecture 5 Distributed Storage Systems
Cloud Computing
Spring 2025
Introduction
• In cloud computing, storage is not confined to a single server or
location.
• Distributed storage systems enable reliable, scalable, and high-
performance data storage across a network of machines.
• These systems underpin many cloud services and are fundamental to
supporting modern applications that require access to large-scale,
highly available data.
• This chapter explores the various facets of distributed storage in the
cloud, from fundamental storage services to advanced architectures
for real-time processing and disaster recovery.
Cloud Storage Services
Cloud providers offer highly scalable and durable storage solutions for
unstructured data. Key services include:
• Amazon S3
• Ceph
• Lustre
Hadoop Distributed File System (HDFS)
• HDFS is a scalable, fault-tolerant distributed file system designed to
run on commodity hardware.
• Designed for batch processing with MapReduce
• Replicates data across nodes for fault tolerance
• Optimized for large, sequential reads
• It divides large files into blocks and distributes them across nodes in a
cluster.
• Each block is replicated to ensure data durability and availability.
Ceph
• Ceph is a unified, distributed storage system designed for excellent
performance, reliability, and scalability.
• It provides object, block, and file system storage in a single platform.
• Ceph uses the CRUSH (Controlled Replication Under Scalable Hashing)
algorithm for data placement, eliminating the need for a central
metadata server.
• Highly scalable with self-healing capabilities
Lustre
• Lustre is a high-performance distributed file system commonly used
in large-scale cluster computing.
• Supports POSIX (Portable OS Interface) compliance for compatibility
• It is widely deployed in supercomputing environments where
performance and throughput are critical.
• Used in scientific computing and financial modeling.
NoSQL Databases in the Cloud
NoSQL databases provide flexible schemas and horizontal scalability for
cloud applications.
• Amazon DynamoDB
• Apache Cassandra
• MongoDB Atlas
Amazon DynamoDB
• DynamoDB is a fully managed NoSQL database service that supports
key-value and document data models.
• Single-digit millisecond latency with auto-scaling
• Supports ACID transactions (atomicity, consistency, isolation, and
durability) and global tables
• It is designed for low-latency and high-throughput applications and
offers features such as on-demand scaling, DAX (DynamoDB
Accelerator), and global tables.
Apache Cassandra
• Cassandra is a highly scalable NoSQL database designed for handling
large amounts of data across multiple commodity servers with no
single point of failure.
• It uses a peer-to-peer architecture and supports eventual consistency.
• Decentralized, wide-column store with tunable consistency
• Linear scalability across multiple data centers
• Used by Netflix, Apple, and other large-scale applications
MongoDB Atlas
• MongoDB Atlas is a fully managed cloud version of MongoDB, a
document-based NoSQL database.
• Atlas supports multi-region deployments, automated backups, and
integrated monitoring tools.
• Document-oriented database with JSON-like schema.
• Supports sharding for horizontal scaling.
• Available as a managed service.
Data Consistency Models and Replication
Strategies
Distributed storage systems often face trade-offs between consistency,
availability, and partition tolerance (CAP theorem). Various consistency
models are used to balance these trade-offs:
• Strong Consistency: Guarantees that all users see the same data at
the same time.
• Eventual Consistency: Updates will eventually propagate through the
system, but immediate consistency is not guaranteed.
• Causal Consistency: Ensures that causally related updates are seen by
all nodes in the same order.
Data Consistency Models and Replication
Strategies
Replication strategies include:
• Master-slave replication: One node handles writes, others replicate
data.
• Multi-master replication: Multiple nodes can handle writes, requiring
conflict resolution.
• Quorum-based replication: Read and write operations require a
quorum of nodes to agree. Balances consistency and availability (e.g.,
Dynamo-style systems)
• Synchronous replication: Ensures data consistency but increases
latency
• Asynchronous replication: Lower latency but risk of data loss
Cloud-Based Data Warehousing
Modern data warehouses enable large-scale analytics with serverless
architectures.
• Google BigQuery
• Snowflake
• Amazon Redshift
Google BigQuery
• BigQuery is a serverless, highly scalable data warehouse that allows
users to run SQL-like queries on large datasets.
• It integrates with S3 for data lakes and supports Redshift Spectrum for
querying data directly from S3.
Data Streaming and Real-Time Processing
Real-time data processing is crucial for applications such as fraud
detection, log analysis, and recommendation systems.
Cloud-based streaming services include:
• Apache Kafka: A distributed event streaming platform that enables
real-time data feeds.
• Amazon Kinesis/ Azure Event Hubs: A suite of services for real-time
data ingestion and processing.
- Managed streaming services for real-time analytics
- Supports ingestion from IoT, logs, and transactions
• Google Cloud Dataflow: A serverless data processing service for
stream and batch data using Apache Beam SDK.
Backup, Disaster Recovery, and Storage
Security
• Storage Security
Backup and Disaster Recovery
Cloud providers offer automated backup services with options for
versioning and point-in-time recovery.
Disaster recovery strategies include:
• Cold standby: Delayed recovery using periodically updated backups.
• Warm standby: Partially active infrastructure that can be quickly
scaled.
• Hot standby: Fully active and redundant systems across regions.
Storage Security
Security in cloud storage involves:
• Access Control: Fine-grained IAM policies, Access control, and access logs.