Coursera: Cloud Computing Concepts, Part 1 – Atsushi Takayama – Medium
Become a member Sign in Get sarted
Coursera: Cloud Computing Concepts, Part 1
Atsushi Takayama Follow
Mar 16 · 28 min read
Cloud Computing Concepts, Part 1 | Coursera
Cloud computing sysems today, whether open-source or used inside
companies, are built using a common set of core…
[Link]
# 2019/03/16 Week 1 Lesson 1: Introduction
to Clouds
## 1.1: Why Clouds?
Operation cost, saves development time.
## 1.2: What is a Cloud?
Cloud = Lots of storage + compute cycles nearby
## 1.3: Introduction to Clouds: History
Never miss a sory from Atsushi Takayama , when you sign
GET UPDATES
up for Medium. Learn more
[Link] [Link] PM]
Coursera: Cloud Computing Concepts, Part 1 – Atsushi Takayama – Medium
## 1.4: Introduction to Clouds: What’s New in
Today’s Clouds
4 distinctive features of today’s clouds:
• Massive scale
• On-demand access
• Data-intensive Nature
• New Cloud Programming Paradigm
[Link] [Link] PM]
Coursera: Cloud Computing Concepts, Part 1 – Atsushi Takayama – Medium
## 1.5: Introduction to Clouds: New Aspects of Clouds
On-demand: renting a cab vs renting a car or buying one
HaaS (HW), IaaS, PaaS, SaaS
Data-Intensive Computing: focus shifts from computation to the data. CPU
utilization no longer the most important resource metric, instead I/O is (disk and/or
network)
## 1.6: Introduction to Clouds: Economics of Clouds
Outsource or own?: If service is running for more than a years or so, owning is cheaper
(in terms of the hardware price only)
Break-even point for storage is much smaller than the overall cost.
As a result: Cloud providers beneft monetarily most from storage
[Link] [Link] PM]
Coursera: Cloud Computing Concepts, Part 1 – Atsushi Takayama – Medium
# 2019/03/16 Week 1 Lesson 2: Clouds are
Distributed Systems
## 2.1: A cloud IS a distribution system
• Servers communicate amongst one another
• Clients communicate with servers
• Clients may also communicate with each other (eg. P2P)
Nicknames for “distributed system” include peer-to-peer systems, grids, clusters,
timeshared computes (Data Processing Industry; 60's-70's), then cloud.
Core concepts are the same!
## 2.2: What is a distributed system?
What is an Operating System?
• User interface to hardware (device drivers)
• Provides abstractions (processes, fle system)
• Resource manager (scheduler)
• Means of communication (networking)
• etc.
A working defnition of “Distributed System”: A distributed system is a collection of
entities, each of which is autonomous, programmable, asynchronous and
failure-prone, and which communicate through an unreliable communication
medium.
[Link] [Link] PM]
Coursera: Cloud Computing Concepts, Part 1 – Atsushi Takayama – Medium
Many interesting problems:
• P2P systems (Gnutella, Kazaa, BitTorrent)
• Cloud Infrastructures (AWS, Azure, Google Cloud)
• Cloud Storage (KVS, NoSQL, Cassandra)
• Cloud Programming (MapReduce, Storm, Pregel)
• Coordination (Paxos, Leader Election, Snapshots)
• Managing Many Clients and Servers Concurrently (Concurrency Control,
Replication Control)
Challenges:
• Failures: no longer the exception, but rather a norm
• Scalability: 1000s of machines, Terabytes of data
• Asynchrony: clock skew and clock drift
• Concurrency: 1000s of machine interacting with each other accessing the same data
• …
# 2019/03/16 Week 1 Lesson 3: MapReduce
## 3.1: MapReduce Paradigm
• Map: <fle name, fle content> → [<key, value>]
• Reduce: <key, [value]> → <key, value>
[Link] [Link] PM]
Coursera: Cloud Computing Concepts, Part 1 – Atsushi Takayama – Medium
Many ways to assign reduce servers to keys for load balancing.
eg. Hash partitioning: hash(key) % (number of reduce servers)
## 3.2: MapReduce Examples
Distributed Grep:
• Input: large set of fles
• Output: lines that match pattern
• Map: emits a line if it matches the supplied pattern
• Reduce: copies the intermediate data to output
Reverse Web-Link Graph:
• Input: web graph: tuple (a, b) where (page a → page b)
• Output: for each page, list of pages that link to it
• Map: <source, target> → <target, source>
• Reduce: emits <target, list(source)>
Count of URL access frequency:
• Input: logs of accessed URLs
• Output: for each URL, % of total accesses for that URL
• Map: log fles → <URL, 1>
• Multiple Reduces: <URL, count>
• then chain
[Link] [Link] PM]
Coursera: Cloud Computing Concepts, Part 1 – Atsushi Takayama – Medium
• Map: <URL, URL count> → <1, (<URL, URL count>)>
• Single Reducer: Sums up URL counts to calculate overall count, then emit multiple
<URL, URL count/overall count>
Sort: (well, it’s easy because MapReduce does sort on Map’s output and Reduce’s input)
• Input: series of <key, value> pairs
• Output: sorted <value>s
• Map: <key, value> → <value, _> (identity)
• Reduce: <key, value> → <key, value> (identity)
• Partition function: partition keys across reducers based on ranges (not hash
partitioning)
## 3.3: MapReduce Scheduling
Make sure no Reduce tasks start before all Map and Shufe tasks fnish; the barrier
between the Map phase and the Reduce phase.
Without the barrier;
• A MapReduce run may be incorrect since some of the key-value pairs generated by
Maps may never be processed by a Reduce if the reduce function is called
exactly once per key.
• All MapReduce runs could be correct if Reduces maintain partial results for
keys, and update these as new key-value pairs come in.
Intermediate fles between Map and Reduce need not be visible, so storing fles close to
the nodes for performance;
[Link] [Link] PM]
Coursera: Cloud Computing Concepts, Part 1 – Atsushi Takayama – Medium
• Map input: from distributed fle system
• Map output: to local fle system
• Reduce input: from (multiple) remote disks
• Reduce output: to distributed fle system
YARN (Yet Another Resource Negotiator) in Hadoop 2.x+:
• Global Resource Manager (RM): scheduling
• Per-server Node Manager (NM): daemon and server-specifc functions
• Per-application (job) Application Master (AM): container negotiation with RM
and NMs & detecting task failures of the job
containers = (some CPU + some memory) packages in a server
## 3.4: MapReduce Fault-Tolerance
[Link] [Link] PM]
Coursera: Cloud Computing Concepts, Part 1 – Atsushi Takayama – Medium
The slowest machine slows the entire job down → speculative execution: perform
backup execution of straggler task
Locality
• run Map task on a machine that contains a replica of corresponding input data
• on failing that, on the same rack that contains the rack with the data
• on failing that, anywhere
# 2019/03/23 Week 2 Lesson 1: Gossip
## 1.1: Multicast Problem
Spread a piece of information from a node to a group of nodes.
NOT broadcast, where message is sent to the entire network.
Requirements (as far as cloud computing is concerned):
[Link] [Link] PM]
Coursera: Cloud Computing Concepts, Part 1 – Atsushi Takayama – Medium
• Fault tolerance: nodes may crash, packets may be dropped
• Scalability: 1000’s of nodes
Problem: Scalability. O(N) where N = size of the network
[Link] [Link] PM]
Coursera: Cloud Computing Concepts, Part 1 – Atsushi Takayama – Medium
Time to reach the leaf of the tree: O(logN)
If a node close to the trunk fails, then branches don’t receive messages.
Use either ACKs (acknowledgements) or NAKs (negative acknowledgements) to repair.
• SRM (Scalable Reliable Multicast): use NAKs, but adds random delays and uses
exponential backof to avoid NAK storms
• RMTP (Reliable Multicast Transport Protocol): use ACKs, but ACKs only sent to
designated receivers, which then re-transmit missing multicasts
Still O(N) ACK/NAK overhead
## 1.2: The Gossip Protocol
“Gossip” or “Epidemic” Multicast:
Periodically (like every 5 secs), a node randomly picks b nodes from the network and
[Link] [Link] PM]
Coursera: Cloud Computing Concepts, Part 1 – Atsushi Takayama – Medium
sends gossip messages.
• Use UDP because the entire protocol is quite reliable
• Typically b = 2 (or very small number)
Other nodes do the same after receiving multicast.
Duplicate happens.
Nodes that have received a piece of information are called infected, others are called
uninfected.
## 1.3: Gossip Analysis
[Link] [Link] PM]
Coursera: Cloud Computing Concepts, Part 1 – Atsushi Takayama – Medium
There are xy ways to make infected & uninfected pairs, so the rate of infection is βxy.
dx/dt = -βxy = -βx(n+1-x)
The solution tells that y goes very quickly to n+1.
[Link] [Link] PM]
Coursera: Cloud Computing Concepts, Part 1 – Atsushi Takayama – Medium
(algebra for this approximation appears at the end of the lecture)
In the gossip protocol, the probability that a particular infected node picks a particular
uninfected node is β = b/n.
t = c log(n) is expressing t in terms of every log(n) rounds that have happened so far.
Even if c and b are both small numbers, 1/(n^(cb-2)) goes to zero very quickly.
• Low latency: within c log(n) rounds,
• Reliability: all but 1/(n^(cb-2)) number of nodes receive the message
• Lightweight: each node has transmitted no more than cb log(n) gossip messages
Fault-tolerance
• Packet loss: replace b by b * (rate of packet loss)
• Node failure: replace n by n * (failure rate) and b by b * (failure rate)
Pull gossip
[Link] [Link] PM]
Coursera: Cloud Computing Concepts, Part 1 – Atsushi Takayama – Medium
In all forms of gossip, it takes O(logN) rounds before about N/2 gets the gossip. (think
about a spanning tree)
Pull gossip is faster than push gossip in the second half.
This is super-exponential: O(log(logN))
Topology-aware gossip
Random target: router face O(N) load
• Router load = O(1)
• Dissemination time = O(logN)
## 1.4: Gossip Implementations
[Link] [Link] PM]
Coursera: Cloud Computing Concepts, Part 1 – Atsushi Takayama – Medium
# 2019/03/23 Week 2 Lesson 2: Membership
## 2.1: What is Group Membership List?
[Link] [Link] PM]
Coursera: Cloud Computing Concepts, Part 1 – Atsushi Takayama – Medium
Process group based systems
• Clouds/Datacenters
• Replicated servers
• Distributed databases
Crash-stop/Fail-stop process failures: once a member of the group (called process) fails,
it stops executing any instructions.
→diferent from other failure models, such as a crash recovery failure model where
processes can recover
Membership list: list of non-faulty processes
Membership protocol (group membership protocol): protocol to maintain membership
list
A major challenge is that the membership protocol has to communicate over an
unreliable communication medium.
• Complete list at all time: strongly consistent, eg. virtual synchrony
• Almost-complete list: weakly consistent, eg. gossip-style, SWIM, etc.
• Partial-random list: eg. SCAMP, T-MAN, Cyclon, etc.
A membership protocol must have;
• Failure detector: fnd out that one of the members failed
• Dissemination: inform about joins, leaves, and failures of processes
[Link] [Link] PM]
Coursera: Cloud Computing Concepts, Part 1 – Atsushi Takayama – Medium
## 2.2: Failure Detectors
Failure is norm rather than the exception in a large scale data center.
• Completeness: each failure is detected (by a non-faulty process)
• Accuracy: there is no mistaken detection
• Speed: time to frst detection
• Scale: equal load on each member & network message load
Completeness & Accuracy cannot be met together 100% in lossy networks.
Preferably: completeness is guaranteed, and accuracy is probabilistically guaranteed
[Link] [Link] PM]
Coursera: Cloud Computing Concepts, Part 1 – Atsushi Takayama – Medium
## 2.3: Gossip-Style Membership
Gossip-style failure detection is a variant of all-to-all heartbeating (more robust).
[Link] [Link] PM]
Coursera: Cloud Computing Concepts, Part 1 – Atsushi Takayama – Medium
Each member maintains a table of;
• Address: process ID
• Heartbeat counter: how many heartbeats were received by the process
• Time: local time at which last time the heartbeat was received
Periodically a process sends out the entire table to a few of its neighbors at random.
When a node receives the membership list, it updates the local list in such a way that;
• by looking at the table row by row,
• if the heartbeat counter in the received table is greater than that of the local table,
• then update the counter and set the time to the current time in the local table.
If a particular row is last updated more than a threshold time ago, then mark the
[Link] [Link] PM]
Coursera: Cloud Computing Concepts, Part 1 – Atsushi Takayama – Medium
process as failure.
• If the heartbeat has not increased for more than T_fail seconds, the member is
considered failed
• And after T_cleanup seconds, it will delete the member from the list
Typically T_cleanup is the same as T_fail.
Without T_cleanup, one entry might be updated with a new local time after it’s been
deleted, if it’s not deleted from the received list.
Bandwidth allowed: how many rows can be sent in one gossip.
If bandwidth allowed per node is only O(1), then only some of the rows are sent, so it
takes O(N log(N)) time to propagate all the list.
[Link] [Link] PM]
Coursera: Cloud Computing Concepts, Part 1 – Atsushi Takayama – Medium
## 2.4: Which is the best failure detector?
PM(T): Probability of mistake in time T
L: Load per member
N*L: Network Message Load
Load of all-to-all heartbeating;
[Link] [Link] PM]
Coursera: Cloud Computing Concepts, Part 1 – Atsushi Takayama – Medium
Load of gossip based approach;
What’s the optimal load?
In theory, the worst case load L* (per member) as a function of T, PM(T), N and p_ml is
at least;
[Link] [Link] PM]
Coursera: Cloud Computing Concepts, Part 1 – Atsushi Takayama – Medium
p_ml: Independent Message Loss probability (applied to all messages independently of
other messages)
Notice that L* is independent of N. (scale free)
→So both all-to-all and gossip-based are sub-optimal.
Why? Because both approaches do not distinguish failure detection and dissemination.
Key:
• Separate the two components
• Use a non heartbeat-based Failure Detection Component
## 2.5: Another Probabilistic Failure Detector
SWIM: Scalable Weakly-consistent Infection-style Membership protocol
[Link] [Link] PM]
Coursera: Cloud Computing Concepts, Part 1 – Atsushi Takayama – Medium
• pi sends a ping to pj
• if pi received a ping back from pj, then do nothing else, otherwise;
• sends a ping to pj indirectly by through K randomly selected processes
• if pi receives at least one acknowledgements from pj by the end of the protocol
period, then do nothing
• if pi does not receive direct ack or any indirect ack from pj, then pj is marked as
failure
SWIM: both frst detection time and process load are constant for fxed false
positive rate and message loss rate.
Heatbeating: at least one of frst detection time and process load increases
with O(N) for fxed false positive rate and message loss rate.
[Link] [Link] PM]
Coursera: Cloud Computing Concepts, Part 1 – Atsushi Takayama – Medium
False Positive Rate (Accuracy);
• PM(T) is exponential in -K, also depends on p_ml (probability of message loss) and
p_f (probability of failure)
Load;
see paper
Detection time;
[Link] [Link] PM]
Coursera: Cloud Computing Concepts, Part 1 – Atsushi Takayama – Medium
Completeness;
• Eventually
• can be made worst case O(N) protocol periods by using round-robin pinging (after
each traversal permute the round-robin list)
## 2.6: Dissemination and suspicion
How to disseminate failures detected by SWIM?
• Multicast (hardware/IP): unreliable, multiple simultaneous multicasts
• Point-to-point (TCP/UDP): expensive
• Infection-style dissemination (the “I” of SWIM): piggyback on Failure
Detector messages (zero extra messages!)
Piggyback some of the recently detected updates on top of ping, ack and indirect ping
messages.
[Link] [Link] PM]
Coursera: Cloud Computing Concepts, Part 1 – Atsushi Takayama – Medium
Suspicion mechanism
• false detections due to perturbed processes or packet losses (eg. from congestion)
• indirect pinging may not solve the problem (eg. correlated message losses near
pinged host)
Key: Suspect a process before declaring it as failed in the group.
[Link] [Link] PM]
Coursera: Cloud Computing Concepts, Part 1 – Atsushi Takayama – Medium
# 2019/03/23 Week 2 Lesson 3: Grids
## 3.1: Grid Applications
Running highly parallel computation without a supercomputer, with scheduling, across
[Link] [Link] PM]
Coursera: Cloud Computing Concepts, Part 1 – Atsushi Takayama – Medium
multiple sites.
## 3.2: Grid Infrastructure
HTCondor: High throughput computing system
• run on workstations
• when a workstation is free, ask the central server (or Globus) for tasks
• once the workstation is user-interrupted, stop the task
[Link] [Link] PM]
Coursera: Cloud Computing Concepts, Part 1 – Atsushi Takayama – Medium
Globus toolkit is an open-source software developed by Globus Alliance which involves
universities, national US research labs and some companies.
[Link] [Link] PM]
Coursera: Cloud Computing Concepts, Part 1 – Atsushi Takayama – Medium
# 2019/03/30 Week 3 P2P Systems
## 1: P2P System Introduction
Why study peer to peer systems?
P2P is the frst distributed systems that seriously focused on scalability with respect to
number of nodes.
## 2: Napster
Servers store only the metadata: tuple of <flename, ip_address, portnum>
Search:
• send keyword to a server
[Link] [Link] PM]
Coursera: Cloud Computing Concepts, Part 1 – Atsushi Takayama – Medium
• all servers search their lists (ternary tree algorithm; each server has up to 3
children)
• return a list of hosts <ip_address, portnum>
• client pings each host in the list to fnd transfer rates
• client fetches fle from best host
Joining a P2P system:
• send an http request to well-known url for that P2P service
• message routed to introducer, a well known server that keeps track of some recently
joined nodes in p2p system
• introducer initializes new peers’ neighbor table
Problems
• centralized server overloaded with search queries (SPoF)
• no security
## 3: Gnutella
[Link] [Link] PM]
Coursera: Cloud Computing Concepts, Part 1 – Atsushi Takayama – Medium
completely P2P
every peer has “peer pointers” to some neighbors
5 main message types
• Query: search
• QueryHits: response to query
• Ping: to probe network for other peers
• Pong: reply to ping, contains address of another peer
• Push: used to initiate fle transfer
[Link] [Link] PM]
Coursera: Cloud Computing Concepts, Part 1 – Atsushi Takayama – Medium
Decrement TTL each time message is passed, so that the message doesn’t keep
circulating around forever.
[Link] [Link] PM]
Coursera: Cloud Computing Concepts, Part 1 – Atsushi Takayama – Medium
How to search?
• Search local fles
• Send message to its immediate neighboring peers
• When a neighboring peer receives the query, it sends out the message to all its
neighboring peers except for the one that has sent the message
• Each peer sends a message only once, that is, it stores recently sent queries and if it
sees it again, the message is not forwarded
[Link] [Link] PM]
Coursera: Cloud Computing Concepts, Part 1 – Atsushi Takayama – Medium
QueryHit messages are reverse-routed to the searcher.
First, requesting peer (bottom left) tries to set up an http connection to the responding
peer (yellow). If it fails, it assumes that it’s behind a frewall and routs a Push message
through the reverse QueryHit path. The responder then generates an HTTP connection
to the requester.
[Link] [Link] PM]
Coursera: Cloud Computing Concepts, Part 1 – Atsushi Takayama – Medium
Problems of Gnutella:
• Ping/Pong constituted 50% of trafc: Multiplex ping/pongs
• Repeated searches with same keywords: Cache Query & QueryHits
• Modem-connected hosts do not have enough bandwidth for passing Gnutella trafc:
use a central server to act as proxy
• Large number of freeloaders (70% only download, never upload)
• Flooding cause excessive trafc
## 4: FastTrack and BitTorrent
[Link] [Link] PM]
Coursera: Cloud Computing Concepts, Part 1 – Atsushi Takayama – Medium
FastTrack is hybrid between Gnutella and Napster. It takes advantage of “healthier”
participants in the system (supernodes).
Proprietary protocol (Kazaa, KazaaLite, Grokster) but some details available.
Peers are selected to be supernodes based on their contributions in the past. (reputation
scheme; based on the number of uploads, etc.)
Instead of fooding a Query, one peer can send it to a nearby supernode.
Incentive to be a supernode: a lot of searches become local searches, so it’s fast.
[Link] [Link] PM]
Coursera: Cloud Computing Concepts, Part 1 – Atsushi Takayama – Medium
BitTorrent has special incentive design.
## 5: Chord
DHT: Distributed Hash Table
• Requirements: Insert, lookup and delete objects with keys
[Link] [Link] PM]
Coursera: Cloud Computing Concepts, Part 1 – Atsushi Takayama – Medium
• Performance concerns: load balancing, fault-tolerance, efciency of lookups and
inserts, locality
Intelligent choice of neighbors to reduce latency and message cost of routing.
Consistent Hashing:
• SHA-1 (ip_address, port) → 160 bit string
• Truncated to m bits
• Called peer id (number between 0 and 2^m-1)
• Not unique but id conficts very unlikely
• Can then map peers to one of 2^m logical points on a circle
Two kinds of peer pointers
[Link] [Link] PM]
Coursera: Cloud Computing Concepts, Part 1 – Atsushi Takayama – Medium
Finger table: List of nodes accessible (known to exist) by a node.
Filenames also mapped using same consistent hash function, ie. frst peer with id
greater than its hash key (mod 2^m).
With K keys and N peers, each peer stores O(K/N) keys.
[Link] [Link] PM]
Coursera: Cloud Computing Concepts, Part 1 – Atsushi Takayama – Medium
Search algorithm:
• Suppose N80 wants to search for an entry for key “[Link]/[Link]” which hash
the hash of K42.
• It wants to fnd the frst node with id greater than 42 (mod m)
• The largest entry in the fnger table of N80 below 42 is 16 (see the two slides above),
so send a search query to N16.
• N16 doesn’t have K42, so it forwards the search query to the next entry (the largest
fnger table entry of N16, but still to the left of N80), which is N32 (16+2⁵=32 and
16+2⁶=48, so N32 doesn’t know the existence of N45).
• N32 still doesn’t have K42, so forwards the query to N45 according to the second
line of the algorithm (“if none exist, send query to successor(n)”).
• N45 matches and has K42 entry, so it return the response directly to N80
[Link] [Link] PM]
Coursera: Cloud Computing Concepts, Part 1 – Atsushi Takayama – Medium
## 6: Failures in Chord
[Link] [Link] PM]
Coursera: Cloud Computing Concepts, Part 1 – Atsushi Takayama – Medium
What happens when N32 has failed in the above case?
N16 doesn’t even know the existence of N45, so the query never reaches N45.
Solution: each node maintain the list of r successors instead of just one
[Link] [Link] PM]
Coursera: Cloud Computing Concepts, Part 1 – Atsushi Takayama – Medium
(w.h.p = with high probability)
What if the peer that’s storing the fle failed?
Store copies of the fle to the successor and the predecessor of the node.
It’s also good for load balancing.
[Link] [Link] PM]
Coursera: Cloud Computing Concepts, Part 1 – Atsushi Takayama – Medium
N40 may need to copy some fles/keys (id between 32 and 40) from N45, ie. fle transfer
occurs between the successor of the newly joined node and the newly joined node.
A new peer afects O(log(N)) other fnger entries in the system, on average, because of
the symmetry of the system where each node points to O(log(N)) fnger table entries
and is pointed by O(log(N)) peers. So the number of messages per peer join =
O(log(N)*log(N)).
Stabilization Protocol:
Concurrent peer joins, leaves, failures might cause loopiness of pointers and failure of
lookups. Chord peers periodically run a stabilization algorithm that checks and updates
pointers and keys.
Churn:
When nodes are constantly joining, leaving and failing, unnecessary fle/key copying
[Link] [Link] PM]
Coursera: Cloud Computing Concepts, Part 1 – Atsushi Takayama – Medium
may occur. Stabilization algorithm may consume more bandwidth to keep up.
Virtual Nodes:
Treat each node as multiple virtual nodes→more load balanced set of segments.
## 7: Pastry
Pastry is another P2P system which is similar to chord.
Routing table based prefx matching.
For each prefx, say 011*, among all potential neighbors with a matching prefx, the
neighbor with the shortest round-trip-time is selected.
## 8: Kelips
[Link] [Link] PM]
Coursera: Cloud Computing Concepts, Part 1 – Atsushi Takayama – Medium
A node selects the closest peer from a foreign afnity group.
Files are stored in its original nodes (uploaders).
All peers in an afnity group store fle pointers with flename hash (mod k), ie. same
information is replicate through every node in an afnity group.
# 2019/04/06 Week 4 Lesson 1: Key-value
stores
## 1.1: Why Key-Value/NOSQL?
Today’s workloads
• Data: Large and unstructured
[Link] [Link] PM]
Coursera: Cloud Computing Concepts, Part 1 – Atsushi Takayama – Medium
• Lots of random reads and writes
• Sometimes write-heavy
• Foreign keys rarely needed
• Joins infrequent
Needs
• Speed
• Avoid SPoF
• Low TCO (total cost of operation)
• Fewer system administrators
• Incremental Scalability
• Scale out, not up
NoSQL = Not only SQL
KVS
• Necessary API: get(key) and put(key, value)
• Tables: Like RDBMS tables, but columns may be missing, or unstructured
• Not always have joins or foreign keys
• Can have index tables, just like RDBMSs
Column-oriented storage
NoSQL systems typically store a column together (or a group of columns), indexed
[Link] [Link] PM]
Coursera: Cloud Computing Concepts, Part 1 – Atsushi Takayama – Medium
•
by the key
• Range searches within a column are fast (no need to fetch the entire database)
## 1.2: Cassandra
Ring-based Distributed Hash Table
• One ring per DC
• No fnger table (routing table), but when a coordinator receives a query, forwards to
an appropriate replica (“Partitioner”)
Two replication strategies
SimpleStrategy uses the Partitioner, of which there are two kinds
• RandomPartitioner: Chord-like hash partitioning. Values are stored close to the
hash key node.
• ByteOrderPartitioner: Assign ranges of keys to servers: easier for range queries.
Easier for range queries. Instead of using the hash of the keys, use keys itself (eg.
timestamp)
NetworkTopologyStrategy: for multi-DC deployments
Two or three replicas per DC, per DC
• First replica placed according to Partitioner
• Then go clockwise around ring until you hit a diferent rack (for rack tolerance)
Snitches: mechanism to map IPs to racks and DCs. Confgured in [Link]
[Link] [Link] PM]
Coursera: Cloud Computing Concepts, Part 1 – Atsushi Takayama – Medium
• SimpleSnitch: Unaware of Topology (Rack-unaware), for running on VMs
• RackInferring: Assumes topology of network by octet of server’s IP address.
[Link] = x.<DC octet>.<rack octet>.<node octet> (frst is ignored)
• PropertyFileSnitch: use confg fle
• EC2Snitch: region = DC, availability zone = rack
Writes:
• Need to be lock-free for heavy workloads. Preferably no disk access at all for the
write in the critical write path.
• Client sends a write request to a coordinator node in Cassandra cluster. Coordinator
uses Partitioner to send query to all replica nodes responsible for key.
• When X replicas respond, coordinator returns an acknowledgement to the client.
(will talk about the X later)
• Always writable: Hinted Handof mechanism (if one or all replica is down,
writes to other replicas AND writes locally until they are up)
• One ring per datacenter. Per-DC coordinator elected (via Zookeeper)to coordinate
with other DCs.
When a replica receives a write request:
1. Saves commit log for failure recovery
2. Change memtables (in-memory KVS)
3. Flush to disk at later time (when memtable is full or old) as SSTable (Sorted String
Table) for the key/value and index/position.
[Link] [Link] PM]
Coursera: Cloud Computing Concepts, Part 1 – Atsushi Takayama – Medium
Bloom flter is used.
When a memtable is fushed, a new SSTable is created. The same key may be written
in older SSTables. Those SSTables are merged (compacted) periodically.
Deletes:
• Add a tombstone to the log and eventually delete at compaction
Reads:
• Coordinator sends read requests to X replicas. (will talk about the X later)
• When X replicas respond, coordinator returns the latest-timestamped value from
among those X.
• Coordinator also fetches value from other replicas, then initiate a read repair if
any two values are diferent.
• A row may be split across multiple SSTables. So reads are slower than writes (but
still fast)
Membership:
• Any server in cluster cloud be the coordinator, so every server maintains the list of
all servers.
[Link] [Link] PM]
Coursera: Cloud Computing Concepts, Part 1 – Atsushi Takayama – Medium
Suspicion Mechanism:
• Set timeout based on underlying network and failure behavior, to improve accuracy.
• Accrual detector: Failure Detector outputs a value (PHI; Φ) representing
suspicion.
• Apps set an appropriate threshold.
[Link] [Link] PM]
Coursera: Cloud Computing Concepts, Part 1 – Atsushi Takayama – Medium
If the inter-arrival time for gossip messages from a particular server have been long in
the past, then it waits slightly longer for the next heartbeat before marking the server as
having failed.
CDF: Cumulative Distribution
Performance comparison with MySQL (50 GB data):
• MySQL: Writes 300 ms / Reads 350 ms avg
• Cassandra: Writes 0.12 ms / Reads 15 ms avg
## 1.3: The Mystery of X — The Cap Theorem
CAP Theorem: In a distributed system, you can satisfy at most 2 out of the 3 guarantees:
1. Consistency: all nodes see same data at any time, or reads return latest written
value by any client
2. Availability: the system allows operations all the time, and operations return
quickly
3. Partition-tolerance: the system continues to work in spite of network partitions
(can happen across DCs when the Internet gets disconnected, or within a DC on a
rack switch outage)
[Link] [Link] PM]
Coursera: Cloud Computing Concepts, Part 1 – Atsushi Takayama – Medium
In today’s cloud computing system, P is essential. So you have to choose between C and
A.
• Cassandra: Availability and Eventual (weak) consistency
• Traditional RDBMSs: Strong consistency over availability under a partition
Eventual Consistency
• If all writes stop, then all replicas converge
• May return stale value
• Keeps converging (catching up) as long as there are writes
RDBMS: ACID (Atomicity, Consistency, Isolation and Durability)
KVS like Cassandra: BASE (Basically Available Soft-state Eventual Consistency)
[Link] [Link] PM]
Coursera: Cloud Computing Concepts, Part 1 – Atsushi Takayama – Medium
Mystery of X (in Cassandra): Client may choose consistency level for each
operation (read/write)
• ANY: any server (may not be replica), fastest
• ALL: all replicas, ensures strong consistency but slowest
• ONE: at least one replica, faster than ALL but cannot tolerate a failure
• QUORUM: quorum across all replicas in all DCs
• LOCAL_QUORUM: quorum in coordinator’s DC
• EACH_QUORUM: quorum in every DC
Reads in Quorum:
[Link] [Link] PM]
Coursera: Cloud Computing Concepts, Part 1 – Atsushi Takayama – Medium
• Client specifes read consistency level R (≤ N = total number of replicas of that
key)
• Coordinator waits for R replicas to respond before sending results to client
• In background, coordinator checks for consistency of remaining (N-R) replicas and
initiates read repair if needed
Writes in Quorum:
• Client specifes write consistency level W (≤ N)
• Coordinator writes new value to W replicas and return.
Two favors: 1. Wait until quorum is reached, 2. Just write and return
(asynchronous)
## 1.4: The Consistency Spectrum
[Link] [Link] PM]
Coursera: Cloud Computing Concepts, Part 1 – Atsushi Takayama – Medium
Eventual consistency started from Amazon’s Dynamo, then LinkedIn’s Voldemort and
Riak.
In recent years, requirements shifted towards more to strong consistency.
Commutative operation: such as +1
[Link] [Link] PM]
Coursera: Cloud Computing Concepts, Part 1 – Atsushi Takayama – Medium
Two major strong consistency models:
1.5: HBase
[Link] [Link] PM]
Coursera: Cloud Computing Concepts, Part 1 – Atsushi Takayama – Medium
Based on Google’s BigTable paper.
API: Get/Put (row), Scan (row range, flter), MultiPut
Prefers consistency over availability
[Link] [Link] PM]
Coursera: Cloud Computing Concepts, Part 1 – Atsushi Takayama – Medium
HBase maintains strong consistency by HLog
Log replay:
• Add edits to the MemStore from stale logs
• After recovery from failure, or upon bootup
• Use timestamp to fnd out where the database is w.r.t the logs)
Replication:
• If an HBase is operated on multiple DCs, one is “Master” and others are “Slaves”
• Master cluster synchronously sends HLogs over to slave clusters
• Coordination among clusters in via Zookeeper
• Zookeeper can be used like a fle system to store control information
[Link] [Link] PM]
Coursera: Cloud Computing Concepts, Part 1 – Atsushi Takayama – Medium
# 2019/04/07 Week 4 Lesson 2: Time
and Ordering
## 2.1: Introduction and Basics
Synchronizing clocks is challenging because servers follow an asynchronous system
model. (message delays, processing delays)
Defnitions:
• Clock Skew: Relative diference in clock values of two processes
• Clock Drift: Relative diference in clock frequencies (rates) of two processes
How often to synchronize?
• (Absolute) Maximum Drift Rate (MDR) of a clock is drift relative to UTC
• max drift rate between two clocks with similar MDR is 2 * MDR
[Link] [Link] PM]
Coursera: Cloud Computing Concepts, Part 1 – Atsushi Takayama – Medium
• M: maximum acceptable skew
• synchronize every M / (2 * MDR) time units (since time = distance / speed)
## 2.2: Cristian’s Algorithm
Suppose we know;
[Link] [Link] PM]
Coursera: Cloud Computing Concepts, Part 1 – Atsushi Takayama – Medium
• Minimum P→S latency: min1
• Minimum S→P latency: min2
The actual time at P when it receives response is between [t+min2, t+RTT-min1].
• Lower bound (t+min2): when the message from S to P took exactly min2 time
• Upper bound (t+RTT-min1): when the message from P to S took exactly min1 time
P sets its time to half way through this interval: t + (RTT+min2-min1)/2
Then the error is bounded! At most (RTT+min2-min1)/2
## 2.3: NTP
[Link] [Link] PM]
Coursera: Cloud Computing Concepts, Part 1 – Atsushi Takayama – Medium
Set ofset o = (tr1-tr2+ts2-ts1)/2
What’s the error?
Suppose real ofset is o_real
• Child is ahead of parent by o_real
• Parent is ahead of child by -o_real
One-way latency
• L1: latency of Message 1
• L2: latency of Message 2
Then
• tr1 = ts1 + L1 + o_real
• tr2 = tr2 + L2 - o_real
[Link] [Link] PM]
Coursera: Cloud Computing Concepts, Part 1 – Atsushi Takayama – Medium
Subtract second equation from the frst
o_real = (tr1-tr2+ts2-ts1)/2+(L2-L1)/2 = o+(L2-L1)/2
|o_real-o| < |(L2-L1)/2)| < (L2+L1)/2
Thus, the error is bounded by the round-trip-time (L2+L1).
## 2.4: Lamport Timestamps
Almost all cloud computing systems use some form of logical (Lamport) ordering of
events.
Causality:
Goal: Assign logical timestamp to each event
• Each process uses a local counter (clock) which is an integer (initially zero)
• A process increments its counter when an event (a send or an instruction)
happens at it. The counter value is the timestamp of the event.
• A send event carries its timestamp
[Link] [Link] PM]
Coursera: Cloud Computing Concepts, Part 1 – Atsushi Takayama – Medium
• Timestamp of a receive event is max(local counter, timestamp of send)+1,
which then updates the clock of receiver
Concurrent events: set of events that causality cannot be determined
Lamport timestamps are not guaranteed to be ordered or unequal for concurrent
events.
## 3.5: Vector Clocks
Vector clocks (vector timestamps): each process maintains N clocks, where N is the
number of processes it communicates, and call the whole vector a timestamp.
[Link] [Link] PM]
Coursera: Cloud Computing Concepts, Part 1 – Atsushi Takayama – Medium
Each message has sender’s timestamp (vector), and the receiver updates each of the
clocks as such.
Causality: two events are causally related if and only if every element in a timestamp is
strictly less than or equal to that of another event
# 2019/04/13
# Week 5 Lesson 1: Snapshots
[Link] [Link] PM]
Coursera: Cloud Computing Concepts, Part 1 – Atsushi Takayama – Medium
## 1.1: What is Global Snapshot?
Use case of Global Snapshot: Checkpointing, Garbage collection, Deadlock detection
Instantaneous state of each process, Instantaneous state of communication channel
Time synchronization always has error, but it’s fne as long as causality is kept
## 1.2: Global Snapshot Algorithm
Problem: Record a global snapshot (state for each process and for each channel)
• N processes in the system
• Two uni-directional channels between each pair
[Link] [Link] PM]
Coursera: Cloud Computing Concepts, Part 1 – Atsushi Takayama – Medium
• Communication channels are FIFO-ordered
• No failure
• All messages arrive intact, and are not duplicated
Requirements:
• Snapshot should not interfere with normal application actions
• Each process is able to record its own state
• Global state is collected in a distributed manner
• Any process may initiate the snapshot
Chandy-Lamport Global Snapshot Algorithm:
[Link] [Link] PM]
Coursera: Cloud Computing Concepts, Part 1 – Atsushi Takayama – Medium
## 1.3: Consistent Cuts
Cut = time frontier at each process and at each channel
• In the cut: events happening before the cut
• Out of the cut: events happening after the cut
[Link] [Link] PM]
Coursera: Cloud Computing Concepts, Part 1 – Atsushi Takayama – Medium
Chandy-Lamport snapshot ensures consistent cut.
## 1.4: Safety and Liveness
Liveness = guarantee that something good will happen
Safety = guarantee that something bad will never happen
# Week 5 Lesson 2: Multicast
## 2.1: Multicast Ordering
[Link] [Link] PM]
Coursera: Cloud Computing Concepts, Part 1 – Atsushi Takayama – Medium
• Multicast: message sent to a group of processes
• Broadcast: message sent to all processes (anywhere)
• Unicast: message sent from one sender to one receiver
FIFO Ordering: Multicast from each sender are received in the order they are sent.
Causal Ordering: All causally related multicasts must be received in the causality-
obeying order. Causal ordering implies FIFO ordering.
Think of a distributed SNS. If X sends a message and Y sends a response to it. In this
case causal ordering must be satisfed for everyone to correctly see the causality
(FIFO ordering is not enough).
Total Ordering (Atomic Broadcast): All receivers receive all multicasts in the same
order. May need to delay delivery of some messages, even a sender’s own message.
## 2.2: Implementing Multicast Ordering 1
FIFO ordering multicast algorithm:
[Link] [Link] PM]
Coursera: Cloud Computing Concepts, Part 1 – Atsushi Takayama – Medium
• In a system with 3 processes, P1, P2, and P3.
• Initially P1’s sequence vector = [0,0,0]
• When P1 sends a message, it becomes [1,0,0] and the message carries the sequence
number 1.
• When P2 receives it, P2’s sequence vector from becomes [0,0,0] to [1,0,0].
• If P2 receives the second multicast of P1 before the frst, then the second message is
bufered until the frst message is received. Receiver’s sequence numbers are always
incremented one by one. [0,0,0]→[1,0,0]→[2,0,0]
Total ordering algorithm: Sequencer-based Approach
[Link] [Link] PM]
Coursera: Cloud Computing Concepts, Part 1 – Atsushi Takayama – Medium
## 2.3: Implementing Multicast Ordering 2
Causal ordering multicast algorithm:
Same data structure as FIFO ordering.
[Link] [Link] PM]
Coursera: Cloud Computing Concepts, Part 1 – Atsushi Takayama – Medium
## 2.4: Reliable Multicast
What happens to multicast when processes fail?
Reliable Multicast: All correct (non-faulty) processes needs to receive the same set of
multicasts as all other correct processes.
Reliable multicast is orthogonal to ordering.
Sending one by one does not satisfy reliability, because the sender might fail during the
loop.
How to fx?
Receivers also sends the sender’s message to all processes in the group.
Not the most efcient multicast protocol, but reliable.
## 2.5: Virtual Synchrony
Attempts to preserve multicast ordering and reliability in spite of failures.
Combines a membership protocol with a multicast protocol.
[Link] [Link] PM]
Coursera: Cloud Computing Concepts, Part 1 – Atsushi Takayama – Medium
Virtual synchrony (VSync) multicast ensures that:
1. The set of multicasts delivered in a given view is the same set at all correct processes
that were in the view. What happens in a view stays in that view.
2. The sender of the multicast message also belongs to that view.
3. If a process Pi does not deliver a multicast M in view V while other processes in the
view V delivered M in V, then Pi will be forcibly removed from the next view
delivered after V at the other processes.
# Week 5 Lesson 3: Paxos
## 3.1: The Consensus Problem
[Link] [Link] PM]
Coursera: Cloud Computing Concepts, Part 1 – Atsushi Takayama – Medium
Reliable Multicast, Membership/Failure Detection, Leader Election and Mutual
Exclusion all related to the Consensus problem.
The formal problem statement:
• N processes
Each process p has
• input variable xp: initial either 0 or 1
• output variable yp: initially b (can be changed only once)
Consensus problem: design a protocol so that at the end either
1. All processes set their output variables to 0, or
[Link] [Link] PM]
Coursera: Cloud Computing Concepts, Part 1 – Atsushi Takayama – Medium
2. All processes set their output variables to 1
There might be other constraints
• Validity = if everyone proposes same value, then that’s what’s decided
• Integrity = decided value must have been proposed by some process
• Non-triviality = there is at least one initial system state that leads to each of all-0’s
or all-1’s outcomes
Many problems in distributed systems are equivalent to consensus!
So consensus is a very important problem, and solving it would be really useful.
Synchronous & Asynchronous distributed system models
Synchronous:
• Each message is received within bounded time
• Drift of each process’ local clock has a known bound
• Each step in a process takes lb < time < ub (each process has a min and max speeds
at which it executes instructions)
Asynchronous:
• No bounds on process execution
• The drift rate of a clock is arbitrary
• No bounds on message transmission delays
[Link] [Link] PM]
Coursera: Cloud Computing Concepts, Part 1 – Atsushi Takayama – Medium
In the synchronous system model, consensus is solvable. On the other hand, in the
asynchronous system model, consensus is impossible to solve. (FLP proof)
Subsequently, safe or probabilistic solutions have become quite popular.
## 3.2: Consensus in Synchronous systems
• initially pi’s value set is empty, then pi adds its own contribution at r=1
[Link] [Link] PM]
Coursera: Cloud Computing Concepts, Part 1 – Atsushi Takayama – Medium
• at each round, pi multicasts new values since the previous round
• and add each received values to its value set
• at the end, the decision (output) variable is the minimum value
## 3.3: Paxos, Simply
Consensus is impossible to solve in asynchronous system (FLP proof), because it is
impossible to distinguish a failed process from one that is just very very (very) slow.
Paxos algorithm is:
• Most popular “consensus-solving” algorithm
• Does not solve consensus problem
• But provides safety and eventual liveness
• Safety: Consensus is not violated
• Eventual Liveness: If things go well sometimes in the future (messages, failures,
etc.), there is a good chance consensus will be reached. But not guaranteed.
[Link] [Link] PM]
Coursera: Cloud Computing Concepts, Part 1 – Atsushi Takayama – Medium
[Link] [Link] PM]
Coursera: Cloud Computing Concepts, Part 1 – Atsushi Takayama – Medium
Big Data Coursera
[Link] [Link] PM]
Coursera: Cloud Computing Concepts, Part 1 – Atsushi Takayama – Medium
Follow
[Link] [Link] PM]