100% found this document useful (1 vote)

808 views

System Design - ML Design 1 PDF

The document discusses system design objectives for large-scale distributed systems at Facebook. It provides examples of systems to design, like a key-value store or video distribution system. For evaluations, designers should consider constraints, visualize the problem and solution, and make tradeoffs around consistency, availability, and performance. A good design clearly defines the problem, proposes a scalable and detailed solution, identifies bottlenecks, and considers how to adapt if requirements change. Key concepts in distributed systems like the CAP theorem, which requires tradeoffs between consistency, availability, and partition tolerance, are also covered.

Uploaded by

Abhishek Bhowmick

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

100% found this document useful (1 vote)

808 views

System Design - ML Design 1 PDF

Uploaded by

Abhishek Bhowmick

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 24

System Design

Objectives/Goals
● Large-scale systems end to end. A strong performance: replicable to many systems at
Facebook.
● Sample Questions:
○ Design a key-value store
○ Design Google search
○ Architect a world-wide video distribution system
○ Build Facebook chat
○ Google Search vs Twitter Search vs FB Search: Google’s index building layer
has many more components for document understanding. It would need
components for extracting deep links, contact information, referrals (for page
rank). On the other hand, Twitter’s index building should be simpler due to small
size tweets and some rich media information for the attached media. Twitter’s
search is head heavy. So a bulk of engineering efforts in designing their search
should go to rapidly indexing new tweets and making them searchable.
● Expectations:
○ What we’re looking for:
■ Can you arrive at an answer in the face of unusual constraints?
■ Can you visualize the entire problem and solution space?
■ Can you make trade-offs like consistency, availability, partitioning,
performance?
■ Can you give ballpark numbers on QPS supported, # of machines needed
using a modern computer?
■ How much have you thought about Facebook and some of the unique
problems we face?

● A good design shows that you:

○ Clearly understand the problem
○ Propose a design for a system that breaks the problem down into components,
that can be built independently, and you can drill into any piece of the design and
talk about it in detail
○ Identify the bottlenecks as the system scales and understand the limitations in
your design
○ Understand how to adapt the solution when requirements change
○ Draw diagrams that clearly describe the relationship between the different
components in the system
○ Calculate (back-of-the-envelope) the physical resources necessary to make this
system work
Key Characteristics of Distributed Systems
Cap Theorem:
Simply put, the CAP theorem demonstrates that any distributed system cannot guaranty C, A, and P
simultaneously, rather, trade-offs must be made at a point-in-time to achieve the level of
performance and availability required for a specific task.

● Consistency: Every read receives the most recent write or an error

● Availability: Every request receives a (non-error) response, without the guarantee that it
contains the most recent write
● Partition tolerance: The system continues to operate despite an arbitrary number of
messages being dropped (or delayed) by the network between nodes When a network
partition failure happens should we decide to cancel the operation and thus decrease the
availability but ensure consistency. Proceed with the operation and thus provide availability
but risk inconsistency

The CAP theorem implies that in the presence of a network partition, one has to choose between
consistency and availability.

Order of Priority to keep in mind while designing a Distributed DB:

● Durability: Changes are permanent in the DB
● Availability
● Performance + Consistency: Make as much consistent as we can (given the constraint of
Availability for Distributed Sys) and highly Performant.

Scalability

● The capability of a system to grow and manage increased demand.

● A system that can continuously evolve to support the growing amount of work is scalable.
● Horizontal scaling: by adding more servers into the pool of resources.
● Vertical scaling: by adding more resources (CPU, RAM, storage, etc) to an existing
server. This approach comes with downtime and an upper limit.

Reliability

● Reliability is the probability that a system will fail in a given period.

● A distributed system is reliable if it keeps delivering its service even when one or multiple
components fail.
● Reliability is achieved through redundancy of components and data (remove every single
point of failure).
Availability

● Availability is the time a system remains operational to perform its required function in a
specific period.
● Measured by the percentage of time that a system remains operational under normal
conditions.
● A reliable system is available.
● An available system is not necessarily reliable.
○ A system with a security hole is available when there is no security attack.

Efficiency

● Latency: response time, the delay to obtain the first piece of data.
● Bandwidth: throughput, amount of data delivered in a given time.

Serviceability / Manageability

● Easiness to operate and maintain the system.

● Simplicity and spend with which a system can be repaired or maintained.

Concepts and Handy Terminologies

- Traffic for 1B active users:
- Approximate QPS and num users using the Pareto principle (80-20)
- 1 Billion active monthly users => 80% active daily users => 800 M active daily
users
- Peak traffic: 80% of this per hour => 640 M active users daily
- 1/3rd population across the globe sleeps (24/8), so 2/3rd active => 640 *⅔ = 420
M users
- Per second = 420M/(3600) ~116k ~150k active users per second
- QPS = 150k/s for 1 B active monthly users
- A typical server for JS, Android might serve 1000 Request per second
- # Servers required ~150, accounting for duplication and fault tolerance (450
servers)

- CAP Theorem:
As mentioned above

- RDBMS vs NoSQL:
- https://github.com/chagri/CP/blob/master/system_design/System_Design_Datab
ases/All_DBs_Cloud_And_Deployment_Systems.md
- Microservices: Kind of architecture for seamlessly working with various
components/services seamlessly. Example of Instagram microservice arch:

AWS Microservices: https://aws.amazon.com/microservices/

- Microservice vs Monolithic vs Serverless:

https://www.youtube.com/watch?v=qYhRvH9tJKw

A monolithic application is simply deployed on a set of identical servers behind a

load balancer. In contrast, a microservice application typically consists of a large
number of services. Each service will have multiple runtime instances.
In contrast to the microservices architecture, monolithic applications are much
easier to debug and test. Since a monolithic app is a single indivisible unit, you
can run end-to-end testing much faster. Simple to deploy. Another advantage
associated with the simplicity of monolithic apps is easier deployment.

https://www.n-ix.com/microservices-vs-monolith-which-architecture-best-choice-y
our-business/

Serverless architecture is a way to build and run applications and services
without having to manage infrastructure. Serverless computing allows you to run
any function without worrying about the infrastructure. This means that servers,
software, tools, backup, and scaling are parts of the platform. Serverless does
not mean that servers are no longer involved, but developers no longer have to
worry about managing them. Your application still runs on servers, but all the
server management is done by a cloud provider such as AWS.

- Load Balancer:
AWS Elastic Load Balancing automatically distributes incoming application traffic
across multiple targets, such as Amazon EC2 instances, containers, IP
addresses, and Lambda functions. It can handle the varying load of your
application traffic in a single Availability Zone or across multiple Availability
Zones. Elastic Load Balancing offers three types of load balancers that all feature
the high availability, automatic scaling, and robust security necessary to make
your applications fault-tolerant. Generally speaking, load balancers fall into three
categories:
● DNS Round Robin (rarely used): clients get a randomly-ordered list of IP
addresses.
pros: easy to implement and free
cons: hard to control and not responsive, since DNS cache needs time to
expire
● L3/L4 Load Balancer: traffic is routed by IP address and port. L3 is a network
layer (IP). L4 is the session layer (TCP).
pros: better granularity, simple, responsive
● L7 Load Balancer: traffic is routed by what is inside the HTTP protocol. L7 is
the application layer (HTTP).

- Memcached vs Redis/Distributed Cache:

Best for storing User meta-data/cache for faster access/availability.

Redis is an in-memory data structure store, used as a database, cache, and message
broker. ... While that's all that Memcached is its only the tip of the Redis iceberg.
Memcached is a volatile in-memory key/value store. Redis can act like one (and do that
job as well as Memcached), but it is a data structure server.

Redis is sometimes described as “Memcached on steroids,” which is hardly surprising

considering that parts of Redis were built in response to lessons learned from using
Memcached. Redis has more features than Memcached and is, thus, more powerful and
flexible.

Redis can handle up to 232 keys and was tested in practice to handle at least 250
million keys per instance. Every hash, list, set and sorted set, can hold 232
elements (~4B). R edis Strings are binary safe, this means that a Redis string can
contain any kind of data, for instance, a JPEG image or a serialized Ruby object.
A String value can be at m ax 512 Megabytes in l ength.

https://www.infoworld.com/article/3063161/why-redis-beats-memcached-for-caching.html

- Streaming DBs: They help distribute data between several producers and many
consumers (e.g. Mongo, MySQL, Redshift, Dynamo) easily. Here Apache Kafka serves
as an "data" integration message bus.
Potential Apache Kafka, AWS Kinesis

- Kubernetes: For CI/CD, Scaling and managing containers

Microservice Architecture with Azure Kubernetes services:

- HTTP 1/1.1 vs HTTP 2 vs WebSocket vs BOSH vs Long Poll HHP:
HTTP/2 is that it uses multiplexed streams. A single HTTP/2 TCP connection can
support many bidirectional streams. These streams can be interleaved (no queuing), and
multiple requests can be sent at the same time without a need to establish new TCP
connections for each one. In addition, servers can now push notifications to clients via
the established connection (HTTP/2 push).

WebSocket (HTTP 2.0 ) is bidirectional i.e. a protocol providing full-duplex

communication channels over a single TCP connection. Whereas, HTTP providing
half-duplex communication, for e.g. HTTP will not be able to serve a chat message app
i.e. User A sends a message to the server, and then the server needs to send to User B
(the later part will not happen with HTTP). Means, the server can push information to the
client (which does not allow direct HTTP). HTTP 2 can do it.

Long Poll and BOST are other bidirectional options over HTTP.

- REST vs RPC/GRPC:
- REST messages typically contain JSON. gRPC, on the other hand, accepts and
returns Protobuf messages
- GrPC uses HTTP2 while REST uses HTTP1, therefore it is faster and no need to
create TCP connection every time, the same can be used for multiple requests,
useful for FB, IG kind of apps with multi-service support. Other advantages:
- The Growth of Page Size and Number of Objects per ask
- Latency
- Messages vs. Resources and Verbs: gRPC comes with clear interfaces and
structured messages for requests and responses. This model translates directly
from programming concepts like interfaces, functions, methods, and data
structures. It also allows gRPC to automatically generate client libraries for you.
- Streaming vs. Request-Response: REST request-response only, gRPC
streaming as well.
- gRPC is strongly typed i.e. more redundant but fewer bugs especially compared
to JSON which completely depends on the developer.
- More info:
https://code.tutsplus.com/tutorials/rest-vs-grpc-battle-of-the-apis--cms-30711

- Hashing Algorithms:
- B62: Based on 62 (26 Capital, 26 lower, 10 ints). 7 length string of base 62=
62^7 ~ 3.5 trillion combinations.
- MD5 Hash

- Consistent Hashing and Load Balancing across the servers

- https://www.youtube.com/watch?v=viaNG1zyx1g
- https://www.youtube.com/watch?v=zaRkONvyGr8&t=556s

- Message Queue:
A message queue is a form of asynchronous service-to-service communication used in
serverless and microservices architectures. Messages are stored on the queue until they
are processed and deleted. Each message is processed only once, by a single
consumer. Message queues can be used to decouple heavyweight processing, to buffer
or batch work, and to smooth spiky workloads. E.g.: Amazon Simple Queue Service
(SQS) website.

- Horizontal vs Vertical Scaling:

- Horizontal scaling: by adding more servers into the pool of resources.
- Vertical scaling: by adding more resources (CPU, RAM, storage, etc) to an
existing server. This approach comes with downtime and an upper limit.

- Availability vs Reliability:
- Reliability:
- Reliability is the probability that a system will fail in a given period.
- A distributed system is reliable if it keeps delivering its service even
when one or multiple components fail.
- Reliability is achieved through redundancy of components and data
(remove every single point of failure).
- Availability:
- Availability is the time a system remains operational to perform its
required function in a specific period.
- Measured by the percentage of time that a system remains
operational under normal conditions.
- A reliable system is available.
- An available system is not necessarily reliable.
- A system with a security hole is available when there is no
security attack.
- Efficiency:
- Latency: response time, the delay to obtain the first piece of data.
- Bandwidth (QPS): throughput, amount of data delivered in a given time

- Zookeeper:
- ZooKeeper is a centralized service for maintaining configuration
information, naming, providing distributed synchronization, and providing
group services. All of these kinds of services are used in some form or
another by distributed applications.

- Paxos: Consensus over distributed hosts, similar to Zookeeper.

- On-device Scalability Inference, Federated Learning:
- Why: privacy, GDPR?
- Model save format for cross-platform inference:
- ONNX vs PMML: ONNX , the Open Neural Network Exchange
Format is an open format that supports the storing and porting of
predictive models across libraries and languages. ... P
MML or
Predictive model markup language is another interchange format
for predictive models. ONNX preferred and supported for NN,
Torch, TF, etc. PMML more traditional for SKlearn and ML kind of
models
- Tools:
- TF lite
- PyTorch Mobile: https://pytorch.org/mobile/home/
- Core ML: For iOS only
- API: Amazon Sagemaker

- TCP vs UDP: TCP is a connection-oriented reliable connection, while UDP is

unreliable but fast, therefore, great for video streaming. TCP is great for sending
documents.

- Caching over ISPs (Internet Service Providers) via Open Connect for different
regions. E.g. YouTube/Netflix caches content per region (e.g. India) and ISPs do
not hit netflix.com everytime, rather netflix.in or something which fetches the data
much faster. They cache popular content such as Bollywood movies/videos over
these caches, which serves 90% of their traffic.

- Synchronous vs Asynchronous (Sort of real-time vs near real-time for

action/feed post-user-action): E.g. On IG user uploads pictures (All actions
related to this activity are Sync.), but pulling followers, showcasing them this new
album and showcasing new content to the user is Async, which system/IG
performs based on the previous activity (upload) of the user.

In a synchronous system, operations (instructions, calculations, logic, etc.) are

coordinated by one, or more, centralized clock signals. An asynchronous digital
system, in contrast, has no global clock. Asynchronous systems do not depend
on strict arrival times of signals or messages for reliable operation.

The major difference between them lies in their transmission methods, i.e.
Synchronous transmissions are synchronized by an external clock; whereas
Asynchronous transmissions are synchronized by special signals along the
transmission medium.

- CDN and Edge: Content delivery Network: dedicated server in the region to
support data (mostly for videos like Netflix). Edge is similar to CDN with more
advantage/local and dedicated line b/w edge and the consumer avoiding transfer
for busy internet, therefore faster data transfer compared to CDN.

- HTTPS = HTTP + TLS (Transport Layer Security): More secure with TLS
protocols

- Bloom Filters, Count-min Sketch: Space-Efficient Probabilistic Data Structures

to identify if an item is part of a set or not. BF can have false positives but not
false negatives, extremely space-efficient.

- Pub-Sub and Queue: Note that customer-facing requests through app/UI should
not be directly exposed to Pub-Sub.
- LRU Cache: LRU stands for least recently used and the idea is to remove the
least recently used data to free up space for the new data.

- Multi-threading, multi-processing, locks, synchronizations

- Kafka vs Kinesis: Git doc

- Cassandra vs Mongo vs Redis: Git doc

- Solr and Elastic Search built on top of Lucene: Highly available, scalable. Allow
full-text search.

Approach

1. Questions to ask:
a. Input:
i. App type
ii. List possible actions (such as Instagram: upload, like, share, comment,
etc.)
iii. Data: kind of data, GDPR/Privacy
iv. Users: kind of users, demographics
b. Optimizing for:
i. Consistency, Availability, Performance, Partitioning?
ii. ACID (Atomic, Consistent, Isolation, Durable) vs BASE (Basic Availability,
Soft-State, Eventual Consistency)
c. Traffic:
i. # Users
ii. # active users
iii. QPS, Per day, per year
iv. Lifecycle?

2. Data Model and Capacity Model:

a. (MUST) Schema: Define all the tables and corresponding columns for users
Different columns to store per row/sample and default columns such as the entry
date/ts and expiry date.
b. The best way to define tables and corresponding Data models is by features
supported by the app.
c. Storage size @ per row, per day, per year, 5 years
d. DB type:
i. RDBMS vs NoSQL (ACID, BASE)
ii. Caching (Redis)
iii. Streaming (Kinesis/Kafka)

3. Service:
a. Continuous Integration and development: Kubernetes, Docker
b. Microservices/Monolithic
c. Synchronous, Asynchronous
d. API: REST, GRPC
e. Zookeeper/Paxo: Managing distributed

4. Scalability and Latency:

a. Service side:
i. Microservices/Monolithic
ii. Load Balancer (service level, back-end or feature level)
iii. Kubernetes + Docker for continuous integration and development
iv. Zookeeper: Managing distributed
b. Data Side:
i. DB: NoSQL (Cassandra for fast read and writes through wide col
scalability, Mongo for docs through multiple servers and no single point
failure like Cassandra)
ii. S3, Hadoop/Spark
c. Caching:
i. MemCache/Redis
d. Streaming:
i. Kinesis/Kafka

5. Model:
a. Deployment:
i. On Device: Look at concepts
ii. API: gRPC/Protobuff/REST
b. Active Learning

6. ML Design:
a. Questions:
i. Usecase:
ii. Data: Annotation: Size
iii. Metrics
iv. Active learning
v. Scalable: training vs inference
b. Model:
i. Regression, Classification, Supervised/Unsupervised
c. Training:
i. Distributed Training: PyTorch.data-parallel, Distributed TF

Relevant Resources
- System Design Basics:
- https://github.com/chagri/grokking-system-design/tree/master/basics
- Cracking the Coding Interview
- Glossary of terms:
- https://www.youtube.com/watch?v=UzLMhqg3_Wc&list=PL73KFetZlkJSZ9vTDSJ1swZh
e6CIYkqTL&index=3&t=0s
-
- Grokking the system design interview:
- Must Watch: https://github.com/lei-hsia/grokking-system-design

- Awesome Youtube Playlist:

- https://www.youtube.com/playlist?list=PL73KFetZlkJSZ9vTDSJ1swZhe6CIYkqTL

- LC:
- Design Youtube:
https://leetcode.com/discuss/interview-question/system-design/496042/Design-vi
deo-sharing-platform-like-Youtube
-
- CP Git:
- All DBs summary:
https://github.com/chagri/CP/blob/master/2020_practice/System_Design_Databa
ses/All_DBs_Cloud_And_Deployment_Systems.md
- https://github.com/chagri/CP/tree/master/system_design
- Kafka Theory and basics brush up:
- https://learning.oreilly.com/videos/apache-kafka-series/9781789342604/9781789
342604-video2_1
- Distributed system basics:
- https://learning.oreilly.com/videos/distributed-systems-in/9781491924914
- Cassandra:
- https://learning.oreilly.com/videos/mastering-cassandra-essentials/97814919941
22

Examples
1. URL Shortening:
a. CTCI
b. https://www.youtube.com/watch?v=JQDHz72OA3c&list=PL73KFetZlkJSZ
9vTDSJ1swZhe6CIYkqTL&index=26&t=180s
c. https://blog.codinghorror.com/url-shortening-hashes-in-practice/

2. Video Streaming:
a. YouTube:
i. https://leetcode.com/discuss/interview-question/system-design/496042/D
esign-video-sharing-platform-like-Youtube
b. Netflix:
https://www.youtube.com/watch?v=x9Hrn0oNmJM

3. Design Distributed DB or Key-Value Store:

https://www.youtube.com/watch?v=rnZmdmlR-2M

4. Image Video Sharing (Instagram/TikTok):

https://www.youtube.com/watch?v=QmX2NPkJTKg

5. Messaging: WhatsApp, FB Messenger:

https://www.youtube.com/watch?v=zKPNUMkwOJE&list=PL73KFetZlkJSZ9vTD
SJ1swZhe6CIYkqTL&index=9&t=156s

6. Recommendation System: Amazon

7. News Feed: FB feed, Twitter

8. FB example for News Feed:

For each user-post pair obtain this (and calculate relevancy score by multiplying values
with prob) and then showcase:
9. Search autocomplete:
https://www.youtube.com/watch?v=us0qySiUsGU&list=PL73KFetZlkJSZ9vTDSJ
1swZhe6CIYkqTL&index=15

10. Logistics: Uber/Instacart

11. Logging Service for 1B users:

Glossary
https://www.youtube.com/watch?v=UzLMhqg3_Wc&list=PL73KFetZlkJSZ9vTDSJ1swZhe6CIYkqTL&inde
x=3&t=0s
Things to consider
● Features
● API
● Availability
● Latency
● Scalability
● Durability
● Class Diagram
● Security and Privacy
● Cost-effective

Concepts to know
● Vertical vs horizontal scaling
● CAP theorem
● ACID vs BASE
● Partitioning/Sharding
● Consistent Hashing
● Optimistic vs pessimistic locking
● Strong vs eventual consistency
● RelationalDB vs NoSQL
● Types of NoSQL
○ Key value
○ Wide column
○ Document-based
○ Graph-based
● Caching
● Data center/racks/hosts
● CPU/memory/Hard drives/Network bandwidth
● Random vs sequential read/writes to disk
● HTTP vs http2 vs WebSocket
● TCP/IP model
● ipv4 vs ipv6
● TCP vs UDP
● DNS lookup
● Http & TLS
● Public key infrastructure and certificate authority(CA)
● Symmetric vs asymmetric encryption
● Load Balancer
● CDNs & Edges
● Bloom filters and Count-Min sketch
● Paxos
● Leader election
● Design patterns and Object-oriented design
● Virtual machines and containers
● Pub-sub architecture
● MapReduce
● Multithreading, locks, synchronization, CAS(compare and set)

Tools
● Cassandra
● MongoDB/Couchbase
● Mysql
● Memcached
● Redis
● Zookeeper
● Kafka
● NGINX
● HAProxy
● Solr, Elastic search
● Amazon S3
● Docker, Kubernetes, Mesos
● Hadoop/Spark and HDFS

ML Design
Objective/Goals

● A good design will touch on the following different components:

○ Problem formulation
○ Optimization function
○ Supervision signal
○ Feature engineering
■ - Data source
■ - Representation
■ - For example, the number of likes is a good idea but a better feature
would involve normalization, smoothing, and bucketing.
○ Model architecture
○ Evaluation metrics
○ Deployment (A/B testing)
○ End to end:
■ What will you do after you train the model and the model does not
perform well?
■ How do you go about debugging an ML model?
■ How do you evaluate and continuously deploy an ML model?
■ Weights and Biases tool or any other tool?
■ TensorBoard

Concepts
● Ranking:
FB example for News Feed:
For each user-post pair obtain this (and calculate relevancy score by multiplying values
with prob) and then showcase:
Approach
https://research.fb.com/blog/2018/05/the-facebook-field-guide-to-machine-learning-video-series/

1. Problem formulation (Data, Target, Success Metrics, Type):

a. Get data(input, output), application
b. Get Target
c. Success Metrics: NDCG, Precision, Recall, Regression, RMSE,
d. Type: Ranking, Supervised (Classification, Regression), Semi, Un

2. Feature Engineering:
a. Data:

b. Types of features: Categorical, Continuous, Derived

c. Normalization, smoothing, bucketing, Scaling
d. Edge cases, sparsity, sampling, seasonality
e. Representation
f. Special featurization techniques: Embeddings, categorization, etc.

3. Training:
a. Model architecture: Interpretable?
b. Cross-Validation
c. Baseline Model:
i. Get a simple baseline like random (normalized entropy), or general
likelihood of click/not-click, average click rate.
ii. Or training on 1st time clickers, i.e. with less # features with
context/history, i.e. simpler inputs.

4. Optimization function:
a. Based on the type and metrics: Ranking loss, RMSE, Cross-Entropy
b. Define based on performance wrt baseline model (RMSE in test data wrt avg
CTR, etc.)

5. Evaluation and Deployment:

a. Offline:
i. Cross-Validation
ii. Progressive Evaluation: Train on the older batch, evaluate the latest batch
of data.
iii. Calibration (Sanity check for overfitting): On train/test set calculate this
for sanity i.e. average pre(generalization):

https://medium.com/analytics-vidhya/calibration-in-machine-learning-e797
2ac93555

iv. Convert Binary to multi-class and evaluate how and where the
performance is coming from, also to debug.

b. Online: A/B Test:

Way to test whether A/B test is configured properly is by comparing control with
control and seeing exactly similar performance. If that’s not the case, then the
experiment is not fairly setup.

c. Evaluate on devices, demographics, days/events, and different experiences

d. Statistical Tests
e. Active Learning, Real-time training (evaluate in latest data)
i. Logging the data/activity during deployment.

6. Hybrid:
a. Model + Rules (such as cache and latest seen items/interests)

Examples
1. Scalable Collaborative Filtering for FB Ads:
https://engineering.fb.com/core-data/recommending-items-to-more-than-a-billion-people/
2. Pinterest Recommendation System using CNNs:
Efficient convolutional network for recommender systems
Patent: Look at images
3.

Grokking The Advanced System Design Interview
91% (11)
Grokking The Advanced System Design Interview
397 pages
Grokking The System Design Interview PDF
93% (46)
Grokking The System Design Interview PDF
196 pages
What Is A System Design Interview - Grokking Modern System Design Interview For Engineers & Managers
0% (2)
What Is A System Design Interview - Grokking Modern System Design Interview For Engineers & Managers
7 pages
Grokking the Java Interview
From Everand
Grokking the Java Interview
Javin Paul
No ratings yet
System Design Interview - An Insider's Guide
90% (10)
System Design Interview - An Insider's Guide
103 pages
Alex Xu System Design
100% (2)
Alex Xu System Design
16 pages
Grokking The Object Oriented Design Interview
100% (1)
Grokking The Object Oriented Design Interview
178 pages
System Design Interview Textbook
No ratings yet
System Design Interview Textbook
51 pages
Coding Interview-Leetcode
83% (6)
Coding Interview-Leetcode
181 pages
System Design Interview Fundamentals
100% (4)
System Design Interview Fundamentals
412 pages
System Design For Cracking Interviews
No ratings yet
System Design For Cracking Interviews
15 pages
Top FAANG Interview Questions From LeetCode
No ratings yet
Top FAANG Interview Questions From LeetCode
14 pages
Learn Multithreading with Modern C++
From Everand
Learn Multithreading with Modern C++
James Raynard
No ratings yet
Educative Top 10 System Design
No ratings yet
Educative Top 10 System Design
19 pages
System Design Primer
No ratings yet
System Design Primer
80 pages
Cheatsheet Systemdesign A4
100% (2)
Cheatsheet Systemdesign A4
6 pages
Ebook Fast Data Architectures For Streaming Applications 2
No ratings yet
Ebook Fast Data Architectures For Streaming Applications 2
58 pages
Cracking Microservices Interviews v1.1
100% (4)
Cracking Microservices Interviews v1.1
152 pages
Cracking Microservices Interviews v1.3 PDF
40% (5)
Cracking Microservices Interviews v1.3 PDF
157 pages
Ebook - Cracking The System Design Interview Course
100% (1)
Ebook - Cracking The System Design Interview Course
91 pages
System Design: Interview Prep
No ratings yet
System Design: Interview Prep
30 pages
System Design Notes
100% (1)
System Design Notes
179 pages
Golden Rules To Answer in A System Design Interview
100% (2)
Golden Rules To Answer in A System Design Interview
33 pages
System Design Interview: Amazon/ Flipkart/ Ebay o
No ratings yet
System Design Interview: Amazon/ Flipkart/ Ebay o
19 pages
System Design Primer PDF
100% (1)
System Design Primer PDF
74 pages
Collection of System Design PDF
No ratings yet
Collection of System Design PDF
34 pages
System Design Interview Complete Guide - Aman Barnwal
No ratings yet
System Design Interview Complete Guide - Aman Barnwal
28 pages
The Ultimate Strategy To Preparing For A Coding Interview - Medium
100% (3)
The Ultimate Strategy To Preparing For A Coding Interview - Medium
14 pages
Scalable System Design
No ratings yet
Scalable System Design
22 pages
System Design Resources
No ratings yet
System Design Resources
25 pages
Grok System Design Interview
100% (4)
Grok System Design Interview
163 pages
System Design 2
No ratings yet
System Design 2
48 pages
Complete DSA Guide
No ratings yet
Complete DSA Guide
102 pages
System Design Interview 1677800449
100% (1)
System Design Interview 1677800449
10 pages
Object-Oriented Basics - Grokking The Object Oriented Design Interview
100% (1)
Object-Oriented Basics - Grokking The Object Oriented Design Interview
270 pages
Hacking A Google Interview Handout 3
No ratings yet
Hacking A Google Interview Handout 3
8 pages
System Design Primer
100% (2)
System Design Primer
59 pages
System Design Cheatsheet 1651760511
No ratings yet
System Design Cheatsheet 1651760511
6 pages
An Insiders Guide To Ace System Design in - Maurice Jayson
100% (2)
An Insiders Guide To Ace System Design in - Maurice Jayson
60 pages
Low Level Design
No ratings yet
Low Level Design
12 pages
System Design Interview Preparation
50% (2)
System Design Interview Preparation
8 pages
Coding Interview in Java
No ratings yet
Coding Interview in Java
568 pages
Design Patterns Difference FAQS Compiled-1
100% (1)
Design Patterns Difference FAQS Compiled-1
8 pages
My Notes - LeetCode
100% (1)
My Notes - LeetCode
31 pages
Coderbyte Ebook
100% (1)
Coderbyte Ebook
58 pages
Multithreading and Concurrency Questions
No ratings yet
Multithreading and Concurrency Questions
26 pages
Cracking System Design Interview Like A Pro - A 10 Day
50% (10)
Cracking System Design Interview Like A Pro - A 10 Day
199 pages
Coding Patterns
100% (1)
Coding Patterns
26 pages
Designing Instagram - Grokking The System Design Interview
No ratings yet
Designing Instagram - Grokking The System Design Interview
16 pages
14 Patterns To Ace Any Coding Interview Question - Fahim Ul Haq
100% (1)
14 Patterns To Ace Any Coding Interview Question - Fahim Ul Haq
13 pages
AlgoBulls - Web Backend Developer - Coding Assignment 2
100% (1)
AlgoBulls - Web Backend Developer - Coding Assignment 2
3 pages
Goldman Sachs - LeetCode
No ratings yet
Goldman Sachs - LeetCode
4 pages
Adobe - LeetCode
No ratings yet
Adobe - LeetCode
8 pages
20 Coding Patterns To Master MAANG Interviews-26
100% (1)
20 Coding Patterns To Master MAANG Interviews-26
23 pages
Leet - Code Solution
100% (1)
Leet - Code Solution
630 pages
Cracking Microservices Interview: Learn Advance Concepts, Patterns, Best Practices, NFRs, Frameworks, Tools and DevOps
From Everand
Cracking Microservices Interview: Learn Advance Concepts, Patterns, Best Practices, NFRs, Frameworks, Tools and DevOps
Sameer S Paradkar
3/5 (1)
Java Concurrency Complete Self-Assessment Guide
From Everand
Java Concurrency Complete Self-Assessment Guide
Gerardus Blokdyk
No ratings yet
Hands-On System Design: Learn System Design, Scaling Applications, Software Development Design Patterns with Real Use-Cases
From Everand
Hands-On System Design: Learn System Design, Scaling Applications, Software Development Design Patterns with Real Use-Cases
Harsh Kumar Ramchandani
No ratings yet
Core Java Interview Questions You'll Most Likely Be Asked: Job Interview Questions Series
From Everand
Core Java Interview Questions You'll Most Likely Be Asked: Job Interview Questions Series
Vibrant Publishers
4/5 (14)
The Coder Habits: The #39# Habits of the Professional Programmer
From Everand
The Coder Habits: The #39# Habits of the Professional Programmer
Rafael Gómez Blanes
5/5 (1)
2024 Guide to Searching the Internet
No ratings yet
2024 Guide to Searching the Internet
24 pages
Ev - Control - Unit Mitsubishi
No ratings yet
Ev - Control - Unit Mitsubishi
3 pages
ISE414 Britz3e PPT 03
No ratings yet
ISE414 Britz3e PPT 03
35 pages
Unit - 1 Introduction To Mechatronics
100% (1)
Unit - 1 Introduction To Mechatronics
106 pages
ReleaseNote_En_V4.80_20250318
No ratings yet
ReleaseNote_En_V4.80_20250318
256 pages
Active Directory Exploitation Cheat Sheet by S1Ckb0Y1337
100% (1)
Active Directory Exploitation Cheat Sheet by S1Ckb0Y1337
28 pages
Z Adru Adj PRL
No ratings yet
Z Adru Adj PRL
5 pages
VIISemesterTeachingScheme2016 17 PDF
No ratings yet
VIISemesterTeachingScheme2016 17 PDF
13 pages
(PME) Power Management Systems - Licensing Overview
No ratings yet
(PME) Power Management Systems - Licensing Overview
57 pages
PDF Download Decameron Tarot English and Spanish Edition Textbook 210211144001
No ratings yet
PDF Download Decameron Tarot English and Spanish Edition Textbook 210211144001
34 pages
Load Promo Codes (Smart, TNT, TM Globe)
0% (1)
Load Promo Codes (Smart, TNT, TM Globe)
10 pages
Resume Venu (1) Sfdaf (1) Latest
No ratings yet
Resume Venu (1) Sfdaf (1) Latest
2 pages
CyberSudo's OSINT Toolkit
No ratings yet
CyberSudo's OSINT Toolkit
38 pages
Kanika Sharma CV
No ratings yet
Kanika Sharma CV
9 pages
NPM Package Manager Q A
No ratings yet
NPM Package Manager Q A
2 pages
Lesson2Module3Lec DataSecurityAwareness
No ratings yet
Lesson2Module3Lec DataSecurityAwareness
29 pages
Kusto dashboard reports via email
No ratings yet
Kusto dashboard reports via email
2 pages
HIGH SCHOOL ICT Malware
No ratings yet
HIGH SCHOOL ICT Malware
20 pages
ReleaseNote SP MDT Tool W1712.00.00
No ratings yet
ReleaseNote SP MDT Tool W1712.00.00
16 pages
10341d1308260677 KM 3650w Light Image Only Center Area KM 3650wentb1r4
No ratings yet
10341d1308260677 KM 3650w Light Image Only Center Area KM 3650wentb1r4
17 pages
Creators - Hok-Creation
No ratings yet
Creators - Hok-Creation
1 page
Saurabh Mittal - CV
No ratings yet
Saurabh Mittal - CV
5 pages
System Design Interview An Insider's Guide by Alex Xu
No ratings yet
System Design Interview An Insider's Guide by Alex Xu
1 page
[Ebooks PDF] download Coding Club Level 1 Python Basics Roffey C. full chapters
100% (5)
[Ebooks PDF] download Coding Club Level 1 Python Basics Roffey C. full chapters
72 pages
Iptv For Isps: DVB-S/S2 DVB-T/T2 DVB-C/C2 Isdb-T
No ratings yet
Iptv For Isps: DVB-S/S2 DVB-T/T2 DVB-C/C2 Isdb-T
4 pages
BIM + CROSS RAILS Use Case Study
No ratings yet
BIM + CROSS RAILS Use Case Study
18 pages
TAD Is Not Getting Activated BAD LABELS IN GL
No ratings yet
TAD Is Not Getting Activated BAD LABELS IN GL
4 pages
Immediate Opening For Non
No ratings yet
Immediate Opening For Non
2 pages
Ansys All You Need To Know About Hardware For Simulation
No ratings yet
Ansys All You Need To Know About Hardware For Simulation
36 pages
Realtek Semicon RTL8761BUV CG - C2687140
No ratings yet
Realtek Semicon RTL8761BUV CG - C2687140
23 pages

System Design - ML Design 1 PDF

Uploaded by

System Design - ML Design 1 PDF

Uploaded by

System Design

● A good design shows that you:

● Consistency: Every read receives the most recent write or an error

Order of Priority to keep in mind while designing a Distributed DB:

● The capability of a system to grow and manage increased demand.

● Reliability is the probability that a system will fail in a given period.

● Easiness to operate and maintain the system.

Concepts and Handy Terminologies

AWS Microservices: ​https://aws.amazon.com/microservices/

- Microservice vs Monolithic vs Serverless:

A monolithic application is simply deployed on a set of identical servers behind a

- Memcached vs Redis/Distributed Cache:

Redis is sometimes described as “Memcached on steroids,” which is hardly surprising

- Kubernetes: ​For CI/CD, Scaling and managing containers

Microservice Architecture with Azure Kubernetes services:

WebSocket (HTTP 2.0 )​ is bidirectional i.e. a protocol providing full-duplex

- Consistent Hashing​ and Load Balancing across the servers

- Horizontal vs Vertical Scaling:

- TCP vs UDP: ​TCP is a connection-oriented reliable connection, while UDP is

- Synchronous vs Asynchronous (Sort of real-time vs near real-time for

In a synchronous system, operations (instructions, calculations, logic, etc.) are

- Bloom Filters, Count-min Sketch: ​Space-Efficient Probabilistic Data Structures

- Multi-threading, multi-processing, locks, synchronizations

- Kafka vs Kinesis: Git doc

- Cassandra vs Mongo vs Redis: Git doc

2. Data Model and Capacity Model:

4. Scalability and Latency:

- Awesome Youtube Playlist:

3. Design Distributed DB or Key-Value Store:

4. Image Video Sharing (Instagram/TikTok):

5. Messaging: WhatsApp, FB Messenger:

6. Recommendation System: Amazon

8. FB example for News Feed:

10. Logistics: Uber/Instacart

11. Logging Service for 1B users:

● A good design will touch on the following different components:

1. Problem formulation (Data, Target, Success Metrics, Type):

b. Types of features: Categorical, Continuous, Derived

5. Evaluation and Deployment:

b. Online: A/B Test:

c. Evaluate on devices, demographics, days/events, and different experiences

You might also like

AWS Microservices: https://aws.amazon.com/microservices/

- Kubernetes: For CI/CD, Scaling and managing containers

WebSocket (HTTP 2.0 ) is bidirectional i.e. a protocol providing full-duplex

- Consistent Hashing and Load Balancing across the servers

- TCP vs UDP: TCP is a connection-oriented reliable connection, while UDP is

- Bloom Filters, Count-min Sketch: Space-Efficient Probabilistic Data Structures