0% found this document useful (0 votes)
17 views70 pages

Week7 Slides

Uploaded by

Karan Kumar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
17 views70 pages

Week7 Slides

Uploaded by

Karan Kumar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 70

Backend Systems

Memory
Hierarchy
Types of storage elements
● On-chip registers: 10s-100s of bytes
Types of storage elements
● On-chip registers: 10s-100s of bytes
● SRAM (cache): 0.1 - 1 MB
Types of storage elements
● On-chip registers: 10s-100s of bytes
● SRAM (cache): 0.1 - 1 MB
● DRAM: 0.1 - 10 GB
Types of storage elements
● On-chip registers: 10s-100s of bytes
● SRAM (cache): 0.1 - 1 MB
● DRAM: 0.1 - 10 GB
● Solid-state disk (SSD) - Flash: 1-100 GB
Types of storage elements
● On-chip registers: 10s-100s of bytes
● SRAM (cache): 0.1 - 1 MB
● DRAM: 0.1 - 10 GB
● Solid-state disk (SSD) - Flash: 1-100 GB
● Magnetic disk (HDD - hard disk drive?): 0.1 - 10 TB
Types of storage elements
● On-chip registers: 10s-100s of bytes
● SRAM (cache): 0.1 - 1 MB
● DRAM: 0.1 - 10 GB
● Solid-state disk (SSD) - Flash: 1-100 GB
● Magnetic disk (HDD - hard disk drive?): 0.1 - 10 TB
● Optical, magnetic, holographic, . . .
Storage Parameters
● Latency: time to read first value from a storage location (lower is better)
○ Register < SRAM < DRAM < SSD < HDD
Storage Parameters
● Latency: time to read first value from a storage location (lower is better)
○ Register < SRAM < DRAM < SSD < HDD
● Throughput: number of bytes/second that can be read (higher is better)
○ DRAM > SSD > HDD (regs, SRAM limited capacity)
Storage Parameters
● Latency: time to read first value from a storage location (lower is better)
○ Register < SRAM < DRAM < SSD < HDD
● Throughput: number of bytes/second that can be read (higher is better)
○ DRAM > SSD > HDD (regs, SRAM limited capacity)
● Density: number of bits stored per unit area / cost (higher is better)
○ Volume manufacture important
○ HDD > SSD > DRAM > SRAM > Regs
Computer Organization
● CPU has as many registers as possible
Computer Organization
● CPU has as many registers as possible
● Backed by L1, L2, L3 cache (SRAM)
Computer Organization
● CPU has as many registers as possible
● Backed by L1, L2, L3 cache (SRAM)
● Backed by several GB of DRAM working memory
Computer Organization
● CPU has as many registers as possible
● Backed by L1, L2, L3 cache (SRAM)
● Backed by several GB of DRAM working memory
● Backed by SSD for high throughput
Computer Organization
● CPU has as many registers as possible
● Backed by L1, L2, L3 cache (SRAM)
● Backed by several GB of DRAM working memory
● Backed by SSD for high throughput
● Backed by HDD for high capacity
Computer Organization
● CPU has as many registers as possible
● Backed by L1, L2, L3 cache (SRAM)
● Backed by several GB of DRAM working memory
● Backed by SSD for high throughput
● Backed by HDD for high capacity
● Backed by long-term storage, backup
Cold storage (?)
● Backups and archives:
○ Huge amounts of data
○ Not read very often
○ Can tolerate high read latency
● Amazon Glacier, Google, Azure Cold/Archive storage classes
● High latency of retrieval: up to 48 hours
● Very high durability
● Very low cost
Impact on application development
● Plan the storage needs based on application growth
● Speed of app determined by types of data stored, how stored
● Some data stores are more efficient for some types of read/write operations

Developer must be aware of choices and what kind of database to choose for a
given application
Data Search
O() notation
● Used in study of algorithmic complexity: beyond scope of this course
● Rough approximation: “order of magnitude”, “approximately” etc.
● Main concepts here:
○ O(1) - constant time independent of input size - excellent!
○ O(log N) - logarithmic in input size - grows slowly with input - very good
○ O(N) - linear in input size - often the baseline - would like to do better
○ O(Nk) - polynomial (quadratic, cubic etc.) - not good as input size grows
○ O(kN) - exponential - VERY bad: won’t work even for reasonably small
inputs
Searching for element in memory
Unsorted data in a linked list

● Start from beginning


● Proceed stepwise, comparing each element
● Stop if found and return LOCATION
● If end-of-list, return NOTFOUND

O(N)
Searching for element in memory
Unsorted data in array

● Start from beginning


● Proceed stepwise, comparing each element
● Stop if found and return LOCATION
● If end-of-list, return NOTFOUND

O(N)
Searching for element in memory
Sorted data in array

● Start from beginning


● Proceed stepwise, comparing each element
● Stop if found and return LOCATION
● If end-of-list, return NOTFOUND

O(N) but...
Searching for element in memory
Sorted data in array

● Look at middle element in array:


○ greater than target - search in lower half
○ lesser than target - search in upper half
● Switch focus to new array: half the size of original
○ Repeat

O(log N)
Problems with arrays
● Size must be fixed ahead of time
● Adding new entries requires resizing - can try oversize, but eventually ...
● Maintaining sorted order O(N):
○ find location to insert
○ move all further elements by 1 to create a gap
○ insert
● Deleting
○ find location, delete
○ move all entries down by 1 step
Alternatives
● Binary search tree
○ Maintaining sorted order is easier: growth of tree
Alternatives
● Binary search tree
○ Maintaining sorted order is easier: growth of tree
● Self-Balancing
○ BST can easily tilt to one side and grow downwards
○ Red-black, AVL, B-tree… more complex, but still reasonable
Alternatives
● Binary search tree
○ Maintaining sorted order is easier: growth of tree
● Self-Balancing
○ BST can easily tilt to one side and grow downwards
○ Red-black, AVL, B-tree… more complex, but still reasonable
● Hash tables
○ Compute an index for an element: O(1)
○ Hope the index for each element is unique!
■ Difficult but doable in many cases
Database Search
Databases (tabular)
● Tables with many columns
● Want to search quickly on some columns
● Maintain “INDEX” of columns to search on
○ Store a sorted version of column
○ Needs column to be “comparable”: integer, short string, date/time etc.
■ Long text fields are not good for index
■ Binary data not good
Example: MySQL
https://dev.mysql.com/doc/refman/8.0/en/index-btree-hash.html
Index-friendly query
SELECT * FROM tbl_name WHERE key_col LIKE 'Patrick%';

SELECT * FROM tbl_name WHERE key_col LIKE 'Pat%_ck%';


Index-unfriendly query
SELECT * FROM tbl_name WHERE key_col LIKE '%Patrick%';

SELECT * FROM tbl_name WHERE key_col LIKE other_col;


Multi-column index
● (index_1, index_2, index_3): compound index on 3 columns:
○ first sorted on index_1, then on index_2, then on index_3
○ all values with same index_1 will be sorted on index_2,
○ all values with same index_1 and index_2 will be sorted on index_3
○ etc.
● eg. (date-of-birth, city-of-birth, name)
○ can query for all people born on same date in same city with same name easily
○ but...
Multi-index friendly
... WHERE index_part1=1 AND index_part2=2 AND other_column=3

/* index = 1 OR index = 2 */
... WHERE index=1 OR A=10 AND index=2

/* optimized like "index_part1='hello'" */


... WHERE index_part1='hello' AND index_part3=5

/* use index on index1 but not on index2 or index3 */


... WHERE index1=1 AND index2=2 OR index1=3 AND index3=3;
Multi-index unfriendly
/* index_part1 is not used */
... WHERE index_part2=1 AND index_part3=2

/* Index not used in both parts of the WHERE clause */


... WHERE index=1 OR A=10

/* No index spans all rows */


... WHERE index_part1=1 OR index_part2=10
Hash-index
● Only used in in-memory tables
● Only for equality comparisons - cannot handle “range”
● Does not help with “ORDER BY”
● Partial key prefix cannot be used
● But VERY fast where applicable...
Query Optimization
Database specific

● https://dev.mysql.com/doc/refman/8.0/en/index-btree-hash.html
● https://www.sqlite.org/optoverview.html
● Postgres:
Summary
● Setting up queries properly impacts application performance
● Building proper indexes crucial to good search
● Compound indexes, multiple indexes etc. possible
○ Too many can be waste of space
● Make use of structure in data to organize it properly
SQL vs NoSQL
SQL
● Structured Query Language
○ Used to query databases that have structure
○ Could also be used for CSV files, spreadsheets etc.
● Closely tied to RDBMS - relational databases
○ Columns / Fields
○ Tables of data hold relationships
○ All entries in a table must have same set of columns
● Tabular databases
○ Efficient indexing possible - use specified columns
○ Storage efficiency: prior knowledge of data size
Problem with tabular databases
● Structure (good? bad?)
● All rows in table must have same set of columns

Example

● Student - hostel => mess


● Student - day-scholar => gate pass for vehicle
● Table? Column for mess, column for gate pass???
Alternate ways to store: Document databases
● Free-form (unstructured)
documents
○ Typically JSON encoded
○ Still structured, but each
document has own structure
● Examples:
○ MongoDB
○ Amazon DocumentDB
Alternate ways to store: Key-Value
● Python dictionary, C++ OrderedMap etc.: dictionary/hash table
● Map a key to a value
● Store using search trees or hash tables
● Very efficient key lookup, not good for range type queries
● Examples:
○ Redis
○ BerkeleyDB
○ memcached …
● Often used alongside other databases for “in-memory” fast queries
Alternate ways to store: Column stores
● Traditional relational DBs store all values of a row together on disk
○ Retrieving all entries of a given row very fast
● Instead store all entries in a column together
○ Retrieve all values of a given attribute (age, place of birth, …) very fast
● Examples:
○ Cassandra
○ HBase...
Alternate ways to store: Graphs
● Friend-of-a-friend, social networks, maps: graph oriented relationships
● Different degrees (number of outgoing edges), weights of edges, nodes etc.
● Path-finding more important than just search
○ Connections, knowledge discovery
● Examples:
○ Neo4J
○ Amazon Neptune
Alternative ways to store: Time Series Databases
● Very application specific: store some metric or values as function of time
● Used for log analysis, performance analysis, monitoring
● Queries:
○ How many hits between T1 and T2?
○ Average number of requests per second?
○ Country from where maximum requests came in past 7 days?
● Typical RDBMS completely unsuitable - same for most alternatives
● Examples:
○ RRDTool
○ InfluxDB
○ Prometheus
● Search: elasticsearch, grafana,...
NoSQL?
● Started out as “alternative” to SQL
● But SQL is just a query language - can be adapted for any kind of query,
including from a document store or graph!
● “Not-only-SQL”
● Additional query patterns for other types of data stores
A word on ACID
● Transaction: core principle of database
● ACID:
○ Atomic
○ Consistent
○ Isolated
○ Durable
● Many NoSQL databases sacrifice some part of ACID (example: eventual
consistency instead of consistency) for performance
● But there can be ACID compliant NoSQL databases as well...
Why not ACID?
● Consistency hard to meet: especially when scaling / distributing
● Eventual consistency easier to meet
● Example:
○ A (located in India) and B (located in the US) both add C as a friend on Facebook
○ Order of adding does not matter!
○ Temporarily seeing C in A’s list but not B, or B’s list but not A - not a catastrophe (?)
● Financial transactions absolutely require ACID
○ Consistency is paramount - even a split second of inconsistent data can cause problems
A word on storage
● In-memory:
○ Fast
○ Doesn’t scale across machines
● Disk
○ Different data structures, organization needed
Scaling
Replication and Redundancy
● Redundancy:
○ Multiple copies of same data
○ Often used in connection with backups - even if one fails, others survive
○ One copy is still the master
● Replication:
○ Usually in context of performance
○ May not be for purpose of backup
○ Multiple sources of same data - less chance of server overload
○ Live replication requires careful design
BASE vs ACID
● “Basically Available”, “Soft state”, “Eventually consistent”
○ Winner of worst acronym award
● Eventual consistency instead of Consistency
○ Replicas can take time to reach consistent state
● Stress on high availability of data
Replication in traditional DBs
● RDBMS replication possible
● Usually server cluster in same data center
○ Load balancing
● Geographically distributed replication harder
○ Latency constraints for consistency
Scale-up vs Scale-out
● Scale-up: traditional approach
○ Larger machine
○ More RAM
○ Faster network, processor
○ requires machine restart with each scale change
● Scale-out:
○ Multiple servers
○ Harder to enforce consistency etc. - better suited to NoSQL / non-ACID
○ Better suited to cloud model: Google, AWS etc provide automatic scale-out, cannot do
auto-scale-up
Application Specific
● Financial transactions:
○ cannot afford even slightest inconsistency
○ Only scale-up possible
● Typical web-application
○ Social networks, media: eventual consistency OK
○ e-commerce: only the financial part needs to go to ACID DB
Security
SQL in context of an application
● Non-MVC app: can have direct SQL queries anywhere
● MVC: only in controller, but any controller can trigger a DB query

So what’s dangerous about queries?


Typical HTML form
<form>
Username:
<input type="text" name="name">
<br />
Password:
<input type="password" name="password">
<br />
</form>
Code
name = form.request["name"]

pswd = form.request["name"]

sql = 'SELECT * FROM Users WHERE Name ="' +


name + '" AND Pass ="' + pswd + '"'
Example input vs SQL

SELECT * FROM Users WHERE Name ="abcd" AND Pass = "pass"


Example input vs SQL

SELECT * FROM Users WHERE Name ="" or ""


AND Pass = "" or ""

Result???
Example input vs SQL

sql = "SELECT * FROM Users WHERE Name = " + name

Input:

a; DROP TABLE Users;

Query:

SELECT * FROM Users WHERE Name = a; DROP TABLE Users;


Problem
● Parameters from HTML taken without validation
● Validation:
○ Are they valid text data (no special characters, other symbols)
○ No punctuation or other invalid input
○ Are they the right kind of input (text, numbers, email, date)?
● Validation MUST be done just before the database query - even if you have
validation in the HTML or Javascript - not good enough
○ Direct HTTP requests can be made with junk data
Web Application Security
● SQL injection
○ Use known frameworks, best practices, validation
● Buffer overflows, input overflows
○ Length of inputs, queries
● Server level issues - protocol implementation?
○ Use known servers with good track record of security
○ Update all patches
● Possible outcomes:
○ loss of data - deletion
○ exposure of data - sensitive information leak
○ manipulation of data - change
HTTPS ?
● Secure sockets: secure communication between client and server
● Server certificate:
○ based on DNS: has been verified by some trusted third party
○ difficult to spoof
○ based on Mathematical properties - ensure very low probability of mistakes match
● However:
○ Only secures link for data transfer - does not perform validation or safety checks
○ Negative impact on “caching” of resources like static files
○ Some overhead on performance
Summary
● Internet and Web security are complex: enough for a course in themselves
● Generally recommended to use known frameworks with trusted track records
● Code audits
● Patch updates on OS, server, network stack etc. essential

App developers should be very careful of their code, but also aware of problems at
other levels of the stack

You might also like