Flipkart Website Architecture

      Mistakes & Learnings

          Siddhartha Reddy
          Architect, Flipkart
June 2007
November 2007
December 2012
www.flipkart.com
• Started in 2007
• Current Architecture from mid 2010
• Evolution of the architecture presented as…

       Issue[1]             RCA[2]   Actions   Learnings




•   *1+ Issue: Website is “slow”
•   [2] RCA = Root Cause Analysis
Surviving & reacting to the environment

INFANCY (2007 – MID-2010)
Website is “slow”!
RCA
• Why?
  – MySQL queries taking too long
• Why?
  – Too many queries
  – Many slow queries
  – Queries locking tables
• Why?
  – Capacity
• Hmm…
Fixing it
• Get beefier servers (the obvious)
• Separate master_db, slave_db
  – Writes go to master_db
  – Reads from slave_db
  – Critical reads from master_db
                              Writes                 Reads
   Reads           Writes

           MySQL              MySQL                  MySQL
                                       Replication   Slave
                              Master
Learning from it
• Scale-out databases reads by distributing load
  across systems
• Isolate database writes from reads
  – Writes are (usually) more critical
Website is “slow”!
    (Again)
RCA
• Why?
  – MySQL queries taking too long (on slave_db)
• Why?
  – Too many queries
  – Many slow queries
• Why?
  – Queries from analytics / reporting and other
    backend jobs
• Urm…
Fixing it
• Analytics / reporting DB (archival_db)
    – Use MyISAM — optimized for reads
    – Additional indexes for quicker reporting
                                           Website                  Website
                                           Writes                    Reads
Website                 Website
Writes                   Reads

                                           MySQL                    MySQL
                                                      Replication   Slave 1
                                           Master
MySQL                   MySQL
          Replication   Slave
Master                                          Replication

                        Analytics           MySQL                   Analytics
                         Reads              Slave 2                  Reads
Learning from it
• Isolate the databases being used for serving
  website traffic from those being used for
  analytical/reporting
• Isolate systems being used by production
  website from those being used for background
  processing
Learning the basics

BABY (2010 – 2011)
Website is “slow”!
RCA
• Why?
• How?
  – Instrumentation
RCA - 1
• Why?
     – Logging a lot
     – PHP processes blocking on writing logs
               Request2
              -> Process2




                                                                                      Writing
                                          Waiting




                                                                Waiting
Request1                    Request3                Request2              Request2              Request3
-> Process1                 -> Process3             :Process1             :Process2             :Process3

              Log file
RCA - 2
• Why?
  – Service Oriented Architecture (SOA)
  – Too many calls to remote services per request
     • Creating fresh connection for each call
     • All the calls are made in serial order


                     Connect to   Request    Connect    Request      Send
   Receive request
                      Service1    Service1   Service2   Service2   response
RCA - 3
• Why?
  – Configurability
  – Fetch a lot of “config” from database for serving
    each request
     Receive    Fetch     Fetch     Fetch     Fetch      Send
     request   Config1   Config2   Config3   Config4   response
RCA – 1,2,3
• Why?
  – Logging a lot
  – SOA
  – Configurability
• Why?
  – PHP’s process model
• Argh!
Fixing it
• fk-w3-agent
  – Simple Java “middleware” daemon
  – Deployed on each web server
  – PHP communicates to it through local socket
  – Hosts pluggable “handlers”
fk-w3-agent: LoggingHandler

               Request2                                 Request2
              -> Process2                               -> Process2
Request1                    Request3      Request1                     Request3
-> Process1                 -> Process3   -> Process1                 -> Process3


                                                         fk-w3-
              Log file                                    agent

                                                                 Async / buffered




                                                        Log file
fk-w3-agent: ServiceHandler(s)
                  Connect to     Request           Connect         Request       Send
Receive request
                   Service1      Service1          Service2        Service2    response




                                            Call
         Receive request                                             Send response
                                      fk-w3-agent


                                        fk-w3-
                                        agent

                      Service1                                Service2
fk-w3-agent: ConfigHandler
Receive      Fetch     Fetch        Fetch          Fetch      Send
request     Config1   Config2      Config3        Config4   response




                             Database

                       Fetch all config from
    Receive request                                Send response
                           fk-w3-agent

                           fk-w3-
                            agent
                                 Poll and cache



                          Database
Learning from it
• PHP — good for frontend and templating
  – Gives a lot of agility
  – Limiting process model
     • Hurdle for high performance
• Java — stability and performance
• Horses for courses
Website is “slow”!
    (Again)
RCA
• Why?
  – PHP processes taking up too much time
  – PHP processes taking up too much CPU
• Why?
  – Product info deserialization taking up time/CPU
  – View construction taking up time/CPU
Fixing it
• Caching!
• Cache fully constructed pages
  – For a few minutes
  – Only for highly trafficked pages (Homepage)
• Cache PHP serialized Product objects
  – ~20 million objects
  – Memcache
• Yeah! But…
  – Add caching => add complexity
Caching: Complications (1)
• “Caching fully constructed pages”
• But parts of pages still need to be dynamic
     • Example: Logged-in user’s name
• Impossible to do effective bucket testing
     • Or at least makes it prohibitively complex
Caching: Complications (2)
• “Caching PHP serialized Product objects”
• Without caching:
              getProductInfo()            Fetch from CMS

• With caching, cache hit:
              getProductInfo()           Fetch from Cache

• With caching, cache miss:
                         Fetch from   Fetch from
      getProductInfo()                             Set in Cache
                           Cache         CMS
Caching: Complications (3)
• TTL: ∞ (i.e. no invalidation)
• Pro-actively repopulate products in the cache
  – Receive “notifications” about product updates
     • Notification Server — pushes notifications raised by
       CMS
• Use a persistent, distributed cache
  – Memcache => Membase, Couchbase
Learning from it
• Caching is a powerful tool for performance
  optimization
• Caching adds complexities
  – Reduced by keeping cache close to data source
  – Think deeply about TTL, invalidation
• Use caching to go from “acceptable
  performance” to “awesome performance”
  – Don’t rely on it to get to “acceptable
    performance”
Growing up

KID (2012)
Website is “slow”!
RCA
• Why?
  – Search-service is slow (or Reviews-service is slow
    or Recommendations-service is slow)
• But why is rest of website slow?
  – Requests to the slow service are blocking
    processing threads
• Eh?!
Let’s do some math
• Let’s say
   – Mean (or median) response time: 100 ms
   – 8-core server
   – All requests are CPU bound
• Throughput: 80 requests per second (rps)
• Let’s also say
   – 95th Percentile response time: 1000 ms
       • Call them “bad requests”
• 4 bad requests in a second
   – Throughput down to 44 rps
• 8 bad requests in a second?
   – Throughput down to 8 rps
Fixing it
• Aggressive timeouts for all service calls
  – Isolate impact of a slow service
     • only to pages that depend on it
• Very aggressive timeouts for non-critical
  services
  – Example: Recommendations
     • On a Product page, Search results page etc.
     • Not on My Recommendations page
• Load non-critical parts of pages through AJAX
Learning from it
• Isolate the impact of a poorly performing
  services / systems
• Isolate the required from the good-to-have
Website is “slow”!
    (Again)
RCA
• Why?
  – Load average of web servers has spiked
• Why?
  – Requests per second has spiked
     • From 1000 rps to 1500 rps
• Why?
  – Large number of notifications of product
    information updates
Fixing it
• Separate cluster for receiving product info
  update notifications from the cluster that
  serves users
• Admission control: Don’t let a system receive
  more requests than it can handle
  – Throttling
• Batch the notifications
Learning from it
• Isolate the systems serving internal requests
  from those serving production traffic
• Admission control to ensure that a system is
  isolated from the over-enthusiasm of a client
• Look at the granularity at which we’re working
Increasing complexity

TEENAGER
THANK YOU
Mistake?
• Sub-optimal decision
  – Not all information/scenarios considered
  – Insufficient information
  – Built for a different scenario
• Due to focus on “functional” aspects
• A mistake is a mistake
  – … in retrospect

Slash n: Tech Talk Track 2 – Website Architecture-Mistakes & Learnings - Siddhartha Reddy

  • 1.
    Flipkart Website Architecture Mistakes & Learnings Siddhartha Reddy Architect, Flipkart
  • 2.
  • 3.
  • 4.
  • 5.
    www.flipkart.com • Started in2007 • Current Architecture from mid 2010 • Evolution of the architecture presented as… Issue[1] RCA[2] Actions Learnings • *1+ Issue: Website is “slow” • [2] RCA = Root Cause Analysis
  • 6.
    Surviving & reactingto the environment INFANCY (2007 – MID-2010)
  • 7.
  • 8.
    RCA • Why? – MySQL queries taking too long • Why? – Too many queries – Many slow queries – Queries locking tables • Why? – Capacity • Hmm…
  • 9.
    Fixing it • Getbeefier servers (the obvious) • Separate master_db, slave_db – Writes go to master_db – Reads from slave_db – Critical reads from master_db Writes Reads Reads Writes MySQL MySQL MySQL Replication Slave Master
  • 10.
    Learning from it •Scale-out databases reads by distributing load across systems • Isolate database writes from reads – Writes are (usually) more critical
  • 11.
  • 12.
    RCA • Why? – MySQL queries taking too long (on slave_db) • Why? – Too many queries – Many slow queries • Why? – Queries from analytics / reporting and other backend jobs • Urm…
  • 13.
    Fixing it • Analytics/ reporting DB (archival_db) – Use MyISAM — optimized for reads – Additional indexes for quicker reporting Website Website Writes Reads Website Website Writes Reads MySQL MySQL Replication Slave 1 Master MySQL MySQL Replication Slave Master Replication Analytics MySQL Analytics Reads Slave 2 Reads
  • 14.
    Learning from it •Isolate the databases being used for serving website traffic from those being used for analytical/reporting • Isolate systems being used by production website from those being used for background processing
  • 15.
    Learning the basics BABY(2010 – 2011)
  • 16.
  • 17.
    RCA • Why? • How? – Instrumentation
  • 18.
    RCA - 1 •Why? – Logging a lot – PHP processes blocking on writing logs Request2 -> Process2 Writing Waiting Waiting Request1 Request3 Request2 Request2 Request3 -> Process1 -> Process3 :Process1 :Process2 :Process3 Log file
  • 19.
    RCA - 2 •Why? – Service Oriented Architecture (SOA) – Too many calls to remote services per request • Creating fresh connection for each call • All the calls are made in serial order Connect to Request Connect Request Send Receive request Service1 Service1 Service2 Service2 response
  • 20.
    RCA - 3 •Why? – Configurability – Fetch a lot of “config” from database for serving each request Receive Fetch Fetch Fetch Fetch Send request Config1 Config2 Config3 Config4 response
  • 21.
    RCA – 1,2,3 •Why? – Logging a lot – SOA – Configurability • Why? – PHP’s process model • Argh!
  • 22.
    Fixing it • fk-w3-agent – Simple Java “middleware” daemon – Deployed on each web server – PHP communicates to it through local socket – Hosts pluggable “handlers”
  • 23.
    fk-w3-agent: LoggingHandler Request2 Request2 -> Process2 -> Process2 Request1 Request3 Request1 Request3 -> Process1 -> Process3 -> Process1 -> Process3 fk-w3- Log file agent Async / buffered Log file
  • 24.
    fk-w3-agent: ServiceHandler(s) Connect to Request Connect Request Send Receive request Service1 Service1 Service2 Service2 response Call Receive request Send response fk-w3-agent fk-w3- agent Service1 Service2
  • 25.
    fk-w3-agent: ConfigHandler Receive Fetch Fetch Fetch Fetch Send request Config1 Config2 Config3 Config4 response Database Fetch all config from Receive request Send response fk-w3-agent fk-w3- agent Poll and cache Database
  • 26.
    Learning from it •PHP — good for frontend and templating – Gives a lot of agility – Limiting process model • Hurdle for high performance • Java — stability and performance • Horses for courses
  • 27.
  • 28.
    RCA • Why? – PHP processes taking up too much time – PHP processes taking up too much CPU • Why? – Product info deserialization taking up time/CPU – View construction taking up time/CPU
  • 29.
    Fixing it • Caching! •Cache fully constructed pages – For a few minutes – Only for highly trafficked pages (Homepage) • Cache PHP serialized Product objects – ~20 million objects – Memcache • Yeah! But… – Add caching => add complexity
  • 30.
    Caching: Complications (1) •“Caching fully constructed pages” • But parts of pages still need to be dynamic • Example: Logged-in user’s name • Impossible to do effective bucket testing • Or at least makes it prohibitively complex
  • 31.
    Caching: Complications (2) •“Caching PHP serialized Product objects” • Without caching: getProductInfo() Fetch from CMS • With caching, cache hit: getProductInfo() Fetch from Cache • With caching, cache miss: Fetch from Fetch from getProductInfo() Set in Cache Cache CMS
  • 32.
    Caching: Complications (3) •TTL: ∞ (i.e. no invalidation) • Pro-actively repopulate products in the cache – Receive “notifications” about product updates • Notification Server — pushes notifications raised by CMS • Use a persistent, distributed cache – Memcache => Membase, Couchbase
  • 33.
    Learning from it •Caching is a powerful tool for performance optimization • Caching adds complexities – Reduced by keeping cache close to data source – Think deeply about TTL, invalidation • Use caching to go from “acceptable performance” to “awesome performance” – Don’t rely on it to get to “acceptable performance”
  • 34.
  • 35.
  • 36.
    RCA • Why? – Search-service is slow (or Reviews-service is slow or Recommendations-service is slow) • But why is rest of website slow? – Requests to the slow service are blocking processing threads • Eh?!
  • 37.
    Let’s do somemath • Let’s say – Mean (or median) response time: 100 ms – 8-core server – All requests are CPU bound • Throughput: 80 requests per second (rps) • Let’s also say – 95th Percentile response time: 1000 ms • Call them “bad requests” • 4 bad requests in a second – Throughput down to 44 rps • 8 bad requests in a second? – Throughput down to 8 rps
  • 38.
    Fixing it • Aggressivetimeouts for all service calls – Isolate impact of a slow service • only to pages that depend on it • Very aggressive timeouts for non-critical services – Example: Recommendations • On a Product page, Search results page etc. • Not on My Recommendations page • Load non-critical parts of pages through AJAX
  • 39.
    Learning from it •Isolate the impact of a poorly performing services / systems • Isolate the required from the good-to-have
  • 40.
  • 41.
    RCA • Why? – Load average of web servers has spiked • Why? – Requests per second has spiked • From 1000 rps to 1500 rps • Why? – Large number of notifications of product information updates
  • 42.
    Fixing it • Separatecluster for receiving product info update notifications from the cluster that serves users • Admission control: Don’t let a system receive more requests than it can handle – Throttling • Batch the notifications
  • 43.
    Learning from it •Isolate the systems serving internal requests from those serving production traffic • Admission control to ensure that a system is isolated from the over-enthusiasm of a client • Look at the granularity at which we’re working
  • 44.
  • 46.
  • 47.
    Mistake? • Sub-optimal decision – Not all information/scenarios considered – Insufficient information – Built for a different scenario • Due to focus on “functional” aspects • A mistake is a mistake – … in retrospect

Editor's Notes

  • #6 “This has basically given us lots of opportunities to make mistakes. And make mistakes we did.”
  • #46 Website Architecture diagram goes here
  • #48 No