Scaling up to 30M users
Scaling Software, Scaling Data & Scaling People
The Wix Experience

Devcon TLV Feb 2013

 Aviran Mordo
 Server Group Manager
 Wix
 @aviranm
About Wix
Wix in Numbers




•   Wix was founded in 2006
•   30M registered users from most countries
•   Over 1,000,000 new users every month
•   ~1,000,000 new websites every month
•   Over 150 TByte of users media files
     – More than 1 billion users media files
     – More than 1.5 TByte uploaded files daily
• Over 300 Servers in 2+1 datacenters + Google + Amazon
Wix Initial Architecture

                                                                                  Wix             MySQL
   • Tomcat, Hibernate, Custom web framework                                    (Tomcat)           DB

       –   Everything generated from HBM files
       –   Built for fast development
       –   Statefull login (tomcat session), EHCache, File uploads
       –   Not considering performance, scalability, fast feature rollout, evaluate
       –   It reflected the fact that we didn’t really know what is our business
       –   We know that we will need to replace it when we grow.
       –   However, we failed to understand how difficult that can be!




                                                                                      HTML 5



                                                                        Flash


2006        2007        2008       2009        2010        2011        2012                2013
Wix Initial Architecture


After two years, we have found out that
• Our initial architecture allowed us to progress vary fast
• However, as we progressed, we slowed down
• So, we learned that
    –   Don’t worry about ‘building it right from the start’ – you won’t
    –   You are going to replace stuff you are building in the initial stages
    –   Be ready to do it
    –   Get it up to customers as fast as you can. Get feedback. Evolve.
    –   Our mistake was not planning for gradual re-write
    –   Build for gradual re-write as you learn the problems and find the right
        solutions
Distributed Cache

   Next we added EHCache as Hibernate 2nd-level cache
   • Why?
       – Cause it is in the design
   • How was it?
       –   Black Box cache
       –   How do we know what is the state of our system?
       –   How to invalidate the cache?
       –   When to invalidate it?
       –   How does “operations” manage the cache?
   • Did we really need it? No!
   • We eventually dropped it

                                                                       HTML 5



                                                               Flash


2006        2007       2008          2009   2010       2011   2012        2013
Editor & Public Segments

   • The Challenge - Updates to our Server imposed downtime for our
     customer’s websites
       – Any Server or Database update has the potential of bringing down all Wix sites
       – Is a symptom of a larger issue
   • The Server served two different concerns
       – Wix Users editing websites
       – Viewing Wix Sites, the sites created by the Wix editor
   • The two concerns require different SLA
       – Wix Sites should never ever have a downtime!
       – Wix Sites should work as fast as possible, always!
       – However, an editing system does not require this level of SLA.


                                                                            HTML 5



                                                                    Flash


2006       2007       2008        2009       2010       2011       2012        2013
Editor & Public Segments

• The two concerns evolve independently
    – Releases of Editing feature should have no impact on
                                                                       Public
      existing Wix sites operations!                                 (Tomcat)
                                                                                Public
                                                                                 DB
• Our Solution
    – Split the Server into two Segments – Public and Editor
                                                                       Editor   Editor
• The Public segment targets serving websites for                    (Tomcat)    DB

  Wix Users
    – Has mostly read-only usage pattern – only updated
      when a site is published
    – Simple publishing system
    – Simple and readonly means it is easier to have higher SLA and DRP
    – MySQL used as NoSQL – single large table with XML text fields
• The Editor segment
    – Exposes the Wix Editing APIs, as well as user account and galleries
      management APIs.
    – Has different release schedule compared to the Public segment
Editor & Public Segments

What we have learned
• MySQL is a damn good NoSQL engine
                                                                       Public     Public
    –   Our public DB was (mainly) one huge table                    (Tomcat)      DB
    –   Queries & Updates are by primary key
    –   Instead of relations, we use text/xml or text/json columns     Editor     Editor
    –   No updates for Blobs – immutable data                        (Tomcat)      DB

    –   No Transactions
• Use indirection table to blob table
    – Insert a new blob value, update the pointer to the new blob, async delete
• MySql auto-generated keys cause problems
    – Locks on key generation
    – Require a single instance to generate keys
• We use GUID keys
    – Can be generated by any client
    – No locks in key value generation
    – Enabler for Master-Master replication
Wix on Managed Hosting




            Co-Location                Managed Hosting                            Cloud
       Own and maintain your        Lease both hardware and          Instantly lease hardware
          own hardware                    maintenance
       Provisioning == buy and      Overnight provisioning             Instant provisioning
       deliver your new server                                         Unlimited resources
        Reliable software on          Reliable software on            Reliable software on
         reliable hardware             reliable hardware              unreliable hardware


                                                                                          HTML 5



                                                                            Flash


2006           2007       2008        2009       2010         2011         2012              2013
Wix Media Segment

   • The Challenge – Our static storage reached over 500 GByte of small files
       – The “upload to app server, post process files, copy to lighttpd server, serve by
         lighttpd” pattern proved inefficient, slow and error prone
       – Disk IO became slow and inefficient as the number of files increased
       – We needed a solution we can grow with –
            • HTTP connections
            • number of files
       – We needed control over caching and Http headers
   • We needed dynamic image manipulations
       –   Rebuild a few millions of media files is not simple




                                                                              HTML 5



                                                                      Flash


2006        2007        2008        2009       2010        2011      2012        2013
Prospero – Wix Media Storage

• Our Solution
   – Lighttpd based
   – Sharded on the file name
   – Two copies of each file
   get 37D815B5.jpg      Go to 37 range servers                       Fallback if not found

            00-1f               20-ef                    40-5f                   60-7f




              0.static   HTTP      2.static       HTTP     4.static      HTTP       6.static




              1.static   HTTP      3.static       HTTP    5.static       HTTP      7.static
Prospero – Wix Media Storage

• Dynamic Image processing
    – Picture Pyramid
    – Picture resize, crop and sharpen “on the fly”
    – Thumbnail generation
• Eventual Consistency solutions scale
    – But you have to build for when eventual consistency is not consistent
• Media files caching headers are critical
    – Max-age, ETag, if-modified-since, etc.
    – Think how to tune those parameters for media files, as per your specific needs
• We tried Amazon S3 and Google for secondary storage
    – However, Amazon proved unreliable (connections, availability)
• We found that using a CDN in front of Prospero is very effective
•   Initially, files where stored on the filesystem
• T We added Tokyo Tyrant backend for small files
• M We added Memcached (Redis) layer for “in transit” files
Prospero – Wix Media Storage

• Our current architecture



         Google Cloud                              x36
                                                     x36
           Storage                                M T x32
                                                   M T
                                                     M T

                             Second fallback        Chicago

                                                     First fallback




                      CDN                       x36
                                                  x36
                                               M T x32
                               If not in CDN    M T
                                                  M T

   get 37D815B5.jpg                              Austin
CDN

• Use a CDN!
• CDN acts as a great connection manager
    – We have CDN hit ratio’s of over 99.9%
• Use the “Cache Killer” pattern
    –   http://static.wix.com/client/css/viewer.css?v=327
    –   http://static.wix.com/client/1.3.2/css/viewer.css
    –   Makes flushing files from the CDN redundant
    –   Enabler for longer caching periods
• There are many vendors
    – We started with 1 CDN vendor
    – We are now working with two CDN vendors
    – Different CDN vendors have advantages at different geo
• Tune HTTP Headers per CDN Vendor
    – CDN Vendors interpret HTTP headers differently
Development Velocity

   • The Challenge – Our codebase became large and entangled
       – Feature rollout became harder over time, requiring longer and longer manual
         regression
       – The longer the regression was, the harder is became to make “a good release”
       – Strange full-table scans queries generated by Hibernate, which we still have no
         idea what code is responsible for…
   • The solution
       –   Mid 2010 – Wix Framework – modern base libraries
       –   Beginning 2011 – CI / CD / TDD techniques + DevOps culture
       –   Mid 2011 – Scala
                                                           CI / CD / TDD + DevOps
       –   SOA Architecture (not WSDL)
                                                                         Scala
                                                      Framework



                                                                                         HTML 5



                                                                                 Flash


2006        2007        2008       2009        2010               2011       2012           2013
People are the key

• Train the people you already have
    – We sent our entire QA department to learn Java
    – Developers learn TDD and CI/CD methodologies.
• Hiring the right people is key to success
    –   Hire only the best developers (only seniors)
    –   Don’t count only on the interview, you need to test actual coding
    –   Anyone who interviews can drop a candidate
    –   Hire people who will challenge you (no “yes man”)
    –   Get people you can trust with “root” access to production
• Never stop hiring
    – If we find an excellent person we will create a position for him even if we do
      not have one open.
• Wix is doubling its size every year
    – Yes we are currently hiring.
    – We’re considering to start hiring and training junior developers.
Wix’s CI / CD / TDD + DevOps model

• Abandon “VERSION” paradigm – move feature centric life
• Make small and frequent release as soon as possible
    – Today we release about 10 times a day, gaining velocity
• Empower the developer
    –   The developer is responsible from product idea to 100,000 active users
    –   Remove every obstacle in the developer’s path
    –   Big cultural change from waterfall – affects the whole company
    –   The developer is responsible for his app operations
• Automate everything – CI/CD/TDD
    – CI – Continuous Integration
    – CD – Continuous Delivery / Deployment
    – TDD – Automated unit-tests, integration tests, GUI tests
• Measure Everything (The lean startup way)
    – A/B test every new feature
    – Monitor real KPIs (business, not CPU)
CI / CD @ Wix – Release Process

• Make an RC
   – Runs build, unit-tests, integration tests
CI / CD @ Wix – Release Process

• Deploy as GA
   – Using Chef, Noah, Artifactory
   – Runs Self-Tests
CI / CD @ Wix – Release Process

• Monitor
   – Deployment, NewRelic, App-Info, Recent Events
• Rollback
Products we’ve built (partial list)
   • Wix Mobile
   • Wix HTML5
       – Full HTML 5 support – total rewrite of our Flash product
   • Third Party Applications (TPAs)
       – With over 200,000 installations in the 3 first months
   • Answers
       – Wix unique support system
   • Wix Billing System (PCI Compliant)
                                                                                       Billing
       – Support complex business models for TPAs                                TPA
       – Support diverse geo                                            eCommerce
                                                                  App Builder
   • eCommerce                                                  HTML 5
                                                             Answers
       – Based on Magento
                                                    Mobile
   • BI
                                                                                HTML 5



                                                                       Flash


2006       2007       2008        2009       2010        2011         2012         2013
Wix Hackathon

• http://www.wix.com/publicevents/hackathon2013
Scaling up to 30M users - The Wix Story

Scaling up to 30M users - The Wix Story

  • 1.
    Scaling up to30M users Scaling Software, Scaling Data & Scaling People The Wix Experience Devcon TLV Feb 2013 Aviran Mordo Server Group Manager Wix @aviranm
  • 2.
  • 3.
    Wix in Numbers • Wix was founded in 2006 • 30M registered users from most countries • Over 1,000,000 new users every month • ~1,000,000 new websites every month • Over 150 TByte of users media files – More than 1 billion users media files – More than 1.5 TByte uploaded files daily • Over 300 Servers in 2+1 datacenters + Google + Amazon
  • 4.
    Wix Initial Architecture Wix MySQL • Tomcat, Hibernate, Custom web framework (Tomcat) DB – Everything generated from HBM files – Built for fast development – Statefull login (tomcat session), EHCache, File uploads – Not considering performance, scalability, fast feature rollout, evaluate – It reflected the fact that we didn’t really know what is our business – We know that we will need to replace it when we grow. – However, we failed to understand how difficult that can be! HTML 5 Flash 2006 2007 2008 2009 2010 2011 2012 2013
  • 5.
    Wix Initial Architecture Aftertwo years, we have found out that • Our initial architecture allowed us to progress vary fast • However, as we progressed, we slowed down • So, we learned that – Don’t worry about ‘building it right from the start’ – you won’t – You are going to replace stuff you are building in the initial stages – Be ready to do it – Get it up to customers as fast as you can. Get feedback. Evolve. – Our mistake was not planning for gradual re-write – Build for gradual re-write as you learn the problems and find the right solutions
  • 6.
    Distributed Cache Next we added EHCache as Hibernate 2nd-level cache • Why? – Cause it is in the design • How was it? – Black Box cache – How do we know what is the state of our system? – How to invalidate the cache? – When to invalidate it? – How does “operations” manage the cache? • Did we really need it? No! • We eventually dropped it HTML 5 Flash 2006 2007 2008 2009 2010 2011 2012 2013
  • 7.
    Editor & PublicSegments • The Challenge - Updates to our Server imposed downtime for our customer’s websites – Any Server or Database update has the potential of bringing down all Wix sites – Is a symptom of a larger issue • The Server served two different concerns – Wix Users editing websites – Viewing Wix Sites, the sites created by the Wix editor • The two concerns require different SLA – Wix Sites should never ever have a downtime! – Wix Sites should work as fast as possible, always! – However, an editing system does not require this level of SLA. HTML 5 Flash 2006 2007 2008 2009 2010 2011 2012 2013
  • 8.
    Editor & PublicSegments • The two concerns evolve independently – Releases of Editing feature should have no impact on Public existing Wix sites operations! (Tomcat) Public DB • Our Solution – Split the Server into two Segments – Public and Editor Editor Editor • The Public segment targets serving websites for (Tomcat) DB Wix Users – Has mostly read-only usage pattern – only updated when a site is published – Simple publishing system – Simple and readonly means it is easier to have higher SLA and DRP – MySQL used as NoSQL – single large table with XML text fields • The Editor segment – Exposes the Wix Editing APIs, as well as user account and galleries management APIs. – Has different release schedule compared to the Public segment
  • 9.
    Editor & PublicSegments What we have learned • MySQL is a damn good NoSQL engine Public Public – Our public DB was (mainly) one huge table (Tomcat) DB – Queries & Updates are by primary key – Instead of relations, we use text/xml or text/json columns Editor Editor – No updates for Blobs – immutable data (Tomcat) DB – No Transactions • Use indirection table to blob table – Insert a new blob value, update the pointer to the new blob, async delete • MySql auto-generated keys cause problems – Locks on key generation – Require a single instance to generate keys • We use GUID keys – Can be generated by any client – No locks in key value generation – Enabler for Master-Master replication
  • 10.
    Wix on ManagedHosting Co-Location Managed Hosting Cloud Own and maintain your Lease both hardware and Instantly lease hardware own hardware maintenance Provisioning == buy and Overnight provisioning Instant provisioning deliver your new server Unlimited resources Reliable software on Reliable software on Reliable software on reliable hardware reliable hardware unreliable hardware HTML 5 Flash 2006 2007 2008 2009 2010 2011 2012 2013
  • 11.
    Wix Media Segment • The Challenge – Our static storage reached over 500 GByte of small files – The “upload to app server, post process files, copy to lighttpd server, serve by lighttpd” pattern proved inefficient, slow and error prone – Disk IO became slow and inefficient as the number of files increased – We needed a solution we can grow with – • HTTP connections • number of files – We needed control over caching and Http headers • We needed dynamic image manipulations – Rebuild a few millions of media files is not simple HTML 5 Flash 2006 2007 2008 2009 2010 2011 2012 2013
  • 12.
    Prospero – WixMedia Storage • Our Solution – Lighttpd based – Sharded on the file name – Two copies of each file get 37D815B5.jpg Go to 37 range servers Fallback if not found 00-1f 20-ef 40-5f 60-7f 0.static HTTP 2.static HTTP 4.static HTTP 6.static 1.static HTTP 3.static HTTP 5.static HTTP 7.static
  • 13.
    Prospero – WixMedia Storage • Dynamic Image processing – Picture Pyramid – Picture resize, crop and sharpen “on the fly” – Thumbnail generation • Eventual Consistency solutions scale – But you have to build for when eventual consistency is not consistent • Media files caching headers are critical – Max-age, ETag, if-modified-since, etc. – Think how to tune those parameters for media files, as per your specific needs • We tried Amazon S3 and Google for secondary storage – However, Amazon proved unreliable (connections, availability) • We found that using a CDN in front of Prospero is very effective • Initially, files where stored on the filesystem • T We added Tokyo Tyrant backend for small files • M We added Memcached (Redis) layer for “in transit” files
  • 14.
    Prospero – WixMedia Storage • Our current architecture Google Cloud x36 x36 Storage M T x32 M T M T Second fallback Chicago First fallback CDN x36 x36 M T x32 If not in CDN M T M T get 37D815B5.jpg Austin
  • 15.
    CDN • Use aCDN! • CDN acts as a great connection manager – We have CDN hit ratio’s of over 99.9% • Use the “Cache Killer” pattern – http://static.wix.com/client/css/viewer.css?v=327 – http://static.wix.com/client/1.3.2/css/viewer.css – Makes flushing files from the CDN redundant – Enabler for longer caching periods • There are many vendors – We started with 1 CDN vendor – We are now working with two CDN vendors – Different CDN vendors have advantages at different geo • Tune HTTP Headers per CDN Vendor – CDN Vendors interpret HTTP headers differently
  • 16.
    Development Velocity • The Challenge – Our codebase became large and entangled – Feature rollout became harder over time, requiring longer and longer manual regression – The longer the regression was, the harder is became to make “a good release” – Strange full-table scans queries generated by Hibernate, which we still have no idea what code is responsible for… • The solution – Mid 2010 – Wix Framework – modern base libraries – Beginning 2011 – CI / CD / TDD techniques + DevOps culture – Mid 2011 – Scala CI / CD / TDD + DevOps – SOA Architecture (not WSDL) Scala Framework HTML 5 Flash 2006 2007 2008 2009 2010 2011 2012 2013
  • 17.
    People are thekey • Train the people you already have – We sent our entire QA department to learn Java – Developers learn TDD and CI/CD methodologies. • Hiring the right people is key to success – Hire only the best developers (only seniors) – Don’t count only on the interview, you need to test actual coding – Anyone who interviews can drop a candidate – Hire people who will challenge you (no “yes man”) – Get people you can trust with “root” access to production • Never stop hiring – If we find an excellent person we will create a position for him even if we do not have one open. • Wix is doubling its size every year – Yes we are currently hiring. – We’re considering to start hiring and training junior developers.
  • 18.
    Wix’s CI /CD / TDD + DevOps model • Abandon “VERSION” paradigm – move feature centric life • Make small and frequent release as soon as possible – Today we release about 10 times a day, gaining velocity • Empower the developer – The developer is responsible from product idea to 100,000 active users – Remove every obstacle in the developer’s path – Big cultural change from waterfall – affects the whole company – The developer is responsible for his app operations • Automate everything – CI/CD/TDD – CI – Continuous Integration – CD – Continuous Delivery / Deployment – TDD – Automated unit-tests, integration tests, GUI tests • Measure Everything (The lean startup way) – A/B test every new feature – Monitor real KPIs (business, not CPU)
  • 19.
    CI / CD@ Wix – Release Process • Make an RC – Runs build, unit-tests, integration tests
  • 20.
    CI / CD@ Wix – Release Process • Deploy as GA – Using Chef, Noah, Artifactory – Runs Self-Tests
  • 21.
    CI / CD@ Wix – Release Process • Monitor – Deployment, NewRelic, App-Info, Recent Events • Rollback
  • 22.
    Products we’ve built(partial list) • Wix Mobile • Wix HTML5 – Full HTML 5 support – total rewrite of our Flash product • Third Party Applications (TPAs) – With over 200,000 installations in the 3 first months • Answers – Wix unique support system • Wix Billing System (PCI Compliant) Billing – Support complex business models for TPAs TPA – Support diverse geo eCommerce App Builder • eCommerce HTML 5 Answers – Based on Magento Mobile • BI HTML 5 Flash 2006 2007 2008 2009 2010 2011 2012 2013
  • 23.

Editor's Notes

  • #15 Managed Hosting costs - $0.13 / GByte storage (counting two copies), which includes 100TByte traffic per host (effectively free traffic)Cloud costs – S3 - $0.06 / Gbyte (Standard Storage) + $0.05 / GByte
  • #21 Akamai (Cotendo)And Level3
  • #27 Key performance indicators