A Big(Query) frog
in a small pond
- Data Science Club #1
Summer 2017/2018
Speaker: Jakub Motýľ
Last updated: 19.2.2018
Join the discussion at slido.com with code #Q730
Today’s agenda
1. Introduction
2. How the tool came to be
1. Game analytics in a nutshell
2. Our in-house solution
3. Architecture overview
4. Collecting the data
1. Gameplay events
2. Marketing campaign attribution
3. Ad-related data
5. Storing and processing
1. Short intro to BigQuery
2. Raw data
3. Flattened views
4. Materialized views
5. CBDW - Our swiss knife
6. Usable outputs
1. Visualization for end users
2. Iterative prototyping of new metrics
7. Machine learning in big data game analytics
1. Business value
2. From the issue to the solution
8. A real world ML example
1. Understanding the business issue
2. Preparing our data
3. Feature engineering and selection
4. Training and tuning
5. Evaluation and validation
9. Q&A section
2
Join the discussion at slido.com with code #Q730
1. Introduction
3
Join the discussion at slido.com with code #Q730
Cellense & BuffPanel
1. Introduction
4
● Steam marketing analytics
● Big data attribution platform and
PC+console gamedev consulting
● Product of Cellense
● “Making mobile games grow”
● Games sector BI & analytics
● Formerly Infinario, branch of
Exponea focused on gaming
● At 15 employees and growing
Join the discussion at slido.com with code #Q730
The founder & the speaker
1. Introduction
Ivan Trančík
● Serial big data entrepreneur
● Co-founder of Infinario (now Exponea)
● Founder & CEO at Cellense & BuffPanel
Jakub Motýľ (that’s me)
● Former programmer at Vacuumlabs
● 2 years as game analyst at Cellense
● Now director of BuffPanel
5
Join the discussion at slido.com with code #Q730
Our partners & clients
1. Introduction
6
Join the discussion at slido.com with code #Q730
2. How the tool came to be
7
Join the discussion at slido.com with code #Q730
Game analytics in a nutshell
2. How the tool came to be
● Data-driven informed decisions instead of instinct driven design
● Blackbox solutions offer basic performance monitoring (KPIs)
● What about in-depth analytics? Onboarding, game economy, …
● AAA studios all have in-house analysts & keep their secrets
● What to do with a fast growing mid-size studio?
8
Join the discussion at slido.com with code #Q730
Our in-house solution should support
2. How the tool came to be
● Basic BI & in-depth analytics
● Custom metrics, reports and dashboards
● Scalable architecture for a reasonable price
● Secure storage & access to raw event data
● Integrations, integrations, integrations
● Usable by all stakeholders
So the platform was born...
9
10
Cellense Big Data Warehouse
Badland Brawl
● A brand new game by Frogmind, part of
Supercell, who is the largest mobile game
publisher by revenue
● Currently in soft-launch
11
Hill Climb Racing 2
● #1 iOS and Android app of Dec 2016
● 110 million players total
● Hundreds of millions of data points daily
● Daily operational costs under $50
Join the discussion at slido.com with code #Q730 12
3. Architecture overview
Join the discussion at slido.com with code #Q730
Usable outputsStoring and processingCollecting the data
Architecture overview
BigQuery
Client’s backend
Periscope Data
Cellense Big
Data
Warehouse
processed
materialized
views
Firebase
backend
daily export
of raw
events
BigQuery console
daily ETL jobs that
process and combine
raw data into
materialized views
daily export of
processed attribution
and ad data
Game client
SDK
(Firebase SDK)
Attribution
service
(Appsflyer)
Ad providers
13
Join the discussion at slido.com with code #Q730
Usable outputsStoring and processingCollecting the data
Architecture overview
BigQuery
Client’s backend
Periscope Data
Cellense Big
Data
Warehouse
processed
materialized
views
Firebase
backend
daily export
of raw
events
BigQuery console
daily ETL jobs that
process and combine
raw data into
materialized views
daily export of
processed attribution
and ad data
Game client
SDK
(Firebase SDK)
Attribution
service
(Appsflyer)
Ad providers
14
Join the discussion at slido.com with code #Q730
Usable outputsStoring and processingCollecting the data
Architecture overview
BigQuery
Client’s backend
Periscope Data
Cellense Big
Data
Warehouse
processed
materialized
views
Firebase
backend
daily export
of raw
events
BigQuery console
daily ETL jobs that
process and combine
raw data into
materialized views
daily export of
processed attribution
and ad data
Game client
SDK
(Firebase SDK)
Attribution
service
(Appsflyer)
Ad providers
15
Join the discussion at slido.com with code #Q730
Usable outputsStoring and processingCollecting the data
Architecture overview
BigQuery
Client’s backend
Periscope Data
Cellense Big
Data
Warehouse
processed
materialized
views
Firebase
backend
daily export
of raw
events
BigQuery console
daily ETL jobs that
process and combine
raw data into
materialized views
daily export of
processed attribution
and ad data
Game client
SDK
(Firebase SDK)
Attribution
service
(Appsflyer)
Ad providers
16
Join the discussion at slido.com with code #Q730 17
4. Collecting the data
Join the discussion at slido.com with code #Q730
Usable outputsStoring and processingCollecting the data
Collecting the data
BigQuery
Client’s backend
Periscope Data
Cellense Big
Data
Warehouse
processed
materialized
views
Firebase
backend
daily export
of raw
events
BigQuery console
daily ETL jobs that
process and combine
raw data into
materialized views
daily export of
processed attribution
and ad data
Game client
SDK
(Firebase SDK)
Attribution
service
(Appsflyer)
Ad providers
18
Join the discussion at slido.com with code #Q730
Gameplay events
4. Collecting the data
19
● Event vehicle_purchase ● Event race_finished
Join the discussion at slido.com with code #Q730
Gameplay events
4. Collecting the data
What do you mean by gameplay events?
● Gameplay events fire whenever a player does a specific action in-game
● Examples include: app start, in-app purchase, level completion, …
● Tracing the whole player journey enables in-depth analysis
How do we track them?
● Using Google’s Firebase SDK for Android & iOS (integrated in app)
● Events are sent straight to Firebase Analytics
How many events are we talking about?
● 100+ various event types totalling in 300+ million events (~200GB) daily
20
Join the discussion at slido.com with code #Q730
Marketing campaign attribution
4. Collecting the data
21
Join the discussion at slido.com with code #Q730
Marketing campaign attribution
4. Collecting the data
22
What do you mean by marketing campaign attribution?
● When players install the game, we know if they clicked an ad for it before
● This enables the marketers to evaluate quality of acquisition sources
How do we track them?
● Using Appsflyer SDK dedicated to mobile attribution (integrated in app)
● Data is sent to the client’s backend for internal marketing evaluation
How many events are we talking about?
● Depending on the game between 5-70% of players come from ads,
resulting in several thousands of events daily
Join the discussion at slido.com with code #Q730
Ad-related data
4. Collecting the data
23
Join the discussion at slido.com with code #Q730
What do you mean by ad-related data?
● Ads are a very common revenue streams in games, in addition to IAPs
● Ads are of various nature, ranging from game ads to TV oracle ads
How do we track them?
● SDKs from different ad providers track ad views in app to their own
servers
● Individual ad providers report the sums daily to the client’s backend via
HTTP postbacks
How many events are we talking about?
● Very few events, usually aggregated daily revenue from different sources
Ad-related data
4. Collecting the data
24
Join the discussion at slido.com with code #Q730 25
5. Storing and processing
Join the discussion at slido.com with code #Q730
Usable outputsStoring and processingCollecting the data
Storing and processing
BigQuery
Client’s backend
Periscope Data
Cellense Big
Data
Warehouse
processed
materialized
views
Firebase
backend
daily export
of raw
events
BigQuery console
daily ETL jobs that
process and combine
raw data into
materialized views
daily export of
processed attribution
and ad data
Game client
SDK
(Firebase SDK)
Attribution
service
(Appsflyer)
Ad providers
26
Join the discussion at slido.com with code #Q730
Short intro to BigQuery
5. Storing and processing
What is BigQuery?
● Serverless database solution offered by Google Cloud Platform
● Big data database compatible with SQL language syntax
● Loading, copying & exporting data is free
● Very low storage costs 2¢ per GB
● Very cheap querying - 5$ per TB
27
Join the discussion at slido.com with code #Q730
Raw data
5. Storing and processing
Why do we store raw data?
● Human and machine errors can lead to data discrepancies
● Many metrics and reports rely on preceding events and their attributes
● After fixing bugs, reprocessing historical data is often neccessary
How do we store the raw data?
● As they come - in case the event body is malformed the events can be
sometimes recovered with use of some ETL magic and then reprocessed
28
Join the discussion at slido.com with code #Q730
Raw data
5. Storing and processing
29
Join the discussion at slido.com with code #Q730
Flattened views
5. Storing and processing
What are flattened views?
● Raw data from Firebase is stored in compressed format
● For BigQuery to be able to effectively work a transformation is needed
before (raw) after (flattened)
30
Join the discussion at slido.com with code #Q730
Why materialized views?
● Querying raw data would be very costly (often terabytes of data)
● Minimized tables (MVs) allow for fast and cost-effective querying
How is it cost-effective?
● We process raw data daily and build MVs for different use-cases
● We batch the data daily and calculate all metrics for these batches
● In case of a defect, we only reprocess affected days of materialized views
Materialized views
5. Storing and processing
31
Join the discussion at slido.com with code #Q730
Materialized views
5. Storing and processing
32
Set of materialized
views
Daily Appsflyer data
Flattened view MV processing
user state
purchases
races
vehicles bought
...
Cheater tables
Raw data
Lifetime attribution
view
Join the discussion at slido.com with code #Q730
What if I have cheaters in my raw data?
● First, we build cheater tables (MV) which allow us to filter the events
● After that we process the filtered raw data
So we need some kind of a dependency system?
● Yes, besides cheaters we also enrich tables with marketing data, ad data, ...
● To ensure the MVs are processed in correct order another party participates
Materialized views
5. Storing and processing
33
Join the discussion at slido.com with code #Q730
Intro - Cellense Big Data Warehouse
● Full stack typescript web application (angular 5 + nest.js)
Why is there a need for another component?
● Monitoring BigQuery jobs’ status
● Dependency management and periodic job scheduling
CBDW - Our swiss knife
5. Storing and processing
34
Join the discussion at slido.com with code #Q730
Is that all?
● Monitoring event inflow
● Running ad-hoc jobs in case reprocessing is needed
● Query cost estimation and syntax validation
● User access management
CBDW - Our swiss knife
5. Storing and processing
35
Join the discussion at slido.com with code #Q730
CBDW - Our swiss knife
5. Storing and processing
36
Join the discussion at slido.com with code #Q730 37
6. Usable outputs
Join the discussion at slido.com with code #Q730
Usable outputsStoring and processingCollecting the data
Usable outputs
BigQuery
Client’s backend
Periscope Data
Cellense Big
Data
Warehouse
processed
materialized
views
Firebase
backend
daily export
of raw
events
BigQuery console
daily ETL jobs that
process and combine
raw data into
materialized views
daily export of
processed attribution
and ad data
Game client
SDK
(Firebase SDK)
Attribution
service
(Appsflyer)
Ad providers
38
Join the discussion at slido.com with code #Q730
Visualization for end users
6. Usable outputs
What is Periscope Data?
● A visualization tool tightly integrated with Google BigQuery
● It enables us to create a wide scale of visual reports, graphs and charts on
top of data directly queried from MVs in BigQuery
39
Periscope example (sample data)
Join the discussion at slido.com with code #Q730
Iterative prototyping of new metrics
6. Usable outputs
Prototyping anything new must be a real hassle, right?
● No, not really
● With combined use of old faithful BigQuery console and our processed
MVs, prototyping and testing is quite simple
40
Join the discussion at slido.com with code #Q730 41
7. Machine learning in big data game analytics
Join the discussion at slido.com with code #Q730
Business value
7. Machine learning in big data game analytics
42
What value can be gained by using ML in game analytics?
● Certain ML models can help us understand the importance and influence of
specific game design elements in context of a given problem
How is this method different from standard data analysis?
● When correctly applied, ML methods are much less prone to cognitive
biases
● Operating with wider scale of inputs, ML methods often reveal unexpected
causality chains that wouldn’t even occur to a human analyst
Join the discussion at slido.com with code #Q730
From the issue to the solution
7. Use of machine learning
43
So how should we proceed?
1. Understanding the business issue
2. Preparing your data
3. Feature engineering and selection
4. Training and tuning
5. Evaluation and validation
6. Presentation and visualization
Source: Udacity.com, (2016),
Cross-industry standard process for data mining
Join the discussion at slido.com with code #Q730 44
8. A real world ML example
Join the discussion at slido.com with code #Q730
Understanding the business issue
8. A real world ML example
45
So what is our problem?
● If the players don’t come back to the game, they can’t generate revenue
● This is called the player churn problem, the #1 issue in f2p games
What does churn and retention mean?
● Churn tells us the percentage of players leaving over time
● Retention is the percentage of players retained in the game over time
Join the discussion at slido.com with code #Q730
Understanding the business issue
8. A real world ML example
46
What are the specific questions we want answered?
● What are the main features separating a retained from a churned player?
● How do these particular features affect a player’s retention rate?
● How does each feature’s importance change over time spent in-game?
Join the discussion at slido.com with code #Q730
Understanding the business issue
8. A real world ML example
47
What do YOU think is the most important factor for retaining a player?
A. Bring the player back every single successive day
B. Make the player watch at most 5 ads a day
C. Reward the player for every significant achievement in-game
Join the discussion at slido.com with code #Q730
Preparing our data
8. A real world ML example
48
How do we ensure that the model is based on sound data we can trust?
● Exploration and quality control of data
● Additional pre-processing and filtering
Join the discussion at slido.com with code #Q730
Preparing our data
8. A real world ML example
49
What to look at in exploration and quality control of data?
● Negative values
○ e.g. how can a player’s total playtime be negative?
● Distributions
○ expect a lot of log-normal or exponential distributions
● Outlier treatment
○ e.g. players with 1 million coins on hand, 99% have < 1k, mean 300
Join the discussion at slido.com with code #Q730
Preparing our data
8. A real world ML example
50
What does the additional pre-processing and filtering entail?
● Selection
○ remove players who don’t finish the tutorial
& have positive play time in other game
modes
● Profiling
○ understand quality of the data
○ e.g. relation between sessions and matches
● Segmentation
○ find new relations between groups of players
with similar characteristics
● Cleansing
○ remove samples with empty/negative
values, fill in partially with mean values
● Validation
○ central tendencies - mean, median, ...
○ variability - standard deviation, ranges, ...
● Target balancing
○ to avoid neglection of important segments
we upscale/downscale sampling of groups
○ for our case we would like around 50/50 split
● Transformation
○ some methods require normal instead of log-
normal distributions to avoid bias
○ not needed in our case
Join the discussion at slido.com with code #Q730
Feature engineering and selection
8. A real world ML example
51
What are our features and how do we find them?
● Features are selected metrics representing the underlying problem
● We pick a number of metrics we use based on common sense
And what features are those for our problem?
● Gameplay statistics - matches played, number of sessions, …
● Total statistics - achieved rank, total playtime, …
● Resources used - coins, gems
● Rewards earned - resources, chests, skins, …
● Social stats, Ad stats, Geo data, ...
Join the discussion at slido.com with code #Q730
Feature engineering and selection
8. A real world ML example
52
Sounds like an awful lot of features, no?
● Yes, to be exact we use around 120 features for this problem
● Many already have MVs as we use them in standard analytics
Why so many? Wouldn’t, say 20, suffice?
● We use as many relevant features as we can to avoid scoping bias
● The best features can be extracted using appropriate methods
Join the discussion at slido.com with code #Q730
Training and tuning
8. A real world ML example
53
So how do we extract the best features?
● We use the random forest method to construct randomized decision trees
● The trees are mapping various options the players can take during
gameplay
● We average the results to find the best option
That has to be a whole rainforest, with 120 features, right?
● About 100 random decision trees should be enough
● Even a higher number of random trees could be biased
Join the discussion at slido.com with code #Q730
Training and tuning
8. A real world ML example
54
How do we know that our results are not biased, then?
● We use the gradient boosted tree to cross-reference results
● After training the tree with the same dataset we get a comparable result
Why do we only use methods working with decision trees?
● For our purposes, we need to interpret the result and quantify contribution
of different game features to retention and churn
● Although other blackbox models may be more successfull, they can’t be
interpreted in this manner
Join the discussion at slido.com with code #Q730
Training and tuning
8. A real world ML example
55
Join the discussion at slido.com with code #Q730
That seems complicated, how do we interpret it?
● We don’t, it’s too complicated and also probably overfitted
● We proceed with pruning the tree
But how do we prune the tree without loosing information?
● We prune sections which provide little power to classify players
Does the pruning have any other effects?
● We avoid overfitting the model and improve the generalization of
knowledge
● It also makes the visualization more much more readable
Evaluation and validation
8. A real world ML example
56
Join the discussion at slido.com with code #Q730
Are we now 100% convinced that the model reflects reality?
● Let’s do a 10-fold cross-validation, just to be sure
Okay, I fold. Anything else?
● That’s it! Now we just have to take a look at the result
Evaluation and validation
8. A real world ML example
57
Join the discussion at slido.com with code #Q730
Evaluation and validation
8. A real world ML example
58
ChurnedRetained
Join the discussion at slido.com with code #Q730
So what did we learn?
● The most important thing is to bring back the player every successive day
● After just a single missed out day, the player churn greatly increases
● ...and also that “intuitive” doesn’t always mean true
Evaluation and validation
8. A real world ML example
59
Join the discussion at slido.com with code #Q730
What should we do, now that we know the culprit for churn in our game?
● Design an AB test and verify the hypothesis
● Make better use of existing assets to achieve higher day-to-day retention
● Find the next issue!
Evaluation and validation
8. A real world ML example
60
Join the discussion at slido.com with code #Q730 61
7. Questions and Answers
A Big(Query) frog
in a small pond
- Data Science Club #1
Summer 2017/2018
Speaker: Jakub Motýľ
Last updated: 19.2.2018
Pssst, we’re hiring. If you’re interested in games
& data, or know someone, drop us a message
at jobs@cellense.com or jobs@buffpanel.com.
Also check out job positions online here & here.

More Related Content

PDF
Live predictions with schemaless data at scale. MLMU Kosice, Exponea
PDF
Why Successful Games Need Analytics
PPTX
Petabytes to Personalization - Data Analytics with Qubit and Looker
PDF
Shift AI 2020: Business benefits of privacy-preserving synthetic data | Sebas...
PDF
Shift AI 2020: Building AI-first Products - Ehsan Yousefzadeh (AIG Investments)
PDF
[Webinar] Interacting with BigQuery and Working with Advanced Queries
PDF
PykQuery.js
PDF
Introduction to Pykih's Services
Live predictions with schemaless data at scale. MLMU Kosice, Exponea
Why Successful Games Need Analytics
Petabytes to Personalization - Data Analytics with Qubit and Looker
Shift AI 2020: Business benefits of privacy-preserving synthetic data | Sebas...
Shift AI 2020: Building AI-first Products - Ehsan Yousefzadeh (AIG Investments)
[Webinar] Interacting with BigQuery and Working with Advanced Queries
PykQuery.js
Introduction to Pykih's Services

Similar to A Big (Query) Frog in a Small Pond, Jakub Motyl, BuffPanel (20)

PDF
danmcclary-pspresentation-katieboyle-171030115522.pdf
PDF
Big data paris 2011 is cool florian douetteau
PPTX
Introduction to big data
PPTX
Big Data, Big Investment
PDF
Big databigideasit4bc
PDF
big-datagroup6-150317090053-conversion-gate01.pdf
PDF
What Is Big Data How Big Data Works.pdf
PDF
Business with Big data
PPTX
PPTX
What is Big Data?
PPTX
What is Big Data?
PPTX
PPT
IT FUTURE- Big data
PDF
How to build and run a big data platform in the 21st century
PPTX
Online Games Analytics - Data Science for Fun
PPTX
Infographics and big data
PPTX
PDF
What Is Big Data How Big Data Works.pdf
PPTX
Big data technologies with Case Study Finance and Healthcare
danmcclary-pspresentation-katieboyle-171030115522.pdf
Big data paris 2011 is cool florian douetteau
Introduction to big data
Big Data, Big Investment
Big databigideasit4bc
big-datagroup6-150317090053-conversion-gate01.pdf
What Is Big Data How Big Data Works.pdf
Business with Big data
What is Big Data?
What is Big Data?
IT FUTURE- Big data
How to build and run a big data platform in the 21st century
Online Games Analytics - Data Science for Fun
Infographics and big data
What Is Big Data How Big Data Works.pdf
Big data technologies with Case Study Finance and Healthcare
Ad

Recently uploaded (20)

PPTX
Amdahl’s law is explained in the above power point presentations
PPTX
Sorting and Hashing in Data Structures with Algorithms, Techniques, Implement...
PPTX
Feature types and data preprocessing steps
PPTX
ASME PCC-02 TRAINING -DESKTOP-NLE5HNP.pptx
PDF
Introduction to Power System StabilityPS
PDF
Exploratory_Data_Analysis_Fundamentals.pdf
PDF
UEFA_Carbon_Footprint_Calculator_Methology_2.0.pdf
PPTX
wireless networks, mobile computing.pptx
PDF
Unit1 - AIML Chapter 1 concept and ethics
PPTX
PRASUNET_20240614003_231416_0000[1].pptx
DOC
T Pandian CV Madurai pandi kokkaf illaya
PPTX
Graph Data Structures with Types, Traversals, Connectivity, and Real-Life App...
PPTX
ai_satellite_crop_management_20250815030350.pptx
PDF
Design of Material Handling Equipment Lecture Note
PDF
Soil Improvement Techniques Note - Rabbi
PDF
distributed database system" (DDBS) is often used to refer to both the distri...
PDF
MLpara ingenieira CIVIL, meca Y AMBIENTAL
PPTX
Management Information system : MIS-e-Business Systems.pptx
PPTX
Measurement Uncertainty and Measurement System analysis
PPTX
A Brief Introduction to IoT- Smart Objects: The "Things" in IoT
Amdahl’s law is explained in the above power point presentations
Sorting and Hashing in Data Structures with Algorithms, Techniques, Implement...
Feature types and data preprocessing steps
ASME PCC-02 TRAINING -DESKTOP-NLE5HNP.pptx
Introduction to Power System StabilityPS
Exploratory_Data_Analysis_Fundamentals.pdf
UEFA_Carbon_Footprint_Calculator_Methology_2.0.pdf
wireless networks, mobile computing.pptx
Unit1 - AIML Chapter 1 concept and ethics
PRASUNET_20240614003_231416_0000[1].pptx
T Pandian CV Madurai pandi kokkaf illaya
Graph Data Structures with Types, Traversals, Connectivity, and Real-Life App...
ai_satellite_crop_management_20250815030350.pptx
Design of Material Handling Equipment Lecture Note
Soil Improvement Techniques Note - Rabbi
distributed database system" (DDBS) is often used to refer to both the distri...
MLpara ingenieira CIVIL, meca Y AMBIENTAL
Management Information system : MIS-e-Business Systems.pptx
Measurement Uncertainty and Measurement System analysis
A Brief Introduction to IoT- Smart Objects: The "Things" in IoT
Ad

A Big (Query) Frog in a Small Pond, Jakub Motyl, BuffPanel

  • 1. A Big(Query) frog in a small pond - Data Science Club #1 Summer 2017/2018 Speaker: Jakub Motýľ Last updated: 19.2.2018
  • 2. Join the discussion at slido.com with code #Q730 Today’s agenda 1. Introduction 2. How the tool came to be 1. Game analytics in a nutshell 2. Our in-house solution 3. Architecture overview 4. Collecting the data 1. Gameplay events 2. Marketing campaign attribution 3. Ad-related data 5. Storing and processing 1. Short intro to BigQuery 2. Raw data 3. Flattened views 4. Materialized views 5. CBDW - Our swiss knife 6. Usable outputs 1. Visualization for end users 2. Iterative prototyping of new metrics 7. Machine learning in big data game analytics 1. Business value 2. From the issue to the solution 8. A real world ML example 1. Understanding the business issue 2. Preparing our data 3. Feature engineering and selection 4. Training and tuning 5. Evaluation and validation 9. Q&A section 2
  • 3. Join the discussion at slido.com with code #Q730 1. Introduction 3
  • 4. Join the discussion at slido.com with code #Q730 Cellense & BuffPanel 1. Introduction 4 ● Steam marketing analytics ● Big data attribution platform and PC+console gamedev consulting ● Product of Cellense ● “Making mobile games grow” ● Games sector BI & analytics ● Formerly Infinario, branch of Exponea focused on gaming ● At 15 employees and growing
  • 5. Join the discussion at slido.com with code #Q730 The founder & the speaker 1. Introduction Ivan Trančík ● Serial big data entrepreneur ● Co-founder of Infinario (now Exponea) ● Founder & CEO at Cellense & BuffPanel Jakub Motýľ (that’s me) ● Former programmer at Vacuumlabs ● 2 years as game analyst at Cellense ● Now director of BuffPanel 5
  • 6. Join the discussion at slido.com with code #Q730 Our partners & clients 1. Introduction 6
  • 7. Join the discussion at slido.com with code #Q730 2. How the tool came to be 7
  • 8. Join the discussion at slido.com with code #Q730 Game analytics in a nutshell 2. How the tool came to be ● Data-driven informed decisions instead of instinct driven design ● Blackbox solutions offer basic performance monitoring (KPIs) ● What about in-depth analytics? Onboarding, game economy, … ● AAA studios all have in-house analysts & keep their secrets ● What to do with a fast growing mid-size studio? 8
  • 9. Join the discussion at slido.com with code #Q730 Our in-house solution should support 2. How the tool came to be ● Basic BI & in-depth analytics ● Custom metrics, reports and dashboards ● Scalable architecture for a reasonable price ● Secure storage & access to raw event data ● Integrations, integrations, integrations ● Usable by all stakeholders So the platform was born... 9
  • 10. 10 Cellense Big Data Warehouse
  • 11. Badland Brawl ● A brand new game by Frogmind, part of Supercell, who is the largest mobile game publisher by revenue ● Currently in soft-launch 11 Hill Climb Racing 2 ● #1 iOS and Android app of Dec 2016 ● 110 million players total ● Hundreds of millions of data points daily ● Daily operational costs under $50
  • 12. Join the discussion at slido.com with code #Q730 12 3. Architecture overview
  • 13. Join the discussion at slido.com with code #Q730 Usable outputsStoring and processingCollecting the data Architecture overview BigQuery Client’s backend Periscope Data Cellense Big Data Warehouse processed materialized views Firebase backend daily export of raw events BigQuery console daily ETL jobs that process and combine raw data into materialized views daily export of processed attribution and ad data Game client SDK (Firebase SDK) Attribution service (Appsflyer) Ad providers 13
  • 14. Join the discussion at slido.com with code #Q730 Usable outputsStoring and processingCollecting the data Architecture overview BigQuery Client’s backend Periscope Data Cellense Big Data Warehouse processed materialized views Firebase backend daily export of raw events BigQuery console daily ETL jobs that process and combine raw data into materialized views daily export of processed attribution and ad data Game client SDK (Firebase SDK) Attribution service (Appsflyer) Ad providers 14
  • 15. Join the discussion at slido.com with code #Q730 Usable outputsStoring and processingCollecting the data Architecture overview BigQuery Client’s backend Periscope Data Cellense Big Data Warehouse processed materialized views Firebase backend daily export of raw events BigQuery console daily ETL jobs that process and combine raw data into materialized views daily export of processed attribution and ad data Game client SDK (Firebase SDK) Attribution service (Appsflyer) Ad providers 15
  • 16. Join the discussion at slido.com with code #Q730 Usable outputsStoring and processingCollecting the data Architecture overview BigQuery Client’s backend Periscope Data Cellense Big Data Warehouse processed materialized views Firebase backend daily export of raw events BigQuery console daily ETL jobs that process and combine raw data into materialized views daily export of processed attribution and ad data Game client SDK (Firebase SDK) Attribution service (Appsflyer) Ad providers 16
  • 17. Join the discussion at slido.com with code #Q730 17 4. Collecting the data
  • 18. Join the discussion at slido.com with code #Q730 Usable outputsStoring and processingCollecting the data Collecting the data BigQuery Client’s backend Periscope Data Cellense Big Data Warehouse processed materialized views Firebase backend daily export of raw events BigQuery console daily ETL jobs that process and combine raw data into materialized views daily export of processed attribution and ad data Game client SDK (Firebase SDK) Attribution service (Appsflyer) Ad providers 18
  • 19. Join the discussion at slido.com with code #Q730 Gameplay events 4. Collecting the data 19 ● Event vehicle_purchase ● Event race_finished
  • 20. Join the discussion at slido.com with code #Q730 Gameplay events 4. Collecting the data What do you mean by gameplay events? ● Gameplay events fire whenever a player does a specific action in-game ● Examples include: app start, in-app purchase, level completion, … ● Tracing the whole player journey enables in-depth analysis How do we track them? ● Using Google’s Firebase SDK for Android & iOS (integrated in app) ● Events are sent straight to Firebase Analytics How many events are we talking about? ● 100+ various event types totalling in 300+ million events (~200GB) daily 20
  • 21. Join the discussion at slido.com with code #Q730 Marketing campaign attribution 4. Collecting the data 21
  • 22. Join the discussion at slido.com with code #Q730 Marketing campaign attribution 4. Collecting the data 22 What do you mean by marketing campaign attribution? ● When players install the game, we know if they clicked an ad for it before ● This enables the marketers to evaluate quality of acquisition sources How do we track them? ● Using Appsflyer SDK dedicated to mobile attribution (integrated in app) ● Data is sent to the client’s backend for internal marketing evaluation How many events are we talking about? ● Depending on the game between 5-70% of players come from ads, resulting in several thousands of events daily
  • 23. Join the discussion at slido.com with code #Q730 Ad-related data 4. Collecting the data 23
  • 24. Join the discussion at slido.com with code #Q730 What do you mean by ad-related data? ● Ads are a very common revenue streams in games, in addition to IAPs ● Ads are of various nature, ranging from game ads to TV oracle ads How do we track them? ● SDKs from different ad providers track ad views in app to their own servers ● Individual ad providers report the sums daily to the client’s backend via HTTP postbacks How many events are we talking about? ● Very few events, usually aggregated daily revenue from different sources Ad-related data 4. Collecting the data 24
  • 25. Join the discussion at slido.com with code #Q730 25 5. Storing and processing
  • 26. Join the discussion at slido.com with code #Q730 Usable outputsStoring and processingCollecting the data Storing and processing BigQuery Client’s backend Periscope Data Cellense Big Data Warehouse processed materialized views Firebase backend daily export of raw events BigQuery console daily ETL jobs that process and combine raw data into materialized views daily export of processed attribution and ad data Game client SDK (Firebase SDK) Attribution service (Appsflyer) Ad providers 26
  • 27. Join the discussion at slido.com with code #Q730 Short intro to BigQuery 5. Storing and processing What is BigQuery? ● Serverless database solution offered by Google Cloud Platform ● Big data database compatible with SQL language syntax ● Loading, copying & exporting data is free ● Very low storage costs 2¢ per GB ● Very cheap querying - 5$ per TB 27
  • 28. Join the discussion at slido.com with code #Q730 Raw data 5. Storing and processing Why do we store raw data? ● Human and machine errors can lead to data discrepancies ● Many metrics and reports rely on preceding events and their attributes ● After fixing bugs, reprocessing historical data is often neccessary How do we store the raw data? ● As they come - in case the event body is malformed the events can be sometimes recovered with use of some ETL magic and then reprocessed 28
  • 29. Join the discussion at slido.com with code #Q730 Raw data 5. Storing and processing 29
  • 30. Join the discussion at slido.com with code #Q730 Flattened views 5. Storing and processing What are flattened views? ● Raw data from Firebase is stored in compressed format ● For BigQuery to be able to effectively work a transformation is needed before (raw) after (flattened) 30
  • 31. Join the discussion at slido.com with code #Q730 Why materialized views? ● Querying raw data would be very costly (often terabytes of data) ● Minimized tables (MVs) allow for fast and cost-effective querying How is it cost-effective? ● We process raw data daily and build MVs for different use-cases ● We batch the data daily and calculate all metrics for these batches ● In case of a defect, we only reprocess affected days of materialized views Materialized views 5. Storing and processing 31
  • 32. Join the discussion at slido.com with code #Q730 Materialized views 5. Storing and processing 32 Set of materialized views Daily Appsflyer data Flattened view MV processing user state purchases races vehicles bought ... Cheater tables Raw data Lifetime attribution view
  • 33. Join the discussion at slido.com with code #Q730 What if I have cheaters in my raw data? ● First, we build cheater tables (MV) which allow us to filter the events ● After that we process the filtered raw data So we need some kind of a dependency system? ● Yes, besides cheaters we also enrich tables with marketing data, ad data, ... ● To ensure the MVs are processed in correct order another party participates Materialized views 5. Storing and processing 33
  • 34. Join the discussion at slido.com with code #Q730 Intro - Cellense Big Data Warehouse ● Full stack typescript web application (angular 5 + nest.js) Why is there a need for another component? ● Monitoring BigQuery jobs’ status ● Dependency management and periodic job scheduling CBDW - Our swiss knife 5. Storing and processing 34
  • 35. Join the discussion at slido.com with code #Q730 Is that all? ● Monitoring event inflow ● Running ad-hoc jobs in case reprocessing is needed ● Query cost estimation and syntax validation ● User access management CBDW - Our swiss knife 5. Storing and processing 35
  • 36. Join the discussion at slido.com with code #Q730 CBDW - Our swiss knife 5. Storing and processing 36
  • 37. Join the discussion at slido.com with code #Q730 37 6. Usable outputs
  • 38. Join the discussion at slido.com with code #Q730 Usable outputsStoring and processingCollecting the data Usable outputs BigQuery Client’s backend Periscope Data Cellense Big Data Warehouse processed materialized views Firebase backend daily export of raw events BigQuery console daily ETL jobs that process and combine raw data into materialized views daily export of processed attribution and ad data Game client SDK (Firebase SDK) Attribution service (Appsflyer) Ad providers 38
  • 39. Join the discussion at slido.com with code #Q730 Visualization for end users 6. Usable outputs What is Periscope Data? ● A visualization tool tightly integrated with Google BigQuery ● It enables us to create a wide scale of visual reports, graphs and charts on top of data directly queried from MVs in BigQuery 39 Periscope example (sample data)
  • 40. Join the discussion at slido.com with code #Q730 Iterative prototyping of new metrics 6. Usable outputs Prototyping anything new must be a real hassle, right? ● No, not really ● With combined use of old faithful BigQuery console and our processed MVs, prototyping and testing is quite simple 40
  • 41. Join the discussion at slido.com with code #Q730 41 7. Machine learning in big data game analytics
  • 42. Join the discussion at slido.com with code #Q730 Business value 7. Machine learning in big data game analytics 42 What value can be gained by using ML in game analytics? ● Certain ML models can help us understand the importance and influence of specific game design elements in context of a given problem How is this method different from standard data analysis? ● When correctly applied, ML methods are much less prone to cognitive biases ● Operating with wider scale of inputs, ML methods often reveal unexpected causality chains that wouldn’t even occur to a human analyst
  • 43. Join the discussion at slido.com with code #Q730 From the issue to the solution 7. Use of machine learning 43 So how should we proceed? 1. Understanding the business issue 2. Preparing your data 3. Feature engineering and selection 4. Training and tuning 5. Evaluation and validation 6. Presentation and visualization Source: Udacity.com, (2016), Cross-industry standard process for data mining
  • 44. Join the discussion at slido.com with code #Q730 44 8. A real world ML example
  • 45. Join the discussion at slido.com with code #Q730 Understanding the business issue 8. A real world ML example 45 So what is our problem? ● If the players don’t come back to the game, they can’t generate revenue ● This is called the player churn problem, the #1 issue in f2p games What does churn and retention mean? ● Churn tells us the percentage of players leaving over time ● Retention is the percentage of players retained in the game over time
  • 46. Join the discussion at slido.com with code #Q730 Understanding the business issue 8. A real world ML example 46 What are the specific questions we want answered? ● What are the main features separating a retained from a churned player? ● How do these particular features affect a player’s retention rate? ● How does each feature’s importance change over time spent in-game?
  • 47. Join the discussion at slido.com with code #Q730 Understanding the business issue 8. A real world ML example 47 What do YOU think is the most important factor for retaining a player? A. Bring the player back every single successive day B. Make the player watch at most 5 ads a day C. Reward the player for every significant achievement in-game
  • 48. Join the discussion at slido.com with code #Q730 Preparing our data 8. A real world ML example 48 How do we ensure that the model is based on sound data we can trust? ● Exploration and quality control of data ● Additional pre-processing and filtering
  • 49. Join the discussion at slido.com with code #Q730 Preparing our data 8. A real world ML example 49 What to look at in exploration and quality control of data? ● Negative values ○ e.g. how can a player’s total playtime be negative? ● Distributions ○ expect a lot of log-normal or exponential distributions ● Outlier treatment ○ e.g. players with 1 million coins on hand, 99% have < 1k, mean 300
  • 50. Join the discussion at slido.com with code #Q730 Preparing our data 8. A real world ML example 50 What does the additional pre-processing and filtering entail? ● Selection ○ remove players who don’t finish the tutorial & have positive play time in other game modes ● Profiling ○ understand quality of the data ○ e.g. relation between sessions and matches ● Segmentation ○ find new relations between groups of players with similar characteristics ● Cleansing ○ remove samples with empty/negative values, fill in partially with mean values ● Validation ○ central tendencies - mean, median, ... ○ variability - standard deviation, ranges, ... ● Target balancing ○ to avoid neglection of important segments we upscale/downscale sampling of groups ○ for our case we would like around 50/50 split ● Transformation ○ some methods require normal instead of log- normal distributions to avoid bias ○ not needed in our case
  • 51. Join the discussion at slido.com with code #Q730 Feature engineering and selection 8. A real world ML example 51 What are our features and how do we find them? ● Features are selected metrics representing the underlying problem ● We pick a number of metrics we use based on common sense And what features are those for our problem? ● Gameplay statistics - matches played, number of sessions, … ● Total statistics - achieved rank, total playtime, … ● Resources used - coins, gems ● Rewards earned - resources, chests, skins, … ● Social stats, Ad stats, Geo data, ...
  • 52. Join the discussion at slido.com with code #Q730 Feature engineering and selection 8. A real world ML example 52 Sounds like an awful lot of features, no? ● Yes, to be exact we use around 120 features for this problem ● Many already have MVs as we use them in standard analytics Why so many? Wouldn’t, say 20, suffice? ● We use as many relevant features as we can to avoid scoping bias ● The best features can be extracted using appropriate methods
  • 53. Join the discussion at slido.com with code #Q730 Training and tuning 8. A real world ML example 53 So how do we extract the best features? ● We use the random forest method to construct randomized decision trees ● The trees are mapping various options the players can take during gameplay ● We average the results to find the best option That has to be a whole rainforest, with 120 features, right? ● About 100 random decision trees should be enough ● Even a higher number of random trees could be biased
  • 54. Join the discussion at slido.com with code #Q730 Training and tuning 8. A real world ML example 54 How do we know that our results are not biased, then? ● We use the gradient boosted tree to cross-reference results ● After training the tree with the same dataset we get a comparable result Why do we only use methods working with decision trees? ● For our purposes, we need to interpret the result and quantify contribution of different game features to retention and churn ● Although other blackbox models may be more successfull, they can’t be interpreted in this manner
  • 55. Join the discussion at slido.com with code #Q730 Training and tuning 8. A real world ML example 55
  • 56. Join the discussion at slido.com with code #Q730 That seems complicated, how do we interpret it? ● We don’t, it’s too complicated and also probably overfitted ● We proceed with pruning the tree But how do we prune the tree without loosing information? ● We prune sections which provide little power to classify players Does the pruning have any other effects? ● We avoid overfitting the model and improve the generalization of knowledge ● It also makes the visualization more much more readable Evaluation and validation 8. A real world ML example 56
  • 57. Join the discussion at slido.com with code #Q730 Are we now 100% convinced that the model reflects reality? ● Let’s do a 10-fold cross-validation, just to be sure Okay, I fold. Anything else? ● That’s it! Now we just have to take a look at the result Evaluation and validation 8. A real world ML example 57
  • 58. Join the discussion at slido.com with code #Q730 Evaluation and validation 8. A real world ML example 58 ChurnedRetained
  • 59. Join the discussion at slido.com with code #Q730 So what did we learn? ● The most important thing is to bring back the player every successive day ● After just a single missed out day, the player churn greatly increases ● ...and also that “intuitive” doesn’t always mean true Evaluation and validation 8. A real world ML example 59
  • 60. Join the discussion at slido.com with code #Q730 What should we do, now that we know the culprit for churn in our game? ● Design an AB test and verify the hypothesis ● Make better use of existing assets to achieve higher day-to-day retention ● Find the next issue! Evaluation and validation 8. A real world ML example 60
  • 61. Join the discussion at slido.com with code #Q730 61 7. Questions and Answers
  • 62. A Big(Query) frog in a small pond - Data Science Club #1 Summer 2017/2018 Speaker: Jakub Motýľ Last updated: 19.2.2018 Pssst, we’re hiring. If you’re interested in games & data, or know someone, drop us a message at [email protected] or [email protected]. Also check out job positions online here & here.

Editor's Notes

  • #30: nested structure of raw data
  • #59: The most important is the root node, separating players who retain and churn. Most of the left subtree represents churned players, while the right subtree contains players who are mostly retained.