A Big (Query) Frog in a Small Pond, Jakub Motyl, BuffPanel

A Big(Query) frog
in a small pond
- Data Science Club #1
Summer 2017/2018
Speaker: Jakub Motýľ
Last updated: 19.2.2018

Join the discussion at slido.com with code #Q730
Today’s agenda
1. Introduction
2. How the tool came to be
1. Game analytics in a nutshell
2. Our in-house solution
3. Architecture overview
4. Collecting the data
1. Gameplay events
2. Marketing campaign attribution
3. Ad-related data
5. Storing and processing
1. Short intro to BigQuery
2. Raw data
3. Flattened views
4. Materialized views
5. CBDW - Our swiss knife
6. Usable outputs
1. Visualization for end users
2. Iterative prototyping of new metrics
7. Machine learning in big data game analytics
1. Business value
2. From the issue to the solution
8. A real world ML example
1. Understanding the business issue
2. Preparing our data
3. Feature engineering and selection
4. Training and tuning
5. Evaluation and validation
9. Q&A section
2

1. Introduction
3

Cellense & BuffPanel
1. Introduction
4
● Steam marketing analytics
● Big data attribution platform and
PC+console gamedev consulting
● Product of Cellense
● “Making mobile games grow”
● Games sector BI & analytics
● Formerly Infinario, branch of
Exponea focused on gaming
● At 15 employees and growing

The founder & the speaker
1. Introduction
Ivan Trančík
● Serial big data entrepreneur
● Co-founder of Infinario (now Exponea)
● Founder & CEO at Cellense & BuffPanel
Jakub Motýľ (that’s me)
● Former programmer at Vacuumlabs
● 2 years as game analyst at Cellense
● Now director of BuffPanel
5

Our partners & clients
1. Introduction
6

7

Game analytics in a nutshell
● Data-driven informed decisions instead of instinct driven design
● Blackbox solutions offer basic performance monitoring (KPIs)
● What about in-depth analytics? Onboarding, game economy, …
● AAA studios all have in-house analysts & keep their secrets
● What to do with a fast growing mid-size studio?
8

Our in-house solution should support
● Basic BI & in-depth analytics
● Custom metrics, reports and dashboards
● Scalable architecture for a reasonable price
● Secure storage & access to raw event data
● Integrations, integrations, integrations
● Usable by all stakeholders
So the platform was born...
9

10
Cellense Big Data Warehouse

Badland Brawl
● A brand new game by Frogmind, part of
Supercell, who is the largest mobile game
publisher by revenue
● Currently in soft-launch
11
Hill Climb Racing 2
● #1 iOS and Android app of Dec 2016
● 110 million players total
● Hundreds of millions of data points daily
● Daily operational costs under $50

Join the discussion at slido.com with code #Q730 12
3. Architecture overview

Usable outputsStoring and processingCollecting the data
Architecture overview
BigQuery
Client’s backend
Periscope Data
Cellense Big
Data
Warehouse
processed
materialized
views
Firebase
backend
daily export
of raw
events
BigQuery console
daily ETL jobs that
process and combine
raw data into
materialized views
daily export of
processed attribution
and ad data
Game client
SDK
(Firebase SDK)
Attribution
service
(Appsflyer)
Ad providers
13

BigQuery
Client’s backend
Periscope Data
Cellense Big
Data
Warehouse
processed
materialized
views
Firebase
backend
daily export
of raw
events
BigQuery console
daily ETL jobs that
process and combine
raw data into
materialized views
daily export of
and ad data
Game client
SDK
(Firebase SDK)
Attribution
service
(Appsflyer)
Ad providers
14

BigQuery
Client’s backend
Periscope Data
Cellense Big
Data
Warehouse
processed
materialized
views
Firebase
backend
daily export
of raw
events
BigQuery console
daily ETL jobs that
process and combine
raw data into
materialized views
daily export of
and ad data
Game client
SDK
(Firebase SDK)
Attribution
service
(Appsflyer)
Ad providers
15

BigQuery
Client’s backend
Periscope Data
Cellense Big
Data
Warehouse
processed
materialized
views
Firebase
backend
daily export
of raw
events
BigQuery console
daily ETL jobs that
process and combine
raw data into
materialized views
daily export of
and ad data
Game client
SDK
(Firebase SDK)
Attribution
service
(Appsflyer)
Ad providers
16

Collecting the data
BigQuery
Client’s backend
Periscope Data
Cellense Big
Data
Warehouse
processed
materialized
views
Firebase
backend
daily export
of raw
events
BigQuery console
daily ETL jobs that
process and combine
raw data into
materialized views
daily export of
and ad data
Game client
SDK
(Firebase SDK)
Attribution
service
(Appsflyer)
Ad providers
18

Gameplay events
19
● Event vehicle_purchase ● Event race_finished

Gameplay events
What do you mean by gameplay events?
● Gameplay events fire whenever a player does a specific action in-game
● Examples include: app start, in-app purchase, level completion, …
● Tracing the whole player journey enables in-depth analysis
How do we track them?
● Using Google’s Firebase SDK for Android & iOS (integrated in app)
● Events are sent straight to Firebase Analytics
How many events are we talking about?
● 100+ various event types totalling in 300+ million events (~200GB) daily
20

Marketing campaign attribution
21

Marketing campaign attribution
22
What do you mean by marketing campaign attribution?
● When players install the game, we know if they clicked an ad for it before
● This enables the marketers to evaluate quality of acquisition sources
● Using Appsflyer SDK dedicated to mobile attribution (integrated in app)
● Data is sent to the client’s backend for internal marketing evaluation
● Depending on the game between 5-70% of players come from ads,
resulting in several thousands of events daily

Ad-related data
23

What do you mean by ad-related data?
● Ads are a very common revenue streams in games, in addition to IAPs
● Ads are of various nature, ranging from game ads to TV oracle ads
● SDKs from different ad providers track ad views in app to their own
servers
● Individual ad providers report the sums daily to the client’s backend via
HTTP postbacks
● Very few events, usually aggregated daily revenue from different sources
Ad-related data
24

Storing and processing
BigQuery
Client’s backend
Periscope Data
Cellense Big
Data
Warehouse
processed
materialized
views
Firebase
backend
daily export
of raw
events
BigQuery console
daily ETL jobs that
process and combine
raw data into
materialized views
daily export of
and ad data
Game client
SDK
(Firebase SDK)
Attribution
service
(Appsflyer)
Ad providers
26

Short intro to BigQuery
What is BigQuery?
● Serverless database solution offered by Google Cloud Platform
● Big data database compatible with SQL language syntax
● Loading, copying & exporting data is free
● Very low storage costs 2¢ per GB
● Very cheap querying - 5$ per TB
27

Raw data
Why do we store raw data?
● Human and machine errors can lead to data discrepancies
● Many metrics and reports rely on preceding events and their attributes
● After fixing bugs, reprocessing historical data is often neccessary
How do we store the raw data?
● As they come - in case the event body is malformed the events can be
sometimes recovered with use of some ETL magic and then reprocessed
28

Raw data
29

Flattened views
What are flattened views?
● Raw data from Firebase is stored in compressed format
● For BigQuery to be able to effectively work a transformation is needed
before (raw) after (flattened)
30

Why materialized views?
● Querying raw data would be very costly (often terabytes of data)
● Minimized tables (MVs) allow for fast and cost-effective querying
How is it cost-effective?
● We process raw data daily and build MVs for different use-cases
● We batch the data daily and calculate all metrics for these batches
● In case of a defect, we only reprocess affected days of materialized views
Materialized views
31

Materialized views
32
Set of materialized
views
Daily Appsflyer data
Flattened view MV processing
user state
purchases
races
vehicles bought
...
Cheater tables
Raw data
Lifetime attribution
view

What if I have cheaters in my raw data?
● First, we build cheater tables (MV) which allow us to filter the events
● After that we process the filtered raw data
So we need some kind of a dependency system?
● Yes, besides cheaters we also enrich tables with marketing data, ad data, ...
● To ensure the MVs are processed in correct order another party participates
Materialized views
33

Intro - Cellense Big Data Warehouse
● Full stack typescript web application (angular 5 + nest.js)
Why is there a need for another component?
● Monitoring BigQuery jobs’ status
● Dependency management and periodic job scheduling
CBDW - Our swiss knife
34

Is that all?
● Monitoring event inflow
● Running ad-hoc jobs in case reprocessing is needed
● Query cost estimation and syntax validation
● User access management
35

36

6. Usable outputs

Usable outputs
BigQuery
Client’s backend
Periscope Data
Cellense Big
Data
Warehouse
processed
materialized
views
Firebase
backend
daily export
of raw
events
BigQuery console
daily ETL jobs that
process and combine
raw data into
materialized views
daily export of
and ad data
Game client
SDK
(Firebase SDK)
Attribution
service
(Appsflyer)
Ad providers
38

Visualization for end users
6. Usable outputs
What is Periscope Data?
● A visualization tool tightly integrated with Google BigQuery
● It enables us to create a wide scale of visual reports, graphs and charts on
top of data directly queried from MVs in BigQuery
39
Periscope example (sample data)

Iterative prototyping of new metrics
6. Usable outputs
Prototyping anything new must be a real hassle, right?
● No, not really
● With combined use of old faithful BigQuery console and our processed
MVs, prototyping and testing is quite simple
40

Business value
42
What value can be gained by using ML in game analytics?
● Certain ML models can help us understand the importance and influence of
specific game design elements in context of a given problem
How is this method different from standard data analysis?
● When correctly applied, ML methods are much less prone to cognitive
biases
● Operating with wider scale of inputs, ML methods often reveal unexpected
causality chains that wouldn’t even occur to a human analyst

From the issue to the solution
7. Use of machine learning
43
So how should we proceed?
1. Understanding the business issue
2. Preparing your data
3. Feature engineering and selection
4. Training and tuning
5. Evaluation and validation
6. Presentation and visualization
Source: Udacity.com, (2016),
Cross-industry standard process for data mining

Understanding the business issue
45
So what is our problem?
● If the players don’t come back to the game, they can’t generate revenue
● This is called the player churn problem, the #1 issue in f2p games
What does churn and retention mean?
● Churn tells us the percentage of players leaving over time
● Retention is the percentage of players retained in the game over time

46
What are the specific questions we want answered?
● What are the main features separating a retained from a churned player?
● How do these particular features affect a player’s retention rate?
● How does each feature’s importance change over time spent in-game?

47
What do YOU think is the most important factor for retaining a player?
A. Bring the player back every single successive day
B. Make the player watch at most 5 ads a day
C. Reward the player for every significant achievement in-game

Preparing our data
48
How do we ensure that the model is based on sound data we can trust?
● Exploration and quality control of data
● Additional pre-processing and filtering

Preparing our data
49
What to look at in exploration and quality control of data?
● Negative values
○ e.g. how can a player’s total playtime be negative?
● Distributions
○ expect a lot of log-normal or exponential distributions
● Outlier treatment
○ e.g. players with 1 million coins on hand, 99% have < 1k, mean 300

Preparing our data
50
What does the additional pre-processing and filtering entail?
● Selection
○ remove players who don’t finish the tutorial
& have positive play time in other game
modes
● Profiling
○ understand quality of the data
○ e.g. relation between sessions and matches
● Segmentation
○ find new relations between groups of players
with similar characteristics
● Cleansing
○ remove samples with empty/negative
values, fill in partially with mean values
● Validation
○ central tendencies - mean, median, ...
○ variability - standard deviation, ranges, ...
● Target balancing
○ to avoid neglection of important segments
we upscale/downscale sampling of groups
○ for our case we would like around 50/50 split
● Transformation
○ some methods require normal instead of log-
normal distributions to avoid bias
○ not needed in our case

Feature engineering and selection
51
What are our features and how do we find them?
● Features are selected metrics representing the underlying problem
● We pick a number of metrics we use based on common sense
And what features are those for our problem?
● Gameplay statistics - matches played, number of sessions, …
● Total statistics - achieved rank, total playtime, …
● Resources used - coins, gems
● Rewards earned - resources, chests, skins, …
● Social stats, Ad stats, Geo data, ...

Feature engineering and selection
52
Sounds like an awful lot of features, no?
● Yes, to be exact we use around 120 features for this problem
● Many already have MVs as we use them in standard analytics
Why so many? Wouldn’t, say 20, suffice?
● We use as many relevant features as we can to avoid scoping bias
● The best features can be extracted using appropriate methods

Training and tuning
53
So how do we extract the best features?
● We use the random forest method to construct randomized decision trees
● The trees are mapping various options the players can take during
gameplay
● We average the results to find the best option
That has to be a whole rainforest, with 120 features, right?
● About 100 random decision trees should be enough
● Even a higher number of random trees could be biased

Training and tuning
54
How do we know that our results are not biased, then?
● We use the gradient boosted tree to cross-reference results
● After training the tree with the same dataset we get a comparable result
Why do we only use methods working with decision trees?
● For our purposes, we need to interpret the result and quantify contribution
of different game features to retention and churn
● Although other blackbox models may be more successfull, they can’t be
interpreted in this manner

Training and tuning
55

That seems complicated, how do we interpret it?
● We don’t, it’s too complicated and also probably overfitted
● We proceed with pruning the tree
But how do we prune the tree without loosing information?
● We prune sections which provide little power to classify players
Does the pruning have any other effects?
● We avoid overfitting the model and improve the generalization of
knowledge
● It also makes the visualization more much more readable
Evaluation and validation
56

Are we now 100% convinced that the model reflects reality?
● Let’s do a 10-fold cross-validation, just to be sure
Okay, I fold. Anything else?
● That’s it! Now we just have to take a look at the result
57

58
ChurnedRetained

So what did we learn?
● The most important thing is to bring back the player every successive day
● After just a single missed out day, the player churn greatly increases
● ...and also that “intuitive” doesn’t always mean true
59

What should we do, now that we know the culprit for churn in our game?
● Design an AB test and verify the hypothesis
● Make better use of existing assets to achieve higher day-to-day retention
● Find the next issue!
60

7. Questions and Answers

A Big(Query) frog
in a small pond
- Data Science Club #1
Summer 2017/2018
Speaker: Jakub Motýľ
Last updated: 19.2.2018
Pssst, we’re hiring. If you’re interested in games
& data, or know someone, drop us a message
at jobs@cellense.com or jobs@buffpanel.com.
Also check out job positions online here & here.

A Big (Query) Frog in a Small Pond, Jakub Motyl, BuffPanel

More Related Content

Similar to A Big (Query) Frog in a Small Pond, Jakub Motyl, BuffPanel (20)

Recently uploaded (20)

A Big (Query) Frog in a Small Pond, Jakub Motyl, BuffPanel

Editor's Notes