Scalding Big ADta
или обжигая горшки с рекламой
Boris Trofimov
@b0ris_1
Agenda
• Two stories on how AD is served inside
AD company
• Awesome Scalding
The story about shoes
or
Big Brother is watching on you
We will answer this question in a few slides
or be careful while buying shoes
How many web sites with
Ad aboard do you open
during the day?
Open any site
with Ad
What can be simpler than
loading web site via web
browser, hah?
Scalding big ADta
However it is deceptive
judgment
The first second…
Story Actors
User
Publisher (foxnews.com)
Ad Server (Google’s Doubleclick)
SSP (Ad Exchange)
DSP (decides what ad to show)
Advertiser (Nike)
we are here
… 1 sec20 ms 100 ms 150 ms
Publisher
receives
request
Publisher
sends
response
Content
delivered
to user
170
Site sends
request to
Ad Server
200
80 ms
280
SSP picks the
winning bid and
sends redirect
url back to ad
Server
Every bidder/DSP receives
info about user:
• ssp_cookie_id
• geo data
• site url
300
SSP (Ad Exchange)
receives ad request
and opens
RTB Auction
210
Ad Server
receives ad
request
and redirects to
Ad Exchange
All bidders
should send their
decision
(participate? &
price) back
350
Ad Server
shows page to
user which
redirects to the
bidder’s server
User’s web page asks
Ad banner from CDN
Showing ad & bidder’s
1x1 pixel
(impression)
400
The first second…
~70% users have
this cookie aboard
>>1 independent
companies take
part in this auction
Under hood
Return info about new user’s
interests with special markers
(segments) that indicates the
new fact about user, e.g user
is man who has iphone and
lives in NYC and has dog.
Major format: <cookie_id – segment_id> Data Scientists
Real time
Offline
Pixel Tracking
Farm
Warehouse
Bidder Farm
Auction
requests
SSP Ad
Exchange
Hourly
Logs
3rd part data
House holders
data
…
Hadoop’s HDFS
Updating user profiles
HiveOozieMapReduce
Partners
HBASEScalding
hbase keeps
user profiles
Update user’s profiles with
new segments
Data export
Brand new
feed about
user
interests
2
3
45
6
7
8
9
0 1 • Impressions
• Clicks
• Post-click Activities
5
Why do we need all this
science?
• Deep audience targeting
• Case: customer would like
to show ad for all men who
live in NYC have iPhone
and dog
Facts about Data Scientists
• Data Scientists do:
– Audience Modeling
identifying new user interests
[segments] and finding ways
to track them
– Audience Bridging
– Insights and Analytics
• They use IBM Netezza as
local warehouse
• They use R language
Facts about Realtime team
• Scala, Java
• Restful Services
• Akka
• In Memory Cache : Aerospike, Redis
Facts about Offline team
• The tasks we solve over hadoop:
– As a Storage to keep all logs we need
– As Profile DB to keep all users and their interests [segments]
– As MapReduce Engine to run jobs on transformations between data
– As a Warehouse to export data via hive
• We use Clouderra CDH 5.1.2
• Major language: Scala
• Pure MapReduce jobs & Scalding/Cascading
• All map reduce applications are wrapped by Oozie’s workflow(s)
• Developing nextgen paltform version based on Spark Streaming/Kafka
Scalding big ADta
hdfs
Scalding in a nutshell
• Concise DSL
• Configurable Source(s) and
sink(s)
• Data transform operations:
– map/flatMap
– pivot/unpivot
– project
– groupBy/reduce/foldLeft
hdfs
Just one example (Java way)
public class WordCount {
public static class Map extends Mapper<LongWritable, Text, Text, IntWritable> {
private final static IntWritable one = new IntWritable(1);
private Text word = new Text();
public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
String line = value.toString();
StringTokenizer tokenizer = new StringTokenizer(line);
while (tokenizer.hasMoreTokens()) {
word.set(tokenizer.nextToken());
context.write(word, one);
}
}
}
public static class Reduce extends Reducer<Text, IntWritable, Text, IntWritable> {
public void reduce(Text key, Iterable<IntWritable> values, Context context)
throws IOException, InterruptedException {
int sum = 0;
for (IntWritable val : values) {
sum += val.get();
}
context.write(key, new IntWritable(sum));
}
}
public static void main(String[] args) throws Exception {
Configuration conf = new Configuration();
Job job = new Job(conf, "wordcount");
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
job.setMapperClass(Map.class);
job.setReducerClass(Reduce.class);
job.setInputFormatClass(TextInputFormat.class);
job.setOutputFormatClass(TextOutputFormat.class);
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
job.waitForCompletion(true);
}
}
Source
Just one example
(Scalding way)
class WordCountJob(args : Args) extends Job(args) {
TextLine( args("input") )
.flatMap('line -> 'word) { line : String => tokenize(line) }
.groupBy('word) { _.size }
.write( Tsv( args("output") ) )
// Split a piece of text into individual words.
def tokenize(text : String) : Array[String] = {
// Lowercase each word and remove punctuation.
text.toLowerCase.split("s+")
}
}
Sink
Transform
operations
Use Case 1
Split
• Motivation: reuse calculated streams
val common = Tsv("./file").map(...)
val branch1 = common.map(..).write(Tsv("output"))
val branch2 = common.groupby(..).write(Tsv("output"))
Use Case 2 Exotic Sources
JDBC (out of the box)
case object YourTableSource extends JDBCSource {
override val tableName = "tableName"
override val columns = List(
varchar("col1", 64),
date("col2"),
tinyint("col3"),
double("col4"),
)
override def currentConfig = ConnectionSpec("www.gt.com", "username", "password",
"mysql")
}
YourTableSource.read.map(...) ...
Use Case 2 Exotic Sources
HBASE
HBaseSource
(https://github.com/ParallelAI/SpyGlass)
• SCAN_ALL,
• GET_LIST,
• SCAN_RANGE
HBaseRawSource
(https://github.com/andry1/SpyGlass)
• Advanced filtering via base64Scan
val hbs3 = new HBaseSource(
tableName,
quorum,
'key,
List("data"),
List('data),
sourceMode = SourceMode.SCAN_ALL)
.read
val scan = new Scan()
scan.setCaching(caching)
val activity_filters = new FilterList(MUST_PASS_ONE, {
val scvf = new SingleColumnValueFilter(toBytes("family"), toBytes("column"), GREATER_OR_EQUAL, toBytes(value))
scvf.setFilterIfMissing(true)
scvf.setLatestVersionOnly(true)
val scvf2 = ...
List(scvf, scvf2)
})
scan.setFilter(activity_filters)
new HBaseRawSource(tableName, quorum, families,
base64Scan = convertScanToBase64(scan)).read. ...
Use Case 3
Join
• Motivation: joining two streams by key
• Different join strategies:
– joinWithLarger
– joinWithSmaller
– joinWithTiny
• Inner, Left, Right, strategies
val pipe1 = Tsv("file1").read
val pipe2 = Tsv("file2").read // small file
val pipe3 = Tsv("file3").read // huge file
val joinedPipe = pipe1.joinWithTiny('id1 -> 'id2, pipe2)
val joinedPipe2 = pipe1.joinWithLarge('id1 -> 'id2, pipe3)
Use Case 4
Distributed Caching and Counters
//somewhere outside Job definition
val fl = DistributedCacheFile("/user/boris/zooKeeper.json")
// next value can be passed through any Scalding's jobs via Args
object for instance
val fileName = fl.path
...
class Job(val args:Args) {
// once we receive fl.path we can read it like a ordinary file
val fileName = args.get("fileName")
lazy val data = readJSONFromFile(fileName)
...
TSV(args.get("input")).read.map('line -> 'word ) {
line => ... /* using data json object*/ ... }
}
// counter example
Stat("jdbc.call.counter","myapp").incBy(1)
Use Case 5
Bridging Profiles
Motivation: bridge information from different sources and build
complete person profile
imp
Own company’s
private cookie
thanks to 1x1 pixel
impression
Bridging two
ssp_cookies via private
cookie
ssp_cookie_Id1
ssp_cookie_Id2
Bridging via ip
address
Bridging Profiles
General task definition:
• Build graph
• Vertexes – user’s interests
• Edges – bridging rules
[cookies, IP,…]
• Task – Identify
connected components
Connected components
Let’s scalding it
/**
* The class represents just one iteration of searching connected component algorithm.
* Somewhere outside the Job code we have to run this job iteratively until N [~20] and should check number inside "count" file.
* If it is zero then we can stop running other iterations
*/
class ConnectedComponentsOneIterationJob(args : Args) extends Job(args) {
val vertexes = Tsv( args("vertexes"),('id,'gid)).read // by default gid is equal to id
val edges = Tsv( args("edges"), ('id_a,'id_b) ).read
val groups = vertexes.joinWithSmaller('id -> 'id_b, vertexes.joinWithSmaller('id -> 'id_a, edges).discard('id ).rename('gid ->'gid_a))
.discard('id )
.rename('gid ->'gid_b)
.filter('gid_a, 'gid_b) {gid : (String, String) => gid._1 != gid._2 }
.project ('gid_a, 'gid_b)
.mapTo(('gid_a, 'gid_b) -> ('gid_a, 'gid_b)) {gid : (String, String) => max(gid._1, gid._2) -> min(gid._1, gid._2) }
// if count=0 then we can stop running next iterations
groups.groupAll { _.size }.write(Tsv("count"))
val new_groups = groups.groupBy('gid_a) {_.min('gid_b)}.rename(('gid_a,'gid_b)->('source, 'target))
val new_vertexes = vertexes.joinWithSmaller('id -> 'source, new_groups, joiner = new LeftJoin )
.mapTo( ('id,'gid,'source,'target)->('id, 'gid)) { param:(String, String, String, String) =>
val (id, gid, source,target) = param
if (target != null) ( id , min( gid, target ) ) else ( id, gid )
}
new_vertexes.write( Tsv( args("new_vertexes") ) )
}
Other nice things
• Typed pipes
• Elegant and fast Matrix operations
• Simple migration on Spark/Kafka
• More sources: e.g. retrieve data from
hive’s hcatalog
Useful Resources
• http://www.adopsinsider.com/ad-serving/how-does-ad-serving-
work/
• http://www.adopsinsider.com/ad-serving/diagramming-the-ssp-
dsp-and-rtb-redirect-path/
• https://github.com/twitter/scalding
• https://github.com/ParallelAI/SpyGlass
• https://github.com/branky/cascading.hive
Thank you!

More Related Content

PDF
Audience counting at Scale
PDF
Cowboy dating with big data
PDF
Ultimate journey towards realtime data platform with 2.5M events per sec
PDF
 On Track with Apache Kafka: Building a Streaming ETL solution with Rail Dat...
PDF
Expanding Apache Spark Use Cases in 2.2 and Beyond with Matei Zaharia and dem...
PDF
Performance Analysis and Optimizations for Kafka Streams Applications
PDF
Solutions for bi-directional integration between Oracle RDBMS and Apache Kafka
PDF
Querying the Internet of Things: Streaming SQL on Kafka/Samza and Storm/Trident
Audience counting at Scale
Cowboy dating with big data
Ultimate journey towards realtime data platform with 2.5M events per sec
 On Track with Apache Kafka: Building a Streaming ETL solution with Rail Dat...
Expanding Apache Spark Use Cases in 2.2 and Beyond with Matei Zaharia and dem...
Performance Analysis and Optimizations for Kafka Streams Applications
Solutions for bi-directional integration between Oracle RDBMS and Apache Kafka
Querying the Internet of Things: Streaming SQL on Kafka/Samza and Storm/Trident

What's hot (20)

PDF
Apache Sqoop: A Data Transfer Tool for Hadoop
PDF
Introduction to Sqoop Aaron Kimball Cloudera Hadoop User Group UK
PDF
What and Why and How: Apache Drill ! - Tugdual Grall
PPTX
Apache Flink Deep Dive
PDF
Hadoop Summit - Interactive Big Data Analysis with Solr, Spark and Hue
PDF
Building Data Product Based on Apache Spark at Airbnb with Jingwei Lu and Liy...
PDF
Cost-Based Optimizer in Apache Spark 2.2
PDF
Hw09 Sqoop Database Import For Hadoop
PPTX
Omid: A Transactional Framework for HBase
PDF
From Big to Fast Data. How #kafka and #kafka-connect can redefine you ETL and...
PDF
Closing the Loop in Extended Reality with Kafka Streams and Machine Learning ...
PPTX
Hive on spark is blazing fast or is it final
PDF
ksqlDB - Stream Processing simplified!
PPTX
Simplifying the Creation of Machine Learning Workflow Pipelines for IoT Appli...
PDF
Bravo Six, Going Realtime. Transitioning Activision Data Pipeline to Streaming
PDF
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
PDF
Accumulo Summit 2015: Performance Models for Apache Accumulo: The Heavy Tail ...
PDF
OrientDB Distributed Architecture v2.0
PDF
Amazon Elastic Map Reduce - Ian Meyers
PDF
A New Chapter of Data Processing with CDK
Apache Sqoop: A Data Transfer Tool for Hadoop
Introduction to Sqoop Aaron Kimball Cloudera Hadoop User Group UK
What and Why and How: Apache Drill ! - Tugdual Grall
Apache Flink Deep Dive
Hadoop Summit - Interactive Big Data Analysis with Solr, Spark and Hue
Building Data Product Based on Apache Spark at Airbnb with Jingwei Lu and Liy...
Cost-Based Optimizer in Apache Spark 2.2
Hw09 Sqoop Database Import For Hadoop
Omid: A Transactional Framework for HBase
From Big to Fast Data. How #kafka and #kafka-connect can redefine you ETL and...
Closing the Loop in Extended Reality with Kafka Streams and Machine Learning ...
Hive on spark is blazing fast or is it final
ksqlDB - Stream Processing simplified!
Simplifying the Creation of Machine Learning Workflow Pipelines for IoT Appli...
Bravo Six, Going Realtime. Transitioning Activision Data Pipeline to Streaming
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
Accumulo Summit 2015: Performance Models for Apache Accumulo: The Heavy Tail ...
OrientDB Distributed Architecture v2.0
Amazon Elastic Map Reduce - Ian Meyers
A New Chapter of Data Processing with CDK
Ad

Viewers also liked (8)

PDF
So various polymorphism in Scala
PPTX
Scalding Big (Ad)ta
PDF
Bending Spark towards enterprise needs
ODP
MongoDB Distilled
PDF
Faster persistent data structures through hashing
PDF
Continuous DB migration based on carbon5 framework
PPTX
Spring AOP Introduction
PPTX
Clustering Java applications with Terracotta and Hazelcast
So various polymorphism in Scala
Scalding Big (Ad)ta
Bending Spark towards enterprise needs
MongoDB Distilled
Faster persistent data structures through hashing
Continuous DB migration based on carbon5 framework
Spring AOP Introduction
Clustering Java applications with Terracotta and Hazelcast
Ad

Similar to Scalding big ADta (20)

PDF
Java/Scala Lab: Борис Трофимов - Обжигающая Big Data.
PDF
Improving Apache Spark Downscaling
PPTX
AI與大數據數據處理 Spark實戰(20171216)
PPTX
ETL with SPARK - First Spark London meetup
PPTX
MongoDB for Time Series Data Part 3: Sharding
PDF
20170126 big data processing
PDF
Vitalii Bondarenko HDinsight: spark. advanced in memory big-data analytics wi...
PDF
SE2016 BigData Vitalii Bondarenko "HD insight spark. Advanced in-memory Big D...
PPTX
First Hive Meetup London 2012-07-10 - Tomas Cervenka - VisualDNA
PPT
Learning with F#
PPTX
Paris Data Geek - Spark Streaming
KEY
Mongodb intro
PDF
Boston Spark Meetup event Slides Update
PDF
PyData Berlin Meetup
PPTX
Data Pipeline at Tapad
PDF
Introduction to Apache Spark
PDF
Greg Hogan – To Petascale and Beyond- Apache Flink in the Clouds
PDF
Robert Pankowecki - Czy sprzedawcy SQLowych baz nas oszukali?
PDF
Unified Big Data Processing with Apache Spark
PDF
Sorry - How Bieber broke Google Cloud at Spotify
Java/Scala Lab: Борис Трофимов - Обжигающая Big Data.
Improving Apache Spark Downscaling
AI與大數據數據處理 Spark實戰(20171216)
ETL with SPARK - First Spark London meetup
MongoDB for Time Series Data Part 3: Sharding
20170126 big data processing
Vitalii Bondarenko HDinsight: spark. advanced in memory big-data analytics wi...
SE2016 BigData Vitalii Bondarenko "HD insight spark. Advanced in-memory Big D...
First Hive Meetup London 2012-07-10 - Tomas Cervenka - VisualDNA
Learning with F#
Paris Data Geek - Spark Streaming
Mongodb intro
Boston Spark Meetup event Slides Update
PyData Berlin Meetup
Data Pipeline at Tapad
Introduction to Apache Spark
Greg Hogan – To Petascale and Beyond- Apache Flink in the Clouds
Robert Pankowecki - Czy sprzedawcy SQLowych baz nas oszukali?
Unified Big Data Processing with Apache Spark
Sorry - How Bieber broke Google Cloud at Spotify

More from b0ris_1 (6)

PDF
Learning from nature or human body as a source on inspiration for software en...
PDF
Devoxx 2022
PDF
IT Arena-2021
PDF
New accelerators in Big Data - Upsolver
PDF
Learning from nature [slides from Software Architecture meetup]
PDF
Cowboy dating with big data TechDays at Lohika-2020
Learning from nature or human body as a source on inspiration for software en...
Devoxx 2022
IT Arena-2021
New accelerators in Big Data - Upsolver
Learning from nature [slides from Software Architecture meetup]
Cowboy dating with big data TechDays at Lohika-2020

Recently uploaded (20)

PPTX
Design ,Art Across Digital Realities and eXtended Reality
PPTX
Wireless sensor networks (WSN) SRM unit 2
PPTX
Agentic Artificial Intelligence (Agentic AI).pptx
PDF
Engineering Solutions for Ethical Dilemmas in Healthcare (www.kiu.ac.ug)
PDF
Principles of operation, construction, theory, advantages and disadvantages, ...
PDF
Lesson 3 .pdf
PPTX
CS6006 - CLOUD COMPUTING - Module - 1.pptx
PDF
Project_Mgmt_Institute_-Marc Marc Marc .pdf
DOCX
ENVIRONMENTAL PROTECTION AND MANAGEMENT (18CVL756)
PPTX
WN UNIT-II CH4_MKaruna_BapatlaEngineeringCollege.pptx
PDF
IAE-V2500 Engine for Airbus Family 319/320
PDF
Cryptography and Network Security-Module-I.pdf
PPTX
DATA STRCUTURE LABORATORY -BCSL305(PRG1)
PPTX
AI-Reporting for Emerging Technologies(BS Computer Engineering)
PPT
UNIT-I Machine Learning Essentials for 2nd years
PPTX
Environmental studies, Moudle 3-Environmental Pollution.pptx
PDF
MACCAFERRY GUIA GAVIONES TERRAPLENES EN ESPAÑOL
PPTX
CNS - Unit 1 (Introduction To Computer Networks) - PPT (2).pptx
PPTX
INTERNET OF THINGS - EMBEDDED SYSTEMS AND INTERNET OF THINGS
PPTX
MAD Unit - 3 User Interface and Data Management (Diploma IT)
Design ,Art Across Digital Realities and eXtended Reality
Wireless sensor networks (WSN) SRM unit 2
Agentic Artificial Intelligence (Agentic AI).pptx
Engineering Solutions for Ethical Dilemmas in Healthcare (www.kiu.ac.ug)
Principles of operation, construction, theory, advantages and disadvantages, ...
Lesson 3 .pdf
CS6006 - CLOUD COMPUTING - Module - 1.pptx
Project_Mgmt_Institute_-Marc Marc Marc .pdf
ENVIRONMENTAL PROTECTION AND MANAGEMENT (18CVL756)
WN UNIT-II CH4_MKaruna_BapatlaEngineeringCollege.pptx
IAE-V2500 Engine for Airbus Family 319/320
Cryptography and Network Security-Module-I.pdf
DATA STRCUTURE LABORATORY -BCSL305(PRG1)
AI-Reporting for Emerging Technologies(BS Computer Engineering)
UNIT-I Machine Learning Essentials for 2nd years
Environmental studies, Moudle 3-Environmental Pollution.pptx
MACCAFERRY GUIA GAVIONES TERRAPLENES EN ESPAÑOL
CNS - Unit 1 (Introduction To Computer Networks) - PPT (2).pptx
INTERNET OF THINGS - EMBEDDED SYSTEMS AND INTERNET OF THINGS
MAD Unit - 3 User Interface and Data Management (Diploma IT)

Scalding big ADta

  • 1. Scalding Big ADta или обжигая горшки с рекламой Boris Trofimov @b0ris_1
  • 2. Agenda • Two stories on how AD is served inside AD company • Awesome Scalding
  • 3. The story about shoes or Big Brother is watching on you
  • 4. We will answer this question in a few slides or be careful while buying shoes
  • 5. How many web sites with Ad aboard do you open during the day?
  • 7. What can be simpler than loading web site via web browser, hah?
  • 9. However it is deceptive judgment
  • 10. The first second… Story Actors User Publisher (foxnews.com) Ad Server (Google’s Doubleclick) SSP (Ad Exchange) DSP (decides what ad to show) Advertiser (Nike) we are here
  • 11. … 1 sec20 ms 100 ms 150 ms Publisher receives request Publisher sends response Content delivered to user 170 Site sends request to Ad Server 200 80 ms 280 SSP picks the winning bid and sends redirect url back to ad Server Every bidder/DSP receives info about user: • ssp_cookie_id • geo data • site url 300 SSP (Ad Exchange) receives ad request and opens RTB Auction 210 Ad Server receives ad request and redirects to Ad Exchange All bidders should send their decision (participate? & price) back 350 Ad Server shows page to user which redirects to the bidder’s server User’s web page asks Ad banner from CDN Showing ad & bidder’s 1x1 pixel (impression) 400 The first second… ~70% users have this cookie aboard >>1 independent companies take part in this auction
  • 13. Return info about new user’s interests with special markers (segments) that indicates the new fact about user, e.g user is man who has iphone and lives in NYC and has dog. Major format: <cookie_id – segment_id> Data Scientists Real time Offline Pixel Tracking Farm Warehouse Bidder Farm Auction requests SSP Ad Exchange Hourly Logs 3rd part data House holders data … Hadoop’s HDFS Updating user profiles HiveOozieMapReduce Partners HBASEScalding hbase keeps user profiles Update user’s profiles with new segments Data export Brand new feed about user interests 2 3 45 6 7 8 9 0 1 • Impressions • Clicks • Post-click Activities 5
  • 14. Why do we need all this science? • Deep audience targeting • Case: customer would like to show ad for all men who live in NYC have iPhone and dog
  • 15. Facts about Data Scientists • Data Scientists do: – Audience Modeling identifying new user interests [segments] and finding ways to track them – Audience Bridging – Insights and Analytics • They use IBM Netezza as local warehouse • They use R language
  • 16. Facts about Realtime team • Scala, Java • Restful Services • Akka • In Memory Cache : Aerospike, Redis
  • 17. Facts about Offline team • The tasks we solve over hadoop: – As a Storage to keep all logs we need – As Profile DB to keep all users and their interests [segments] – As MapReduce Engine to run jobs on transformations between data – As a Warehouse to export data via hive • We use Clouderra CDH 5.1.2 • Major language: Scala • Pure MapReduce jobs & Scalding/Cascading • All map reduce applications are wrapped by Oozie’s workflow(s) • Developing nextgen paltform version based on Spark Streaming/Kafka
  • 19. hdfs Scalding in a nutshell • Concise DSL • Configurable Source(s) and sink(s) • Data transform operations: – map/flatMap – pivot/unpivot – project – groupBy/reduce/foldLeft hdfs
  • 20. Just one example (Java way) public class WordCount { public static class Map extends Mapper<LongWritable, Text, Text, IntWritable> { private final static IntWritable one = new IntWritable(1); private Text word = new Text(); public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException { String line = value.toString(); StringTokenizer tokenizer = new StringTokenizer(line); while (tokenizer.hasMoreTokens()) { word.set(tokenizer.nextToken()); context.write(word, one); } } } public static class Reduce extends Reducer<Text, IntWritable, Text, IntWritable> { public void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException { int sum = 0; for (IntWritable val : values) { sum += val.get(); } context.write(key, new IntWritable(sum)); } } public static void main(String[] args) throws Exception { Configuration conf = new Configuration(); Job job = new Job(conf, "wordcount"); job.setOutputKeyClass(Text.class); job.setOutputValueClass(IntWritable.class); job.setMapperClass(Map.class); job.setReducerClass(Reduce.class); job.setInputFormatClass(TextInputFormat.class); job.setOutputFormatClass(TextOutputFormat.class); FileInputFormat.addInputPath(job, new Path(args[0])); FileOutputFormat.setOutputPath(job, new Path(args[1])); job.waitForCompletion(true); } }
  • 21. Source Just one example (Scalding way) class WordCountJob(args : Args) extends Job(args) { TextLine( args("input") ) .flatMap('line -> 'word) { line : String => tokenize(line) } .groupBy('word) { _.size } .write( Tsv( args("output") ) ) // Split a piece of text into individual words. def tokenize(text : String) : Array[String] = { // Lowercase each word and remove punctuation. text.toLowerCase.split("s+") } } Sink Transform operations
  • 22. Use Case 1 Split • Motivation: reuse calculated streams val common = Tsv("./file").map(...) val branch1 = common.map(..).write(Tsv("output")) val branch2 = common.groupby(..).write(Tsv("output"))
  • 23. Use Case 2 Exotic Sources JDBC (out of the box) case object YourTableSource extends JDBCSource { override val tableName = "tableName" override val columns = List( varchar("col1", 64), date("col2"), tinyint("col3"), double("col4"), ) override def currentConfig = ConnectionSpec("www.gt.com", "username", "password", "mysql") } YourTableSource.read.map(...) ...
  • 24. Use Case 2 Exotic Sources HBASE HBaseSource (https://github.com/ParallelAI/SpyGlass) • SCAN_ALL, • GET_LIST, • SCAN_RANGE HBaseRawSource (https://github.com/andry1/SpyGlass) • Advanced filtering via base64Scan val hbs3 = new HBaseSource( tableName, quorum, 'key, List("data"), List('data), sourceMode = SourceMode.SCAN_ALL) .read val scan = new Scan() scan.setCaching(caching) val activity_filters = new FilterList(MUST_PASS_ONE, { val scvf = new SingleColumnValueFilter(toBytes("family"), toBytes("column"), GREATER_OR_EQUAL, toBytes(value)) scvf.setFilterIfMissing(true) scvf.setLatestVersionOnly(true) val scvf2 = ... List(scvf, scvf2) }) scan.setFilter(activity_filters) new HBaseRawSource(tableName, quorum, families, base64Scan = convertScanToBase64(scan)).read. ...
  • 25. Use Case 3 Join • Motivation: joining two streams by key • Different join strategies: – joinWithLarger – joinWithSmaller – joinWithTiny • Inner, Left, Right, strategies val pipe1 = Tsv("file1").read val pipe2 = Tsv("file2").read // small file val pipe3 = Tsv("file3").read // huge file val joinedPipe = pipe1.joinWithTiny('id1 -> 'id2, pipe2) val joinedPipe2 = pipe1.joinWithLarge('id1 -> 'id2, pipe3)
  • 26. Use Case 4 Distributed Caching and Counters //somewhere outside Job definition val fl = DistributedCacheFile("/user/boris/zooKeeper.json") // next value can be passed through any Scalding's jobs via Args object for instance val fileName = fl.path ... class Job(val args:Args) { // once we receive fl.path we can read it like a ordinary file val fileName = args.get("fileName") lazy val data = readJSONFromFile(fileName) ... TSV(args.get("input")).read.map('line -> 'word ) { line => ... /* using data json object*/ ... } } // counter example Stat("jdbc.call.counter","myapp").incBy(1)
  • 27. Use Case 5 Bridging Profiles Motivation: bridge information from different sources and build complete person profile imp Own company’s private cookie thanks to 1x1 pixel impression Bridging two ssp_cookies via private cookie ssp_cookie_Id1 ssp_cookie_Id2 Bridging via ip address
  • 28. Bridging Profiles General task definition: • Build graph • Vertexes – user’s interests • Edges – bridging rules [cookies, IP,…] • Task – Identify connected components
  • 29. Connected components Let’s scalding it /** * The class represents just one iteration of searching connected component algorithm. * Somewhere outside the Job code we have to run this job iteratively until N [~20] and should check number inside "count" file. * If it is zero then we can stop running other iterations */ class ConnectedComponentsOneIterationJob(args : Args) extends Job(args) { val vertexes = Tsv( args("vertexes"),('id,'gid)).read // by default gid is equal to id val edges = Tsv( args("edges"), ('id_a,'id_b) ).read val groups = vertexes.joinWithSmaller('id -> 'id_b, vertexes.joinWithSmaller('id -> 'id_a, edges).discard('id ).rename('gid ->'gid_a)) .discard('id ) .rename('gid ->'gid_b) .filter('gid_a, 'gid_b) {gid : (String, String) => gid._1 != gid._2 } .project ('gid_a, 'gid_b) .mapTo(('gid_a, 'gid_b) -> ('gid_a, 'gid_b)) {gid : (String, String) => max(gid._1, gid._2) -> min(gid._1, gid._2) } // if count=0 then we can stop running next iterations groups.groupAll { _.size }.write(Tsv("count")) val new_groups = groups.groupBy('gid_a) {_.min('gid_b)}.rename(('gid_a,'gid_b)->('source, 'target)) val new_vertexes = vertexes.joinWithSmaller('id -> 'source, new_groups, joiner = new LeftJoin ) .mapTo( ('id,'gid,'source,'target)->('id, 'gid)) { param:(String, String, String, String) => val (id, gid, source,target) = param if (target != null) ( id , min( gid, target ) ) else ( id, gid ) } new_vertexes.write( Tsv( args("new_vertexes") ) ) }
  • 30. Other nice things • Typed pipes • Elegant and fast Matrix operations • Simple migration on Spark/Kafka • More sources: e.g. retrieve data from hive’s hcatalog
  • 31. Useful Resources • http://www.adopsinsider.com/ad-serving/how-does-ad-serving- work/ • http://www.adopsinsider.com/ad-serving/diagramming-the-ssp- dsp-and-rtb-redirect-path/ • https://github.com/twitter/scalding • https://github.com/ParallelAI/SpyGlass • https://github.com/branky/cascading.hive