Scalding big ADta

Scalding Big ADta
или обжигая горшки с рекламой
Boris Trofimov
@b0ris_1

Agenda
• Two stories on how AD is served inside
AD company
• Awesome Scalding

The story about shoes
or
Big Brother is watching on you

We will answer this question in a few slides
or be careful while buying shoes

How many web sites with
Ad aboard do you open
during the day?

What can be simpler than
loading web site via web
browser, hah?

However it is deceptive
judgment

The first second…
Story Actors
User
Publisher (foxnews.com)
Ad Server (Google’s Doubleclick)
SSP (Ad Exchange)
DSP (decides what ad to show)
Advertiser (Nike)
we are here

… 1 sec20 ms 100 ms 150 ms
Publisher
receives
request
Publisher
sends
response
Content
delivered
to user
170
Site sends
request to
Ad Server
200
80 ms
280
SSP picks the
winning bid and
sends redirect
url back to ad
Server
Every bidder/DSP receives
info about user:
• ssp_cookie_id
• geo data
• site url
300
SSP (Ad Exchange)
receives ad request
and opens
RTB Auction
210
Ad Server
receives ad
request
and redirects to
Ad Exchange
All bidders
should send their
decision
(participate? &
price) back
350
Ad Server
shows page to
user which
redirects to the
bidder’s server
User’s web page asks
Ad banner from CDN
Showing ad & bidder’s
1x1 pixel
(impression)
400
The first second…
~70% users have
this cookie aboard
>>1 independent
companies take
part in this auction

Return info about new user’s
interests with special markers
(segments) that indicates the
new fact about user, e.g user
is man who has iphone and
lives in NYC and has dog.
Major format: <cookie_id – segment_id> Data Scientists
Real time
Offline
Pixel Tracking
Farm
Warehouse
Bidder Farm
Auction
requests
SSP Ad
Exchange
Hourly
Logs
3rd part data
House holders
data
…
Hadoop’s HDFS
Updating user profiles
HiveOozieMapReduce
Partners
HBASEScalding
hbase keeps
user profiles
Update user’s profiles with
new segments
Data export
Brand new
feed about
user
interests
2
3
45
6
7
8
9
0 1 • Impressions
• Clicks
• Post-click Activities
5

Why do we need all this
science?
• Deep audience targeting
• Case: customer would like
to show ad for all men who
live in NYC have iPhone
and dog

Facts about Data Scientists
• Data Scientists do:
– Audience Modeling
identifying new user interests
[segments] and finding ways
to track them
– Audience Bridging
– Insights and Analytics
• They use IBM Netezza as
local warehouse
• They use R language

Facts about Realtime team
• Scala, Java
• Restful Services
• Akka
• In Memory Cache : Aerospike, Redis

Facts about Offline team
• The tasks we solve over hadoop:
– As a Storage to keep all logs we need
– As Profile DB to keep all users and their interests [segments]
– As MapReduce Engine to run jobs on transformations between data
– As a Warehouse to export data via hive
• We use Clouderra CDH 5.1.2
• Major language: Scala
• Pure MapReduce jobs & Scalding/Cascading
• All map reduce applications are wrapped by Oozie’s workflow(s)
• Developing nextgen paltform version based on Spark Streaming/Kafka

hdfs
Scalding in a nutshell
• Concise DSL
• Configurable Source(s) and
sink(s)
• Data transform operations:
– map/flatMap
– pivot/unpivot
– project
– groupBy/reduce/foldLeft
hdfs

Just one example (Java way)
public class WordCount {
public static class Map extends Mapper<LongWritable, Text, Text, IntWritable> {
private final static IntWritable one = new IntWritable(1);
private Text word = new Text();
public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
String line = value.toString();
StringTokenizer tokenizer = new StringTokenizer(line);
while (tokenizer.hasMoreTokens()) {
word.set(tokenizer.nextToken());
context.write(word, one);
}
}
}
public static class Reduce extends Reducer<Text, IntWritable, Text, IntWritable> {
public void reduce(Text key, Iterable<IntWritable> values, Context context)
throws IOException, InterruptedException {
int sum = 0;
for (IntWritable val : values) {
sum += val.get();
}
context.write(key, new IntWritable(sum));
}
}
public static void main(String[] args) throws Exception {
Configuration conf = new Configuration();
Job job = new Job(conf, "wordcount");
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
job.setMapperClass(Map.class);
job.setReducerClass(Reduce.class);
job.setInputFormatClass(TextInputFormat.class);
job.setOutputFormatClass(TextOutputFormat.class);
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
job.waitForCompletion(true);
}
}

Source
Just one example
(Scalding way)
class WordCountJob(args : Args) extends Job(args) {
TextLine( args("input") )
.flatMap('line -> 'word) { line : String => tokenize(line) }
.groupBy('word) { _.size }
.write( Tsv( args("output") ) )
// Split a piece of text into individual words.
def tokenize(text : String) : Array[String] = {
// Lowercase each word and remove punctuation.
text.toLowerCase.split("s+")
}
}
Sink
Transform
operations

Use Case 1
Split
• Motivation: reuse calculated streams
val common = Tsv("./file").map(...)
val branch1 = common.map(..).write(Tsv("output"))
val branch2 = common.groupby(..).write(Tsv("output"))

Use Case 2 Exotic Sources
JDBC (out of the box)
case object YourTableSource extends JDBCSource {
override val tableName = "tableName"
override val columns = List(
varchar("col1", 64),
date("col2"),
tinyint("col3"),
double("col4"),
)
override def currentConfig = ConnectionSpec("www.gt.com", "username", "password",
"mysql")
}
YourTableSource.read.map(...) ...

Use Case 2 Exotic Sources
HBASE
HBaseSource
(https://github.com/ParallelAI/SpyGlass)
• SCAN_ALL,
• GET_LIST,
• SCAN_RANGE
HBaseRawSource
(https://github.com/andry1/SpyGlass)
• Advanced filtering via base64Scan
val hbs3 = new HBaseSource(
tableName,
quorum,
'key,
List("data"),
List('data),
sourceMode = SourceMode.SCAN_ALL)
.read
val scan = new Scan()
scan.setCaching(caching)
val activity_filters = new FilterList(MUST_PASS_ONE, {
val scvf = new SingleColumnValueFilter(toBytes("family"), toBytes("column"), GREATER_OR_EQUAL, toBytes(value))
scvf.setFilterIfMissing(true)
scvf.setLatestVersionOnly(true)
val scvf2 = ...
List(scvf, scvf2)
})
scan.setFilter(activity_filters)
new HBaseRawSource(tableName, quorum, families,
base64Scan = convertScanToBase64(scan)).read. ...

Use Case 3
Join
• Motivation: joining two streams by key
• Different join strategies:
– joinWithLarger
– joinWithSmaller
– joinWithTiny
• Inner, Left, Right, strategies
val pipe1 = Tsv("file1").read
val pipe2 = Tsv("file2").read // small file
val pipe3 = Tsv("file3").read // huge file
val joinedPipe = pipe1.joinWithTiny('id1 -> 'id2, pipe2)
val joinedPipe2 = pipe1.joinWithLarge('id1 -> 'id2, pipe3)

Use Case 4
Distributed Caching and Counters
//somewhere outside Job definition
val fl = DistributedCacheFile("/user/boris/zooKeeper.json")
// next value can be passed through any Scalding's jobs via Args
object for instance
val fileName = fl.path
...
class Job(val args:Args) {
// once we receive fl.path we can read it like a ordinary file
val fileName = args.get("fileName")
lazy val data = readJSONFromFile(fileName)
...
TSV(args.get("input")).read.map('line -> 'word ) {
line => ... /* using data json object*/ ... }
}
// counter example
Stat("jdbc.call.counter","myapp").incBy(1)

Use Case 5
Bridging Profiles
Motivation: bridge information from different sources and build
complete person profile
imp
Own company’s
private cookie
thanks to 1x1 pixel
impression
Bridging two
ssp_cookies via private
cookie
ssp_cookie_Id1
ssp_cookie_Id2
Bridging via ip
address

Bridging Profiles
General task definition:
• Build graph
• Vertexes – user’s interests
• Edges – bridging rules
[cookies, IP,…]
• Task – Identify
connected components

Connected components
Let’s scalding it
/**
* The class represents just one iteration of searching connected component algorithm.
* Somewhere outside the Job code we have to run this job iteratively until N [~20] and should check number inside "count" file.
* If it is zero then we can stop running other iterations
*/
class ConnectedComponentsOneIterationJob(args : Args) extends Job(args) {
val vertexes = Tsv( args("vertexes"),('id,'gid)).read // by default gid is equal to id
val edges = Tsv( args("edges"), ('id_a,'id_b) ).read
val groups = vertexes.joinWithSmaller('id -> 'id_b, vertexes.joinWithSmaller('id -> 'id_a, edges).discard('id ).rename('gid ->'gid_a))
.discard('id )
.rename('gid ->'gid_b)
.filter('gid_a, 'gid_b) {gid : (String, String) => gid._1 != gid._2 }
.project ('gid_a, 'gid_b)
.mapTo(('gid_a, 'gid_b) -> ('gid_a, 'gid_b)) {gid : (String, String) => max(gid._1, gid._2) -> min(gid._1, gid._2) }
// if count=0 then we can stop running next iterations
groups.groupAll { _.size }.write(Tsv("count"))
val new_groups = groups.groupBy('gid_a) {_.min('gid_b)}.rename(('gid_a,'gid_b)->('source, 'target))
val new_vertexes = vertexes.joinWithSmaller('id -> 'source, new_groups, joiner = new LeftJoin )
.mapTo( ('id,'gid,'source,'target)->('id, 'gid)) { param:(String, String, String, String) =>
val (id, gid, source,target) = param
if (target != null) ( id , min( gid, target ) ) else ( id, gid )
}
new_vertexes.write( Tsv( args("new_vertexes") ) )
}

Other nice things
• Typed pipes
• Elegant and fast Matrix operations
• Simple migration on Spark/Kafka
• More sources: e.g. retrieve data from
hive’s hcatalog

Useful Resources
• http://www.adopsinsider.com/ad-serving/how-does-ad-serving-
work/
• http://www.adopsinsider.com/ad-serving/diagramming-the-ssp-
dsp-and-rtb-redirect-path/
• https://github.com/twitter/scalding
• https://github.com/ParallelAI/SpyGlass
• https://github.com/branky/cascading.hive

Scalding big ADta

More Related Content

What's hot (20)

Viewers also liked (8)

Similar to Scalding big ADta (20)

More from b0ris_1 (6)

Recently uploaded (20)

Scalding big ADta