Lookup Stage
Lookup Stage
Introduction
DataStage sparse lookup is considered an expensive operation because of a round-trip database query for each
incoming row. It is appropriate if the following 2 conditions are met.
1. The size of reference table is huge, i.e., more than millions of rows. If the reference table is small enough to fit
into memory entirely, normal lookup is a better choice.
2. The number of input rows is less than 1% of the reference table. Otherwise, use a Join stage.
Is it possible to speedup sparse lookup by sending queries to database in batches of 10, 20 or 50 rows? In other words,
instead of sending the following SQL to database for each incoming row,
Solution
In order to make the 2nd query work, we need 2 tricks. First, we need to concatenate values from multiple rows into a
single string, separated by a delimiter (e.g. comma). I am not sure how to do this in DataStage v8.1 or earlier. Not
impossible, but rather complicated. Since v8.5, the Transformer stage has loop capability, which makes the task of
concatenating multiple rows much easier.
The 2nd trick is to, at database side, split the value list from a comma-delimited string into an array. If we simply plug the
original string into the 2nd query, the database will interpret it literally as a single value. If only there is a standard SQL
function equivalent to string.split() in C# or Java. Different databases use their own tricks to achieve
string.split(). I will use Oracle in my example implementation.
1 of 6 12/23/2015 7:15 PM
DataStage Batch Sparse Lookup - CodeProject http://www.codeproject.com/Articles/717353/DataStage-Batch-Sparse-L...
Implementation
Job Overview
The example implementation generates 50k rows using a Row Generator stage. Each row has a key column and a value
column. The rows are duplicated in Transformer_25. One copy is branched to Transformer_7, where multiple
rows of the keys are concatenated. The number of rows in each concatenation is set by a job parameter,
#BATCH_SIZE#. The concatenated keys are then sent to Lookup_0 for sparse lookup against an Oracle table with
5m rows. The lookup results are merged back to the original stream in Lookup_16.
2 of 6 12/23/2015 7:15 PM
DataStage Batch Sparse Lookup - CodeProject http://www.codeproject.com/Articles/717353/DataStage-Batch-Sparse-L...
3 of 6 12/23/2015 7:15 PM
DataStage Batch Sparse Lookup - CodeProject http://www.codeproject.com/Articles/717353/DataStage-Batch-Sparse-L...
Test run
A test run with BATCH_SIZE of 50 is shown below. DSLink4 indicated that 1,000 queries, instead of 50,000, were
sent to the database.
Performance Evaluation
Another job (shown below) using regular sparse lookup compares the performance of batch sparse lookup.
4 of 6 12/23/2015 7:15 PM
DataStage Batch Sparse Lookup - CodeProject http://www.codeproject.com/Articles/717353/DataStage-Batch-Sparse-L...
The result of the comparison is summarized in the chart below. Batch sparse lookup can cut down job running time by
~75%. The most effective batch size is between 20 to 50.
License
This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)
Share
5 of 6 12/23/2015 7:15 PM
DataStage Batch Sparse Lookup - CodeProject http://www.codeproject.com/Articles/717353/DataStage-Batch-Sparse-L...
No Biography provided
Permalink | Advertise | Privacy | Terms of Use | Mobile Article Copyright 2014 by dingjing
Select Language ▼
Web02 | 2.8.151126.1 | Last Updated 28 Jan 2014 Everything else Copyright © CodeProject, 1999-2015
6 of 6 12/23/2015 7:15 PM