Is Spark the right choice for data analysis ?

Is Spark the right choice for Data
Analysis ?
Ahmed Kamal, Big Data Engineer
http://ahmedkamal.me

Resources ?
●“Advanced Analytics using Spark”, a practical book !
●“The thing I like most about this book is its focus on
examples, which are all drawn from real applications on real-
world data sets.” - Matei Zaharia, CTO at Databricks.
●It is all about developing data applications using Spark

Data Applications, like what ?
●Build a model to detect credit card fraud using thousands of
features and billions of transactions.
●Intelligently recommend millions of products to millions of
users.
●Estimate financial risk through simulations of portfolios
including millions of instruments.
●Easily manipulate data from thousands of human genomes
to detect genetic associations with disease.

Doing something useful with data
●Often, “doing something useful” = Placing a schema over it
and using SQL to answer questions like
●“of the gazillion users who made it to the third page in
our registration process, how many are over 25?”
●The field of how to structure a data warehouse and
organize information to make answering these kinds of
questions easy is a rich one.

A new superpower !
●When people say that we live in an age of “big data,” they
mean that we have tools for collecting, storing, and
processing information at a scale previously unheard of.
●There is a gap between having access to these tools and all
this data, and doing something useful with it.

Doing extra useful things
●Requirements :
a- Flexible programming model
b- Rich functionality in machine learning and statistics
●Existing Tools :
R, Python (PyData stack) and Octave
Pros : Little effort, easy to use
Cons : Viable only to small data sets, too complex to
redesign to be suitable for working over clusters of
computers.

Why is it difficult ?
●Some algorithms (like machine learning algos) would have
wide data dependencies.
• Data are partitioned across nodes.
• Network transfer is much sloooower than memory
accesses.
●What about the probability of failures ?
●Summary : We need a programming paradigm that is
sensitive to the c/c of the underlying system and that
encourages good choices and make it easy to write parallel
code.

High performance Computing
●Use Case : processing a large file full of DNA sequencing
reads in parallel
●1- Manually split the file into smaller files
●2- Submitting a job for each file split to the scheduler
●3- Continuous jobs monitoring to resubmit any failed jobs
●All to all operations like sorting the full data would require
streaming through one node or to go and use MPI.
●Relatively low level of abstraction and difficulty of use in
addition to the high cost.

The 3 truths about data science
●Successful data preprocessing is a must for successful
analysis.
–Large data sets requires special treatment
–Feature engineering should be given more time than the
time spent on the algorithms stuff. (A model for fraud
detection can use IP location info, login times, click logs)
–How would you convert features into vectors suitable for ML
algorithms.

The 3 truths about data science
●Iteration is the key.
–Famous optimization techniques like Gradient Descent
requires repeated scans over the input until convergence
–You can't get it right from the first time.
(Features/Algo/Test)

Analytics between lab and factory
A framework that makes modeling easy
but is also a good fit for production systems is a huge
win.

Apache Spark In Points
●Spark continues from what Hadoop Shines at (Linear
Scalability , Fault Tolerance)
●Spark supports DAG (Direct Acyclic Graph of operators)
●Complements its capabilities with rich set of
transformations.
●In-memory processing. (Suitable for iterations)

Apache Spark In Points
●The most important bottleneck that Spark addresses is
analyst productivity. (R, HDFS, MR, .. etc)
●Spark is better at being an operational system than most
exploratory systems and better for data exploration than
the technologies commonly used in operational systems.
●Standing on top of JVM – Good integration with Hadoop
ecosystem

Spark From the other side !
●Still young compared to MapReduce
●Its main components needs a lot of work to be mature
enough (stream processing, SQL, machine learning, and
graph processing)
–MLlib’s pipelines and transformer API model is in progress
–Its statistics and modeling functionality comes nowhere near that of
single machine languages like R
–Its SQL functionality is rich, but still lags far behind that of Hive.

Spark Programming Model
●It starts with a dataset or a few residing in a distributed
persistent storage (like HDFS)
●Writing a Spark program typically consists of a few related
steps:
–Defining a set of transformations on input data sets.
–Invoking actions that output the transformed data sets to persistent
storage or return results to the driver’s local memory.
–Running local computations that operate on the results computed in a
distributed fashion. These can help you decide what transformations
and actions to undertake next.

Why should you consider Scala ?
●Spark has already different wrappers (Java, python)
●It reduces performance overhead. (Running your different
language of top of JVM)
●It gives you access to the latest and greatest.
●It will help you understand the Spark philosophy.
–If you know how to use Spark in Scala, even if you primarily
use it from other languages, you’ll have a better
understanding of the system and will be in a better position
to “think in Spark.”

If you are immune to boredom,
there is literally nothing you cannot
accomplish.
—David Foster Wallace

Data Science's First Step
●Data cleansing is the first step in any data science project.
●Many clever analyses have been undone because the data
analyzed had fundamental quality problems or bias problem.
●It is a dull work that you have to do before you can get to
the really cool machine learning algorithm that you’ve been
dying to apply to a new problem.

Our First Real Problem !
●Name : Record Linkage
●Description :
–we have a large collection of records from one or more
source systems
–it is likely that some of the records refer to the same
underlying entity, such as a customer, a patient.
–Each of the entities has a number of attributes, such as a
name or address

The Challenge
●Challenge :
–The values of these attributes aren’t perfect
–Values might have different formatting, or typos, or missing
information.
–It is easy for a human to understand and identify at a
glance, but is difficult for a computer to learn.

Steps we are going to take
●Bringing Data from the Cluster to the Client
●Shipping Code from the Client to the Cluster
●Structuring Data with Tuples and Case Classes
●Getting some numbers regarding our data.

Is Spark the right choice for data analysis ?

More Related Content

What's hot

Viewers also liked

Similar to Is Spark the right choice for data analysis ?

Recently uploaded

Is Spark the right choice for data analysis ?