Is Spark the right choice for Data
Analysis ?
Ahmed Kamal, Big Data Engineer
http://ahmedkamal.me
Resources ?
●“Advanced Analytics using Spark”, a practical book !
●“The thing I like most about this book is its focus on
examples, which are all drawn from real applications on real-
world data sets.” - Matei Zaharia, CTO at Databricks.
●It is all about developing data applications using Spark
Data Applications, like what ?
●Build a model to detect credit card fraud using thousands of
features and billions of transactions.
●Intelligently recommend millions of products to millions of
users.
●Estimate financial risk through simulations of portfolios
including millions of instruments.
●Easily manipulate data from thousands of human genomes
to detect genetic associations with disease.
Doing something useful with data
●Often, “doing something useful” = Placing a schema over it
and using SQL to answer questions like
●“of the gazillion users who made it to the third page in
our registration process, how many are over 25?”
●The field of how to structure a data warehouse and
organize information to make answering these kinds of
questions easy is a rich one.
A new superpower !
●When people say that we live in an age of “big data,” they
mean that we have tools for collecting, storing, and
processing information at a scale previously unheard of.
●There is a gap between having access to these tools and all
this data, and doing something useful with it.
Doing extra useful things
●Requirements :
a- Flexible programming model
b- Rich functionality in machine learning and statistics
●Existing Tools :
R, Python (PyData stack) and Octave
Pros : Little effort, easy to use
Cons : Viable only to small data sets, too complex to
redesign to be suitable for working over clusters of
computers.
Why is it difficult ?
●Some algorithms (like machine learning algos) would have
wide data dependencies.
• Data are partitioned across nodes.
• Network transfer is much sloooower than memory
accesses.
●What about the probability of failures ?
●Summary : We need a programming paradigm that is
sensitive to the c/c of the underlying system and that
encourages good choices and make it easy to write parallel
code.
High performance Computing
●Use Case : processing a large file full of DNA sequencing
reads in parallel
●1- Manually split the file into smaller files
●2- Submitting a job for each file split to the scheduler
●3- Continuous jobs monitoring to resubmit any failed jobs
●All to all operations like sorting the full data would require
streaming through one node or to go and use MPI.
●Relatively low level of abstraction and difficulty of use in
addition to the high cost.
The 3 truths about data science
●Successful data preprocessing is a must for successful
analysis.
–Large data sets requires special treatment
–Feature engineering should be given more time than the
time spent on the algorithms stuff. (A model for fraud
detection can use IP location info, login times, click logs)
–How would you convert features into vectors suitable for ML
algorithms.
The 3 truths about data science
●Iteration is the key.
–Famous optimization techniques like Gradient Descent
requires repeated scans over the input until convergence
–You can't get it right from the first time.
(Features/Algo/Test)
Analytics between lab and factory
A framework that makes modeling easy
but is also a good fit for production systems is a huge
win.
Apache Spark In Points
●Spark continues from what Hadoop Shines at (Linear
Scalability , Fault Tolerance)
●Spark supports DAG (Direct Acyclic Graph of operators)
●Complements its capabilities with rich set of
transformations.
●In-memory processing. (Suitable for iterations)
Apache Spark In Points
●The most important bottleneck that Spark addresses is
analyst productivity. (R, HDFS, MR, .. etc)
●Spark is better at being an operational system than most
exploratory systems and better for data exploration than
the technologies commonly used in operational systems.
●Standing on top of JVM – Good integration with Hadoop
ecosystem
Spark From the other side !
●Still young compared to MapReduce
●Its main components needs a lot of work to be mature
enough (stream processing, SQL, machine learning, and
graph processing)
–MLlib’s pipelines and transformer API model is in progress
–Its statistics and modeling functionality comes nowhere near that of
single machine languages like R
–Its SQL functionality is rich, but still lags far behind that of Hive.
Spark Programming Model
●It starts with a dataset or a few residing in a distributed
persistent storage (like HDFS)
●Writing a Spark program typically consists of a few related
steps:
–Defining a set of transformations on input data sets.
–Invoking actions that output the transformed data sets to persistent
storage or return results to the driver’s local memory.
–Running local computations that operate on the results computed in a
distributed fashion. These can help you decide what transformations
and actions to undertake next.
Why should you consider Scala ?
●Spark has already different wrappers (Java, python)
●It reduces performance overhead. (Running your different
language of top of JVM)
●It gives you access to the latest and greatest.
●It will help you understand the Spark philosophy.
–If you know how to use Spark in Scala, even if you primarily
use it from other languages, you’ll have a better
understanding of the system and will be in a better position
to “think in Spark.”
If you are immune to boredom,
there is literally nothing you cannot
accomplish.
—David Foster Wallace
Data Science's First Step
●Data cleansing is the first step in any data science project.
●Many clever analyses have been undone because the data
analyzed had fundamental quality problems or bias problem.
●It is a dull work that you have to do before you can get to
the really cool machine learning algorithm that you’ve been
dying to apply to a new problem.
Our First Real Problem !
●Name : Record Linkage
●Description :
–we have a large collection of records from one or more
source systems
–it is likely that some of the records refer to the same
underlying entity, such as a customer, a patient.
–Each of the entities has a number of attributes, such as a
name or address
The Challenge
●Challenge :
–The values of these attributes aren’t perfect
–Values might have different formatting, or typos, or missing
information.
–It is easy for a human to understand and identify at a
glance, but is difficult for a computer to learn.
Steps we are going to take
●Bringing Data from the Cluster to the Client
●Shipping Code from the Client to the Cluster
●Structuring Data with Tuples and Case Classes
●Getting some numbers regarding our data.
The End
Thanks A lot :)

Is Spark the right choice for data analysis ?

  • 1.
    Is Spark theright choice for Data Analysis ? Ahmed Kamal, Big Data Engineer http://ahmedkamal.me
  • 2.
    Resources ? ●“Advanced Analyticsusing Spark”, a practical book ! ●“The thing I like most about this book is its focus on examples, which are all drawn from real applications on real- world data sets.” - Matei Zaharia, CTO at Databricks. ●It is all about developing data applications using Spark
  • 3.
    Data Applications, likewhat ? ●Build a model to detect credit card fraud using thousands of features and billions of transactions. ●Intelligently recommend millions of products to millions of users. ●Estimate financial risk through simulations of portfolios including millions of instruments. ●Easily manipulate data from thousands of human genomes to detect genetic associations with disease.
  • 4.
    Doing something usefulwith data ●Often, “doing something useful” = Placing a schema over it and using SQL to answer questions like ●“of the gazillion users who made it to the third page in our registration process, how many are over 25?” ●The field of how to structure a data warehouse and organize information to make answering these kinds of questions easy is a rich one.
  • 5.
    A new superpower! ●When people say that we live in an age of “big data,” they mean that we have tools for collecting, storing, and processing information at a scale previously unheard of. ●There is a gap between having access to these tools and all this data, and doing something useful with it.
  • 6.
    Doing extra usefulthings ●Requirements : a- Flexible programming model b- Rich functionality in machine learning and statistics ●Existing Tools : R, Python (PyData stack) and Octave Pros : Little effort, easy to use Cons : Viable only to small data sets, too complex to redesign to be suitable for working over clusters of computers.
  • 7.
    Why is itdifficult ? ●Some algorithms (like machine learning algos) would have wide data dependencies. • Data are partitioned across nodes. • Network transfer is much sloooower than memory accesses. ●What about the probability of failures ? ●Summary : We need a programming paradigm that is sensitive to the c/c of the underlying system and that encourages good choices and make it easy to write parallel code.
  • 8.
    High performance Computing ●UseCase : processing a large file full of DNA sequencing reads in parallel ●1- Manually split the file into smaller files ●2- Submitting a job for each file split to the scheduler ●3- Continuous jobs monitoring to resubmit any failed jobs ●All to all operations like sorting the full data would require streaming through one node or to go and use MPI. ●Relatively low level of abstraction and difficulty of use in addition to the high cost.
  • 9.
    The 3 truthsabout data science ●Successful data preprocessing is a must for successful analysis. –Large data sets requires special treatment –Feature engineering should be given more time than the time spent on the algorithms stuff. (A model for fraud detection can use IP location info, login times, click logs) –How would you convert features into vectors suitable for ML algorithms.
  • 10.
    The 3 truthsabout data science ●Iteration is the key. –Famous optimization techniques like Gradient Descent requires repeated scans over the input until convergence –You can't get it right from the first time. (Features/Algo/Test)
  • 11.
    Analytics between laband factory A framework that makes modeling easy but is also a good fit for production systems is a huge win.
  • 12.
    Apache Spark InPoints ●Spark continues from what Hadoop Shines at (Linear Scalability , Fault Tolerance) ●Spark supports DAG (Direct Acyclic Graph of operators) ●Complements its capabilities with rich set of transformations. ●In-memory processing. (Suitable for iterations)
  • 13.
    Apache Spark InPoints ●The most important bottleneck that Spark addresses is analyst productivity. (R, HDFS, MR, .. etc) ●Spark is better at being an operational system than most exploratory systems and better for data exploration than the technologies commonly used in operational systems. ●Standing on top of JVM – Good integration with Hadoop ecosystem
  • 14.
    Spark From theother side ! ●Still young compared to MapReduce ●Its main components needs a lot of work to be mature enough (stream processing, SQL, machine learning, and graph processing) –MLlib’s pipelines and transformer API model is in progress –Its statistics and modeling functionality comes nowhere near that of single machine languages like R –Its SQL functionality is rich, but still lags far behind that of Hive.
  • 15.
    Spark Programming Model ●Itstarts with a dataset or a few residing in a distributed persistent storage (like HDFS) ●Writing a Spark program typically consists of a few related steps: –Defining a set of transformations on input data sets. –Invoking actions that output the transformed data sets to persistent storage or return results to the driver’s local memory. –Running local computations that operate on the results computed in a distributed fashion. These can help you decide what transformations and actions to undertake next.
  • 16.
    Why should youconsider Scala ? ●Spark has already different wrappers (Java, python) ●It reduces performance overhead. (Running your different language of top of JVM) ●It gives you access to the latest and greatest. ●It will help you understand the Spark philosophy. –If you know how to use Spark in Scala, even if you primarily use it from other languages, you’ll have a better understanding of the system and will be in a better position to “think in Spark.”
  • 17.
    If you areimmune to boredom, there is literally nothing you cannot accomplish. —David Foster Wallace
  • 18.
    Data Science's FirstStep ●Data cleansing is the first step in any data science project. ●Many clever analyses have been undone because the data analyzed had fundamental quality problems or bias problem. ●It is a dull work that you have to do before you can get to the really cool machine learning algorithm that you’ve been dying to apply to a new problem.
  • 19.
    Our First RealProblem ! ●Name : Record Linkage ●Description : –we have a large collection of records from one or more source systems –it is likely that some of the records refer to the same underlying entity, such as a customer, a patient. –Each of the entities has a number of attributes, such as a name or address
  • 20.
    The Challenge ●Challenge : –Thevalues of these attributes aren’t perfect –Values might have different formatting, or typos, or missing information. –It is easy for a human to understand and identify at a glance, but is difficult for a computer to learn.
  • 21.
    Steps we aregoing to take ●Bringing Data from the Cluster to the Client ●Shipping Code from the Client to the Cluster ●Structuring Data with Tuples and Case Classes ●Getting some numbers regarding our data.
  • 22.