SlideShare a Scribd company logo
ETL in ClojureETL in Clojure
Dmitriy Morozov / JEEConf 2015
Dmitriy MorozovDmitriy Morozov
Software engineer at
Functional programming junky
Occasional cyclist
Zoomdata.com
@argc
Plan of attackPlan of attack
ETL at ZoomdataETL at Zoomdata
CascalogCascalog
SparkSpark
DemoDemo
ConclusionConclusion
Is a modern BI application focused onIs a modern BI application focused on
allowing everyday business users toallowing everyday business users to
be able to visually interact andbe able to visually interact and
explore their data and discoverexplore their data and discover
insight out of that data.insight out of that data.
What we do at ZoomdataWhat we do at Zoomdata
What we do at ZoomdataWhat we do at Zoomdata
ETL in Clojure
We did ETL inWe did ETL in
Hive/ImpalaHive/Impala
Using SQL for ETLUsing SQL for ETL
Hive is slow, and so is Hive on Tez
SQL is horrible for doing anything complicated
Code is hard to maintain, reuse and test
Lessons learnedLessons learned
Why Clojure?Why Clojure?
Functional!
Runs on JVM
Interactive development
Zero delta between prototyp code and
production code
CascalogCascalog
Datalog DSL in CLojure
Built on top of Hadoop and Cascading
Query compiles to Hadoop MapReduce jobs
Supports local execution for prototyping
Great testing story
DatalogDatalog
language
Syntactically is a subset of Prolog
It is often used as a for
deductive databases.
Query statements can be stated in any order
Logic programming
query language
DatalogDatalog
Word Count using Hadoop API
Word count in CascalogWord count in Cascalog
Cascalog Query StructureCascalog Query Structure
Cascalog / GeneratorsCascalog / Generators
Cascalog / OperationsCascalog / Operations
Cascalog / OperationsCascalog / Operations
Cascalog / JoinsCascalog / Joins
Cascalog / OperationsCascalog / Operations
Cascalog / AggregatorsCascalog / Aggregators
Cascalog / AggregatorsCascalog / Aggregators
Cascalog / TroubleshootingCascalog / Troubleshooting
Cascalog / TestingCascalog / Testing
Cascalog / TroubleshootingCascalog / Troubleshooting
Flow Visualisation /Flow Visualisation / DOTDOT
Flow Visualisation /Flow Visualisation / DrivenDriven
DEMODEMO
Cascalog DownsidesCascalog Downsides
Hadoop < SparkHadoop < Spark **
Cascalog DownsidesCascalog Downsides
No supportNo support
for streamingfor streaming
datadata
Cascalog DownsidesCascalog Downsides
What are the alternatives?What are the alternatives?
Java API forJava API for
FlamboFlambo
SparklingSparkling
SparkSpark
Customer XCustomer X
Customer X wants to do DataCustomer X wants to do Data
Science!Science!
Drug PersistenceDrug Persistence
Determining whether a patient isDetermining whether a patient is
persistent or not based on whether shepersistent or not based on whether she
refilled the prescription in time.refilled the prescription in time.
Drug PersistenceDrug Persistence
Drug PersistenceDrug Persistence
Drug PersistenceDrug Persistence
Drug PersistenceDrug Persistence
ETL in Clojure
ETL in Clojure
ETL in Clojure
ETL in Clojure
ETL in Clojure
Example: Drug PersistenceExample: Drug Persistence
ETL in Clojure
Things to check outThings to check out
How Yieldbot does Data science in Clojure
Cascalog for the Impatient
Streaming MapReduce in Clojure
Sparkling
Flambo
Thank you!Thank you!

More Related Content

PPTX
Oracle REST Data Services: Options for your Web Services
Jeff Smith
 
PDF
Oracle Multitenant meets Oracle RAC 12c OOW13 [CON8706]
Markus Michalewicz
 
PPTX
Modern Java Workshop
Simon Ritter
 
PDF
Achieving Continuous Availability for Your Applications with Oracle MAA
Markus Michalewicz
 
PDF
Apache kafka performance(latency)_benchmark_v0.3
SANG WON PARK
 
PDF
Real-time Data Streaming from Oracle to Apache Kafka
confluent
 
PDF
RDFUnit - Test-Driven Linked Data quality Assessment (WWW2014)
Dimitris Kontokostas
 
PDF
Bulk Loading Data into Cassandra
DataStax
 
Oracle REST Data Services: Options for your Web Services
Jeff Smith
 
Oracle Multitenant meets Oracle RAC 12c OOW13 [CON8706]
Markus Michalewicz
 
Modern Java Workshop
Simon Ritter
 
Achieving Continuous Availability for Your Applications with Oracle MAA
Markus Michalewicz
 
Apache kafka performance(latency)_benchmark_v0.3
SANG WON PARK
 
Real-time Data Streaming from Oracle to Apache Kafka
confluent
 
RDFUnit - Test-Driven Linked Data quality Assessment (WWW2014)
Dimitris Kontokostas
 
Bulk Loading Data into Cassandra
DataStax
 

What's hot (20)

PDF
Proxysql sharding
Marco Tusa
 
PDF
Oracle Drivers configuration for High Availability, is it a developer's job?
Ludovico Caldara
 
PPTX
Spark etl
Imran Rashid
 
PDF
Demystifying the Distributed Database Landscape (DevOps) (1).pdf
ScyllaDB
 
PDF
Migacion forms apex
Daniel Bozzolo
 
PPTX
High Availability for Oracle SE2
Markus Flechtner
 
PDF
Upgrading from SSIS Package Deployment to Project Deployment (SQLSaturday Den...
Cathrine Wilhelmsen
 
PPTX
Apache Kylin on HBase: Extreme OLAP engine for big data
Shi Shao Feng
 
PPTX
Oracle GoldenGate 18c - REST API Examples
Bobby Curtis
 
PDF
Fluent Bit: Log Forwarding at Scale
Eduardo Silva Pereira
 
PDF
MariaDB 마이그레이션 - 네오클로바
NeoClova
 
PPTX
SQL and NoSQL Better Together in Alasql
Andrey Gershun
 
PPTX
Cassandra vs. ScyllaDB: Evolutionary Differences
ScyllaDB
 
PDF
Oracle GoldenGate Roadmap Oracle OpenWorld 2020
Thomas Vengal
 
PPTX
Oracle application express ppt
Abhinaw Kumar
 
PPTX
Ground Breakers Romania: Oracle Autonomous Database
Maria Colgan
 
PDF
Library Operating System for Linux #netdev01
Hajime Tazaki
 
PPTX
Functional Programming In JS
Damian Łabas
 
PPTX
1. SQL Basics - Introduction
Varun A M
 
PDF
Oam install & config
Vigilant Technologies
 
Proxysql sharding
Marco Tusa
 
Oracle Drivers configuration for High Availability, is it a developer's job?
Ludovico Caldara
 
Spark etl
Imran Rashid
 
Demystifying the Distributed Database Landscape (DevOps) (1).pdf
ScyllaDB
 
Migacion forms apex
Daniel Bozzolo
 
High Availability for Oracle SE2
Markus Flechtner
 
Upgrading from SSIS Package Deployment to Project Deployment (SQLSaturday Den...
Cathrine Wilhelmsen
 
Apache Kylin on HBase: Extreme OLAP engine for big data
Shi Shao Feng
 
Oracle GoldenGate 18c - REST API Examples
Bobby Curtis
 
Fluent Bit: Log Forwarding at Scale
Eduardo Silva Pereira
 
MariaDB 마이그레이션 - 네오클로바
NeoClova
 
SQL and NoSQL Better Together in Alasql
Andrey Gershun
 
Cassandra vs. ScyllaDB: Evolutionary Differences
ScyllaDB
 
Oracle GoldenGate Roadmap Oracle OpenWorld 2020
Thomas Vengal
 
Oracle application express ppt
Abhinaw Kumar
 
Ground Breakers Romania: Oracle Autonomous Database
Maria Colgan
 
Library Operating System for Linux #netdev01
Hajime Tazaki
 
Functional Programming In JS
Damian Łabas
 
1. SQL Basics - Introduction
Varun A M
 
Oam install & config
Vigilant Technologies
 
Ad

Viewers also liked (20)

PDF
Clojure: an overview
Larry Diehl
 
PDF
Optimal Learning for Fun and Profit with MOE
Yelp Engineering
 
PDF
Clojure for Java developers
John Stevenson
 
PDF
Clojure: The Art of Abstraction
Alex Miller
 
PDF
DSL in Clojure
Misha Kozik
 
PDF
Writing DSL in Clojure
Misha Kozik
 
PDF
Doing data science with Clojure
Simon Belak
 
PDF
Intro to Java 8 Closures (Dainius Mezanskas)
Kaunas Java User Group
 
PDF
Spec + onyx
Simon Belak
 
PDF
JDK: CPU, PSU, LU, FR — WTF?!
Alexey Fyodorov
 
PDF
"Optimal Learning for Fun and Profit" by Scott Clark (Presented at The Yelp E...
Yelp Engineering
 
ODP
Jee conf
Valerii Moisieienko
 
PDF
3 years with Clojure
Michael Klishin
 
PPTX
Big Data-Driven Applications with Cassandra and Spark
Artem Chebotko
 
PPTX
JEEConf 2015 Big Data Analysis in Java World
Serg Masyutin
 
PPTX
Giving Design Critique
Yelp Engineering
 
ODP
Getting started with Clojure
John Stevenson
 
PDF
JVM上的实用Lisp方言:Clojure
Rui Peng
 
PDF
Maximum Overdrive: Tuning the Spark Cassandra Connector (Russell Spitzer, Dat...
DataStax
 
PPT
Third Nature - Open Source Data Warehousing
mark madsen
 
Clojure: an overview
Larry Diehl
 
Optimal Learning for Fun and Profit with MOE
Yelp Engineering
 
Clojure for Java developers
John Stevenson
 
Clojure: The Art of Abstraction
Alex Miller
 
DSL in Clojure
Misha Kozik
 
Writing DSL in Clojure
Misha Kozik
 
Doing data science with Clojure
Simon Belak
 
Intro to Java 8 Closures (Dainius Mezanskas)
Kaunas Java User Group
 
Spec + onyx
Simon Belak
 
JDK: CPU, PSU, LU, FR — WTF?!
Alexey Fyodorov
 
"Optimal Learning for Fun and Profit" by Scott Clark (Presented at The Yelp E...
Yelp Engineering
 
3 years with Clojure
Michael Klishin
 
Big Data-Driven Applications with Cassandra and Spark
Artem Chebotko
 
JEEConf 2015 Big Data Analysis in Java World
Serg Masyutin
 
Giving Design Critique
Yelp Engineering
 
Getting started with Clojure
John Stevenson
 
JVM上的实用Lisp方言:Clojure
Rui Peng
 
Maximum Overdrive: Tuning the Spark Cassandra Connector (Russell Spitzer, Dat...
DataStax
 
Third Nature - Open Source Data Warehousing
mark madsen
 
Ad

Similar to ETL in Clojure (11)

PDF
Cassandra Data Modelling with CQL (OSCON 2015)
twentyideas
 
KEY
Cascalog at Strange Loop
nathanmarz
 
PPT
Yahoo! Hadoop User Group - May Meetup - Extraordinarily rapid and robust data...
Hadoop User Group
 
PDF
Four Languages From Forty Years Ago
Scott Wlaschin
 
KEY
Cascalog at May Bay Area Hadoop User Group
nathanmarz
 
PDF
Cassandra 3 new features @ Geecon Krakow 2016
Duyhai Doan
 
PDF
Breakthrough OLAP performance with Cassandra and Spark
Evan Chan
 
PDF
FiloDB - Breakthrough OLAP Performance with Cassandra and Spark
Evan Chan
 
PDF
TupleJump: Breakthrough OLAP performance on Cassandra and Spark
DataStax Academy
 
PDF
Zendesk @ clj-melb
Chris Hausler
 
PDF
Cassandra 3 new features 2016
Duyhai Doan
 
Cassandra Data Modelling with CQL (OSCON 2015)
twentyideas
 
Cascalog at Strange Loop
nathanmarz
 
Yahoo! Hadoop User Group - May Meetup - Extraordinarily rapid and robust data...
Hadoop User Group
 
Four Languages From Forty Years Ago
Scott Wlaschin
 
Cascalog at May Bay Area Hadoop User Group
nathanmarz
 
Cassandra 3 new features @ Geecon Krakow 2016
Duyhai Doan
 
Breakthrough OLAP performance with Cassandra and Spark
Evan Chan
 
FiloDB - Breakthrough OLAP Performance with Cassandra and Spark
Evan Chan
 
TupleJump: Breakthrough OLAP performance on Cassandra and Spark
DataStax Academy
 
Zendesk @ clj-melb
Chris Hausler
 
Cassandra 3 new features 2016
Duyhai Doan
 

Recently uploaded (20)

PDF
Balancing Resource Capacity and Workloads with OnePlan – Avoid Overloading Te...
OnePlan Solutions
 
PDF
Teaching Reproducibility and Embracing Variability: From Floating-Point Exper...
University of Rennes, INSA Rennes, Inria/IRISA, CNRS
 
PPTX
Web Testing.pptx528278vshbuqffqhhqiwnwuq
studylike474
 
PPTX
The-Dawn-of-AI-Reshaping-Our-World.pptxx
parthbhanushali307
 
PPTX
classification of computer and basic part of digital computer
ravisinghrajpurohit3
 
PPT
Why Reliable Server Maintenance Service in New York is Crucial for Your Business
Sam Vohra
 
PPTX
slidesgo-unlocking-the-code-the-dynamic-dance-of-variables-and-constants-2024...
kr2589474
 
PDF
49785682629390197565_LRN3014_Migrating_the_Beast.pdf
Abilash868456
 
PDF
advancepresentationskillshdhdhhdhdhdhhfhf
jasmenrojas249
 
PPTX
Presentation about Database and Database Administrator
abhishekchauhan86963
 
PDF
Adobe Illustrator Crack Full Download (Latest Version 2025) Pre-Activated
imang66g
 
PDF
An Experience-Based Look at AI Lead Generation Pricing, Features & B2B Results
Thomas albart
 
PPTX
Presentation about variables and constant.pptx
kr2589474
 
PPTX
GALILEO CRS SYSTEM | GALILEO TRAVEL SOFTWARE
philipnathen82
 
PPTX
Explanation about Structures in C language.pptx
Veeral Rathod
 
PPTX
Can You Build Dashboards Using Open Source Visualization Tool.pptx
Varsha Nayak
 
PPTX
Role Of Python In Programing Language.pptx
jaykoshti048
 
PPTX
Contractor Management Platform and Software Solution for Compliance
SHEQ Network Limited
 
PPTX
AI-Ready Handoff: Auto-Summaries & Draft Emails from MQL to Slack in One Flow
bbedford2
 
PDF
Salesforce Implementation Services Provider.pdf
VALiNTRY360
 
Balancing Resource Capacity and Workloads with OnePlan – Avoid Overloading Te...
OnePlan Solutions
 
Teaching Reproducibility and Embracing Variability: From Floating-Point Exper...
University of Rennes, INSA Rennes, Inria/IRISA, CNRS
 
Web Testing.pptx528278vshbuqffqhhqiwnwuq
studylike474
 
The-Dawn-of-AI-Reshaping-Our-World.pptxx
parthbhanushali307
 
classification of computer and basic part of digital computer
ravisinghrajpurohit3
 
Why Reliable Server Maintenance Service in New York is Crucial for Your Business
Sam Vohra
 
slidesgo-unlocking-the-code-the-dynamic-dance-of-variables-and-constants-2024...
kr2589474
 
49785682629390197565_LRN3014_Migrating_the_Beast.pdf
Abilash868456
 
advancepresentationskillshdhdhhdhdhdhhfhf
jasmenrojas249
 
Presentation about Database and Database Administrator
abhishekchauhan86963
 
Adobe Illustrator Crack Full Download (Latest Version 2025) Pre-Activated
imang66g
 
An Experience-Based Look at AI Lead Generation Pricing, Features & B2B Results
Thomas albart
 
Presentation about variables and constant.pptx
kr2589474
 
GALILEO CRS SYSTEM | GALILEO TRAVEL SOFTWARE
philipnathen82
 
Explanation about Structures in C language.pptx
Veeral Rathod
 
Can You Build Dashboards Using Open Source Visualization Tool.pptx
Varsha Nayak
 
Role Of Python In Programing Language.pptx
jaykoshti048
 
Contractor Management Platform and Software Solution for Compliance
SHEQ Network Limited
 
AI-Ready Handoff: Auto-Summaries & Draft Emails from MQL to Slack in One Flow
bbedford2
 
Salesforce Implementation Services Provider.pdf
VALiNTRY360
 

ETL in Clojure