Neo4j Data Loading
with Kettle
Matt Casters
Chief Solutions Architect / Kettle Project Founder
Agenda
➢ What is Kettle?
➢ The Neo4j plugins
➢ Data loading performance tips
➢ Streaming data integration
➢ Metadata driven data possibilities
➢ Kettle Execution lineage in a graph
➢ Roadmap update
➢ Q&A
What is Kettle?
3
Kettle: Introduction
➢ Pentaho Data Integration from Hitachi Vantara
➢ One of the most widely used ETL tools
➢ Ready for the most demanding tasks
➢ Open source Apache Public License 2.0
➢ Well maintained
➢ Large community, marketplace, ...
➢ Easy to embed, install, package, rebrand
➢ Download : Sourceforge / Pentaho / 8.2 / PDI-CE
Kettle: where is it used?
➢ On tiny and enormous systems, real or virtual
➢ Very small computers, Raspberry Pie sized
➢ Your laptop or browser
➢ Locally or in the cloud
➢ On Hadoop clusters, VMs, Docker, Serverless,
➢ At large and small companies
➢ In government
➢ In education
➢ In the Neo4j Solutions Reference Architecture
Kettle: Why is it used?
➢ Reduce costs!
➢ Answers the “build or buy?” question
build
buy
Time
Accum.
Cost
Kettle
Kettle: Architecture
➢ Metadata driven, engine based :
○ No code generation
○ Define what you need to happen
→ GUI, Web, code, rules, …
○ Clear and transparent, self documenting
➢ Types of work:
○ Jobs for workflows
○ Transformations for parallel data streaming
Kettle: Design
➢ 100% Exposure of our engine through UI elements
➢ Everyone should be able to play along: plugins!
➢ We built integration points for others: run everywhere!
➢ Allow the user to avoid programming anything
➢ Allow the user to program anything: JavaScript, Java,
Groovy, RegEx, Rules, Python, Ruby, R, …
➢ Transparency wins: best in class logging, data lineage,
execution lineage, debugging, data previewing, row
sniff testing, …
Kettle: things of note
➢ SpoonGit: UI integration with git
➢ WebSpoon: web interface to the full Spoon UI
➢ Data Sets: build transformation unit tests
➢ Huge list of other plugins available, including from
Neo4j, on a marketplace, …
➢ Support for the latest technology stacks
➢ Project on github has over 1,000 forks
https://github.com/pentaho/pentaho-kettle
Kettle: The Toolset
➢ Spoon: GUI
➢ Scripts
➢ Server(s)
➢ Java API & SDK
➢ Standard file format
➢ Plugin ecosystem
➢ Docker image(s)
➢ Documentation, books, ...
Neo4j Kettle Plugins
11
Neo4j Plugins: where to find?
➢ Started by the community, extended by Neo4j
➢ Releases/Download shortcut:
○ http://neo4j.kettle.be
➢ Project:
○ https://github.com/knowbi/knowbi-pentaho-
pdi-neo4j-output
Give us feedback!
Neo4j Cypher
➢ For reading and writing
➢ Dynamic Cypher
➢ Batching and UNWIND
➢ Parameters
➢ Return values
➢ Helpers
Neo4j Output
➢ Easy node creation
➢ Create/Merge of ()-[]-()
➢ Batching and UNWIND
➢ Dynamic labels
Neo4j Graph Output
➢ Update (parts of) a graph
➢ Using a logical model
➢ Using field mapping
➢ Auto-generate Cypher
Check Neo4j Connection
➢ Job Entry (workflow)
➢ Validate DBs are up
➢ Used in error diagnostic
➢ Defensive setup
➢ Pessimistic approach
Neo4j Cypher Script
➢ Job Entry (workflow)
➢ Executes series of Cypher statements
Neo4j Kettle Plugins v4
18
Plugins v4
➢ Bulk loading steps
➢ Performance options
➢ Encrypted/obfuscated password in variables
➢ Bug fixes & UI improvements
Neo4j Generate CSVs
➢ Generate CSV files for Neo4j Import
➢ Generates appropriate header
➢ Handles escaping, quoting, …
➢ Outputs file names
Neo4j Split Graph
➢ Splits a graph field into nodes and relationships
➢ Used for unique value calculation
Neo4j Importer
➢ Runs a neo4j-import command
➢ Accepts the filenames of CSV files
Data loading
Performance tips
23
Pre-processing in Kettle
➢ Do work in Kettle that can be avoided in Neo4j
➢ Calculate unique nodes
➢ Do required data conversions
➢ Data cleaning
Parallel loading & batching
➢ Parallel node creation
➢ Limit high parallelism in the general case
➢ UNWIND in Neo4j Cypher step
➢ Create option in Neo4j Output step
➢ Use larger batch sizes (>1000)
➢ Create indexes up-front or with the options
Importing data
➢ Bulk loading with import is much faster
➢ A few orders of magnitude faster
➢ Collect all the data in CSV files
➢ Use the new steps to load
➢ Seamless path to incremental loads
Streaming data loads
27
Streaming options
➢ Micro-batching (every X minutes)
➢ Kafka, Event Hubs, Queues,... (never ending)
Streaming options
➢ Transformations can be never ending
➢ Any operation is possible
➢ Can collect data in other data platforms
➢ Is transactionally safe if it is supported (Kafka, …)
➢ Can be parallelized & scaled out
Metadata driven
Data possibilities
30
➢ Kettle transformations & jobs are metadata
➢ ETL Metadata Injection: transformation templates
➢ Neo4j is a great metadata database
➢ Kettle can make use of this
Metadata FTW
Metadata driven loads
➢ Loading hundreds of types of files
➢ Processing data from hundreds of databases
➢ Automatic data standardization and normalisation
→ Massive time gains!
Metadata driven extracts
➢ Without hardcoded sources, selections and targets
➢ Sourcing selections from users, processes, ...
➢ Using the possibilities of the Kettle engine
→ Flexibility, performance, without coding
Kettle Execution Lineage
34
Kettle Logging Architecture
➢ Unique ID per execution
➢ Precise sourcing of logging records
➢ Very “graphy” data
Execution
Metadata
Impact
Parent /
child
relation
Parent /
child
relation
The Kettle Neo4j Logging plugin
➢ Stores operational metadata in a graph
➢ https://github.com/mattcasters/kettle-neo4j-logging
➢ Tools
○ View execution information: log, duration, errors
○ Find error paths
○ Jump to error location
○ Find execution path of a step
○ Get time window: “since last succesful execution”
Execution lineage in a graph
➢ Documents the exection process
○ Log text, metadata, times, ...
Roadmap update
38
Roadmap Neo4j plugin
➢ 25 releases in 2018
➢ Major 4.0 release next week
➢ Then:
○ New Neo4j Output step
○ More graph data type operations
○ <Insert YOUR suggestion!>
➢ Tuning options for Neo4j steps running in initial
Kettle Apache Beam implementation:
→ DataFlow, Spark, Flink, …
Roadmap Neo4j Logging plugin
➢ Generic impact information logging
➢ Store data lineage in Neo4j
➢ Git revision graph loading (new step)
➢ Storing and viewing unit testing results
➢ Operational “dashboard”
Q&A
41

Neo4j Data Loading with Kettle

  • 1.
    Neo4j Data Loading withKettle Matt Casters Chief Solutions Architect / Kettle Project Founder
  • 2.
    Agenda ➢ What isKettle? ➢ The Neo4j plugins ➢ Data loading performance tips ➢ Streaming data integration ➢ Metadata driven data possibilities ➢ Kettle Execution lineage in a graph ➢ Roadmap update ➢ Q&A
  • 3.
  • 4.
    Kettle: Introduction ➢ PentahoData Integration from Hitachi Vantara ➢ One of the most widely used ETL tools ➢ Ready for the most demanding tasks ➢ Open source Apache Public License 2.0 ➢ Well maintained ➢ Large community, marketplace, ... ➢ Easy to embed, install, package, rebrand ➢ Download : Sourceforge / Pentaho / 8.2 / PDI-CE
  • 5.
    Kettle: where isit used? ➢ On tiny and enormous systems, real or virtual ➢ Very small computers, Raspberry Pie sized ➢ Your laptop or browser ➢ Locally or in the cloud ➢ On Hadoop clusters, VMs, Docker, Serverless, ➢ At large and small companies ➢ In government ➢ In education ➢ In the Neo4j Solutions Reference Architecture
  • 6.
    Kettle: Why isit used? ➢ Reduce costs! ➢ Answers the “build or buy?” question build buy Time Accum. Cost Kettle
  • 7.
    Kettle: Architecture ➢ Metadatadriven, engine based : ○ No code generation ○ Define what you need to happen → GUI, Web, code, rules, … ○ Clear and transparent, self documenting ➢ Types of work: ○ Jobs for workflows ○ Transformations for parallel data streaming
  • 8.
    Kettle: Design ➢ 100%Exposure of our engine through UI elements ➢ Everyone should be able to play along: plugins! ➢ We built integration points for others: run everywhere! ➢ Allow the user to avoid programming anything ➢ Allow the user to program anything: JavaScript, Java, Groovy, RegEx, Rules, Python, Ruby, R, … ➢ Transparency wins: best in class logging, data lineage, execution lineage, debugging, data previewing, row sniff testing, …
  • 9.
    Kettle: things ofnote ➢ SpoonGit: UI integration with git ➢ WebSpoon: web interface to the full Spoon UI ➢ Data Sets: build transformation unit tests ➢ Huge list of other plugins available, including from Neo4j, on a marketplace, … ➢ Support for the latest technology stacks ➢ Project on github has over 1,000 forks https://github.com/pentaho/pentaho-kettle
  • 10.
    Kettle: The Toolset ➢Spoon: GUI ➢ Scripts ➢ Server(s) ➢ Java API & SDK ➢ Standard file format ➢ Plugin ecosystem ➢ Docker image(s) ➢ Documentation, books, ...
  • 11.
  • 12.
    Neo4j Plugins: whereto find? ➢ Started by the community, extended by Neo4j ➢ Releases/Download shortcut: ○ http://neo4j.kettle.be ➢ Project: ○ https://github.com/knowbi/knowbi-pentaho- pdi-neo4j-output Give us feedback!
  • 13.
    Neo4j Cypher ➢ Forreading and writing ➢ Dynamic Cypher ➢ Batching and UNWIND ➢ Parameters ➢ Return values ➢ Helpers
  • 14.
    Neo4j Output ➢ Easynode creation ➢ Create/Merge of ()-[]-() ➢ Batching and UNWIND ➢ Dynamic labels
  • 15.
    Neo4j Graph Output ➢Update (parts of) a graph ➢ Using a logical model ➢ Using field mapping ➢ Auto-generate Cypher
  • 16.
    Check Neo4j Connection ➢Job Entry (workflow) ➢ Validate DBs are up ➢ Used in error diagnostic ➢ Defensive setup ➢ Pessimistic approach
  • 17.
    Neo4j Cypher Script ➢Job Entry (workflow) ➢ Executes series of Cypher statements
  • 18.
  • 19.
    Plugins v4 ➢ Bulkloading steps ➢ Performance options ➢ Encrypted/obfuscated password in variables ➢ Bug fixes & UI improvements
  • 20.
    Neo4j Generate CSVs ➢Generate CSV files for Neo4j Import ➢ Generates appropriate header ➢ Handles escaping, quoting, … ➢ Outputs file names
  • 21.
    Neo4j Split Graph ➢Splits a graph field into nodes and relationships ➢ Used for unique value calculation
  • 22.
    Neo4j Importer ➢ Runsa neo4j-import command ➢ Accepts the filenames of CSV files
  • 23.
  • 24.
    Pre-processing in Kettle ➢Do work in Kettle that can be avoided in Neo4j ➢ Calculate unique nodes ➢ Do required data conversions ➢ Data cleaning
  • 25.
    Parallel loading &batching ➢ Parallel node creation ➢ Limit high parallelism in the general case ➢ UNWIND in Neo4j Cypher step ➢ Create option in Neo4j Output step ➢ Use larger batch sizes (>1000) ➢ Create indexes up-front or with the options
  • 26.
    Importing data ➢ Bulkloading with import is much faster ➢ A few orders of magnitude faster ➢ Collect all the data in CSV files ➢ Use the new steps to load ➢ Seamless path to incremental loads
  • 27.
  • 28.
    Streaming options ➢ Micro-batching(every X minutes) ➢ Kafka, Event Hubs, Queues,... (never ending)
  • 29.
    Streaming options ➢ Transformationscan be never ending ➢ Any operation is possible ➢ Can collect data in other data platforms ➢ Is transactionally safe if it is supported (Kafka, …) ➢ Can be parallelized & scaled out
  • 30.
  • 31.
    ➢ Kettle transformations& jobs are metadata ➢ ETL Metadata Injection: transformation templates ➢ Neo4j is a great metadata database ➢ Kettle can make use of this Metadata FTW
  • 32.
    Metadata driven loads ➢Loading hundreds of types of files ➢ Processing data from hundreds of databases ➢ Automatic data standardization and normalisation → Massive time gains!
  • 33.
    Metadata driven extracts ➢Without hardcoded sources, selections and targets ➢ Sourcing selections from users, processes, ... ➢ Using the possibilities of the Kettle engine → Flexibility, performance, without coding
  • 34.
  • 35.
    Kettle Logging Architecture ➢Unique ID per execution ➢ Precise sourcing of logging records ➢ Very “graphy” data Execution Metadata Impact Parent / child relation Parent / child relation
  • 36.
    The Kettle Neo4jLogging plugin ➢ Stores operational metadata in a graph ➢ https://github.com/mattcasters/kettle-neo4j-logging ➢ Tools ○ View execution information: log, duration, errors ○ Find error paths ○ Jump to error location ○ Find execution path of a step ○ Get time window: “since last succesful execution”
  • 37.
    Execution lineage ina graph ➢ Documents the exection process ○ Log text, metadata, times, ...
  • 38.
  • 39.
    Roadmap Neo4j plugin ➢25 releases in 2018 ➢ Major 4.0 release next week ➢ Then: ○ New Neo4j Output step ○ More graph data type operations ○ <Insert YOUR suggestion!> ➢ Tuning options for Neo4j steps running in initial Kettle Apache Beam implementation: → DataFlow, Spark, Flink, …
  • 40.
    Roadmap Neo4j Loggingplugin ➢ Generic impact information logging ➢ Store data lineage in Neo4j ➢ Git revision graph loading (new step) ➢ Storing and viewing unit testing results ➢ Operational “dashboard”
  • 41.