Kettle & Neo4j
Matt Casters, matt.casters@neo4j.com
mattcasters
Agenda
•What is Kettle?
•The Neo4j plugins
•Loading data into Neo4j (with demo)
•Extracting data from Neo4j (with demo)
•Recap
•Q&A
What is Kettle?
3
Kettle: Introduction
•a.k.a Pentaho Data Integration
•One of the most widely used ETL tools
•Ready for the most demanding tasks
•Open source Apache Public License 2.0
•Well maintained
•Large community, marketplace, ...
•Easy to embed, install, package, rebrand
•Download from Sourceforge / Pentaho / PDI-CE
Kettle: Introduction
• Kettle
• Extraction
• Transformation
•Transportation
• Loading
• Environment
• PDI: Pentaho Data Integration @ Hitachi Vantara
Kettle: Architecture
•Metadata driven, engine based :
•No code generation
•Define what you need to happen
-> GUI, Web, code, rules, …
•Execute wherever you need to
-> From Raspberry Pi to Hadoop
•Types of work:
● Jobs for workflows
● Transformations for parallel streaming
Kettle: Design
• 100% Exposure of our engine through UI elements
• Everyone should be able to play along: plugins!
•We built integration points for others: run everywhere!
• Allow the user to avoid programming anything
• Allow the user to program anything: JavaScript, Java,
SQL, RegEx, Rules, Python, Ruby, R, OO Formula, Pig, …
• Transparency wins: top class logging, data lineage,
execution lineage, debugging, data previewing, row
sniff testing, …
Kettle: Cool things
• SpoonGit: UI integration with git
• WebSpoon: web interface to the full Spoon UI
•Data Sets: build transformation unit tests
• Large marketplace with:
http://www.pentaho.com/marketplace/
• Project on github has over 1,000 forks
https://github.com/pentaho/pentaho-kettle
Kettle: Quick Spoon intro
Neo4j Kettle Plugins
10
Plugins: Neo4j Cypher
•For reading and writing
•Dynamic Cypher
•Batching and UNWIND
•Parallel execution
Plugins: Neo4j Output
•Easy node creation
•Create/Merge of ()-[]-()
•Batching and UNWIND
•Parallel execution
•Dynamic labels
Plugins: Neo4j Graph Output
•Update parts of a graph
•Auto-generate Cypher
•Using model
•Using field mapping
Plugins: Check Neo4j Connection
•Job Entry
•Validate DBs are up
•Used in error diagnostic
•Defensive setup
Plugins: Neo4j Cypher Script
•Job Entry
•Executes series of Cypher statements
Loading data into Neo4j
16
Loading Neo4j: loading nodes
•Demonstrates the Neo4j Output step
•Read a CSV file in parallel
•Load the data into nodes in parallel
Loading Neo4j: remove all data
•Demonstrates the Neo4j Cypher step
•Calls procedures
•Uses dynamic Cypher statements
•Reads and updates Neo4j
•Removes the all nodes and edges in batches
Loading Neo4j: update graphs
•Demonstrates the Neo4j Graph Output step
•Updates multiple nodes and relationships at once
•Takes key values into account to ignore nodes
•Automatically generates MERGE statements
Loading Neo4j: Kafka updating Neo4j
• Demonstrates Kafka integration
• Stream data using a Kafka consumer
• Continuously update Neo4j
Extracting data with Kettle
21
Sourcing Neo4j: simple reading
● Read using a Cypher query
● Write to an Excel file
Sourcing Neo4j: Kettle JDBC
● Expose Neo4j queries as a virtual SQL table
● Allow SQL queries to run against Neo4j
Recap
24
Take-aways
With Kettle & Neo4j plugins:
•Work faster, tackle harder problems
•Reduce risk by showing results faster
•Keep maintenance costs under control
Kettle & Neo4j : Q&A
26

GraphDay Paris - Intégrer des flux de données dans Neo4j avec l'ETL Open Source Kettle

  • 1.
  • 2.
    Agenda •What is Kettle? •TheNeo4j plugins •Loading data into Neo4j (with demo) •Extracting data from Neo4j (with demo) •Recap •Q&A
  • 3.
  • 4.
    Kettle: Introduction •a.k.a PentahoData Integration •One of the most widely used ETL tools •Ready for the most demanding tasks •Open source Apache Public License 2.0 •Well maintained •Large community, marketplace, ... •Easy to embed, install, package, rebrand •Download from Sourceforge / Pentaho / PDI-CE
  • 5.
    Kettle: Introduction • Kettle •Extraction • Transformation •Transportation • Loading • Environment • PDI: Pentaho Data Integration @ Hitachi Vantara
  • 6.
    Kettle: Architecture •Metadata driven,engine based : •No code generation •Define what you need to happen -> GUI, Web, code, rules, … •Execute wherever you need to -> From Raspberry Pi to Hadoop •Types of work: ● Jobs for workflows ● Transformations for parallel streaming
  • 7.
    Kettle: Design • 100%Exposure of our engine through UI elements • Everyone should be able to play along: plugins! •We built integration points for others: run everywhere! • Allow the user to avoid programming anything • Allow the user to program anything: JavaScript, Java, SQL, RegEx, Rules, Python, Ruby, R, OO Formula, Pig, … • Transparency wins: top class logging, data lineage, execution lineage, debugging, data previewing, row sniff testing, …
  • 8.
    Kettle: Cool things •SpoonGit: UI integration with git • WebSpoon: web interface to the full Spoon UI •Data Sets: build transformation unit tests • Large marketplace with: http://www.pentaho.com/marketplace/ • Project on github has over 1,000 forks https://github.com/pentaho/pentaho-kettle
  • 9.
  • 10.
  • 11.
    Plugins: Neo4j Cypher •Forreading and writing •Dynamic Cypher •Batching and UNWIND •Parallel execution
  • 12.
    Plugins: Neo4j Output •Easynode creation •Create/Merge of ()-[]-() •Batching and UNWIND •Parallel execution •Dynamic labels
  • 13.
    Plugins: Neo4j GraphOutput •Update parts of a graph •Auto-generate Cypher •Using model •Using field mapping
  • 14.
    Plugins: Check Neo4jConnection •Job Entry •Validate DBs are up •Used in error diagnostic •Defensive setup
  • 15.
    Plugins: Neo4j CypherScript •Job Entry •Executes series of Cypher statements
  • 16.
  • 17.
    Loading Neo4j: loadingnodes •Demonstrates the Neo4j Output step •Read a CSV file in parallel •Load the data into nodes in parallel
  • 18.
    Loading Neo4j: removeall data •Demonstrates the Neo4j Cypher step •Calls procedures •Uses dynamic Cypher statements •Reads and updates Neo4j •Removes the all nodes and edges in batches
  • 19.
    Loading Neo4j: updategraphs •Demonstrates the Neo4j Graph Output step •Updates multiple nodes and relationships at once •Takes key values into account to ignore nodes •Automatically generates MERGE statements
  • 20.
    Loading Neo4j: Kafkaupdating Neo4j • Demonstrates Kafka integration • Stream data using a Kafka consumer • Continuously update Neo4j
  • 21.
  • 22.
    Sourcing Neo4j: simplereading ● Read using a Cypher query ● Write to an Excel file
  • 23.
    Sourcing Neo4j: KettleJDBC ● Expose Neo4j queries as a virtual SQL table ● Allow SQL queries to run against Neo4j
  • 24.
  • 25.
    Take-aways With Kettle &Neo4j plugins: •Work faster, tackle harder problems •Reduce risk by showing results faster •Keep maintenance costs under control
  • 26.