Kettle - Neo4J
Bart Maertens,
Neo4J meetup
Brussels
2018-06-12
# What is PDI
# Kettle and know.bi - history
# How does this relate to Neo4J
# CREATE (Kettle)-[r:LOAD_TO]->(Neo4J)
# Demo
# Content
# KETTLE - know.bi
#BM: involved with Kettle/Pentaho
from the start
#know.bi
‱ founded in early 2012, Pentaho partner from day 1
‱ PDI/Kettle experts, involved in all major projects in
Benelux, internationally
‱ heavily involved in community (PCM14, PCM16)
‱ Neo4J partner since Q2 2018
# What is ETL
#ETL: Extract Transform Load, later
broadened to data integration
#ETL:
‱ Extract: read data from source (RDBMS,
files, applications, REST, 
)
‱ Transform: enrich (join, lookup, 
),
clean, join, ...
‱ Load: load to DWH or other data formats
#History in loading data warehouses,
now much broader use
# What is KETTLE
#Open Source Project behind Pentaho Data Integration
(now Hitachi)
#Visual ETL development
#Modular: +/- 30 plugin/extension points
#Handles small (e.g. real time) to big/huge data
#300+ ‘steps’ out of the box, many more in marketplace
#Active developer community
# KETTLE - a bit of history
#Kettle project: “Kettle E.T.T.L. Environment”
#Dec 2005: Matt Casters open sources KETTLE
#April 2006: Pentaho acquires KETTLE
(PDI: Pentaho Data Integration)
#Feb 2015: Pentaho acquired by Hitachi
#Early 2018: Matt leaves Hitachi
AND...
# KETTLE - Concepts
#Jobs vs Transformations:
‱ transformation:
‱ read from source, modify/enrich/join, write to target
‱ parallel processing
‱ job:
‱ orchestration, environment checks, error handling, 

‱ sequential processing (by default)
#Components:
‱ Spoon: visually develop, test, debug jobs/transformations
‱ kitchen/pan: CLI interfaces for job/transformation execution
‱ Carte: lightweight web server for remote execution
# KETTLE - Key Strengths
#Flexible:
‱ engine-based
‱ highly configurable
‱ handles streaming, small and huge data well
#Extensible:
‱ pluggable, extensible architecture (+/- 30 plugin points, everything is a plugin)
‱ marketplace
#Active community
# KETTLE - Tips & Tricks
#Organize flow visually
#Use meaningful step names
#Write modular ETL code
#Enable logging, check the logs
#Treat your ETL project as software development!! (git, CI,
documentation, 
)
# KETTLE - Tips & Tricks
#Aim for
# KETTLE - Tips & Tricks
#Avoid!!
# Neo4J and ETL
#Neo4J options to load nodes, relationships
‱ CSV Import
‱ Neo4J ETL
‱ APIs: .NET, Python, Java, 

#Powerful options, but too much:
‱ scripting
‱ scripting
‱ scripting
‱ 
.
# Neo4J - Kettle options
#KETTLE Neo4J options:
‱ Neo4J JDBC driver (Matt Burgess):
‱ write Cypher queries, process (tabular) results
‱ Neo4J Output step (know.bi):
‱ prepare data in Kettle, use step to create nodes, relationships
‱ allows to create graphs without the need for boiler plate code
‱ scripting
#Future: additional Neo4J optimizations in development
# Demo
Demo
# Thank you!
www.know.bi
info@know.bi
@know_bi

Neo4J meetup, Brussels, 2018-06-12

  • 1.
    Kettle - Neo4J BartMaertens, Neo4J meetup Brussels 2018-06-12
  • 2.
    # What isPDI # Kettle and know.bi - history # How does this relate to Neo4J # CREATE (Kettle)-[r:LOAD_TO]->(Neo4J) # Demo # Content
  • 3.
    # KETTLE -know.bi #BM: involved with Kettle/Pentaho from the start #know.bi ‱ founded in early 2012, Pentaho partner from day 1 ‱ PDI/Kettle experts, involved in all major projects in Benelux, internationally ‱ heavily involved in community (PCM14, PCM16) ‱ Neo4J partner since Q2 2018
  • 4.
    # What isETL #ETL: Extract Transform Load, later broadened to data integration #ETL: ‱ Extract: read data from source (RDBMS, files, applications, REST, 
) ‱ Transform: enrich (join, lookup, 
), clean, join, ... ‱ Load: load to DWH or other data formats #History in loading data warehouses, now much broader use
  • 5.
    # What isKETTLE #Open Source Project behind Pentaho Data Integration (now Hitachi) #Visual ETL development #Modular: +/- 30 plugin/extension points #Handles small (e.g. real time) to big/huge data #300+ ‘steps’ out of the box, many more in marketplace #Active developer community
  • 6.
    # KETTLE -a bit of history #Kettle project: “Kettle E.T.T.L. Environment” #Dec 2005: Matt Casters open sources KETTLE #April 2006: Pentaho acquires KETTLE (PDI: Pentaho Data Integration) #Feb 2015: Pentaho acquired by Hitachi #Early 2018: Matt leaves Hitachi
  • 7.
  • 9.
    # KETTLE -Concepts #Jobs vs Transformations: ‱ transformation: ‱ read from source, modify/enrich/join, write to target ‱ parallel processing ‱ job: ‱ orchestration, environment checks, error handling, 
 ‱ sequential processing (by default) #Components: ‱ Spoon: visually develop, test, debug jobs/transformations ‱ kitchen/pan: CLI interfaces for job/transformation execution ‱ Carte: lightweight web server for remote execution
  • 12.
    # KETTLE -Key Strengths #Flexible: ‱ engine-based ‱ highly configurable ‱ handles streaming, small and huge data well #Extensible: ‱ pluggable, extensible architecture (+/- 30 plugin points, everything is a plugin) ‱ marketplace #Active community
  • 13.
    # KETTLE -Tips & Tricks #Organize flow visually #Use meaningful step names #Write modular ETL code #Enable logging, check the logs #Treat your ETL project as software development!! (git, CI, documentation, 
)
  • 14.
    # KETTLE -Tips & Tricks #Aim for
  • 15.
    # KETTLE -Tips & Tricks #Avoid!!
  • 17.
    # Neo4J andETL #Neo4J options to load nodes, relationships ‱ CSV Import ‱ Neo4J ETL ‱ APIs: .NET, Python, Java, 
 #Powerful options, but too much: ‱ scripting ‱ scripting ‱ scripting ‱ 
.
  • 18.
    # Neo4J -Kettle options #KETTLE Neo4J options: ‱ Neo4J JDBC driver (Matt Burgess): ‱ write Cypher queries, process (tabular) results ‱ Neo4J Output step (know.bi): ‱ prepare data in Kettle, use step to create nodes, relationships ‱ allows to create graphs without the need for boiler plate code ‱ scripting #Future: additional Neo4J optimizations in development
  • 19.
  • 20.