Document Classification with Neo4j 
(graphs)-[:are]->(everywhere) 
© All Rights Reserved 2014 | Neo Technology, Inc. 
@kennybastani 
Neo4j Developer Evangelist
© All Rights Reserved 2014 | Neo Technology, Inc. 
Agenda 
• Introduction to Neo4j 
• Introduction to Graph-based Document Classification 
• Graph-based Hierarchical Pattern Recognition 
• Generating a Vector Space Model for Recommendations 
• Graphify for Neo4j 
• U.S. Presidential Speech Transcript Analysis 
2
Introduction to Neo4j 
© All Rights Reserved 2014 | Neo Technology, Inc. 
3
The Property Graph Data Model 
© All Rights Reserved 2014 | Neo Technology, Inc. 
4
© All Rights Reserved 2014 | Neo Technology, Inc. 
John 
Sally 
Graph Databases 
Book 
5
© All Rights Reserved 2014 | Neo Technology, Inc. 
name: John 
age: 27 
name: Sally 
age: 32 
FRIEND_OF 
since: 01/09/2013 
title: Graph Databases 
authors: Ian Robinson, 
Jim Webber 
HAS_READ 
on: 2/03/2013 
rating: 5 
HAS_READ 
on: 02/09/2013 
rating: 4 
FRIEND_OF 
since: 01/09/2013 
6
The Relational Table Model 
© All Rights Reserved 2014 | Neo Technology, Inc. 
7
Customers Customer_Accounts Accounts 
© All Rights Reserved 2014 | Neo Technology, Inc. 
8
The Neo4j Browser 
© All Rights Reserved 2014 | Neo Technology, Inc. 
9
Neo4j Browser - finding help 
© All Rights Reserved 2014 | Neo Technology, Inc. 
http://localhost:7474/ 
10
Execute Cypher, Visualize 
© All Rights Reserved 2014 | Neo Technology, Inc. 
11
Introduction to Document Classification 
© All Rights Reserved 2014 | Neo Technology, Inc. 
12
© All Rights Reserved 2014 | Neo Technology, Inc. 
Document Classification 
Automatically assign a document to one or more classes 
Documents may be classified according to their subjects or 
according to other attributes 
Automatically classify unlabeled documents to a set of relevant 
classes using labeled training data 
13
Example Use Cases for Document 
© All Rights Reserved 2014 | Neo Technology, Inc. 
Classification 
14
Sentiment Analysis for Movie Reviews 
Scenario: A movie website allows users to submit reviews describing what they 
either liked or disliked about a particular movie. 
© All Rights Reserved 2014 | Neo Technology, Inc. 
Problem: The user reviews are unstructured text. 
How do I automatically generate a score indicating whether the review was 
positive or negative? 
Solution: Train a natural language parsing model on a dataset that has been 
labeled in previous reviews as either positive or negative. 
15
Recommend Relevant Tags 
Scenario: A Q/A website allows users to submit questions and receive answers 
from other users. 
Problem: Users sometime do not know what tags to apply to their questions in 
order to increase discoverability for receiving answers. 
Solution: Automatically recommend the most relevant tags for questions by 
classifying the text from training on previous questions. 
© All Rights Reserved 2014 | Neo Technology, Inc. 
16
Recommend Similar Articles 
Scenario: A news website provides hundreds of new articles a day to users on a 
broad range of topics. 
Problem: The site needs to increase user engagement and time spent on the site. 
Solution: Train natural language parsing models for daily articles in order to 
provide recommendations for highly relevant articles at the bottom of each page. 
© All Rights Reserved 2014 | Neo Technology, Inc. 
17
How Automated Document Classification Works 
© All Rights Reserved 2014 | Neo Technology, Inc. 
18
Label 
© All Rights Reserved 2014 | Neo Technology, Inc. 
X Y 
Document 
Document 
Document 
Document 
Label Label 
Assign a set of labels that describes the 
document’s text 
Supervised Learning 
Step 1: Create a Training Dataset 
Z 
19
Step 2: Train a Natural Language Parsing Model 
p 
X Y 
= State Machine 
© All Rights Reserved 2014 | Neo Technology, Inc. 
Deep feature representations are selected and 
learned using an evolutionary algorithm 
State machines represent predicates that evaluate to 
0 or 1 for a text match 
State machines map to classes of document labels 
that matched text during training 
Deep Learning 
p p 
p p p 
Class 
Class 
Z 
Class 
20
cos(θ) 
© All Rights Reserved 2014 | Neo Technology, Inc. 
Unlabeled Document 
The natural language parsing model is 
used to classify other unlabeled 
documents 
X 
Class 
Y 
Class 
Z 
Class 
0.99 
0.67 
0.01 
cos(θ) 
cos(θ) 
Step 3: Classify Unlabeled Documents 
21
Hierarchical Pattern Recognition 
© All Rights Reserved 2014 | Neo Technology, Inc. 
(HPR) 
22
What is Hierarchical Pattern Recognition (HPR)? 
HPR is a graph-based deep learning algorithm I 
created that learns deep feature representations in 
linear time — 
I created the algorithm to do graph-based traversals 
using a hierarchy of finite state machines (FSM). 
Designed for scalable performance in P time: 
© All Rights Reserved 2014 | Neo Technology, Inc. 
23
Influences & Inspirations 
+ = 
p 
p p 
p p p 
X Y Z 
© All Rights Reserved 2014 | Neo Technology, Inc. 
24 
Ray Kurzweil 
(Pattern Recognition Theory of Mind) 
Jeff Hawkins 
(Hierarchical Temporal Memory) 
Hierarchical Pattern Recognition
How does feature extraction work? 
p 
© All Rights Reserved 2014 | Neo Technology, Inc. 
25 
Hierarchical Pattern Recognition 
“Deep” feature representations are learned and associated 
with labels that are mapped to documents that the feature 
was discovered in. 
The feature hierarchy is translated into a Vector Space Model 
for classification on feature vectors generated from unlabeled 
text. 
p p 
p p p 
X Y Z 
HPR uses a probabilistic model in combination with an 
evolutionary algorithm to generate hierarchies of deep feature 
representations.
Graph-based feature learning 
© All Rights Reserved 2014 | Neo Technology, Inc. 
26
Learning new features from 
matches on training data 
© All Rights Reserved 2014 | Neo Technology, Inc. 
27
Cost Function for the Generations of Features 
Reproduction occurs after a threshold of matches has been 
exceeded for a feature. 
After replication the cost function is applied to increase that 
threshold every time the feature reproduces. 
is the current threshold on the feature node. 
is the minimum threshold, which I chose as 5 for new features. 
© All Rights Reserved 2014 | Neo Technology, Inc. 
Cost function: 
28
© All Rights 29 Reserved 2014 | Neo Technology, Inc.
Vector Space Model 
© All Rights Reserved 2014 | Neo Technology, Inc. 
30
Generating Feature Vectors 
The natural language parsing model created during training can be 
turned into a global feature index. 
This global feature index is a list of Neo4j internal IDs for every feature 
in the hierarchy. 
Using that global feature index, a multi-dimensional vector space is 
created with a length equal to the number of features in the hierarchy. 
© All Rights Reserved 2014 | Neo Technology, Inc. 
31
Relevance Rankings 
“Relevance rankings of documents in a keyword search can be 
calculated, using the assumptions of document similarities theory, by 
comparing the deviation of angles between each document vector and 
the original query vector where the query is represented as the same 
kind of vector as the documents.” - Wikipedia 
© All Rights Reserved 2014 | Neo Technology, Inc. 
32
Vector-based Cosine Similarity Measure 
In practice, it is easier to calculate the cosine of the angle between the 
vectors, instead of the angle itself: 
© All Rights Reserved 2014 | Neo Technology, Inc. 
33
Cosine Similarity & Vector Space Model 
© All Rights Reserved 2014 | Neo Technology, Inc. 
34
Vector-based Cosine Similarity Measure 
“The resulting similarity ranges from -1 meaning exactly opposite, to 1 
meaning exactly the same, with 0 usually indicating independence, 
and in-between values indicating intermediate similarity or 
dissimilarity.” 
© All Rights Reserved 2014 | Neo Technology, Inc. 
via Wikipedia 
35
Graphify for Neo4j 
© All Rights Reserved 2014 | Neo Technology, Inc. 
36
Graphify for Neo4j 
Graphify is a Neo4j unmanaged extension used for 
document and text classification using graph-based 
hierarchical pattern recognition. 
© All Rights Reserved 2014 | Neo Technology, Inc. 
https://github.com/kbastani/graphify 
37
Example Project 
Head over to the GitHub project page and clone it to your 
local machine. 
Follow the directions listed in the README.md to install the 
extension. 
Navigate to the /examples directory of the project. 
© All Rights Reserved 2014 | Neo Technology, Inc. 
Run: 
examples/graphify-examples-author/src/java/org/neo4j/nlp/examples/author/main.java 
38
U.S. Presidential Speech 
Transcript Analysis 
© All Rights Reserved 2014 | Neo Technology, Inc. 
39
Identify the Political Affiliation of a Presidential Speech 
This example ingests a set of texts from presidential speeches with 
labels from the author of that speech in training phase. After building 
the training models, unlabeled presidential speeches are classified in 
the test phase. 
© All Rights Reserved 2014 | Neo Technology, Inc. 
40
The Presidents 
© All Rights Reserved 2014 | Neo Technology, Inc. 
• Ronald Reagan 
• labels: liberal, republican, ronald-reagan 
• George H.W. Bush 
• labels: conservative, republican, bush41 
• Bill Clinton 
• labels: liberal, democrat, bill-clinton 
• George W. Bush 
• labels: conservative, republican, bush43 
• Barack Obama 
• labels: liberal, democrat, barack-obama 
41
© All Rights Reserved 2014 | Neo Technology, Inc. 
Training 
Each of the presidents in the example have 6 speeches to analyze. 
4 of the speeches are used to build a natural language parsing model. 
2 of the speeches are used to test the validity of that model. 
42
Get Similar Labels/Classes 
© All Rights Reserved 2014 | Neo Technology, Inc. 
43
Ronald Reagan 
republican 0.7182046285385341 
liberal 0.644281223102398 
democrat 0.4854114595950056 
conservative 0.4133639188595147 
bill-clinton 0.4057969121945167 
barack-obama 0.323947855372623 
bush41 0.3222644898334092 
bush43 0.3161309849153592 
© All Rights Reserved 2014 | Neo Technology, Inc. 
Class Similarity 
44
George H.W. Bush 
conservative 0.7032274806766954 
republican 0.6047256274615608 
liberal 0.4439742461594541 
democrat 0.39114918238853674 
bill-clinton 0.3234223107986785 
ronald-reagan 0.3222644898334092 
barack-obama 0.2929260544514002 
bush43 0.29106733975087984 
© All Rights Reserved 2014 | Neo Technology, Inc. 
Class Similarity 
45
democrat 0.8375678825642422 
liberal 0.7847858060182163 
republican 0.5561860529059708 
conservative 0.45365774896422445 
barack-obama 0.4507676679770066 
ronald-reagan 0.4057969121945167 
bush43 0.365042482383354 
bush41 0.3234223107986785 
© All Rights Reserved 2014 | Neo Technology, Inc. 
Bill Clinton 
Class Similarity 
46
George W. Bush 
conservative 0.820636570272315 
republican 0.7056890956512284 
liberal 0.5075788396061254 
democrat 0.4505424322086937 
bill-clinton 0.365042482383354 
barack-obama 0.33801949243378965 
ronald-reagan 0.3161309849153592 
bush41 0.29106733975087984 
© All Rights Reserved 2014 | Neo Technology, Inc. 
Class Similarity 
47
Barack Obama 
democrat 0.7668017370739147 
liberal 0.7184792203867296 
republican 0.4847680475425114 
bill-clinton 0.4507676679770066 
conservative 0.4149264161292232 
bush43 0.33801949243378965 
ronald-reagan 0.323947855372623 
bush41 0.2929260544514002 
© All Rights Reserved 2014 | Neo Technology, Inc. 
Class Similarity 
48
Get involved in the Neo4j community 
© All Rights Reserved 2014 | Neo Technology, Inc. 
49
http://stackoverflow.com/questions/tagged/neo4j 
© All Rights Reserved 2014 | Neo Technology, Inc. 
50
http://groups.google.com/group/neo4j 
© All Rights Reserved 2014 | Neo Technology, Inc. 
51
https://github.com/neo4j/neo4j/issues 
© All Rights Reserved 2014 | Neo Technology, Inc. 
52
http://neo4j.meetup.com/ 
© All Rights Reserved 2014 | Neo Technology, Inc. 
53
© All Rights Reserved 2014 | Neo Technology, Inc. 
(Thank You) 
54
Twitter www.twitter.com/kennybastani 
LinkedIn www.linkedin.com/in/kennybastani 
GitHub www.github.com/kbastani 
© All Rights Reserved 2014 | Neo Technology, Inc. 
Get in touch 
55

More Related Content

PDF
Graphs for Data Science and Machine Learning
PDF
Slides: Knowledge Graphs vs. Property Graphs
PPTX
Dự đoán liên kết trong đồ thị tri thức
PDF
Graph-Powered Machine Learning
PDF
Training Week: Build APIs with Neo4j GraphQL Library
PPTX
Knowledge Graph Introduction
PPTX
Generative AI Application Development using LangChain and LangFlow
PDF
Data Modeling with Neo4j
Graphs for Data Science and Machine Learning
Slides: Knowledge Graphs vs. Property Graphs
Dự đoán liên kết trong đồ thị tri thức
Graph-Powered Machine Learning
Training Week: Build APIs with Neo4j GraphQL Library
Knowledge Graph Introduction
Generative AI Application Development using LangChain and LangFlow
Data Modeling with Neo4j

What's hot (20)

PDF
Graph database Use Cases
PDF
Introduction to Neo4j
PPTX
Demystifying Graph Neural Networks
PDF
Gnn overview
PDF
The Data Platform for Today’s Intelligent Applications
PPTX
Graph databases
PDF
Artificial intelligence ai l6-logic va-suy_dien
PDF
Knowledge graphs, meet Deep Learning
PDF
Clickstream Data Warehouse - Turning clicks into customers
PPTX
[Final]collaborative filtering and recommender systems
PDF
10 Key Considerations for AI/ML Model Governance
PDF
Artificial intelligence ai l5-thoa man-rang_buoc
PPTX
Using NVivo QSR Theory and Practice for Qualitative Data Analysis in a PhD
PPTX
Introduction to Generative Models.pptx
PDF
Data Visualization
PDF
Information visualization - introduction
PDF
Explainable AI
PDF
Introduction to Knowledge Graphs
PPTX
A survey on graph kernels
PDF
A Short Introduction to Generative Adversarial Networks
Graph database Use Cases
Introduction to Neo4j
Demystifying Graph Neural Networks
Gnn overview
The Data Platform for Today’s Intelligent Applications
Graph databases
Artificial intelligence ai l6-logic va-suy_dien
Knowledge graphs, meet Deep Learning
Clickstream Data Warehouse - Turning clicks into customers
[Final]collaborative filtering and recommender systems
10 Key Considerations for AI/ML Model Governance
Artificial intelligence ai l5-thoa man-rang_buoc
Using NVivo QSR Theory and Practice for Qualitative Data Analysis in a PhD
Introduction to Generative Models.pptx
Data Visualization
Information visualization - introduction
Explainable AI
Introduction to Knowledge Graphs
A survey on graph kernels
A Short Introduction to Generative Adversarial Networks
Ad

Viewers also liked (18)

PPT
Natural language search using Neo4j
PDF
Natural Language Processing with Graph Databases and Neo4j
PPT
Natural Language Processing with Neo4j
PDF
Building a Graph-based Analytics Platform
PDF
Open Source Big Graph Analytics on Neo4j with Apache Spark
PPT
Big Graph Analytics on Neo4j with Apache Spark
PPTX
Introduction to Graph Databases
PPTX
Neo4J Open Source Graph Database
PDF
20141216 graph database prototyping ams meetup
PPT
Dnc Day 4 – Obama Speech
PDF
The impact of language planning, terminology planning, and arabicization, on ...
PDF
Meryl streep took a stand against donald trump
PDF
AP Invoice Processing for JD Edwards_Bottomline Technologies
KEY
Document Classification In PHP
PPT
The war on terrorism
PDF
M893 & m894 seahawks contest
PPTX
Visual Resume
PPTX
Adivina de _quienes_son_las_siguientes_cansiones[1]
Natural language search using Neo4j
Natural Language Processing with Graph Databases and Neo4j
Natural Language Processing with Neo4j
Building a Graph-based Analytics Platform
Open Source Big Graph Analytics on Neo4j with Apache Spark
Big Graph Analytics on Neo4j with Apache Spark
Introduction to Graph Databases
Neo4J Open Source Graph Database
20141216 graph database prototyping ams meetup
Dnc Day 4 – Obama Speech
The impact of language planning, terminology planning, and arabicization, on ...
Meryl streep took a stand against donald trump
AP Invoice Processing for JD Edwards_Bottomline Technologies
Document Classification In PHP
The war on terrorism
M893 & m894 seahawks contest
Visual Resume
Adivina de _quienes_son_las_siguientes_cansiones[1]
Ad

Similar to Document Classification with Neo4j (20)

PDF
Atelier - Innover avec l’IA Générative et les graphes de connaissances
PDF
History Of C Essay
PDF
MSRA 2018: Intelligent Software Engineering: Synergy between AI and Software ...
PDF
xAPI: The Landscape
PPT
Software system design sample
PDF
Performance Comparison of Binary Machine Learning Classifiers in Identifying ...
PDF
Data science workshop
PDF
An Empirical Comparison of Knowledge Graph Embeddings for Item Recommendation
PPTX
C# programming : Chapter One
PDF
See to believe: capturing insights using contextual inquiry
PPTX
Artificial Intelligence Day 6 Slides for your Reference Happy Learning
PDF
OWF14 - Big Data : The State of Machine Learning in 2014
PDF
Sudipta mukherjee 2016_2017
PDF
Sudipta_Mukherjee_2016_2017
PDF
Maruti gollapudi cv
PDF
Software Analytics - Achievements and Challenges
PPTX
Transferring Software Testing Tools to Practice
PDF
Software craftsmanship - Imperative or Hype
PDF
Knowledge Graphs and Generative AI
PDF
Xiangen Hu - WESST - AutoTutor, an implementation of Conversation-Based Intel...
Atelier - Innover avec l’IA Générative et les graphes de connaissances
History Of C Essay
MSRA 2018: Intelligent Software Engineering: Synergy between AI and Software ...
xAPI: The Landscape
Software system design sample
Performance Comparison of Binary Machine Learning Classifiers in Identifying ...
Data science workshop
An Empirical Comparison of Knowledge Graph Embeddings for Item Recommendation
C# programming : Chapter One
See to believe: capturing insights using contextual inquiry
Artificial Intelligence Day 6 Slides for your Reference Happy Learning
OWF14 - Big Data : The State of Machine Learning in 2014
Sudipta mukherjee 2016_2017
Sudipta_Mukherjee_2016_2017
Maruti gollapudi cv
Software Analytics - Achievements and Challenges
Transferring Software Testing Tools to Practice
Software craftsmanship - Imperative or Hype
Knowledge Graphs and Generative AI
Xiangen Hu - WESST - AutoTutor, an implementation of Conversation-Based Intel...

More from Kenny Bastani (9)

PDF
In the Eventual Consistency of Succeeding at Microservices
PDF
Building Cloud Native Architectures with Spring
PDF
Extending the Platform with Spring Boot and Cloud Foundry
PDF
Back your app with MySQL and Redis on Cloud Foundry
PDF
Using Docker, Neo4j, and Spring Cloud for Developing Microservices
PDF
Cloud Native Java Microservices
PPTX
Building REST APIs with Spring Boot and Spring Cloud
PDF
Neo4j Graph Data Modeling
PDF
Building Killer Apps with Neo4j 2.0
In the Eventual Consistency of Succeeding at Microservices
Building Cloud Native Architectures with Spring
Extending the Platform with Spring Boot and Cloud Foundry
Back your app with MySQL and Redis on Cloud Foundry
Using Docker, Neo4j, and Spring Cloud for Developing Microservices
Cloud Native Java Microservices
Building REST APIs with Spring Boot and Spring Cloud
Neo4j Graph Data Modeling
Building Killer Apps with Neo4j 2.0

Recently uploaded (20)

PPT
Module 1.ppt Iot fundamentals and Architecture
PDF
Five Habits of High-Impact Board Members
PPTX
The various Industrial Revolutions .pptx
PDF
Hindi spoken digit analysis for native and non-native speakers
PDF
CloudStack 4.21: First Look Webinar slides
PPTX
Benefits of Physical activity for teenagers.pptx
PDF
Getting Started with Data Integration: FME Form 101
PDF
STKI Israel Market Study 2025 version august
PPTX
observCloud-Native Containerability and monitoring.pptx
PDF
How ambidextrous entrepreneurial leaders react to the artificial intelligence...
PDF
August Patch Tuesday
PDF
Unlock new opportunities with location data.pdf
PPTX
Modernising the Digital Integration Hub
PPTX
Web Crawler for Trend Tracking Gen Z Insights.pptx
PPTX
Chapter 5: Probability Theory and Statistics
PDF
A comparative study of natural language inference in Swahili using monolingua...
PDF
A novel scalable deep ensemble learning framework for big data classification...
PPTX
Final SEM Unit 1 for mit wpu at pune .pptx
PDF
Zenith AI: Advanced Artificial Intelligence
PDF
Assigned Numbers - 2025 - Bluetooth® Document
Module 1.ppt Iot fundamentals and Architecture
Five Habits of High-Impact Board Members
The various Industrial Revolutions .pptx
Hindi spoken digit analysis for native and non-native speakers
CloudStack 4.21: First Look Webinar slides
Benefits of Physical activity for teenagers.pptx
Getting Started with Data Integration: FME Form 101
STKI Israel Market Study 2025 version august
observCloud-Native Containerability and monitoring.pptx
How ambidextrous entrepreneurial leaders react to the artificial intelligence...
August Patch Tuesday
Unlock new opportunities with location data.pdf
Modernising the Digital Integration Hub
Web Crawler for Trend Tracking Gen Z Insights.pptx
Chapter 5: Probability Theory and Statistics
A comparative study of natural language inference in Swahili using monolingua...
A novel scalable deep ensemble learning framework for big data classification...
Final SEM Unit 1 for mit wpu at pune .pptx
Zenith AI: Advanced Artificial Intelligence
Assigned Numbers - 2025 - Bluetooth® Document

Document Classification with Neo4j

  • 1. Document Classification with Neo4j (graphs)-[:are]->(everywhere) © All Rights Reserved 2014 | Neo Technology, Inc. @kennybastani Neo4j Developer Evangelist
  • 2. © All Rights Reserved 2014 | Neo Technology, Inc. Agenda • Introduction to Neo4j • Introduction to Graph-based Document Classification • Graph-based Hierarchical Pattern Recognition • Generating a Vector Space Model for Recommendations • Graphify for Neo4j • U.S. Presidential Speech Transcript Analysis 2
  • 3. Introduction to Neo4j © All Rights Reserved 2014 | Neo Technology, Inc. 3
  • 4. The Property Graph Data Model © All Rights Reserved 2014 | Neo Technology, Inc. 4
  • 5. © All Rights Reserved 2014 | Neo Technology, Inc. John Sally Graph Databases Book 5
  • 6. © All Rights Reserved 2014 | Neo Technology, Inc. name: John age: 27 name: Sally age: 32 FRIEND_OF since: 01/09/2013 title: Graph Databases authors: Ian Robinson, Jim Webber HAS_READ on: 2/03/2013 rating: 5 HAS_READ on: 02/09/2013 rating: 4 FRIEND_OF since: 01/09/2013 6
  • 7. The Relational Table Model © All Rights Reserved 2014 | Neo Technology, Inc. 7
  • 8. Customers Customer_Accounts Accounts © All Rights Reserved 2014 | Neo Technology, Inc. 8
  • 9. The Neo4j Browser © All Rights Reserved 2014 | Neo Technology, Inc. 9
  • 10. Neo4j Browser - finding help © All Rights Reserved 2014 | Neo Technology, Inc. http://localhost:7474/ 10
  • 11. Execute Cypher, Visualize © All Rights Reserved 2014 | Neo Technology, Inc. 11
  • 12. Introduction to Document Classification © All Rights Reserved 2014 | Neo Technology, Inc. 12
  • 13. © All Rights Reserved 2014 | Neo Technology, Inc. Document Classification Automatically assign a document to one or more classes Documents may be classified according to their subjects or according to other attributes Automatically classify unlabeled documents to a set of relevant classes using labeled training data 13
  • 14. Example Use Cases for Document © All Rights Reserved 2014 | Neo Technology, Inc. Classification 14
  • 15. Sentiment Analysis for Movie Reviews Scenario: A movie website allows users to submit reviews describing what they either liked or disliked about a particular movie. © All Rights Reserved 2014 | Neo Technology, Inc. Problem: The user reviews are unstructured text. How do I automatically generate a score indicating whether the review was positive or negative? Solution: Train a natural language parsing model on a dataset that has been labeled in previous reviews as either positive or negative. 15
  • 16. Recommend Relevant Tags Scenario: A Q/A website allows users to submit questions and receive answers from other users. Problem: Users sometime do not know what tags to apply to their questions in order to increase discoverability for receiving answers. Solution: Automatically recommend the most relevant tags for questions by classifying the text from training on previous questions. © All Rights Reserved 2014 | Neo Technology, Inc. 16
  • 17. Recommend Similar Articles Scenario: A news website provides hundreds of new articles a day to users on a broad range of topics. Problem: The site needs to increase user engagement and time spent on the site. Solution: Train natural language parsing models for daily articles in order to provide recommendations for highly relevant articles at the bottom of each page. © All Rights Reserved 2014 | Neo Technology, Inc. 17
  • 18. How Automated Document Classification Works © All Rights Reserved 2014 | Neo Technology, Inc. 18
  • 19. Label © All Rights Reserved 2014 | Neo Technology, Inc. X Y Document Document Document Document Label Label Assign a set of labels that describes the document’s text Supervised Learning Step 1: Create a Training Dataset Z 19
  • 20. Step 2: Train a Natural Language Parsing Model p X Y = State Machine © All Rights Reserved 2014 | Neo Technology, Inc. Deep feature representations are selected and learned using an evolutionary algorithm State machines represent predicates that evaluate to 0 or 1 for a text match State machines map to classes of document labels that matched text during training Deep Learning p p p p p Class Class Z Class 20
  • 21. cos(θ) © All Rights Reserved 2014 | Neo Technology, Inc. Unlabeled Document The natural language parsing model is used to classify other unlabeled documents X Class Y Class Z Class 0.99 0.67 0.01 cos(θ) cos(θ) Step 3: Classify Unlabeled Documents 21
  • 22. Hierarchical Pattern Recognition © All Rights Reserved 2014 | Neo Technology, Inc. (HPR) 22
  • 23. What is Hierarchical Pattern Recognition (HPR)? HPR is a graph-based deep learning algorithm I created that learns deep feature representations in linear time — I created the algorithm to do graph-based traversals using a hierarchy of finite state machines (FSM). Designed for scalable performance in P time: © All Rights Reserved 2014 | Neo Technology, Inc. 23
  • 24. Influences & Inspirations + = p p p p p p X Y Z © All Rights Reserved 2014 | Neo Technology, Inc. 24 Ray Kurzweil (Pattern Recognition Theory of Mind) Jeff Hawkins (Hierarchical Temporal Memory) Hierarchical Pattern Recognition
  • 25. How does feature extraction work? p © All Rights Reserved 2014 | Neo Technology, Inc. 25 Hierarchical Pattern Recognition “Deep” feature representations are learned and associated with labels that are mapped to documents that the feature was discovered in. The feature hierarchy is translated into a Vector Space Model for classification on feature vectors generated from unlabeled text. p p p p p X Y Z HPR uses a probabilistic model in combination with an evolutionary algorithm to generate hierarchies of deep feature representations.
  • 26. Graph-based feature learning © All Rights Reserved 2014 | Neo Technology, Inc. 26
  • 27. Learning new features from matches on training data © All Rights Reserved 2014 | Neo Technology, Inc. 27
  • 28. Cost Function for the Generations of Features Reproduction occurs after a threshold of matches has been exceeded for a feature. After replication the cost function is applied to increase that threshold every time the feature reproduces. is the current threshold on the feature node. is the minimum threshold, which I chose as 5 for new features. © All Rights Reserved 2014 | Neo Technology, Inc. Cost function: 28
  • 29. © All Rights 29 Reserved 2014 | Neo Technology, Inc.
  • 30. Vector Space Model © All Rights Reserved 2014 | Neo Technology, Inc. 30
  • 31. Generating Feature Vectors The natural language parsing model created during training can be turned into a global feature index. This global feature index is a list of Neo4j internal IDs for every feature in the hierarchy. Using that global feature index, a multi-dimensional vector space is created with a length equal to the number of features in the hierarchy. © All Rights Reserved 2014 | Neo Technology, Inc. 31
  • 32. Relevance Rankings “Relevance rankings of documents in a keyword search can be calculated, using the assumptions of document similarities theory, by comparing the deviation of angles between each document vector and the original query vector where the query is represented as the same kind of vector as the documents.” - Wikipedia © All Rights Reserved 2014 | Neo Technology, Inc. 32
  • 33. Vector-based Cosine Similarity Measure In practice, it is easier to calculate the cosine of the angle between the vectors, instead of the angle itself: © All Rights Reserved 2014 | Neo Technology, Inc. 33
  • 34. Cosine Similarity & Vector Space Model © All Rights Reserved 2014 | Neo Technology, Inc. 34
  • 35. Vector-based Cosine Similarity Measure “The resulting similarity ranges from -1 meaning exactly opposite, to 1 meaning exactly the same, with 0 usually indicating independence, and in-between values indicating intermediate similarity or dissimilarity.” © All Rights Reserved 2014 | Neo Technology, Inc. via Wikipedia 35
  • 36. Graphify for Neo4j © All Rights Reserved 2014 | Neo Technology, Inc. 36
  • 37. Graphify for Neo4j Graphify is a Neo4j unmanaged extension used for document and text classification using graph-based hierarchical pattern recognition. © All Rights Reserved 2014 | Neo Technology, Inc. https://github.com/kbastani/graphify 37
  • 38. Example Project Head over to the GitHub project page and clone it to your local machine. Follow the directions listed in the README.md to install the extension. Navigate to the /examples directory of the project. © All Rights Reserved 2014 | Neo Technology, Inc. Run: examples/graphify-examples-author/src/java/org/neo4j/nlp/examples/author/main.java 38
  • 39. U.S. Presidential Speech Transcript Analysis © All Rights Reserved 2014 | Neo Technology, Inc. 39
  • 40. Identify the Political Affiliation of a Presidential Speech This example ingests a set of texts from presidential speeches with labels from the author of that speech in training phase. After building the training models, unlabeled presidential speeches are classified in the test phase. © All Rights Reserved 2014 | Neo Technology, Inc. 40
  • 41. The Presidents © All Rights Reserved 2014 | Neo Technology, Inc. • Ronald Reagan • labels: liberal, republican, ronald-reagan • George H.W. Bush • labels: conservative, republican, bush41 • Bill Clinton • labels: liberal, democrat, bill-clinton • George W. Bush • labels: conservative, republican, bush43 • Barack Obama • labels: liberal, democrat, barack-obama 41
  • 42. © All Rights Reserved 2014 | Neo Technology, Inc. Training Each of the presidents in the example have 6 speeches to analyze. 4 of the speeches are used to build a natural language parsing model. 2 of the speeches are used to test the validity of that model. 42
  • 43. Get Similar Labels/Classes © All Rights Reserved 2014 | Neo Technology, Inc. 43
  • 44. Ronald Reagan republican 0.7182046285385341 liberal 0.644281223102398 democrat 0.4854114595950056 conservative 0.4133639188595147 bill-clinton 0.4057969121945167 barack-obama 0.323947855372623 bush41 0.3222644898334092 bush43 0.3161309849153592 © All Rights Reserved 2014 | Neo Technology, Inc. Class Similarity 44
  • 45. George H.W. Bush conservative 0.7032274806766954 republican 0.6047256274615608 liberal 0.4439742461594541 democrat 0.39114918238853674 bill-clinton 0.3234223107986785 ronald-reagan 0.3222644898334092 barack-obama 0.2929260544514002 bush43 0.29106733975087984 © All Rights Reserved 2014 | Neo Technology, Inc. Class Similarity 45
  • 46. democrat 0.8375678825642422 liberal 0.7847858060182163 republican 0.5561860529059708 conservative 0.45365774896422445 barack-obama 0.4507676679770066 ronald-reagan 0.4057969121945167 bush43 0.365042482383354 bush41 0.3234223107986785 © All Rights Reserved 2014 | Neo Technology, Inc. Bill Clinton Class Similarity 46
  • 47. George W. Bush conservative 0.820636570272315 republican 0.7056890956512284 liberal 0.5075788396061254 democrat 0.4505424322086937 bill-clinton 0.365042482383354 barack-obama 0.33801949243378965 ronald-reagan 0.3161309849153592 bush41 0.29106733975087984 © All Rights Reserved 2014 | Neo Technology, Inc. Class Similarity 47
  • 48. Barack Obama democrat 0.7668017370739147 liberal 0.7184792203867296 republican 0.4847680475425114 bill-clinton 0.4507676679770066 conservative 0.4149264161292232 bush43 0.33801949243378965 ronald-reagan 0.323947855372623 bush41 0.2929260544514002 © All Rights Reserved 2014 | Neo Technology, Inc. Class Similarity 48
  • 49. Get involved in the Neo4j community © All Rights Reserved 2014 | Neo Technology, Inc. 49
  • 53. http://neo4j.meetup.com/ © All Rights Reserved 2014 | Neo Technology, Inc. 53
  • 54. © All Rights Reserved 2014 | Neo Technology, Inc. (Thank You) 54
  • 55. Twitter www.twitter.com/kennybastani LinkedIn www.linkedin.com/in/kennybastani GitHub www.github.com/kbastani © All Rights Reserved 2014 | Neo Technology, Inc. Get in touch 55

Editor's Notes

  • #6: When we think about data, we tend to think about how things are connected. This is a natural part of how we talk about things, and also of the graph model. “This is also a graph, but with some data attached. Here: we’ve attached names to the nodes and described the type of the relationships.”
  • #7: “We can take this further, and attach arbitrary key/value pairs” This is the Property Graph Model, which has the following characteristics: It contains Nodes and Relationships, both of which can contain properties (key-value pairs). Relationships are always between exactly 2 nodes. They have a type, and they are directed. “There are other graph models, however everyone in the industry has converged on the idea that this model is the most obvious and the most useful for real humans and the application we’re building”
  • #8: Let’s review the relational table model, to see the difference from the graph property model
  • #9: Start with Customers and Accounts “We have a customer, Alice.” “She’s got 3 accounts” “To keep track of which accounts Alice owns, we need a 3rd table, to store the mapping. Typically called a join table.”
  • #11: Dashboard, for monitoring of key stats Node, Relationship and Property “counts” are just estimates (actually represent the allocated ID space for each graph entity)
  • #12: “The Console is where you can run graph queries, written in Cypher.” We’ll be using this starting... now.
  • #24: Disclaimer: This is a graph-based approach to text classification and pattern recognition. This can be done in many different ways, including SVM, bayesian networks, belief networks, and many other approaches. I chose to create this on top of Neo4j because first its a database and second its already formatted as a network. This gives me the advantage of not worrying about data storage.
  • #27: Explain how the genetic algorithm works.
  • #41: I chose this example project because it’s easy to get presidential speeches online and it seemed like a good example to get others going with Graphify.
  • #50: “Get involved with the community, attend meetups, browse our open source code libraries, including Neo4j, by visiting us on GitHub.”
  • #51: “Visit stackoverflow.com with the tag Neo4j to get fast answers to your questions. We have a very active community of contributors that provide thorough answers 24/7. If you get stuck, make sure you head there.”
  • #52: “The same goes for Google groups, if you prefer that format over Stackoverflow.”
  • #53: “You can visit us on GitHub to submit or browse issues.”
  • #54: “Finally, I urge you to check out our website’s meetup page to find out where meetups are happening all around the world. Also we encourage you to share your experience with Neo4j, your applications, and your use cases by speaking at a local meetup. If you’re interested, please reach out to me, my contact details are in the next slide.”
  • #55: “Thank you for spending some time with me and learning about Neo4j and Cypher.”
  • #56: “Get in touch with me about meetups and Neo4j community events happening around the world.” “I’ll now open up the floor to questions.”