Searching for Meaning:
The hidden structure in unstructured data
Trey Grainger
SVP of Engineering, Lucidworks
Southern Data Science Conference
2018.04.13
Trey Grainger
SVP of Engineering
• Previously Director of Engineering @ CareerBuilder
• MBA, Management of Technology – Georgia Tech
• BA, Computer Science, Business, & Philosophy – Furman University
• Information Retrieval & Web Search - Stanford University
Other fun projects:
• Co-author of Solr in Action, plus numerous research papers
• Advisor to Presearch, the decentralized search engine
• Founder of Celiaccess.com, the gluten-free search engine
• Lucene / Solr contributor
About Me
Based in San Francisco, offices
and employees worldwide
Fusion, the platform for building
data-driven, smart apps
Over 400 customers running our
commercial software
Consulting and support for
organizations using Solr
Produces the world’s largest open
source user conference dedicated
to Lucene/Solr
Lucidworks is the primary commercial
contributor to the Apache Solr project
Employs over 40% of the active
committers on the Solr project
Contributes over 70% of Solr's
open source codebase
40%
70%
Fusion powers search for the brightest companies in the world.
most often used in
reference to
My Three Assertions
1) Unstructured data is actually “hyper-structured” data. It is a
graph that contains much more structure than typical “structured
data.”
2) That graph is very rich, but is a compression of meaning into a
lossy format. Much of data science is essentially the
decompression from this lossy format into a reconstituted form.
3) Most Important: Every instance of a word or phrase you ever
encounter has a unique meaning.
Assertion 1:
Unstructured data is actually “hyper-
structured” data. It is a graph that
contains much more structure than
typical “structured data.”
Southern
Data Science
Structured Data
Employees Table
id name company start_date
lw100 Trey
Grainger
1234 2016-02-01
dis2 Mickey
Mouse
9123 1928-11-28
tsla1 Elon
Musk
5678 2003-07-01
Companies Table
id name start_date
1234 Lucidworks 2016-02-01
5678 Tesla 1928-11-28
9123 Disney 2003-07-01
Discrete
Values
Continuous
Values
Foreign
Key
Southern
Data Science
Unstructured Data
Trey Grainger works at Lucidworks.
He is speaking at the SDSC 2018. Southern Data
Science Conference (SDSC) is being held in Atlanta
April 12-14, 2018. Trey got his masters from
Georgia Tech.
Southern
Data Science
Unstructured Data
Southern Data Science Conference
(SDSC) is being held in Atlanta
April 12-14, 2018.
Trey got his masters degree from
Georgia Tech.
Southern
Data Science
Trey Grainger works at Lucidworks.
He is speaking at the SDSC 2018.
Trey’s Voicemail
Foreign Key?
Trey Grainger works at Lucidworks.
He is speaking at the SDSC 2018.
Southern Data Science Conference
(SDSC) is being held in Atlanta
April 12-14, 2018.
Trey got his masters degree from
Georgia Tech.
Southern
Data Science
Trey’s Voicemail
Trey Grainger works at Lucidworks.
He is speaking at the SDSC 2018.
Fuzzy Foreign Key? (Entity Resolution)
Southern Data Science Conference
(SDSC) is being held in Atlanta
April 12-14, 2018.
Trey got his masters degree from
Georgia Tech.
Southern
Data Science
Trey’s Voicemail
Trey Grainger works at Lucidworks.
He is speaking at the SDSC 2018.
Fuzzier Foreign Key? (metadata, latent features)
Southern Data Science Conference
(SDSC) is being held in Atlanta
April 12-14, 2018.
Trey got his masters degree from
Georgia Tech.
Trey’s Voicemail
Southern
Data Science
Trey Grainger works at Lucidworks.
He is speaking at the SDSC 2018.
Fuzzier Foreign Key? (metadata, latent features)
Southern Data Science Conference
(SDSC) is being held in Atlanta
April 12-14, 2018.
Trey got his masters degree from
Georgia Tech.
Trey’s Voicemail
Southern
Data Science
Not so Fast!
Giant Graph of Relationships...
Trey Grainger works for Lucidworks.
He is speaking at the SDSC 2018.
Southern Data Science Conference
(SDSC) is being held in Atlanta
April 12-14, 2018.
Trey got his masters degree from
Georgia Tech.
Southern
Data Science
Trey’s Voicemail
Assertion 1 (Summary):
Unstructured data is actually “hyper-
structured” data. It is a graph that
contains much more structure than
typical “structured data.”
Southern
Data Science
Assertion 2:
That graph is very rich, but is a
compression of meaning into a lossy
format. Much of data science is
essentially the decompression from
this lossy format into a reconstituted
form.
Southern
Data Science
Southern
Data Science
01
Semantic Data Encoded into Free Text Content
e en eng engi engineer engineers
engineer engineersNode Type: Term
software
engineer
software
engineers
electrical
engineering
engineer
engineering software
…
…
…
Node Type:
Character Sequence
Node Type:
Term Sequence
Node Type:
Document
id: 1
text: looking for a software
engineerwith degree in
computer science or
electrical engineering
id: 2
text: apply to be a software
engineer and work with
other great software
engineers
id: 3
text: start a great careerin
electrical engineering
…
…
How do we easily harness this
“semantic graph” or relationships
within unstructured information?
Southern
Data Science
Search Engines are really good at querying
across characters sequences, term sequences,
and documents
Example Queries:
c?o CTO, CEO, CFO, …
"VP Engineering"~2 “VP of Engineering”,
VP Engineering” ,“Engineering VP”,
“VP of Infrastructure Engineering”
(Microsoft OR MS) AND Word “MS Word”, “Microsoft Word”
Term Documents
a doc1 [2x]
brown doc3 [1x] , doc5 [1x]
cat doc4 [1x]
cow doc2 [1x] , doc5 [1x]
… ...
once doc1 [1x], doc5 [1x]
over doc2 [1x], doc3 [1x]
the doc2 [2x], doc3 [2x],
doc4[2x], doc5 [1x]
… …
Document Content Field
doc1 once upon a time, in a land far,
far away
doc2 the cow jumped over the moon.
doc3 the quick brown fox jumped over
the lazy dog.
doc4 the cat in the hat
doc5 The brown cow said “moo”
once.
… …
What you SEND to Lucene/Solr:
How the content is INDEXED into
Lucene/Solr (conceptually):
An inverted index (“how a search engine works”)
Southern
Data Science
/solr/collection/select/?q=apache solr
Term Documents
… …
apache
doc1, doc3, doc4,
doc5
…
hadoop doc2, doc4, doc6
… …
solr
doc1, doc3, doc4,
doc7, doc8
… …
doc5
doc7 doc8
doc1 doc3
doc4
solr
apache
apache solr
Matching queries to documents
Southern
Data Science
Search engines also do relevancy ranking (query to doc)
Score(q, d) =
∑ idf(t) · ( tf(t in d) · (k + 1) ) / ( tf(t in d) + k · (1 – b + b · |d| / avgdl )
t in q
Where:
t = term; d = document; q = query; i = index
tf(t in d) = numTermOccurrencesInDocument ½
idf(t) = 1 + log (numDocs / (docFreq + 1))
|d| = ∑ 1
t in d
avgdl = = ( ∑ |d| ) / ( ∑ 1 ) )
d in i d in i
k = Free parameter. Usually ~1.2 to 2.0. Increases term frequency
saturation point.
b = Free parameter. Usually ~0.75. Increases impact of document
normalization.
DOI: 10.1109/DSAA.2016.51
Conference: 2016 IEEE International Conference on
Data Science and Advanced Analytics (DSAA)
• “A compact, auto-generated model for real-time traversal and
ranking of any relationship within a domain”
• A multi-dimensional term-to-term (vs. term-to-document) search
engine
• A tool which enables knowledge modeling and reasoning, natural language
processing, anomaly detection, data cleansing, semantic search, analytics,
data classification, root cause analysis, and recommendations systems
• It’s kind of like Word2Vec, but vectors (or matrices) are generated
on the fly and are better suited for interpreting the nuanced intent of
typical search queries
What is the Semantic Knowledge Graph?
Open Sourced!
Southern
Data Science
Knowledge
Graph
Southern
Data Science
Knowledge
Graph
Southern
Data Science
id: 1
job_title: Software Engineer
desc: software engineer at a
great company
skills: .Net, C#, java
id: 2
job_title: Registered Nurse
desc: a registered nurse at
hospital doing hard work
skills: oncology, phlebotemy
id: 3
job_title: Java Developer
desc: a software engineer or a
java engineer doing work
skills: java, scala, hibernate
field doc term
desc
1
a
at
company
engineer
great
software
2
a
at
doing
hard
hospital
nurse
registered
work
3
a
doing
engineer
java
or
software
work
job_title 1
Software
Engineer
… … …
Terms-Docs Inverted IndexDocs-Terms Forward IndexDocuments
Source: Trey Grainger,
Khalifeh AlJadda, Mohammed
Korayem, Andries Smith.“The
Semantic Knowledge Graph: A
compact, auto-generated
model for real-time traversal
and ranking of any relationship
within a domain”. DSAA 2016.
Knowledge
Graph
field term postings list
doc pos
desc
a
1 4
2 1
3 1, 5
at
1 3
2 4
company 1 6
doing
2 6
3 8
engineer
1 2
3 3, 7
great 1 5
hard 2 7
hospital 2 5
java 3 6
nurse 2 3
or 3 4
registered 2 2
software
1 1
3 2
work
2 10
3 9
job_title java developer 3 1
… … … …
Southern
Data Science
Source: Trey Grainger,
Khalifeh AlJadda, Mohammed
Korayem, Andries Smith.“The
Semantic Knowledge Graph: A
compact, auto-generated
model for real-time traversal
and ranking of any relationship
within a domain”. DSAA 2016.
Knowledge
Graph
Set-theory View
Graph View
How the Graph Traversal Works
skill: Java
skill: Scala
skill:
Hibernate
skill:
Oncology
doc 1
doc 2
doc 3
doc 4
doc 5
doc 6
skill:
Java
skill: Java
skill: Scala
skill:
Hibernate
skill:
Oncology
Data Structure View
Java
Scala Hibernate
docs
1, 2, 6
docs
3, 4
Oncology
doc 5
Southern
Data Science
Source: Trey Grainger,
Khalifeh AlJadda, Mohammed
Korayem, Andries Smith.“The
Semantic Knowledge Graph: A
compact, auto-generated
model for real-time traversal
and ranking of any relationship
within a domain”. DSAA 2016.
Knowledge
Graph
Multi-level Traversal
Data Structure View
Graph View
doc 1
doc 2
doc 3
doc 4
doc 5
doc 6
skill:
Java
skill: Java
skill: Scala
skill:
Hibernate
skill:
Oncology
doc 1
doc 2
doc 3
doc 4
doc 5
doc 6
job_title:
Software
Engineer
job_title:
Data
Scientist
job_title:
Java
Developer
……
Inverted Index
Lookup
Forward Index
Lookup
Forward Index
Lookup
Inverted Index
Lookup
Java
Java
Developer
Hibernate
Scala
Software
Engineer
Data
Scientist
has_related_job_title
has_related_job_title
Scoring of Node Relationships (Edge Weights)
Foreground vs. Background Analysis
Every term scored against it’s context. The more
commonly the term appears within it’s foreground
context versus its background context, the more
relevant it is to the specified foreground context.
countFG(x) - totalDocsFG * probBG(x)
z = --------------------------------------------------------
sqrt(totalDocsFG * probBG(x) * (1 - probBG(x)))
{ "type":"keywords”, "values":[
{ "value":"hive", "relatedness":0.9773, "popularity":369 },
{ "value":"java", "relatedness":0.9236, "popularity":15653 },
{ "value":".net", "relatedness":0.5294, "popularity":17683 },
{ "value":"bee", "relatedness":0.0, "popularity":0 },
{ "value":"teacher", "relatedness":-0.2380, "popularity":9923 },
{ "value":"registered nurse", "relatedness": -0.3802 "popularity":27089 } ] }
We are essentially boosting terms which are more related to some known feature
(and ignoring terms which are equally likely to appear in the background corpus)
+
-
Foreground Query:
"Hadoop"
Knowledge
Graph
Southern
Data Science
Source: Trey Grainger,
Khalifeh AlJadda, Mohammed
Korayem, Andries Smith.“The
Semantic Knowledge Graph: A
compact, auto-generated
model for real-time traversal
and ranking of any relationship
within a domain”. DSAA 2016.
Knowledge
Graph
Multi-level Graph Traversal with Scores
software engineer*
(materialized node)
Java
C#
.NET
.NET
Developer
Java
Developer
Hibernate
ScalaVB.NET
Software
Engineer
Data
Scientist
Skill
Nodes
has_related_skillStarting
Node
Skill
Nodes
has_related_skill Job Title
Nodes
has_related_job_title
0.90
0.88 0.93
0.93
0.34
0.74
0.91
0.89
0.74
0.89
0.780.72
0.48
0.93
0.76
0.83
0.80
0.64
0.61
0.780.55
Southern
Data Science
Related term vector (for query concept expansion)
http://localhost:8983/solr/stack-exchange-health/skg
Southern
Data Science
Who’s in Love with Jean Grey?
Assertion 2 (Summary):
That graph is very rich, but is a
compression of meaning into a lossy
format. Much of data science is
essentially the decompression from
this lossy format into a reconstituted
form.
Southern
Data Science
Assertion 3:
Every instance of a word or phrase you
ever encounter has a unique meaning.
Southern
Data Science
Thought Exercise
What do you think of when I say the
word “driver”?
Southern
Data Science
Ambiguity
Example Related Keywords (representing multiple meanings)
driver truck driver, linux, windows, courier, embedded, cdl,
delivery
architect autocad drafter, designer, enterprise architect, java
architect, designer, architectural designer, data architect,
oracle, java, architectural drafter, autocad, drafter, cad,
engineer
… …
Source: M. Korayem, C. Ortiz, K. AlJadda, T. Grainger. "Query Sense Disambiguation Leveraging Large Scale User Behavioral Data". IEEE Big Data 2015.
Southern
Data Science
Use Case: Query Disambiguation
Example Related Keywords (representing multiple meanings)
driver truck driver, linux, windows, courier, embedded, cdl,
delivery
architect autocad drafter, designer, enterprise architect, java
architect, designer, architectural designer, data architect,
oracle, java, architectural drafter, autocad, drafter, cad,
engineer
… …
Source: M. Korayem, C. Ortiz, K. AlJadda, T. Grainger. "Query Sense Disambiguation Leveraging Large Scale User Behavioral Data". IEEE Big Data 2015.
Southern
Data Science
Disambiguated meanings (represented as term vectors)
Example Related Keywords (Disambiguated Meanings)
architect 1: enterprise architect, java architect, data architect, oracle, java, .net
2: architectural designer, architectural drafter, autocad, autocad drafter, designer,
drafter, cad, engineer
driver 1: linux, windows, embedded
2: truck driver, cdl driver, delivery driver, class b driver, cdl, courier
designer 1: design, print, animation, artist, illustrator, creative, graphic artist, graphic,
photoshop, video
2: graphic, web designer, design, web design, graphic design, graphic designer
3: design, drafter, cad designer, draftsman, autocad, mechanical designer, proe,
structural designer, revit
… …
Source: M. Korayem, C. Ortiz, K. AlJadda, T. Grainger. "Query Sense Disambiguation Leveraging Large Scale User Behavioral Data". IEEE Big Data 2015.
Southern
Data Science
Using the disambiguated meanings
In a situation where a user searches for an ambiguous phrase, what information can we
use to pick the correct underlying meaning?
1. Any pre-existing knowledge about the user:
• User is a software engineer
• User has previously run searches for “c++” and “linux”
2. Context within the query:
User searched for windows AND driver vs. courier OR driver
3. If all else fails (and there is no context), use the most commonly occurring meaning.
driver 1: linux, windows, embedded
2: truck driver, cdl driver, delivery driver, class b driver, cdl, courier
Source: M. Korayem, C. Ortiz, K. AlJadda, T. Grainger. "Query Sense Disambiguation Leveraging Large Scale User Behavioral Data". IEEE Big Data 2015.
Southern
Data Science
Thought Exercise
What do you think of when I say the
word “Apple”?
Southern
Data Science
Every term or phrase is a
Context-dependent cluster of
meaning with an ambiguous label
Southern
Data Science
Every term or phrase is a
Context-dependent cluster of
meaning with an ambiguous label
Southern
Data Science
Southern
Data Science
What does “love” mean?
http://localhost:8983/solr/thesaurus/skg
Southern
Data Science
What does “love” mean in the context of “hug”?
http://localhost:8983/solr/thesaurus/skg
Southern
Data Science
What does “love” mean in the context of “child”?
http://localhost:8983/solr/thesaurus/skg
My Three Assertions (Recap)
1) Unstructured data is actually “hyper-structured” data. It is a
graph that contains much more structure than typical “structured
data.”
2) That graph is very rich, but is a compression of meaning into a
lossy format. Much of data science is essentially the
decompression from this lossy format into a reconstituted form.
3) Most Important: Every instance of a word or phrase you ever
encounter has a unique meaning.
Why do we care?
User’s Query:
machine learning research and development Portland, OR software
engineer AND hadoop, java
Traditional Query Parsing:
(machine AND learning AND research AND development AND portland)
OR (software AND engineer AND hadoop AND java)
Semantic Query Parsing:
"machine learning" AND "research and development" AND "Portland, OR"
AND "software engineer" AND hadoop AND java
Semantically Expanded Query:
("machine learning"^10 OR "data scientist" OR "data mining" OR "artificial intelligence")
AND ("research and development"^10 OR "r&d") AND
AND ("Portland, OR"^10 OR "Portland, Oregon" OR {!geofilt pt=45.512,-122.676 d=50 sfield=geo})
AND ("software engineer"^10 OR "software developer")
AND (hadoop^10 OR "big data" OR hbase OR hive) AND (java^10 OR j2ee)
Contact Info
Trey Grainger
trey.grainger@lucidworks.com
@treygrainger
http://solrinaction.com
Other presentations:
http://www.treygrainger.com
Discount code: ctwsdsc18
Southern
Data Science

Searching for Meaning

  • 1.
    Searching for Meaning: Thehidden structure in unstructured data Trey Grainger SVP of Engineering, Lucidworks Southern Data Science Conference 2018.04.13
  • 2.
    Trey Grainger SVP ofEngineering • Previously Director of Engineering @ CareerBuilder • MBA, Management of Technology – Georgia Tech • BA, Computer Science, Business, & Philosophy – Furman University • Information Retrieval & Web Search - Stanford University Other fun projects: • Co-author of Solr in Action, plus numerous research papers • Advisor to Presearch, the decentralized search engine • Founder of Celiaccess.com, the gluten-free search engine • Lucene / Solr contributor About Me
  • 3.
    Based in SanFrancisco, offices and employees worldwide Fusion, the platform for building data-driven, smart apps Over 400 customers running our commercial software Consulting and support for organizations using Solr Produces the world’s largest open source user conference dedicated to Lucene/Solr Lucidworks is the primary commercial contributor to the Apache Solr project Employs over 40% of the active committers on the Solr project Contributes over 70% of Solr's open source codebase 40% 70%
  • 4.
    Fusion powers searchfor the brightest companies in the world.
  • 6.
    most often usedin reference to
  • 7.
    My Three Assertions 1)Unstructured data is actually “hyper-structured” data. It is a graph that contains much more structure than typical “structured data.” 2) That graph is very rich, but is a compression of meaning into a lossy format. Much of data science is essentially the decompression from this lossy format into a reconstituted form. 3) Most Important: Every instance of a word or phrase you ever encounter has a unique meaning.
  • 8.
    Assertion 1: Unstructured datais actually “hyper- structured” data. It is a graph that contains much more structure than typical “structured data.” Southern Data Science
  • 9.
    Structured Data Employees Table idname company start_date lw100 Trey Grainger 1234 2016-02-01 dis2 Mickey Mouse 9123 1928-11-28 tsla1 Elon Musk 5678 2003-07-01 Companies Table id name start_date 1234 Lucidworks 2016-02-01 5678 Tesla 1928-11-28 9123 Disney 2003-07-01 Discrete Values Continuous Values Foreign Key Southern Data Science
  • 10.
    Unstructured Data Trey Graingerworks at Lucidworks. He is speaking at the SDSC 2018. Southern Data Science Conference (SDSC) is being held in Atlanta April 12-14, 2018. Trey got his masters from Georgia Tech. Southern Data Science
  • 11.
    Unstructured Data Southern DataScience Conference (SDSC) is being held in Atlanta April 12-14, 2018. Trey got his masters degree from Georgia Tech. Southern Data Science Trey Grainger works at Lucidworks. He is speaking at the SDSC 2018. Trey’s Voicemail
  • 12.
    Foreign Key? Trey Graingerworks at Lucidworks. He is speaking at the SDSC 2018. Southern Data Science Conference (SDSC) is being held in Atlanta April 12-14, 2018. Trey got his masters degree from Georgia Tech. Southern Data Science Trey’s Voicemail
  • 13.
    Trey Grainger worksat Lucidworks. He is speaking at the SDSC 2018. Fuzzy Foreign Key? (Entity Resolution) Southern Data Science Conference (SDSC) is being held in Atlanta April 12-14, 2018. Trey got his masters degree from Georgia Tech. Southern Data Science Trey’s Voicemail
  • 14.
    Trey Grainger worksat Lucidworks. He is speaking at the SDSC 2018. Fuzzier Foreign Key? (metadata, latent features) Southern Data Science Conference (SDSC) is being held in Atlanta April 12-14, 2018. Trey got his masters degree from Georgia Tech. Trey’s Voicemail Southern Data Science
  • 15.
    Trey Grainger worksat Lucidworks. He is speaking at the SDSC 2018. Fuzzier Foreign Key? (metadata, latent features) Southern Data Science Conference (SDSC) is being held in Atlanta April 12-14, 2018. Trey got his masters degree from Georgia Tech. Trey’s Voicemail Southern Data Science Not so Fast!
  • 18.
    Giant Graph ofRelationships... Trey Grainger works for Lucidworks. He is speaking at the SDSC 2018. Southern Data Science Conference (SDSC) is being held in Atlanta April 12-14, 2018. Trey got his masters degree from Georgia Tech. Southern Data Science Trey’s Voicemail
  • 19.
    Assertion 1 (Summary): Unstructureddata is actually “hyper- structured” data. It is a graph that contains much more structure than typical “structured data.” Southern Data Science
  • 20.
    Assertion 2: That graphis very rich, but is a compression of meaning into a lossy format. Much of data science is essentially the decompression from this lossy format into a reconstituted form. Southern Data Science
  • 21.
    Southern Data Science 01 Semantic DataEncoded into Free Text Content e en eng engi engineer engineers engineer engineersNode Type: Term software engineer software engineers electrical engineering engineer engineering software … … … Node Type: Character Sequence Node Type: Term Sequence Node Type: Document id: 1 text: looking for a software engineerwith degree in computer science or electrical engineering id: 2 text: apply to be a software engineer and work with other great software engineers id: 3 text: start a great careerin electrical engineering … …
  • 22.
    How do weeasily harness this “semantic graph” or relationships within unstructured information? Southern Data Science
  • 23.
    Search Engines arereally good at querying across characters sequences, term sequences, and documents Example Queries: c?o CTO, CEO, CFO, … "VP Engineering"~2 “VP of Engineering”, VP Engineering” ,“Engineering VP”, “VP of Infrastructure Engineering” (Microsoft OR MS) AND Word “MS Word”, “Microsoft Word”
  • 24.
    Term Documents a doc1[2x] brown doc3 [1x] , doc5 [1x] cat doc4 [1x] cow doc2 [1x] , doc5 [1x] … ... once doc1 [1x], doc5 [1x] over doc2 [1x], doc3 [1x] the doc2 [2x], doc3 [2x], doc4[2x], doc5 [1x] … … Document Content Field doc1 once upon a time, in a land far, far away doc2 the cow jumped over the moon. doc3 the quick brown fox jumped over the lazy dog. doc4 the cat in the hat doc5 The brown cow said “moo” once. … … What you SEND to Lucene/Solr: How the content is INDEXED into Lucene/Solr (conceptually): An inverted index (“how a search engine works”) Southern Data Science
  • 25.
    /solr/collection/select/?q=apache solr Term Documents …… apache doc1, doc3, doc4, doc5 … hadoop doc2, doc4, doc6 … … solr doc1, doc3, doc4, doc7, doc8 … … doc5 doc7 doc8 doc1 doc3 doc4 solr apache apache solr Matching queries to documents Southern Data Science
  • 26.
    Search engines alsodo relevancy ranking (query to doc) Score(q, d) = ∑ idf(t) · ( tf(t in d) · (k + 1) ) / ( tf(t in d) + k · (1 – b + b · |d| / avgdl ) t in q Where: t = term; d = document; q = query; i = index tf(t in d) = numTermOccurrencesInDocument ½ idf(t) = 1 + log (numDocs / (docFreq + 1)) |d| = ∑ 1 t in d avgdl = = ( ∑ |d| ) / ( ∑ 1 ) ) d in i d in i k = Free parameter. Usually ~1.2 to 2.0. Increases term frequency saturation point. b = Free parameter. Usually ~0.75. Increases impact of document normalization.
  • 27.
    DOI: 10.1109/DSAA.2016.51 Conference: 2016IEEE International Conference on Data Science and Advanced Analytics (DSAA)
  • 28.
    • “A compact,auto-generated model for real-time traversal and ranking of any relationship within a domain” • A multi-dimensional term-to-term (vs. term-to-document) search engine • A tool which enables knowledge modeling and reasoning, natural language processing, anomaly detection, data cleansing, semantic search, analytics, data classification, root cause analysis, and recommendations systems • It’s kind of like Word2Vec, but vectors (or matrices) are generated on the fly and are better suited for interpreting the nuanced intent of typical search queries What is the Semantic Knowledge Graph?
  • 29.
  • 30.
  • 31.
  • 32.
    Southern Data Science id: 1 job_title:Software Engineer desc: software engineer at a great company skills: .Net, C#, java id: 2 job_title: Registered Nurse desc: a registered nurse at hospital doing hard work skills: oncology, phlebotemy id: 3 job_title: Java Developer desc: a software engineer or a java engineer doing work skills: java, scala, hibernate field doc term desc 1 a at company engineer great software 2 a at doing hard hospital nurse registered work 3 a doing engineer java or software work job_title 1 Software Engineer … … … Terms-Docs Inverted IndexDocs-Terms Forward IndexDocuments Source: Trey Grainger, Khalifeh AlJadda, Mohammed Korayem, Andries Smith.“The Semantic Knowledge Graph: A compact, auto-generated model for real-time traversal and ranking of any relationship within a domain”. DSAA 2016. Knowledge Graph field term postings list doc pos desc a 1 4 2 1 3 1, 5 at 1 3 2 4 company 1 6 doing 2 6 3 8 engineer 1 2 3 3, 7 great 1 5 hard 2 7 hospital 2 5 java 3 6 nurse 2 3 or 3 4 registered 2 2 software 1 1 3 2 work 2 10 3 9 job_title java developer 3 1 … … … …
  • 33.
    Southern Data Science Source: TreyGrainger, Khalifeh AlJadda, Mohammed Korayem, Andries Smith.“The Semantic Knowledge Graph: A compact, auto-generated model for real-time traversal and ranking of any relationship within a domain”. DSAA 2016. Knowledge Graph Set-theory View Graph View How the Graph Traversal Works skill: Java skill: Scala skill: Hibernate skill: Oncology doc 1 doc 2 doc 3 doc 4 doc 5 doc 6 skill: Java skill: Java skill: Scala skill: Hibernate skill: Oncology Data Structure View Java Scala Hibernate docs 1, 2, 6 docs 3, 4 Oncology doc 5
  • 34.
    Southern Data Science Source: TreyGrainger, Khalifeh AlJadda, Mohammed Korayem, Andries Smith.“The Semantic Knowledge Graph: A compact, auto-generated model for real-time traversal and ranking of any relationship within a domain”. DSAA 2016. Knowledge Graph Multi-level Traversal Data Structure View Graph View doc 1 doc 2 doc 3 doc 4 doc 5 doc 6 skill: Java skill: Java skill: Scala skill: Hibernate skill: Oncology doc 1 doc 2 doc 3 doc 4 doc 5 doc 6 job_title: Software Engineer job_title: Data Scientist job_title: Java Developer …… Inverted Index Lookup Forward Index Lookup Forward Index Lookup Inverted Index Lookup Java Java Developer Hibernate Scala Software Engineer Data Scientist has_related_job_title has_related_job_title
  • 35.
    Scoring of NodeRelationships (Edge Weights) Foreground vs. Background Analysis Every term scored against it’s context. The more commonly the term appears within it’s foreground context versus its background context, the more relevant it is to the specified foreground context. countFG(x) - totalDocsFG * probBG(x) z = -------------------------------------------------------- sqrt(totalDocsFG * probBG(x) * (1 - probBG(x))) { "type":"keywords”, "values":[ { "value":"hive", "relatedness":0.9773, "popularity":369 }, { "value":"java", "relatedness":0.9236, "popularity":15653 }, { "value":".net", "relatedness":0.5294, "popularity":17683 }, { "value":"bee", "relatedness":0.0, "popularity":0 }, { "value":"teacher", "relatedness":-0.2380, "popularity":9923 }, { "value":"registered nurse", "relatedness": -0.3802 "popularity":27089 } ] } We are essentially boosting terms which are more related to some known feature (and ignoring terms which are equally likely to appear in the background corpus) + - Foreground Query: "Hadoop" Knowledge Graph
  • 36.
    Southern Data Science Source: TreyGrainger, Khalifeh AlJadda, Mohammed Korayem, Andries Smith.“The Semantic Knowledge Graph: A compact, auto-generated model for real-time traversal and ranking of any relationship within a domain”. DSAA 2016. Knowledge Graph Multi-level Graph Traversal with Scores software engineer* (materialized node) Java C# .NET .NET Developer Java Developer Hibernate ScalaVB.NET Software Engineer Data Scientist Skill Nodes has_related_skillStarting Node Skill Nodes has_related_skill Job Title Nodes has_related_job_title 0.90 0.88 0.93 0.93 0.34 0.74 0.91 0.89 0.74 0.89 0.780.72 0.48 0.93 0.76 0.83 0.80 0.64 0.61 0.780.55
  • 37.
    Southern Data Science Related termvector (for query concept expansion) http://localhost:8983/solr/stack-exchange-health/skg
  • 38.
  • 39.
    Assertion 2 (Summary): Thatgraph is very rich, but is a compression of meaning into a lossy format. Much of data science is essentially the decompression from this lossy format into a reconstituted form. Southern Data Science
  • 40.
    Assertion 3: Every instanceof a word or phrase you ever encounter has a unique meaning. Southern Data Science
  • 41.
    Thought Exercise What doyou think of when I say the word “driver”? Southern Data Science
  • 42.
    Ambiguity Example Related Keywords(representing multiple meanings) driver truck driver, linux, windows, courier, embedded, cdl, delivery architect autocad drafter, designer, enterprise architect, java architect, designer, architectural designer, data architect, oracle, java, architectural drafter, autocad, drafter, cad, engineer … … Source: M. Korayem, C. Ortiz, K. AlJadda, T. Grainger. "Query Sense Disambiguation Leveraging Large Scale User Behavioral Data". IEEE Big Data 2015. Southern Data Science
  • 43.
    Use Case: QueryDisambiguation Example Related Keywords (representing multiple meanings) driver truck driver, linux, windows, courier, embedded, cdl, delivery architect autocad drafter, designer, enterprise architect, java architect, designer, architectural designer, data architect, oracle, java, architectural drafter, autocad, drafter, cad, engineer … … Source: M. Korayem, C. Ortiz, K. AlJadda, T. Grainger. "Query Sense Disambiguation Leveraging Large Scale User Behavioral Data". IEEE Big Data 2015. Southern Data Science
  • 44.
    Disambiguated meanings (representedas term vectors) Example Related Keywords (Disambiguated Meanings) architect 1: enterprise architect, java architect, data architect, oracle, java, .net 2: architectural designer, architectural drafter, autocad, autocad drafter, designer, drafter, cad, engineer driver 1: linux, windows, embedded 2: truck driver, cdl driver, delivery driver, class b driver, cdl, courier designer 1: design, print, animation, artist, illustrator, creative, graphic artist, graphic, photoshop, video 2: graphic, web designer, design, web design, graphic design, graphic designer 3: design, drafter, cad designer, draftsman, autocad, mechanical designer, proe, structural designer, revit … … Source: M. Korayem, C. Ortiz, K. AlJadda, T. Grainger. "Query Sense Disambiguation Leveraging Large Scale User Behavioral Data". IEEE Big Data 2015. Southern Data Science
  • 45.
    Using the disambiguatedmeanings In a situation where a user searches for an ambiguous phrase, what information can we use to pick the correct underlying meaning? 1. Any pre-existing knowledge about the user: • User is a software engineer • User has previously run searches for “c++” and “linux” 2. Context within the query: User searched for windows AND driver vs. courier OR driver 3. If all else fails (and there is no context), use the most commonly occurring meaning. driver 1: linux, windows, embedded 2: truck driver, cdl driver, delivery driver, class b driver, cdl, courier Source: M. Korayem, C. Ortiz, K. AlJadda, T. Grainger. "Query Sense Disambiguation Leveraging Large Scale User Behavioral Data". IEEE Big Data 2015. Southern Data Science
  • 46.
    Thought Exercise What doyou think of when I say the word “Apple”? Southern Data Science
  • 47.
    Every term orphrase is a Context-dependent cluster of meaning with an ambiguous label Southern Data Science
  • 48.
    Every term orphrase is a Context-dependent cluster of meaning with an ambiguous label Southern Data Science
  • 49.
    Southern Data Science What does“love” mean? http://localhost:8983/solr/thesaurus/skg
  • 50.
    Southern Data Science What does“love” mean in the context of “hug”? http://localhost:8983/solr/thesaurus/skg
  • 51.
    Southern Data Science What does“love” mean in the context of “child”? http://localhost:8983/solr/thesaurus/skg
  • 52.
    My Three Assertions(Recap) 1) Unstructured data is actually “hyper-structured” data. It is a graph that contains much more structure than typical “structured data.” 2) That graph is very rich, but is a compression of meaning into a lossy format. Much of data science is essentially the decompression from this lossy format into a reconstituted form. 3) Most Important: Every instance of a word or phrase you ever encounter has a unique meaning.
  • 53.
    Why do wecare? User’s Query: machine learning research and development Portland, OR software engineer AND hadoop, java Traditional Query Parsing: (machine AND learning AND research AND development AND portland) OR (software AND engineer AND hadoop AND java) Semantic Query Parsing: "machine learning" AND "research and development" AND "Portland, OR" AND "software engineer" AND hadoop AND java Semantically Expanded Query: ("machine learning"^10 OR "data scientist" OR "data mining" OR "artificial intelligence") AND ("research and development"^10 OR "r&d") AND AND ("Portland, OR"^10 OR "Portland, Oregon" OR {!geofilt pt=45.512,-122.676 d=50 sfield=geo}) AND ("software engineer"^10 OR "software developer") AND (hadoop^10 OR "big data" OR hbase OR hive) AND (java^10 OR j2ee)
  • 54.
    Contact Info Trey Grainger [email protected] @treygrainger http://solrinaction.com Otherpresentations: http://www.treygrainger.com Discount code: ctwsdsc18 Southern Data Science