SlideShare a Scribd company logo
Slide 1 Copyright © 2012 MarkLogic® Corporation. All rights reserved.Slide 1
Evolving MyBuzzMetrics with Text Analytics
September 2012
Eric Austvold – Insights Executive
Fernando Mesa – WW Director of Enterprise Solution
Pete Aven – Systems Engineer
Slide 2 Copyright © 2012 MarkLogic® Corporation. All rights reserved.Slide 2
Agenda
§ Introductions
§ NM Incite goals for text analytics
§ MarkLogic evolving MyBuzzMetrics with Text Analytics
§ Entity Extraction
§ Topic Discovery / Theme extraction
§ Data Faceting
§ Trend spotting
§ Visualization
§ Use Cases and Demos
§ Next Steps
Slide 3 Copyright © 2012 MarkLogic® Corporation. All rights reserved.Slide 3
Goals for Text Analytics
§ What’s your goal?
§ What are your clients asking you for?
§ How do you want to service your internal clients? Analysts,
researchers, account managers?
§ How do you want to service your external clients? Self service
reporting? Ad-hoc analysis? Integration with their data?
§ How do you envision your new solution to complement other
Nielsen services?
Slide 4 Copyright © 2012 MarkLogic® Corporation. All rights reserved.Slide 4
Text Analytics - Evolution
§ Reliant on relational data
structures that are challenging
to manage, silos of data
§ Not indexed immediately, not
possible to query in real time
§ New Parses = Re-ingestion
§ Re-ingestion = new schema
design – creates delays
§ Not real time – difficult to
determine buzz
§ Impossible at 30+ billion docs
§ Pre-processing required to
handle batches of data
§ Extraction methods lose
context and full perspective
§ Flexible – Built on an
infrastructure that can integrate
text mining output
§ Context Aware – Without schema
redesigns, context of original
document persists as text miners
enrich that content, preserving
relationships to the original data
§ Scales – Can accommodate real
time ad-hoc queries and reports
across a corpus of 30+ billion
documents
§ Enrichment – a better method of
leveraging text mining work
Traditional
Methods
MarkLogic Enabled
Methods
Slide 5 Copyright © 2012 MarkLogic® Corporation. All rights reserved.Slide 5
The Parse
§ The Parse
§ Actor, Action, Object
§ Fact
§ Entity
§ Qualifier
§ Etc.
§ Basis Entity Enrichment
§ Open Enrichment Framework
§ Calais
§ Temis
§ Data Harmony
§ NetOwl
What it means…
We can integrate with
all enrichment engines.
Slide 6 Copyright © 2012 MarkLogic® Corporation. All rights reserved.Slide 6
The Platform
§ The Platform
§ Flexibility
§ Speed
§ Scale
§ Delivery of Insight
What it means…
Clients can rapidly
deliver insight in real
time to help users
discover new insights.
Slide 7Copyright © 2009 Mark Logic Corporation. All rights reserved.
MarkLogic and Text Analytics
Web Services
ETL
Connector (*)
Social Media
Connector (*)
RDBMS
connector
Search
Unified Index
For all data structuresTransactional
Database
Data
Retrieval
Repository
Classification
Concept Extraction
Entity Enrichment
Web
Applications
Decision
Support
APIs/Services
Taxonomies
App Server
Third-party Partners
Analytics
Leverage value generated from text mining Generate Opinions
(in the form of data)
Slide 8 Copyright © 2012 MarkLogic® Corporation. All rights reserved.Slide 8
Traditional Enrichment = Extraction
First
Name
Last
Name
Other Comments
Chris Smith Data Chris bought an upgrade package for his black, 2011, Honda
Pilot on 9/16. Car returned for service on 9/21. The bolt on
the undercarriage cracked due to heat. He doesn’t think it’s the
transmission however as …..
Actor Action Object
Chris buy package
Fact
package-buy
car-return
bolt-cracked
Entity Type
Chris person
Honda organization
9/16 date
Qualifiers
upgrade
black
More Parsing = More Tables/Rows = More Joins = Does Not Scale!
And What About Context?
Slide 9 Copyright © 2012 MarkLogic® Corporation. All rights reserved.Slide 9
Enrichment with MarkLogic
<actor><person>Chris </person></actor><action>bought
</action>an <qualifier>upgrade
</qualifier><object>package</object> for his
<qualifier>black </qualifier>, <qualifier>2011 </qualifier>,
<organization>Honda </organization> Pilot on
<date>9/16</date>. Car returned for service on <date>
9/21 </date>. The bolt on the undercarriage cracked due to
heat. <person @name=“Chris”>He </person> doesn’t think
it’s the transmission however as …..
Pepsi<name> </name><brand> </brand><drink> </drink>
Markup Inline!
Every Tag Becomes a Candidate For an Index!
What it
means…
Enrichment
persists
context and
scales without
a schema
redesign, saves
time and
resources as
client needs
evolve.
Slide 10 Copyright © 2012 MarkLogic® Corporation. All rights reserved.Slide 10
Text Mining is part of the big picture
Words and phrases
... Semantic Web is a collaborative
movement led by the World Wide Web
Consortium (W3C) ...
Structure Label
Author Ing
Comp
ID Para
Org
Data/Metadata
name:sorbitol
date:2012-06-04
company:Roche
Entities in Context
... diabetes, since the risk of
blindness is very high in such
patients...
Geospatial
<location>
<lat>46.946584</lat>
<lng>93.076172</lng>
</location>
Universal Index
Slide 11 Copyright © 2012 MarkLogic® Corporation. All rights reserved.Slide 11
Demo
Slide 12 Copyright © 2012 MarkLogic® Corporation. All rights reserved.Slide 12
Agenda
§ Introductions
§ Nielsen’s goals and challenges related to unstructured data
§ MarkLogic Beyond Big Data Search
§ Entity Extraction
§ Topic Discovery / Theme extraction
§ Data Faceting
§ Trend spotting
§ Visualization
§ Use Cases and Demos
§ Next Steps
Slide 15 Copyright © 2012 MarkLogic® Corporation. All rights reserved.Slide 15
MarkLogic Analytics
§ Why Use MarkLogic Analytics?
§ Term list analytics
§ Range index analytics
§ Combining term lists and range indexes
§ Range index best practices & references
Slide 16 Copyright © 2012 MarkLogic® Corporation. All rights reserved.Slide 16
MarkLogic Analytics – why use it?
Applications increasingly combine structured and unstructured
information (e.g., electronic healthcare records)
Show me male patients that are under the age of 45 with an
ADMITTING DIAGNOSIS that included Chest Pain, or with a
HISTORY OF PRESENT ILLNESS including symptoms for Chest
Pain, Shortness of Breath, or Dizziness. Additionally, identify patients
within this population with regular alcohol consumption in the SOCIAL
HISTORY, alcoholism in the FAMILY HISTORY, and one of the
following 17 synonyms for stress diagnoses in the ASSESSMENT
AND TREATMENT PLAN.
Structured Unstructured/Contextual
Slide 32 Copyright © 2012 MarkLogic® Corporation. All rights reserved.Slide 32
Agenda
§ Text Enrichment
Slide 33 Copyright © 2012 MarkLogic® Corporation. All rights reserved.Slide 33
Text Enrichment – with Entities
§ Load, manipulate, query content as-is
§ … then enrich the content over time
§ Entity extraction
§ Specialized technology
§ Identifies people, places, things in free text
§ Entity extraction -> Entity enrichment
§ Entities are marked-up in-line
§ Gives you
§ More focused search (includes proximity, structure)
§ Analytics
§ Alerting
Slide 34 Copyright © 2012 MarkLogic® Corporation. All rights reserved.Slide 34
Enrich Your Content … With
Entities: Example
<Article xmlns:e="http://marklogic.com/entity">
<title><e:person>John Louis</e:person></title>
<acknowledgement><e:gpe>Wikipedia</e:gpe>, the free encyclopedia</acknowledgement>
<section>
<para>"Tiger" <e:person>John Louis</e:person> (born <e:date>14 June
1941</e:date>)[<refto ID="1">1</refto>] was an <e:gpe>England</e:gpe> international speedway
rider who rode for <e:organization>Ipswich Witches</e:organization>. He is the father of
<e:gpe>Great Britain</e:gpe> international <e:person>Chris Louis</e:person>.
<e:person>John</e:person> rode a weslake for most of his career.</para>
</section>
<section>
<title>Career history</title>
<para><e:person>John</e:person> finished third in the 1975 Speedway World
Championship and was part of the <e:organization>England Speedway World
Cup</e:organization> winning teams of 1972, 1974 and 1975. He was also World Pairs Champion
in 1976 with <e:person>Malcolm Simmons</e:person>. He also captained
<e:gpe>Ipswich</e:gpe> when they were <e:nationality>British</e:nationality> Champions in
1976. <e:person>John</e:person> won the <e:nationality>British</e:nationality> Speedway
Championship in 1975. He was also <e:organization>National League Riders</e:organization>
champion in 1971 and <e:organization>British League Riders</e:organization> champion in
1979.</para>
<para>He retired in 1984 and is now the promoter of <e:organization>Ipswich
Witches</e:organization>.</para>
</section>
Slide 35 Copyright © 2012 MarkLogic® Corporation. All rights reserved.Slide 35
Entity Enrichment With
MarkLogic Server
1. Rule Based using Built-in function
§ Can leverage a taxonomy for drive entity definition
§ Uses Content Processing Framework to Automate process
2. Statistical Analysis using built-in Entity Enricher
§ Licensed BASIS for enrichment
§ For automated entity enrichment
3. External Using Partner Network
§ Seamless integration using Open Enrichment Framework
§ Can use a combination of tools (Best of Bread)
§ Can leverage both internal and external Solution
Three Approaches to Entity Enrichment
Slide 36 Copyright © 2012 MarkLogic® Corporation. All rights reserved.Slide 36
Entity Enrichment: Built-in
§ Take an XML node, and markup entities in that node
§ Substitute $expr for each entity in $node
§ Use any style of markup using $expr plus these variables:
§ $cts:node
§ $cts:text
§ $cts:entity-type
§ Advantage: the most flexible
§ Choose your style of markup
§ Choose which parts you want to markup
§ Choose which entities you want to use/ignore
cts:entity-highlight(
$node as node(),
$expr as item()*
) as node()
Slide 37 Copyright © 2012 MarkLogic® Corporation. All rights reserved.Slide 37
Entity Enrichment: Built-in:
Example(2)
cts:entity-highlight(
<a>John went to England</a>,
<entity>{
element {$cts:entity-type} {$cts:text}
}
</entity>
)
<a>
<entity><PERSON>John</PERSON></entity>
went to
<entity><GPE>England</GPE></entity>
</a>
Slide 38 Copyright © 2012 MarkLogic® Corporation. All rights reserved.Slide 38
BASIS Enrichment: What Gets
Tagged?
With the built-in entity enrichment, you can tag:
person
organization
location
GPE (geopolitical entity)
facility
religion
nationality
credit card number
email
latitude/longitude
money
percent
ID (personal ID number)
phone number
URL
UTM
date
time
Slide 39 Copyright © 2012 MarkLogic® Corporation. All rights reserved.Slide 39
Entity Enrichment Framework
§ You have a choice …
§ There are several Entity Extraction engines available
§ No engine is best-of-breed for all knowledge domains, all
languages
§ The Open Enrichment Framework lets you choose an engine that
suits your needs to extract more domain-specific entities and/or
support additional languages
§ Pipelines available
§ Temis Luxid
§ Open Calais
§ Data Harmony
§ NetOWL
§ Add other pipelines yourself
Slide 40 Copyright © 2012 MarkLogic® Corporation. All rights reserved.Slide 40
Agenda
§ Classification
Slide 41 Copyright © 2012 MarkLogic® Corporation. All rights reserved.Slide 41
Classification With MarkLogic
Server
1. Rule Based using Reverse Queries
§ Match documents against a pre-defined rule and automatically tag
content
§ Can use both Forward and Reverse queries for sophisticated
scenarios. We call it Match-making
2. Statistical Classification using built-in SVM Classifier
3. External Using Partner Network
§ Seamless integration using Open Enrichment Framework
§ Can use a combination of tools (Best of Bread)
§ Can leverage both internal and external Solution
Three Approaches to Classification
Slide 42 Copyright © 2012 MarkLogic® Corporation. All rights reserved.Slide 42
Agenda
§ Trend Spotting
Slide 43 Copyright © 2012 MarkLogic® Corporation. All rights reserved.Slide 43
Trend Spotting With MarkLogic
Server
1. Co-Occurrences with Frequency Rules
§ Spot trends in Business Entities and their relationship to other
concepts as they bubble up and surface above the noise
§ Use Co-Occurrence Analytical Indexes paired with Alerting to signal
trends and anomalies in real-time
Analytics + Alerting
Slide 44 Copyright © 2012 MarkLogic® Corporation. All rights reserved.Slide 44
Agenda
§ Other Text Analytics
Slide 45 Copyright © 2012 MarkLogic® Corporation. All rights reserved.Slide 45
Additional Text Analytics
1. Linking of unstructured Information
§ CTS:Similar to find related pieces of information in unstructured
documents
§ External Tools for finger-printing (find loose associations)
2. Query Expansion using Synonyms and Taxonomies
§ Narrow / Broaden Analytics
§ Parent / Child
§ Associative & Equivalent
3. Type-Ahead using Lexicons
§ Support for high-speed distinct values in entire database or in a
segment
Slide 46 Copyright © 2012 MarkLogic® Corporation. All rights reserved.Slide 46
Math and Statistical analytical
functions
1. math:variance-p
2. cts:variance-p
3. math:variance
4. cts:variance
5. math:stddev-p
6. cts:stddev-p
7. math:stddev
8. cts:stddev
9. math:covariance-p
10. cts:covariance-p
11. math:covariance
12. cts:covariance
13. math:correlation
14. cts:correlation
15. math:linear-model
16. cts:linear-model
17. math:median
18. cts:median
19. math:percentile
20. cts:percentile
21. math:mode
22. math:rank
23. cts:rank
24. math:percent-rank
25. cts:percent-rank
Slide 47 Copyright © 2012 MarkLogic® Corporation. All rights reserved.Slide 47
Next Steps
Slide 48 Copyright © 2012 MarkLogic® Corporation. All rights reserved.Slide 48
The Only Operational Database Technology for
Mission-Critical Big Data Applications

More Related Content

PPTX
Understanding Data
Kingsley Uyi Idehen
 
PDF
Ontotext Overview Winter 2012
Matthew Petrillo
 
PPTX
Metadata Use Cases You Can Use
dmurph4
 
PDF
Hol311 Getting%20 Started%20with%20the%20 Business%20 Data%20 Catalog%20in%20...
LiquidHub
 
PPTX
CWIN 17 / sessions data vault modeling - f2-f - nishat gupta
Capgemini
 
PPT
Getting It System Toolkit: Enhancing User Experience & Customizing a Future f...
Tim Bowersox
 
PPTX
LOD Cloud Knowledge Graph vs COVID-19
Kingsley Uyi Idehen
 
Understanding Data
Kingsley Uyi Idehen
 
Ontotext Overview Winter 2012
Matthew Petrillo
 
Metadata Use Cases You Can Use
dmurph4
 
Hol311 Getting%20 Started%20with%20the%20 Business%20 Data%20 Catalog%20in%20...
LiquidHub
 
CWIN 17 / sessions data vault modeling - f2-f - nishat gupta
Capgemini
 
Getting It System Toolkit: Enhancing User Experience & Customizing a Future f...
Tim Bowersox
 
LOD Cloud Knowledge Graph vs COVID-19
Kingsley Uyi Idehen
 

Similar to Mark logic text analytics (20)

PDF
MarkLogic Semantic use cases
Fernando Mesa
 
PDF
Augmentation, Collaboration, Governance: Defining the Future of Self-Service BI
Denodo
 
PDF
Fbdl enabling comprehensive_data_services
Cindy Irby
 
PPTX
Insights into Real-world Data Management Challenges
DataWorks Summit
 
PPT
Tapping Into A Massively Interconnected Knowledge Network
BlueFish
 
PPTX
Insights into Real World Data Management Challenges
DataWorks Summit
 
PDF
[db tech showcase Tokyo 2018] #dbts2018 #B36 『Design Your Databases straight ...
Insight Technology, Inc.
 
PPTX
Targeted Marketing: How Marketing Companies can use Big Data to Target Custom...
Ray Février
 
PDF
Introduction to AutoML and Data Science using the Oracle Autonomous Database ...
Sandesh Rao
 
PDF
Cranking It Up - SuiteWorld 2017
Diego Cardozo
 
PDF
The New Database Frontier: Harnessing the Cloud
Inside Analysis
 
PPTX
Robert Parkin Portfolio
rsparkin
 
PDF
Cis 555 Week 4 Assignment 2 Automated Teller Machine (Atm)...
Karen Thompson
 
PPTX
Präsentation share point
coda-efurt
 
PPTX
Sharepoint Architecture
arun kumar
 
PPTX
Interior Designs
arun kumar
 
PPTX
Microsoft PPT_Sharepoint_introduction
Dipti Bohra
 
DOC
Mahendrababu N
suresh babu
 
PDF
Where the Warehouse Ends: A New Age of Information Access
Inside Analysis
 
PPTX
Your data layer - Choosing the right database solutions for the future
ObjectRocket
 
MarkLogic Semantic use cases
Fernando Mesa
 
Augmentation, Collaboration, Governance: Defining the Future of Self-Service BI
Denodo
 
Fbdl enabling comprehensive_data_services
Cindy Irby
 
Insights into Real-world Data Management Challenges
DataWorks Summit
 
Tapping Into A Massively Interconnected Knowledge Network
BlueFish
 
Insights into Real World Data Management Challenges
DataWorks Summit
 
[db tech showcase Tokyo 2018] #dbts2018 #B36 『Design Your Databases straight ...
Insight Technology, Inc.
 
Targeted Marketing: How Marketing Companies can use Big Data to Target Custom...
Ray Février
 
Introduction to AutoML and Data Science using the Oracle Autonomous Database ...
Sandesh Rao
 
Cranking It Up - SuiteWorld 2017
Diego Cardozo
 
The New Database Frontier: Harnessing the Cloud
Inside Analysis
 
Robert Parkin Portfolio
rsparkin
 
Cis 555 Week 4 Assignment 2 Automated Teller Machine (Atm)...
Karen Thompson
 
Präsentation share point
coda-efurt
 
Sharepoint Architecture
arun kumar
 
Interior Designs
arun kumar
 
Microsoft PPT_Sharepoint_introduction
Dipti Bohra
 
Mahendrababu N
suresh babu
 
Where the Warehouse Ends: A New Age of Information Access
Inside Analysis
 
Your data layer - Choosing the right database solutions for the future
ObjectRocket
 
Ad

Recently uploaded (20)

PPT
L2 Rules of Netiquette in Empowerment technology
Archibal2
 
PDF
Oracle AI Vector Search- Getting Started and what's new in 2025- AIOUG Yatra ...
Sandesh Rao
 
PDF
Advances in Ultra High Voltage (UHV) Transmission and Distribution Systems.pdf
Nabajyoti Banik
 
PPTX
Comunidade Salesforce São Paulo - Desmistificando o Omnistudio (Vlocity)
Francisco Vieira Júnior
 
PDF
Security features in Dell, HP, and Lenovo PC systems: A research-based compar...
Principled Technologies
 
PDF
SparkLabs Primer on Artificial Intelligence 2025
SparkLabs Group
 
PPTX
Smart Infrastructure and Automation through IoT Sensors
Rejig Digital
 
PDF
Google I/O Extended 2025 Baku - all ppts
HusseinMalikMammadli
 
PDF
How-Cloud-Computing-Impacts-Businesses-in-2025-and-Beyond.pdf
Artjoker Software Development Company
 
PDF
Best ERP System for Manufacturing in India | Elite Mindz
Elite Mindz
 
PPTX
OA presentation.pptx OA presentation.pptx
pateldhruv002338
 
PDF
Tea4chat - another LLM Project by Kerem Atam
a0m0rajab1
 
PDF
The Evolution of KM Roles (Presented at Knowledge Summit Dublin 2025)
Enterprise Knowledge
 
PDF
CIFDAQ'S Market Insight: BTC to ETH money in motion
CIFDAQ
 
PDF
Chapter 1 Introduction to CV and IP Lecture Note.pdf
Getnet Tigabie Askale -(GM)
 
PDF
Automating ArcGIS Content Discovery with FME: A Real World Use Case
Safe Software
 
PDF
Event Presentation Google Cloud Next Extended 2025
minhtrietgect
 
PDF
Revolutionize Operations with Intelligent IoT Monitoring and Control
Rejig Digital
 
PPTX
AI and Robotics for Human Well-being.pptx
JAYMIN SUTHAR
 
PPTX
How to Build a Scalable Micro-Investing Platform in 2025 - A Founder’s Guide ...
Third Rock Techkno
 
L2 Rules of Netiquette in Empowerment technology
Archibal2
 
Oracle AI Vector Search- Getting Started and what's new in 2025- AIOUG Yatra ...
Sandesh Rao
 
Advances in Ultra High Voltage (UHV) Transmission and Distribution Systems.pdf
Nabajyoti Banik
 
Comunidade Salesforce São Paulo - Desmistificando o Omnistudio (Vlocity)
Francisco Vieira Júnior
 
Security features in Dell, HP, and Lenovo PC systems: A research-based compar...
Principled Technologies
 
SparkLabs Primer on Artificial Intelligence 2025
SparkLabs Group
 
Smart Infrastructure and Automation through IoT Sensors
Rejig Digital
 
Google I/O Extended 2025 Baku - all ppts
HusseinMalikMammadli
 
How-Cloud-Computing-Impacts-Businesses-in-2025-and-Beyond.pdf
Artjoker Software Development Company
 
Best ERP System for Manufacturing in India | Elite Mindz
Elite Mindz
 
OA presentation.pptx OA presentation.pptx
pateldhruv002338
 
Tea4chat - another LLM Project by Kerem Atam
a0m0rajab1
 
The Evolution of KM Roles (Presented at Knowledge Summit Dublin 2025)
Enterprise Knowledge
 
CIFDAQ'S Market Insight: BTC to ETH money in motion
CIFDAQ
 
Chapter 1 Introduction to CV and IP Lecture Note.pdf
Getnet Tigabie Askale -(GM)
 
Automating ArcGIS Content Discovery with FME: A Real World Use Case
Safe Software
 
Event Presentation Google Cloud Next Extended 2025
minhtrietgect
 
Revolutionize Operations with Intelligent IoT Monitoring and Control
Rejig Digital
 
AI and Robotics for Human Well-being.pptx
JAYMIN SUTHAR
 
How to Build a Scalable Micro-Investing Platform in 2025 - A Founder’s Guide ...
Third Rock Techkno
 
Ad

Mark logic text analytics

  • 1. Slide 1 Copyright © 2012 MarkLogic® Corporation. All rights reserved.Slide 1 Evolving MyBuzzMetrics with Text Analytics September 2012 Eric Austvold – Insights Executive Fernando Mesa – WW Director of Enterprise Solution Pete Aven – Systems Engineer
  • 2. Slide 2 Copyright © 2012 MarkLogic® Corporation. All rights reserved.Slide 2 Agenda § Introductions § NM Incite goals for text analytics § MarkLogic evolving MyBuzzMetrics with Text Analytics § Entity Extraction § Topic Discovery / Theme extraction § Data Faceting § Trend spotting § Visualization § Use Cases and Demos § Next Steps
  • 3. Slide 3 Copyright © 2012 MarkLogic® Corporation. All rights reserved.Slide 3 Goals for Text Analytics § What’s your goal? § What are your clients asking you for? § How do you want to service your internal clients? Analysts, researchers, account managers? § How do you want to service your external clients? Self service reporting? Ad-hoc analysis? Integration with their data? § How do you envision your new solution to complement other Nielsen services?
  • 4. Slide 4 Copyright © 2012 MarkLogic® Corporation. All rights reserved.Slide 4 Text Analytics - Evolution § Reliant on relational data structures that are challenging to manage, silos of data § Not indexed immediately, not possible to query in real time § New Parses = Re-ingestion § Re-ingestion = new schema design – creates delays § Not real time – difficult to determine buzz § Impossible at 30+ billion docs § Pre-processing required to handle batches of data § Extraction methods lose context and full perspective § Flexible – Built on an infrastructure that can integrate text mining output § Context Aware – Without schema redesigns, context of original document persists as text miners enrich that content, preserving relationships to the original data § Scales – Can accommodate real time ad-hoc queries and reports across a corpus of 30+ billion documents § Enrichment – a better method of leveraging text mining work Traditional Methods MarkLogic Enabled Methods
  • 5. Slide 5 Copyright © 2012 MarkLogic® Corporation. All rights reserved.Slide 5 The Parse § The Parse § Actor, Action, Object § Fact § Entity § Qualifier § Etc. § Basis Entity Enrichment § Open Enrichment Framework § Calais § Temis § Data Harmony § NetOwl What it means… We can integrate with all enrichment engines.
  • 6. Slide 6 Copyright © 2012 MarkLogic® Corporation. All rights reserved.Slide 6 The Platform § The Platform § Flexibility § Speed § Scale § Delivery of Insight What it means… Clients can rapidly deliver insight in real time to help users discover new insights.
  • 7. Slide 7Copyright © 2009 Mark Logic Corporation. All rights reserved. MarkLogic and Text Analytics Web Services ETL Connector (*) Social Media Connector (*) RDBMS connector Search Unified Index For all data structuresTransactional Database Data Retrieval Repository Classification Concept Extraction Entity Enrichment Web Applications Decision Support APIs/Services Taxonomies App Server Third-party Partners Analytics Leverage value generated from text mining Generate Opinions (in the form of data)
  • 8. Slide 8 Copyright © 2012 MarkLogic® Corporation. All rights reserved.Slide 8 Traditional Enrichment = Extraction First Name Last Name Other Comments Chris Smith Data Chris bought an upgrade package for his black, 2011, Honda Pilot on 9/16. Car returned for service on 9/21. The bolt on the undercarriage cracked due to heat. He doesn’t think it’s the transmission however as ….. Actor Action Object Chris buy package Fact package-buy car-return bolt-cracked Entity Type Chris person Honda organization 9/16 date Qualifiers upgrade black More Parsing = More Tables/Rows = More Joins = Does Not Scale! And What About Context?
  • 9. Slide 9 Copyright © 2012 MarkLogic® Corporation. All rights reserved.Slide 9 Enrichment with MarkLogic <actor><person>Chris </person></actor><action>bought </action>an <qualifier>upgrade </qualifier><object>package</object> for his <qualifier>black </qualifier>, <qualifier>2011 </qualifier>, <organization>Honda </organization> Pilot on <date>9/16</date>. Car returned for service on <date> 9/21 </date>. The bolt on the undercarriage cracked due to heat. <person @name=“Chris”>He </person> doesn’t think it’s the transmission however as ….. Pepsi<name> </name><brand> </brand><drink> </drink> Markup Inline! Every Tag Becomes a Candidate For an Index! What it means… Enrichment persists context and scales without a schema redesign, saves time and resources as client needs evolve.
  • 10. Slide 10 Copyright © 2012 MarkLogic® Corporation. All rights reserved.Slide 10 Text Mining is part of the big picture Words and phrases ... Semantic Web is a collaborative movement led by the World Wide Web Consortium (W3C) ... Structure Label Author Ing Comp ID Para Org Data/Metadata name:sorbitol date:2012-06-04 company:Roche Entities in Context ... diabetes, since the risk of blindness is very high in such patients... Geospatial <location> <lat>46.946584</lat> <lng>93.076172</lng> </location> Universal Index
  • 11. Slide 11 Copyright © 2012 MarkLogic® Corporation. All rights reserved.Slide 11 Demo
  • 12. Slide 12 Copyright © 2012 MarkLogic® Corporation. All rights reserved.Slide 12 Agenda § Introductions § Nielsen’s goals and challenges related to unstructured data § MarkLogic Beyond Big Data Search § Entity Extraction § Topic Discovery / Theme extraction § Data Faceting § Trend spotting § Visualization § Use Cases and Demos § Next Steps
  • 13. Slide 15 Copyright © 2012 MarkLogic® Corporation. All rights reserved.Slide 15 MarkLogic Analytics § Why Use MarkLogic Analytics? § Term list analytics § Range index analytics § Combining term lists and range indexes § Range index best practices & references
  • 14. Slide 16 Copyright © 2012 MarkLogic® Corporation. All rights reserved.Slide 16 MarkLogic Analytics – why use it? Applications increasingly combine structured and unstructured information (e.g., electronic healthcare records) Show me male patients that are under the age of 45 with an ADMITTING DIAGNOSIS that included Chest Pain, or with a HISTORY OF PRESENT ILLNESS including symptoms for Chest Pain, Shortness of Breath, or Dizziness. Additionally, identify patients within this population with regular alcohol consumption in the SOCIAL HISTORY, alcoholism in the FAMILY HISTORY, and one of the following 17 synonyms for stress diagnoses in the ASSESSMENT AND TREATMENT PLAN. Structured Unstructured/Contextual
  • 15. Slide 32 Copyright © 2012 MarkLogic® Corporation. All rights reserved.Slide 32 Agenda § Text Enrichment
  • 16. Slide 33 Copyright © 2012 MarkLogic® Corporation. All rights reserved.Slide 33 Text Enrichment – with Entities § Load, manipulate, query content as-is § … then enrich the content over time § Entity extraction § Specialized technology § Identifies people, places, things in free text § Entity extraction -> Entity enrichment § Entities are marked-up in-line § Gives you § More focused search (includes proximity, structure) § Analytics § Alerting
  • 17. Slide 34 Copyright © 2012 MarkLogic® Corporation. All rights reserved.Slide 34 Enrich Your Content … With Entities: Example <Article xmlns:e="http://marklogic.com/entity"> <title><e:person>John Louis</e:person></title> <acknowledgement><e:gpe>Wikipedia</e:gpe>, the free encyclopedia</acknowledgement> <section> <para>"Tiger" <e:person>John Louis</e:person> (born <e:date>14 June 1941</e:date>)[<refto ID="1">1</refto>] was an <e:gpe>England</e:gpe> international speedway rider who rode for <e:organization>Ipswich Witches</e:organization>. He is the father of <e:gpe>Great Britain</e:gpe> international <e:person>Chris Louis</e:person>. <e:person>John</e:person> rode a weslake for most of his career.</para> </section> <section> <title>Career history</title> <para><e:person>John</e:person> finished third in the 1975 Speedway World Championship and was part of the <e:organization>England Speedway World Cup</e:organization> winning teams of 1972, 1974 and 1975. He was also World Pairs Champion in 1976 with <e:person>Malcolm Simmons</e:person>. He also captained <e:gpe>Ipswich</e:gpe> when they were <e:nationality>British</e:nationality> Champions in 1976. <e:person>John</e:person> won the <e:nationality>British</e:nationality> Speedway Championship in 1975. He was also <e:organization>National League Riders</e:organization> champion in 1971 and <e:organization>British League Riders</e:organization> champion in 1979.</para> <para>He retired in 1984 and is now the promoter of <e:organization>Ipswich Witches</e:organization>.</para> </section>
  • 18. Slide 35 Copyright © 2012 MarkLogic® Corporation. All rights reserved.Slide 35 Entity Enrichment With MarkLogic Server 1. Rule Based using Built-in function § Can leverage a taxonomy for drive entity definition § Uses Content Processing Framework to Automate process 2. Statistical Analysis using built-in Entity Enricher § Licensed BASIS for enrichment § For automated entity enrichment 3. External Using Partner Network § Seamless integration using Open Enrichment Framework § Can use a combination of tools (Best of Bread) § Can leverage both internal and external Solution Three Approaches to Entity Enrichment
  • 19. Slide 36 Copyright © 2012 MarkLogic® Corporation. All rights reserved.Slide 36 Entity Enrichment: Built-in § Take an XML node, and markup entities in that node § Substitute $expr for each entity in $node § Use any style of markup using $expr plus these variables: § $cts:node § $cts:text § $cts:entity-type § Advantage: the most flexible § Choose your style of markup § Choose which parts you want to markup § Choose which entities you want to use/ignore cts:entity-highlight( $node as node(), $expr as item()* ) as node()
  • 20. Slide 37 Copyright © 2012 MarkLogic® Corporation. All rights reserved.Slide 37 Entity Enrichment: Built-in: Example(2) cts:entity-highlight( <a>John went to England</a>, <entity>{ element {$cts:entity-type} {$cts:text} } </entity> ) <a> <entity><PERSON>John</PERSON></entity> went to <entity><GPE>England</GPE></entity> </a>
  • 21. Slide 38 Copyright © 2012 MarkLogic® Corporation. All rights reserved.Slide 38 BASIS Enrichment: What Gets Tagged? With the built-in entity enrichment, you can tag: person organization location GPE (geopolitical entity) facility religion nationality credit card number email latitude/longitude money percent ID (personal ID number) phone number URL UTM date time
  • 22. Slide 39 Copyright © 2012 MarkLogic® Corporation. All rights reserved.Slide 39 Entity Enrichment Framework § You have a choice … § There are several Entity Extraction engines available § No engine is best-of-breed for all knowledge domains, all languages § The Open Enrichment Framework lets you choose an engine that suits your needs to extract more domain-specific entities and/or support additional languages § Pipelines available § Temis Luxid § Open Calais § Data Harmony § NetOWL § Add other pipelines yourself
  • 23. Slide 40 Copyright © 2012 MarkLogic® Corporation. All rights reserved.Slide 40 Agenda § Classification
  • 24. Slide 41 Copyright © 2012 MarkLogic® Corporation. All rights reserved.Slide 41 Classification With MarkLogic Server 1. Rule Based using Reverse Queries § Match documents against a pre-defined rule and automatically tag content § Can use both Forward and Reverse queries for sophisticated scenarios. We call it Match-making 2. Statistical Classification using built-in SVM Classifier 3. External Using Partner Network § Seamless integration using Open Enrichment Framework § Can use a combination of tools (Best of Bread) § Can leverage both internal and external Solution Three Approaches to Classification
  • 25. Slide 42 Copyright © 2012 MarkLogic® Corporation. All rights reserved.Slide 42 Agenda § Trend Spotting
  • 26. Slide 43 Copyright © 2012 MarkLogic® Corporation. All rights reserved.Slide 43 Trend Spotting With MarkLogic Server 1. Co-Occurrences with Frequency Rules § Spot trends in Business Entities and their relationship to other concepts as they bubble up and surface above the noise § Use Co-Occurrence Analytical Indexes paired with Alerting to signal trends and anomalies in real-time Analytics + Alerting
  • 27. Slide 44 Copyright © 2012 MarkLogic® Corporation. All rights reserved.Slide 44 Agenda § Other Text Analytics
  • 28. Slide 45 Copyright © 2012 MarkLogic® Corporation. All rights reserved.Slide 45 Additional Text Analytics 1. Linking of unstructured Information § CTS:Similar to find related pieces of information in unstructured documents § External Tools for finger-printing (find loose associations) 2. Query Expansion using Synonyms and Taxonomies § Narrow / Broaden Analytics § Parent / Child § Associative & Equivalent 3. Type-Ahead using Lexicons § Support for high-speed distinct values in entire database or in a segment
  • 29. Slide 46 Copyright © 2012 MarkLogic® Corporation. All rights reserved.Slide 46 Math and Statistical analytical functions 1. math:variance-p 2. cts:variance-p 3. math:variance 4. cts:variance 5. math:stddev-p 6. cts:stddev-p 7. math:stddev 8. cts:stddev 9. math:covariance-p 10. cts:covariance-p 11. math:covariance 12. cts:covariance 13. math:correlation 14. cts:correlation 15. math:linear-model 16. cts:linear-model 17. math:median 18. cts:median 19. math:percentile 20. cts:percentile 21. math:mode 22. math:rank 23. cts:rank 24. math:percent-rank 25. cts:percent-rank
  • 30. Slide 47 Copyright © 2012 MarkLogic® Corporation. All rights reserved.Slide 47 Next Steps
  • 31. Slide 48 Copyright © 2012 MarkLogic® Corporation. All rights reserved.Slide 48 The Only Operational Database Technology for Mission-Critical Big Data Applications