Text mining and the Semantic Web
Dr Diana Maynard
NLP Group
Department of Computer Science
University of Sheffield
http://nlp.shef.ac.uk
Structure of this lecture
Text Mining and the Semantic Web
Text Mining Components / Methods
Information Extraction
Evaluation
Visualisation
Summary
University of Manchester 15 March
Introduction to Text Mining and
the Semantic Web
http://nlp.shef.ac.uk
What is Text Mining?
Text mining is about knowledge discovery from
large collections of unstructured text.
Its not the same as data mining, which is
more about discovering patterns in structured
data stored in databases.
Similar techniques are sometimes used,
however text mining has many additional
constraints caused by the unstructured nature
of the text and the use of natural language.
Information extraction (IE) is a major
component of text mining.
IE is about extracting facts and structured
information from unstructured text.
University of Manchester 15 March
http://nlp.shef.ac.uk
Challenge of the Semantic Web
The Semantic Web requires machine
processable, repurposable data to complement
hypertext
Such metadata can be divided into two types of
information: explicit and implicit. IE is mainly
concerned with implicit (semantic) metadata.
More on this later
University of Manchester 15 March
Text mining components and
methods
http://nlp.shef.ac.uk
Text mining stages
Document selection and filtering (IR
techniques)
Document pre-processing (NLP
techniques)
Document processing (NLP / ML /
statistical techniques)
University of Manchester 15 March
http://nlp.shef.ac.uk
Stages of document processing
Document selection involves identification and retrieval of
potentially relevant documents from a large set (e.g. the
web) in order to reduce the search space. Standard or
semantically-enhanced IR techniques can be used for this.
Document pre-processing involves cleaning and preparing
the documents, e.g. removal of extraneous information,
error correction, spelling normalisation, tokenisation, POS
tagging, etc.
Document processing consists mainly of information
extraction
For the Semantic Web, this is realised in terms of metadata
extraction
University of Manchester 15 March
http://nlp.shef.ac.uk
Metadata extraction
Metadata extraction consists of two types:
Explicit metadata extraction involves information
describing the document, such as that contained
in the header information of HTML documents
(titles, abstracts, authors, creation date, etc.)
Implicit metadata extraction involves semantic
information deduced from the material itself, i.e.
endogenous information such as names of entities
and relations contained in the text. This essentially
involves Information Extraction techniques, often
with the help of an ontology.
University of Manchester 15 March
Information Extraction (IE)
http://nlp.shef.ac.uk
IE is not IR
IR pulls documents
from large text
collections (usually the
Web) in response to
specific keywords or
queries. You analyse
the documents.
IE pulls facts and
structured information
from the content of large
text collections. You
analyse the facts.
University of Manchester 15 March
11
http://nlp.shef.ac.uk
IE for Document Access
With traditional query engines, getting the facts can
be hard and slow
Where has the Queen visited in the last year?
Which places on the East Coast of the US
have had cases of West Nile Virus?
Which search terms would you use to get this kind
of information?
How can you specify you want someones home
page?
IE returns information in a structured way
IR returns documents containing the relevant
information somewhere (if youre lucky)
University of Manchester 15 March
12
http://nlp.shef.ac.uk
IE as an alternative to IR
IE returns knowledge at a much deeper
level than traditional IR
Constructing a database through IE and
linking it back to the documents can
provide a valuable alternative search tool.
Even if results are not always accurate,
they can be valuable if linked back to the
original text
University of Manchester 15 March
13
http://nlp.shef.ac.uk
Some example applications
HaSIE
KIM
Threat Trackers
University of Manchester 15 March
14
http://nlp.shef.ac.uk
HaSIE
Application developed by University of
Sheffield, which aims to find out how
companies report about health and safety
information
Answers questions such as:
How many members of staff died or had accidents
in the last year?
Is there anyone responsible for health and safety?
What measures have been put in place to improve
health and safety in the workplace?
University of Manchester 15 March
15
http://nlp.shef.ac.uk
HASIE
Identification of such information is too
time-consuming and arduous to be done
manually
IR systems cant cope with this because
they return whole documents, which could
be hundreds of pages
System identifies relevant sections of each
document, pulls out sentences about
health and safety issues, and populates a
database with relevant information
University of Manchester 15 March
16
http://nlp.shef.ac.uk
HASIE
University of Manchester 15 March
17
http://nlp.shef.ac.uk
KIM
KIM is a software platform developed by
Ontotext for semantic annotation of text.
KIM performs automatic ontology
population and semantic annotation for
Semantic Web and KM applications
Indexing and retrieval (an IE-enhanced
search technology)
Query and exploration of formal
knowledge
University of Manchester 15 March
18
http://nlp.shef.ac.uk
KIM
Ontotexts KIM query and results
University of Manchester 15 March
19
http://nlp.shef.ac.uk
Threat tracker
Application developed by Alias-I which finds and
relates information in documents
Intended for use by Information Analysts who
use unstructured news feeds and standing
collections as sources
Used by DARPA for tracking possible
information about terrorists etc.
Identification of entities, aliases, relations etc.
enables you to build up chains of related people
and things
University of Manchester 15 March
20
http://nlp.shef.ac.uk
Threat tracker
University of Manchester 15 March
21
http://nlp.shef.ac.uk
What is Named Entity
Recognition?
Identification of proper names in texts,
and their classification into a set of
predefined categories of interest
Persons
Organisations (companies, government
organisations, committees, etc)
Locations (cities, countries, rivers, etc)
Date and time expressions
Various other types as appropriate
University of Manchester 15 March
22
http://nlp.shef.ac.uk
Why is NE important?
NE provides a foundation from which to build
more complex IE systems
Relations between NEs can provide tracking,
ontological information and scenario building
Tracking (co-reference) Dr Head, John, he
Ontologies Manchester, CT
Scenario Dr Head became the new director
of Shiny Rockets Corp
University of Manchester 15 March
23
http://nlp.shef.ac.uk
Two kinds of approaches
Knowledge Engineering Learning Systems
rule based
developed by
experienced language
engineers
make use of human
intuition
require only small
amount of training data
development can be
very time consuming
some changes may be
hard to accommodate
use statistics or other
machine learning
developers do not
need LE expertise
require large amounts
of annotated training
data
some changes may
require re-annotation
of the entire training
corpus
University of Manchester 15 March
24
http://nlp.shef.ac.uk
Typical NE pipeline
Pre-processing (tokenisation, sentence
splitting, morphological analysis, POS
tagging)
Entity finding (gazeteer lookup, NE
grammars)
Coreference (alias finding, orthographic
coreference etc.)
Export to database / XML
University of Manchester 15 March
25
http://nlp.shef.ac.uk
GATE and ANNIE
GATE (Generalised Architecture for Text Engineering)
is a framework for language processing
ANNIE (A Nearly New Information Extraction system)
is a suite of language processing tools, which
provides NE recognition
GATE also includes:
plugins for language processing, e.g. parsers,
machine learning tools, stemmers, IR tools, IE
components for various languages etc.
tools for visualising and manipulating ontologies
ontology-based information extraction tools
evaluation and benchmarking tools
University of Manchester 15 March
26
http://nlp.shef.ac.uk
GATE
University of Manchester 15 March
27
http://nlp.shef.ac.uk
Information Extraction for the Semantic Web
Traditional IE is based on a flat structure, e.g.
recognising Person, Location, Organisation,
Date, Time etc.
For the Semantic Web, we need information in a
hierarchical structure
Idea is that we attach semantic metadata to the
documents, pointing to concepts in an ontology
Information can be exported as an ontology
annotated with instances, or as text annotated
with links to the ontology
University of Manchester 15 March
28
http://nlp.shef.ac.uk
Richer NE Tagging
Attachment of
instances in the text to
concepts in the
domain ontology
Disambiguation of
instances, e.g.
Cambridge, MA vs
Cambridge, UK
University of Manchester 15 March
29
http://nlp.shef.ac.uk
Magpie
Developed by the Open University
Plugin for standard web browser
Automatically associates an ontology-based
semantic layer to web resources, allowing
relevant services to be linked
Provides means for a structured and informed
exploration of the web resources
e.g. looking at a list of publications, we can find
information about an author such as projects
they work on, other people they work with, etc.
University of Manchester 15 March
30
http://nlp.shef.ac.uk
MAGPIE in action
University of Manchester 15 March
31
http://nlp.shef.ac.uk
MAGPIE in action
University of Manchester 15 March
32
Evaluation
http://nlp.shef.ac.uk
Evaluation metrics and tools
Evaluation metrics mathematically define how to
measure the systems performance against humanannotated gold standard
Scoring program implements the metric and
provides performance measures
for each document and over the entire corpus
for each type of NE
may also evaluate changes over time
A gold standard reference set also needs to be
provided this may be time-consuming to produce
Visualisation tools show the results graphically and
enable easy comparison
University of Manchester 15 March
34
http://nlp.shef.ac.uk
Methods of evaluation
Traditional IE is evaluated in terms of Precision
and Recall
Precision - how accurate were the answers the
system produced?
correct answers/answers produced
Recall - how good was the system at finding
everything it should have found?
correct answers/total possible correct answers
There is usually a tradeoff between precision
and recall, so a weighted average of the two (Fmeasure) is generally also used.
University of Manchester 15 March
35
http://nlp.shef.ac.uk
GATE AnnotationDiff Tool
University of Manchester 15 March
36
http://nlp.shef.ac.uk
Metrics for Richer IE
Precision and Recall are not sufficient for
ontology-based IE, because the distinction
between right and wrong is less obvious
Recognising a Person as a Location is clearly
wrong, but recognising a Research Assistant as a
Lecturer is not so wrong
Similarity metrics need to be integrated
additionally, such that items closer together in the
hierarchy are given a higher score, if wrong
Also possible is a cost-based approach, where
different weights can be given to each concept in
the hierarchy, and to different types of error, and
combined to form a single score
University of Manchester 15 March
37
Visualisation of Results
http://nlp.shef.ac.uk
Visualisation of Results
Cluster Map example
Traditionally used to show documents classified
according to topic
Here shows instances classified according to
concept
Enables analysis, comparison and querying of
results
Examples here created by Marta Sabou (Free
University of Amsterdam) using Aduna software
University of Manchester 15 March
39
http://nlp.shef.ac.uk
The principle Venn Diagrams
Documents
classified
according to topic
University of Manchester 15 March
40
http://nlp.shef.ac.uk
Jobs by region
Instances
classified by
concept
University of Manchester 15 March
41
http://nlp.shef.ac.uk
Concept distribution
Shows the
relative
importance of
different concepts
University of Manchester 15 March
42
http://nlp.shef.ac.uk
Correct and
incorrect
instances
attached to
concepts
University of Manchester 15 March
43
http://nlp.shef.ac.uk
Summary
Introduction to text mining and the
semantic web
How traditional information extraction
techniques, including visualisation and
evaluation, can be extended to deal with
complexity of the Semantic Web
How text mining can help the progression
of the Semantic Web
University of Manchester 15 March
44
http://nlp.shef.ac.uk
Research questions
Automatic annotation tools are currently
mainly domain and ontology-dependent,
and work best on a small scale
Tools designed for large scale applications
lose out on accuracy
Ontology population works best when the
ontology already exists, but how do we
ensure accurate ontology generation?
Need large scale evaluation programs
University of Manchester 15 March
45
http://nlp.shef.ac.uk
Some useful links
NaCTem (National centre for text mining)
http://www.nactem.ac.uk
GATE
http://gate.ac.uk
KIM
http://www.ontotext.com/kim/
h-TechSight
http://www.h-techsight.org
Magpie
http://www.kmi.open.ac.uk/projects/magpie
University of Manchester 15 March
46