0% found this document useful (0 votes)
41 views46 pages

Text Mining and The Semantic Web: DR Diana Maynard NLP Group Department of Computer Science University of Sheffield

The document discusses text mining and information extraction for the semantic web. It describes how text mining involves knowledge discovery from large collections of unstructured text using natural language processing, machine learning, and statistical techniques. Information extraction is a key component and aims to extract structured facts and information from unstructured text. The document outlines common text mining components like document selection, preprocessing, and processing. It provides examples of information extraction systems like HaSIE, KIM, and Threat Tracker that extract structured metadata that can populate databases or ontologies.

Uploaded by

MohammedNaushad
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
41 views46 pages

Text Mining and The Semantic Web: DR Diana Maynard NLP Group Department of Computer Science University of Sheffield

The document discusses text mining and information extraction for the semantic web. It describes how text mining involves knowledge discovery from large collections of unstructured text using natural language processing, machine learning, and statistical techniques. Information extraction is a key component and aims to extract structured facts and information from unstructured text. The document outlines common text mining components like document selection, preprocessing, and processing. It provides examples of information extraction systems like HaSIE, KIM, and Threat Tracker that extract structured metadata that can populate databases or ontologies.

Uploaded by

MohammedNaushad
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
You are on page 1/ 46

Text mining and the Semantic Web

Dr Diana Maynard
NLP Group
Department of Computer Science
University of Sheffield

http://nlp.shef.ac.uk

Structure of this lecture

Text Mining and the Semantic Web


Text Mining Components / Methods
Information Extraction
Evaluation
Visualisation
Summary

University of Manchester 15 March

Introduction to Text Mining and


the Semantic Web

http://nlp.shef.ac.uk

What is Text Mining?


Text mining is about knowledge discovery from
large collections of unstructured text.
Its not the same as data mining, which is
more about discovering patterns in structured
data stored in databases.
Similar techniques are sometimes used,
however text mining has many additional
constraints caused by the unstructured nature
of the text and the use of natural language.
Information extraction (IE) is a major
component of text mining.
IE is about extracting facts and structured
information from unstructured text.
University of Manchester 15 March

http://nlp.shef.ac.uk

Challenge of the Semantic Web


The Semantic Web requires machine
processable, repurposable data to complement
hypertext
Such metadata can be divided into two types of
information: explicit and implicit. IE is mainly
concerned with implicit (semantic) metadata.
More on this later

University of Manchester 15 March

Text mining components and


methods

http://nlp.shef.ac.uk

Text mining stages


Document selection and filtering (IR
techniques)
Document pre-processing (NLP
techniques)
Document processing (NLP / ML /
statistical techniques)

University of Manchester 15 March

http://nlp.shef.ac.uk

Stages of document processing


Document selection involves identification and retrieval of
potentially relevant documents from a large set (e.g. the
web) in order to reduce the search space. Standard or
semantically-enhanced IR techniques can be used for this.
Document pre-processing involves cleaning and preparing
the documents, e.g. removal of extraneous information,
error correction, spelling normalisation, tokenisation, POS
tagging, etc.
Document processing consists mainly of information
extraction
For the Semantic Web, this is realised in terms of metadata
extraction

University of Manchester 15 March

http://nlp.shef.ac.uk

Metadata extraction
Metadata extraction consists of two types:
Explicit metadata extraction involves information
describing the document, such as that contained
in the header information of HTML documents
(titles, abstracts, authors, creation date, etc.)
Implicit metadata extraction involves semantic
information deduced from the material itself, i.e.
endogenous information such as names of entities
and relations contained in the text. This essentially
involves Information Extraction techniques, often
with the help of an ontology.
University of Manchester 15 March

Information Extraction (IE)

http://nlp.shef.ac.uk

IE is not IR
IR pulls documents
from large text
collections (usually the
Web) in response to
specific keywords or
queries. You analyse
the documents.
IE pulls facts and
structured information
from the content of large
text collections. You
analyse the facts.
University of Manchester 15 March

11

http://nlp.shef.ac.uk

IE for Document Access


With traditional query engines, getting the facts can
be hard and slow
Where has the Queen visited in the last year?
Which places on the East Coast of the US
have had cases of West Nile Virus?
Which search terms would you use to get this kind
of information?
How can you specify you want someones home
page?
IE returns information in a structured way
IR returns documents containing the relevant
information somewhere (if youre lucky)

University of Manchester 15 March

12

http://nlp.shef.ac.uk

IE as an alternative to IR
IE returns knowledge at a much deeper
level than traditional IR
Constructing a database through IE and
linking it back to the documents can
provide a valuable alternative search tool.
Even if results are not always accurate,
they can be valuable if linked back to the
original text
University of Manchester 15 March

13

http://nlp.shef.ac.uk

Some example applications


HaSIE
KIM
Threat Trackers

University of Manchester 15 March

14

http://nlp.shef.ac.uk

HaSIE
Application developed by University of
Sheffield, which aims to find out how
companies report about health and safety
information
Answers questions such as:
How many members of staff died or had accidents
in the last year?
Is there anyone responsible for health and safety?
What measures have been put in place to improve
health and safety in the workplace?
University of Manchester 15 March

15

http://nlp.shef.ac.uk

HASIE
Identification of such information is too
time-consuming and arduous to be done
manually
IR systems cant cope with this because
they return whole documents, which could
be hundreds of pages
System identifies relevant sections of each
document, pulls out sentences about
health and safety issues, and populates a
database with relevant information
University of Manchester 15 March

16

http://nlp.shef.ac.uk

HASIE

University of Manchester 15 March

17

http://nlp.shef.ac.uk

KIM
KIM is a software platform developed by
Ontotext for semantic annotation of text.
KIM performs automatic ontology
population and semantic annotation for
Semantic Web and KM applications
Indexing and retrieval (an IE-enhanced
search technology)
Query and exploration of formal
knowledge
University of Manchester 15 March

18

http://nlp.shef.ac.uk

KIM
Ontotexts KIM query and results

University of Manchester 15 March

19

http://nlp.shef.ac.uk

Threat tracker
Application developed by Alias-I which finds and
relates information in documents
Intended for use by Information Analysts who
use unstructured news feeds and standing
collections as sources
Used by DARPA for tracking possible
information about terrorists etc.
Identification of entities, aliases, relations etc.
enables you to build up chains of related people
and things
University of Manchester 15 March

20

http://nlp.shef.ac.uk

Threat tracker

University of Manchester 15 March

21

http://nlp.shef.ac.uk

What is Named Entity


Recognition?
Identification of proper names in texts,
and their classification into a set of
predefined categories of interest
Persons
Organisations (companies, government
organisations, committees, etc)
Locations (cities, countries, rivers, etc)
Date and time expressions
Various other types as appropriate
University of Manchester 15 March

22

http://nlp.shef.ac.uk

Why is NE important?
NE provides a foundation from which to build
more complex IE systems
Relations between NEs can provide tracking,
ontological information and scenario building
Tracking (co-reference) Dr Head, John, he
Ontologies Manchester, CT
Scenario Dr Head became the new director
of Shiny Rockets Corp
University of Manchester 15 March

23

http://nlp.shef.ac.uk

Two kinds of approaches


Knowledge Engineering Learning Systems
rule based
developed by
experienced language
engineers
make use of human
intuition
require only small
amount of training data
development can be
very time consuming
some changes may be
hard to accommodate

use statistics or other


machine learning
developers do not
need LE expertise
require large amounts
of annotated training
data
some changes may
require re-annotation
of the entire training
corpus

University of Manchester 15 March

24

http://nlp.shef.ac.uk

Typical NE pipeline
Pre-processing (tokenisation, sentence
splitting, morphological analysis, POS
tagging)
Entity finding (gazeteer lookup, NE
grammars)
Coreference (alias finding, orthographic
coreference etc.)
Export to database / XML
University of Manchester 15 March

25

http://nlp.shef.ac.uk

GATE and ANNIE


GATE (Generalised Architecture for Text Engineering)
is a framework for language processing
ANNIE (A Nearly New Information Extraction system)
is a suite of language processing tools, which
provides NE recognition
GATE also includes:
plugins for language processing, e.g. parsers,
machine learning tools, stemmers, IR tools, IE
components for various languages etc.
tools for visualising and manipulating ontologies
ontology-based information extraction tools
evaluation and benchmarking tools
University of Manchester 15 March

26

http://nlp.shef.ac.uk

GATE

University of Manchester 15 March

27

http://nlp.shef.ac.uk

Information Extraction for the Semantic Web


Traditional IE is based on a flat structure, e.g.
recognising Person, Location, Organisation,
Date, Time etc.
For the Semantic Web, we need information in a
hierarchical structure
Idea is that we attach semantic metadata to the
documents, pointing to concepts in an ontology
Information can be exported as an ontology
annotated with instances, or as text annotated
with links to the ontology
University of Manchester 15 March

28

http://nlp.shef.ac.uk

Richer NE Tagging
Attachment of
instances in the text to
concepts in the
domain ontology
Disambiguation of
instances, e.g.
Cambridge, MA vs
Cambridge, UK

University of Manchester 15 March

29

http://nlp.shef.ac.uk

Magpie
Developed by the Open University
Plugin for standard web browser
Automatically associates an ontology-based
semantic layer to web resources, allowing
relevant services to be linked
Provides means for a structured and informed
exploration of the web resources
e.g. looking at a list of publications, we can find
information about an author such as projects
they work on, other people they work with, etc.
University of Manchester 15 March

30

http://nlp.shef.ac.uk

MAGPIE in action

University of Manchester 15 March

31

http://nlp.shef.ac.uk

MAGPIE in action

University of Manchester 15 March

32

Evaluation

http://nlp.shef.ac.uk

Evaluation metrics and tools


Evaluation metrics mathematically define how to
measure the systems performance against humanannotated gold standard
Scoring program implements the metric and
provides performance measures
for each document and over the entire corpus
for each type of NE
may also evaluate changes over time

A gold standard reference set also needs to be


provided this may be time-consuming to produce
Visualisation tools show the results graphically and
enable easy comparison
University of Manchester 15 March

34

http://nlp.shef.ac.uk

Methods of evaluation
Traditional IE is evaluated in terms of Precision
and Recall
Precision - how accurate were the answers the
system produced?
correct answers/answers produced
Recall - how good was the system at finding
everything it should have found?
correct answers/total possible correct answers
There is usually a tradeoff between precision
and recall, so a weighted average of the two (Fmeasure) is generally also used.
University of Manchester 15 March

35

http://nlp.shef.ac.uk

GATE AnnotationDiff Tool

University of Manchester 15 March

36

http://nlp.shef.ac.uk

Metrics for Richer IE


Precision and Recall are not sufficient for
ontology-based IE, because the distinction
between right and wrong is less obvious
Recognising a Person as a Location is clearly
wrong, but recognising a Research Assistant as a
Lecturer is not so wrong
Similarity metrics need to be integrated
additionally, such that items closer together in the
hierarchy are given a higher score, if wrong
Also possible is a cost-based approach, where
different weights can be given to each concept in
the hierarchy, and to different types of error, and
combined to form a single score
University of Manchester 15 March

37

Visualisation of Results

http://nlp.shef.ac.uk

Visualisation of Results
Cluster Map example
Traditionally used to show documents classified
according to topic
Here shows instances classified according to
concept
Enables analysis, comparison and querying of
results
Examples here created by Marta Sabou (Free
University of Amsterdam) using Aduna software
University of Manchester 15 March

39

http://nlp.shef.ac.uk

The principle Venn Diagrams


Documents
classified
according to topic

University of Manchester 15 March

40

http://nlp.shef.ac.uk

Jobs by region

Instances
classified by
concept

University of Manchester 15 March

41

http://nlp.shef.ac.uk

Concept distribution

Shows the
relative
importance of
different concepts
University of Manchester 15 March

42

http://nlp.shef.ac.uk

Correct and
incorrect
instances
attached to
concepts

University of Manchester 15 March

43

http://nlp.shef.ac.uk

Summary
Introduction to text mining and the
semantic web
How traditional information extraction
techniques, including visualisation and
evaluation, can be extended to deal with
complexity of the Semantic Web
How text mining can help the progression
of the Semantic Web
University of Manchester 15 March

44

http://nlp.shef.ac.uk

Research questions
Automatic annotation tools are currently
mainly domain and ontology-dependent,
and work best on a small scale
Tools designed for large scale applications
lose out on accuracy
Ontology population works best when the
ontology already exists, but how do we
ensure accurate ontology generation?
Need large scale evaluation programs
University of Manchester 15 March

45

http://nlp.shef.ac.uk

Some useful links


NaCTem (National centre for text mining)
http://www.nactem.ac.uk
GATE
http://gate.ac.uk
KIM
http://www.ontotext.com/kim/
h-TechSight
http://www.h-techsight.org
Magpie
http://www.kmi.open.ac.uk/projects/magpie
University of Manchester 15 March

46

You might also like