Web Intelligence: What Is Webintelligence?
Web Intelligence: What Is Webintelligence?
What is WebIntelligence?
WebIntelligence allows you to access, analyze, and share
reports on corporate
data over intranets and extranets. WebIntelligence is
installed on a web server
on your corporate network.
To use WebIntelligence from your local computer, you log
into the business
intelligence portal InfoView via your Internet browser.
Depending on your security
rights, you can interact with the reports in corporate
documents or edit or build
DATA MINING
Background
The manual extraction of patterns from data has occurred for centuries.
Early methods of identifying patterns in data include Bayes' theorem (1700s)
and regression analysis (1800s). The proliferation, ubiquity and increasing
power of computer technology has increased data collection, storage and
manipulations. As data sets have grown in size and complexity, direct
hands-on data analysis has increasingly been augmented with indirect,
automatic data processing. This has been aided by other discoveries in
computer science, such as neural networks, clustering, genetic algorithms
(1950s), decision trees (1960s) and support vector machines (1980s). Data
mining is the process of applying these methods to data with the intention of
uncovering hidden patterns.[5] It has been used for many years by
businesses, scientists and governments to sift through volumes of data such
as airline passenger trip records, census data and supermarket scanner data
to produce market research reports. (Note, however, that reporting is not
always considered to be data mining.)
A primary reason for using data mining is to assist in the analysis of
collections of observations of behavior. Such data are vulnerable to
collinearity because of unknown interrelations. An unavoidable fact of data
mining is that the (sub-)set(s) of data being analyzed may not be
representative of the whole domain, and therefore may not contain examples
of certain critical relationships and behaviors that exist across other parts of
the domain. To address this sort of issue, the analysis may be augmented
using experiment-based and other approaches, such as Choice Modeling for
human-generated data. In these situations, inherent correlations can be either
controlled for, or removed altogether, during the construction of the
experimental design.
There have been some efforts to define standards for data mining, for
example the 1999 European Cross Industry Standard Process for Data
Mining (CRISP-DM 1.0) and the 2004 Java Data Mining standard (JDM
1.0). These are evolving standards; later versions of these standards are
under development. Independent of these standardization efforts, freely
available open-source software systems like the R Project, Weka, KNIME,
RapidMiner, jHepWork and others have become an informal standard for
defining data-mining processes. Notably, all these systems are able to import
and export models in PMML (Predictive Model Markup Language) which
provides a standard way to represent data mining models so that these can be
shared between different statistical applications.[6] PMML is an XML-based
language developed by the Data Mining Group (DMG),[7] an independent
group composed of many data mining companies. PMML version 4.0 was
released in June 2009.[7]HYPERLINK \l "cite_note-7"[8]HYPERLINK \l
"cite_note-8"[9]
[edit] Research and evolution
In addition to industry driven demand for standards and interoperability,
professional and academic activity have also made considerable
contributions to the evolution and rigor of the methods and models; an
article published in a 2008 issue of the International Journal of Information
Technology and Decision Making summarizes the results of a literature
survey which traces and analyzes this evolution.[10]
The premier professional body in the field is the Association for Computing
Machinery's Special Interest Group on Knowledge discovery and Data
Mining (SIGKDD).[citation needed] Since 1989 they have hosted an annual
international conference and published its proceedings,[11] and since 1999
have published a biannual academic journal titled "SIGKDD Explorations".
[12] Other Computer Science conferences on data mining include:
Process
Pre-processing
Before data mining algorithms can be used, a target data set must be
assembled. As data mining can only uncover patterns already present in the
data, the target dataset must be large enough to contain these patterns while
remaining concise enough to be mined in an acceptable timeframe. A
common source for data is a datamart or data warehouse. Pre-process is
essential to analyse the multivariate datasets before data mining.
The target set is then cleaned. Data cleaning removes the observations with
noise and missing data.
Data mining
Data mining commonly involves four classes of tasks:[16]
• Clustering – is the task of discovering groups and structures in the
data that are in some way or another "similar", without using known
structures in the data.
• Χλα σ σ ι φ ι χ α τ ι ο ν – is the task of generalizing known
structure to apply to new data. For example, an email program might
attempt to classify an email as legitimate or spam. Common
algorithms include decision tree learning, nearest neighbor, naive
Bayesian classification, neural networks and support vector machines.
• Ρε γ ρ ε σ σ ι ο ν – Attempts to find a function which models
the data with the least error.
• Ασ σ ο χ ι α τ ι ο ν ρυλ ε λεαρ ν ι ν γ –
Searches for relationships between variables. For example a
supermarket might gather data on customer purchasing habits. Using
association rule learning, the supermarket can determine which
products are frequently bought together and use this information for
marketing purposes. This is sometimes referred to as market basket
analysis.
Results validation
The final step of knowledge discovery from data is to verify the patterns
produced by the data mining algorithms occur in the wider data set. Not all
patterns found by the data mining algorithms are necessarily valid. It is
common for the data mining algorithms to find patterns in the training set
which are not present in the general data set, this is called overfitting. To
overcome this, the evaluation uses a test set of data which the data mining
algorithm was not trained on. The learnt patterns are applied to this test set
and the resulting output is compared to the desired output. For example, a
data mining algorithm trying to distinguish spam from legitimate emails
would be trained on a training set of sample emails. Once trained, the learnt
patterns would be applied to the test set of emails which it had not been
trained on, the accuracy of these patterns can then be measured from how
many emails they correctly classify. A number of statistical methods may be
used to evaluate the algorithm such as ROC curves.
If the learnt patterns do not meet the desired standards, then it is necessary to
reevaluate and change the pre-processing and data mining. If the learnt
patterns do meet the desired standards then the final step is to interpret the
learnt patterns and turn them into knowledge.
Notable uses
Games
Since the early 1960s, with the availability of oracles for certain
combinatorial games, also called tablebases (e.g. for 3x3-chess) with any
beginning configuration, small-board dots-and-boxes, small-board-hex, and
certain endgames in chess, dots-and-boxes, and hex; a new area for data
mining has been opened up. This is the extraction of human-usable strategies
from these oracles. Current pattern recognition approaches do not seem to
fully have the required high level of abstraction in order to be applied
successfully. Instead, extensive experimentation with the tablebases,
combined with an intensive study of tablebase-answers to well designed
problems and with knowledge of prior art, i.e. pre-tablebase knowledge, is
used to yield insightful patterns. Berlekamp in dots-and-boxes etc. and John
Nunn in chess endgames are notable examples of researchers doing this
work, though they were not and are not involved in tablebase generation.
Business
Data mining in customer relationship management applications can
contribute significantly to the bottom line.Rather than randomly contacting a
prospect or customer through a call center or sending mail, a company can
concentrate its efforts on prospects that are predicted to have a high
likelihood of responding to an offer. More sophisticated methods may be
used to optimize resources across campaigns so that one may predict which
channel and which offer an individual is most likely to respond to—across
all potential offers. Additionally, sophisticated applications could be used to
automate the mailing. Once the results from data mining (potential
prospect/customer and channel/offer) are determined, this "sophisticated
application" can either automatically send an e-mail or regular mail. Finally,
in cases where many people will take an action without an offer, uplift
modeling can be used to determine which people will have the greatest
increase in responding if given an offer.Data clustering can also be used to
automatically discover the segments or groups within a customer data set.
Businesses employing data mining may see a return on investment, but also
they recognize that the number of predictive models can quickly become
very large. Rather than one model to predict how many customers will
churn, a business could build a separate model for each region and customer
type. Then instead of sending an offer to all people that are likely to churn, it
may only want to send offers to customers. And finally, it may also want to
determine which customers are going to be profitable over a window of time
and only send the offers to those that are likely to be profitable. In order to
maintain this quantity of models, they need to manage model versions and
move to automated data mining.
Data mining can also be helpful to human-resources departments in
identifying the characteristics of their most successful employees.
Information obtained, such as universities attended by highly successful
employees, can help HR focus recruiting efforts accordingly. Additionally,
Strategic Enterprise Management applications help a company translate
corporate-level goals, such as profit and margin share targets, into
operational decisions, such as production plans and workforce levels.
Another example of data mining, often called the Market basket analysis ,
relates to its use in retail sales. If a clothing store records the purchases of
customers, a data-mining system could identify those customers who favor
silk shirts over cotton ones. Although some explanations of relationships
may be difficult, taking advantage of it is easier. The example deals with
assotiation rules within transaction-based data. Not all data are transaction
based and logical or inexact rules may also be present within a data base. In
a manufacturing application, an inexact rule may state that 73% of products
which have a specific defect or problem will develop a secondary problem
within the next six months.Market basket analysis has also been used to
identify the purchase patterns of the Alpha consumer. Alpha Consumers are
people that play a key role in connecting with the concept behind a product,
then adopting that product, and finally validating it for the rest of society.
Analyzing the data collected on this type of users has allowed companies to
predict future buying trends and forecast supply demands.
Data Mining is a highly effective tool in the catalog marketing industry
citation needed. Catalogers have a rich history of customer transactions on
millions of customers dating back several years. Data mining tools can
identify patterns among customers and help identify the most likely
customers to respond to upcoming mailing campaigns.
Data Mining for business applications is a component which needs to be
integrated into a complex modelling and decision making process. RBI
advocates an holistic approach that integrates data mining,modeling and
intractive visulization, into an end-to-end discovery and continuous
innovation process powered by human and automated learning. In the area
of decision making the RBI approach has been used to mine the knowledge
which is progressively acquired from the decision maker and self-tune the
decision method accordingly
Related to an integrated-circuit production line, an example of data mining is
described in the paper "Mining IC Test Data to Optimize VLSI Testing."In
this paper the application of data mining and decision analysis to the
problem of die-level functional test is described. Experiments mentioned in
this paper demonstrate the ability of applying a system of mining historical
die-test data to create a probabilistic model of patterns of die failure which
are then utilized to decide in real time which die to test next and when to
stop testing. This system has been shown, based on experiments with
historical test data, to have the potential to improve profits on mature IC
products.
Science and engineering
In recent years, data mining has been widely used in area of science and
engineering, such as bioinformatics, genetics, medicine, education and
electrical power engineering.
In the area of study on human genetics, an important goal is to understand
the mapping relationship between the inter-individual variation in human
DNA sequences and variability in disease susceptibility. In lay terms, it is to
find out how the changes in an individual's DNA sequence affect the risk of
developing common diseases such as cancer. This is very important to help
improve the diagnosis, prevention and treatment of the diseases. The data
mining technique that is used to perform this task is known as multifactor
dimantionality reduction.
In the area of electrical power engineering, data mining techniques have
been widely used for condition monitoring of high voltage electrical
equipment. The purpose of condition monitoring is to obtain valuable
information on the intulation's health status of the equipment. Data
clustering such as (SOM) has been applied on the vibration monitoring and
analysis of transformer on-load tap-changers(OLTCS). Using vibration
monitoring, it can be observed that each tap change operation generates a
signal that contains information about the condition of the tap changer
contacts and the drive mechanisms. Obviously, different tap positions will
generate different signals. However, there was considerable variability
amongst normal condition signals for exactly the same tap position. SOM
has been applied to detect abnormal conditions and to estimate the nature of
the abnormalities.
Data mining techniques have also been applied for (DGA) on power
tyransforms. DGA, as a diagnostics for power transformer, has been
available for many years. Data mining techniques such as SOM has been
applied to analyze data and to determine trends which are not obvious to the
standard DGA ratio techniques such as Duval Triangle.
A fourth area of application for data mining in science/engineering is within
educational research, where data mining has been used to study the factors
leading students to choose to engage in behaviors which reduce their
learning and to understand the factors influencing university student
retention.A similar example of the social application of data mining is its use
whereby descriptors of human expertise are extracted, normalized and
classified so as to facilitate the finding of experts, particularly in scientific
and technical fields. In this way, data mining can facilitate Institutional
memory.
Other examples of applying data mining technique applications are
biomedical data facilitated by domain ontologys, mining clinical trial
data,trafic analysis using SOM, et cetera.
In adverse drug reaction surveillance, since 1998, used data mining methods
to routinely screen for reporting patterns indicative of emerging drug safety
issues in the WHO global database of 4.6 million suspected adverse drug
reaction incidents. Recently, similar methodology has been developed to
mine large collections of electronic health records for temporal patterns
associating drug prescriptions to medical diagnoses.
Spatial data mining
Spatial data mining is the application of data mining techniques to spatial
data. Spatial data mining follows along the same functions in data mining,
with the end objective to find patterns in geography. So far, data mining
and(GIS) have existed as two separate technologies, each with its own
methods, traditions and approaches to visualization and data analysis.
Particularly, most contemporary GIS have only very basic spatial analysis
functionality. The immense explosion in geographically referenced data
occasioned by developments in IT, digital mapping, remote sensing, and the
global diffusion of GIS emphasises the importance of developing data driven
inductive approaches to geographical analysis and modeling.
Data mining, which is the partially automated search for hidden patterns in
large databases, offers great potential benefits for applied GIS-based
decision-making. Recently, the task of integrating these two technologies
has become critical, especially as various public and private sector
organizations possessing huge databases with thematic and geographically
referenced data begin to realise the huge potential of the information hidden
there. Among those organizations are:
• offices requiring analysis or dissemination of geo-referenced
statistical data
• public health services searching for explanations of disease clusters
• environmental agencies assessing the impact of changing land-use
patterns on climate change
• geo-marketing companies doing customer segmentation based on
spatial location.
Challenges
Geospatial data repositories tend to be very large. Moreover, existing GIS
datasets are often splintered into feature and attribute components, that are
conventionally archived in hybrid data management systems. Algorithmic
requirements differ substantially for relational (attribute) data management
and for topological (feature) data management. Related to this is the range
and diversity of geographic data formats, that also presents unique
challenges. The digital geographic data revolution is creating new types of
data formats beyond the traditional "vector" and "raster" formats.
Geographic data repositories increasingly include ill-structured data such as
imagery and geo-referenced multi-media.
There are several critical research challenges in geographic knowledge
discovery and data mining. Miller and Han offer the following list of
emerging research topics in the field:
• Developing and supporting geographic data warehouses – Spatial
properties are often reduced to simple aspatial attributes in
mainstream data warehouses. Creating an integrated GDW requires
solving issues in spatial and temporal data interoperability, including
differences in semantics, referencing systems, geometry, accuracy and
position.
• Better spatio-temporal representations in geographic knowledge
discovery – Current geographic knowledge discovery (GKD)
techniques generally use very simple representations of geographic
objects and spatial relationships. Geographic data mining techniques
should recognize more complex geographic objects (lines and
polygons) and relationships (non-Euclidean distances, direction,
connectivity and interaction through attributed geographic space such
as terrain). Time needs to be more fully integrated into these
geographic representations and relationships.
• Geographic knowledge discovery using diverse data types – GKD
techniques should be developed that can handle diverse data types
beyond the traditional raster and vector models, including imagery
and geo-referenced multimedia, as well as dynamic data types (video
streams, animation).
In four annual survey of data miners (2007-2010), data mining practitioners
consistently identified that they faced three key challenges more than any
others:
• Dirty Data
• Explaining Data Mining to Others
• Unavailability of Data / Difficult Access to Data
In the 2010 SURVEYS data miners also shared their experiences in
overcoming these challenges.
Surveillance
Previous data mining to stop terrorist programs under the U.S. government
include the (TIA) program, Secure Flight (formerly known as Computer-
Assisted Passenger Prescreening System )
Analysis, Dissemination, Visualization, Insight, Semantic Enhanceme and
the Multi-state Anti-Terrorism Information Exchange MATRIX. These
programs have been discontinued due to controversy over whether they
violate the US Constitution's 4th amendment, although many programs that
were formed under them continue to be funded by different organizations, or
under different names.
Two plausible data mining techniques in the context of combating terrorism
include "pattern mining" and "subject-based data mining".
Pattern mining
"Pattern mining" is a data mining technique that involves finding existing
paterns in data. In this context patterns often means association rules. The
original motivation for searching association rules came from the desire to
analyze supermarket transaction data, that is, to examine customer behavior
in terms of the purchased products. For example, an association rule "beer ⇒
potato chips (80%)" states that four out of five customers that bought beer
also bought potato chips.
In the context of pattern mining as a tool to identify terrorist activity, the
National reasearch council provides the following definition: "Pattern-based
data mining looks for patterns (including anomalous data patterns) that
might be associated with terrorist activity — these patterns might be
regarded as small signals in a large ocean of noise.Pattern Mining includes
new areas such a MIR where patterns seen both in the temporal and non
temporal domains are imported to classical knowledge discovery search
techniques.
Subject-based data mining
"Subject-based data mining" is a data mining technique involving the search
for associations between individuals in data. In the context of combating
terrorism, the National reasearch council provides the following definition:
"Subject-based data mining uses an initiating individual or other datum that
is considered, based on other information, to be of high interest, and the goal
is to determine what other persons or financial transactions or movements,
etc., are related to that initiating datum."
Privacy concerns and ethics
Some people believe that data mining itself is ethically neutral.It is
important to note that the term data mining has no ethical implications. The
term is often associated with the mining of information in relation to
peoples' behavior. However, data mining is a statistical technique that is
applied to a set of information, or a data set. Associating these data sets with
people is an extreme narrowing of the types of data that are available in
today's technological society. Examples could range from a set of crash test
data for passenger vehicles, to the performance of a group of stocks. These
types of data sets make up a great proportion of the information available to
be acted on by data mining techniques, and rarely have ethical concerns
associated with them. However, the ways in which data mining can be used
can raise questions regarding privacy, legality, and ethics.In particular, data
mining government or commercial data sets for national security or law
enforcement purposes, such as in the Total information awerness Program or
in advise, has raised privacy concern.
Data mining requires data preparation which can uncover information or
patterns which may compromise confidentiality and privacy obligations. A
common way for this to occur is through Data aggregation . Data
aggregation is when the data are accrued, possibly from various sources, and
put together so that they can be analyzed. This is not data mining per se, but
a result of the preparation of data before and for the purposes of the analysis.
The threat to an individual's privacy comes into play when the data, once
compiled, cause the data miner, or anyone who has access to the newly
compiled data set, to be able to identify specific individuals, especially when
originally the data were anonymous.
It is recommended that an individual is made aware of the following before
data are collected:
• the purpose of the data collection and any data mining projects,
• how the data will be used,
• who will be able to mine the data and use them,
• the security surrounding access to the data, and in addition,
• how collected data can be updated.
In the USA, privacy concerns have been somewhat addressed by their
congress via the passage of regulatory controls such as the HIPAA. The
HIPAA requires individuals to be given "informed consent" regarding any
information that they provide and its intended future uses by the facility
receiving that information. According to an article in Biotech Business
Week, "In practice, HIPAA may not offer any greater protection than the
longstanding regulations in the research arena, says the AAHC. More
importantly, the rule's goal of protection through informed consent is
undermined by the complexity of consent forms that are required of patients
and participants, which approach a level of incomprehensibility to average
individuals."This underscores the necessity for data anonymity in data
aggregation practices.
One may additionally modify the data so that they are anonymous, so that
individuals may not be readily identified.However, even de-identified data
sets can contain enough information to identify individuals, as occurred
when journalists were able to find several individuals based on a set of
search histories that were inadvertently released by AOL.
Marketplace surveys
Several researchers and organizations have conducted reviews of data
mining tools and surveys of data miners. These identify some of the
strengths and weaknesses of the software packages. They also provide an
overview of the behaviors, preferences and views of data miners.
INFORMATION RETRIEVAL
Information retrieval (IR) is the area of study concerned with searching for
documents, for information within documents, and for meta data about
documents, as well as that of searching relational database and the world
wide web. There is overlap in the usage of the terms data retrieval, document
retrieval, information retrieval, and text retrieval, but each also has its own
body of literature, theory, praxis, and technologies. IR is interdisciplinary,
based on computer science, mathematics, library science, information
science, information architecture, cognitive psychology, linguistic, and
statistic.
Automated information retrieval systems are used to reduce what has been
called "information overloaded". Many universities and public librarys use
IR systems to provide access to books, journals and other documents. Web
search engines are the most visible IR application.
The idea of using computers to search for relevant pieces of information was
popularized in the article As we may think by Vannever Bush in 1945.The
first automated information retrieval systems were introduced in the 1950s
and 1960s. By 1970 several different techniques had been shown to perform
well on small text corpora such as the Cranfield collection (several thousand
documents).Large-scale retrieval systems, such as the Lockheed Dialog
system, came into use early in the 1970s.
In 1992, the US Department of Defense along with the NIST, cosponsored
the TREC as part of the TIPSTER text program. The aim of this was to look
into the information retrieval community by supplying the infrastructure that
was needed for evaluation of text retrieval methodologies on a very large
text collection. This catalyzed research on methods that scale to huge
corpora. The introduction of web search engines has boosted the need for
very large scale retrieval systems even further.
The use of digital methods for storing and retrieving information has led to
the phenomenon of digital, where a digital resource ceases to be readable
because the physical media, the reader required to read the media, the
hardware, or the software that runs on it, is no longer available. The
information is initially easier to retrieve than if it were on paper, but is then
effectively lost.
Overview
An information retrieval process begins when a user enters a query into the
system. Queries are formal statements of information needs, for example
search strings in web search engines. In information retrieval a query does
not uniquely identify a single object in the collection. Instead, several
objects may match the query, perhaps with different degrees of relavancy.
An object is an entity that is represented by information in a data base. User
queries are matched against the database information. Depending on the
application the data objects may be, for example, text documents, images,
audio,mind maps or videos. Often the documents themselves are not kept or
stored directly in the IR system, but are instead represented in the system by
document surrogates or metadata.
Most IR systems compute a numeric score on how well each object in the
database match the query, and rank the objects according to this value. The
top ranking objects are then shown to the user. The process may then be
iterated if the user wishes to refine the query.
Performance measures
Many different measures for evaluating the performance of information
retrieval systems have been proposed. The measures require a collection of
documents and a query. All common measures described here assume a
ground truth notion of relevancy: every document is known to be either
relevant or non-relevant to a particular query. In practice queries may be
illposed and there may be different shades of relevancy.
Precision
Precision is the fraction of the documents retrieved that are relevant to the
user's information need.
This is also known as the F1 measure, because recall and precision are
evenly weighted.
The general formula for non-negative real β is:
.
Two other commonly used F measures are the F2 measure, which weights
recall twice as much as precision, and the F0.5 measure, which weights
precision twice as much as recall.
The F-measure was derived by van Rijsbergen (1979) so that Fβ "measures
the effectiveness of retrieval with respect to a user who attaches β times as
much importance to recall as precision". It is based on van Rijsbergen's
effectiveness measure E = 1 − (1 / (α / P + (1 − α) / R)). Their relationship is
Fβ = 1 − E where α = 1 / (β2 + 1).
Average precision
Precision and recall are single-value metrics based on the whole list of
documents returned by the system. For systems that return a ranked
sequence of documents, it is desirable to also consider the order in which the
returned documents are presented. Average precision emphasizes ranking
relevant documents higher. It is the average of precisions computed at the
point of each of the relevant documents in the ranked sequence:
where r is the rank, N the number retrieved, rel() a binary function on the
relevance of a given rank, and P(r) precision at a given cut-off rank:
This metric is also sometimes referred to geometrically as the area under the
Precision-Recall curve.
Note that the denominator (number of relevant documents) is the number of
relevant documents in the entire collection, so that the metric reflects
performance over all relevant documents, regardless of a retrieval cutoff.
R-Precision
Precision at rank R. Where R is the total number of relevant documents.
This measure is highly correlated to Average Precision.
Mean average precision
Mean average precision for a set of queries is the mean of the average
precision scores for each query.
Since result set may vary in size among different queries or systems, to
compare performances the normalised version of DCG uses an ideal DCG.
To this end, it sorts documents of a result list by relevance, producing an
ideal DCG at position p (IDCGp), which normalizes the score:
The nDCG values for all queries can be averaged to obtain a measure of the
average performance of a ranking algorithm. Note that in a perfect ranking
algorithm, the DCGp will be the same as the IDCGp producing an nDCG of
1.0. All nDCG calculations are then relative values on the interval 0.0 to 1.0
and so are cross-query comparable
SEMANTIC WEB
Purpose
The main purpose of the Semantic Web is driving the evolution of the
current Web by allowing users to use it to its full potential, thus allowing
them to find, share, and combine information more easily. Humans are
capable of using the Web to carry out tasks such as finding the Irish word for
"folder," reserving a library book, and searching for a low price for a DVD.
However, machines cannot accomplish all of these tasks without human
direction, because web pages are designed to be read by people, not
machines. The semantic web is a vision of information that can be
interpreted by machines, so machines can perform more of the tedious work
involved in finding, combining, and acting upon information on the web.
Tim Berners-Lee originally expressed the vision of the semantic web as
follows:
Components