0% found this document useful (0 votes)
108 views25 pages

Web Intelligence: What Is Webintelligence?

WebIntelligence allows users to access, analyze, and share reports on corporate data stored over intranets and extranets. It is installed on a web server on the corporate network. Users log into the business intelligence portal via their internet browser and can interact with reports depending on their security rights. Users create and edit WebIntelligence documents using either the HTML Report Panel or Java Report Panel, with the Java panel offering more powerful features like multiple queries and custom calculations.

Uploaded by

Rajesh Rathod
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as RTF, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
108 views25 pages

Web Intelligence: What Is Webintelligence?

WebIntelligence allows users to access, analyze, and share reports on corporate data stored over intranets and extranets. It is installed on a web server on the corporate network. Users log into the business intelligence portal via their internet browser and can interact with reports depending on their security rights. Users create and edit WebIntelligence documents using either the HTML Report Panel or Java Report Panel, with the Java panel offering more powerful features like multiple queries and custom calculations.

Uploaded by

Rajesh Rathod
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as RTF, PDF, TXT or read online on Scribd
You are on page 1/ 25

WEB INTELLIGENCE

What is WebIntelligence?
WebIntelligence allows you to access, analyze, and share
reports on corporate
data over intranets and extranets. WebIntelligence is
installed on a web server
on your corporate network.
To use WebIntelligence from your local computer, you log
into the business
intelligence portal InfoView via your Internet browser.
Depending on your security
rights, you can interact with the reports in corporate
documents or edit or build

Creating and editing WebIntelligence documents


You create or edit WebIntelligence documents using one of
two document
editors:
• HTML Report Panel
• Java Report Panel
WebIntelligence Java Report Panel
Designed for users with sophisticated query, reporting, and
analysis needs, the
Java Report Panel offers powerful features, including the
ability to include
multiple queries on different data sources (universes) and
use formula language
to create custom calculations and variables. When you edit
or create documents,
the Java Report Panel launches in a separate browser
window on your desktop.
To find out how to select the Java Report Panel as your
WebIntelligence
document editor.
NOTE

Informatio technology on the web in order to create the next generation of


products, services and frameworks based on the internet.
The term was born in a paper written by Ning Zhong, Jiming Liu Yao and
Y.Y.Ohsuga in the Computer Software and Applications Conference in
2000.
The 21st century is the age of the Internet and the World Wide Web. The
Web revolutionizes the way we gather, process, and use information. At the
same time, it also redefines the meanings and processes of business,
commerce, marketing, finance, publishing, education, research,
development, as well as other aspects of our daily life. Although individual
Web based information systems are constantly being deployed, advanced
issues and n techniques for developing and for benefiting from Web
intelligence still remain to be systematically studied. Roughly speaking, web
intelligence exploits artificial intelligence and advanced information
technology on the Web and Internet. It is the key and the most urgent
research field of IT for business intelligence.

DATA MINING

Data mining (also known as Knowledge Discovery in Data, or KDD), field


of computer science is the process of extracting patterns from large data sets
by combining methods from statistics and artificial intelligence with
database management.
With recent tremendous technical advances in processing power, storage
capacity, and inter-connectivity of computer technology, data mining is seen
as an increasingly important tool by modern business to transform
unprecedented quantities of digital data into business intelligence giving an
in a wide range of profiling practices, such as marketing, surveillance, fraud
detection, and scientific discovery. The growing consensus that data mining
can bring real value has led to an explosion in demand for novel data mining
technologies.
The related terms data dredging, data fishing and data snooping refer to the
use of data mining techniques to sample portions of the larger population
data set that are (or may be) too small for reliable statistical inferences to be
made about the validity of any patterns discovered. These techniques can,
however, be used in the creation of new hypotheses to test against the larger
data populations.

Background
The manual extraction of patterns from data has occurred for centuries.
Early methods of identifying patterns in data include Bayes' theorem (1700s)
and regression analysis (1800s). The proliferation, ubiquity and increasing
power of computer technology has increased data collection, storage and
manipulations. As data sets have grown in size and complexity, direct
hands-on data analysis has increasingly been augmented with indirect,
automatic data processing. This has been aided by other discoveries in
computer science, such as neural networks, clustering, genetic algorithms
(1950s), decision trees (1960s) and support vector machines (1980s). Data
mining is the process of applying these methods to data with the intention of
uncovering hidden patterns.[5] It has been used for many years by
businesses, scientists and governments to sift through volumes of data such
as airline passenger trip records, census data and supermarket scanner data
to produce market research reports. (Note, however, that reporting is not
always considered to be data mining.)
A primary reason for using data mining is to assist in the analysis of
collections of observations of behavior. Such data are vulnerable to
collinearity because of unknown interrelations. An unavoidable fact of data
mining is that the (sub-)set(s) of data being analyzed may not be
representative of the whole domain, and therefore may not contain examples
of certain critical relationships and behaviors that exist across other parts of
the domain. To address this sort of issue, the analysis may be augmented
using experiment-based and other approaches, such as Choice Modeling for
human-generated data. In these situations, inherent correlations can be either
controlled for, or removed altogether, during the construction of the
experimental design.
There have been some efforts to define standards for data mining, for
example the 1999 European Cross Industry Standard Process for Data
Mining (CRISP-DM 1.0) and the 2004 Java Data Mining standard (JDM
1.0). These are evolving standards; later versions of these standards are
under development. Independent of these standardization efforts, freely
available open-source software systems like the R Project, Weka, KNIME,
RapidMiner, jHepWork and others have become an informal standard for
defining data-mining processes. Notably, all these systems are able to import
and export models in PMML (Predictive Model Markup Language) which
provides a standard way to represent data mining models so that these can be
shared between different statistical applications.[6] PMML is an XML-based
language developed by the Data Mining Group (DMG),[7] an independent
group composed of many data mining companies. PMML version 4.0 was
released in June 2009.[7]HYPERLINK \l "cite_note-7"[8]HYPERLINK \l
"cite_note-8"[9]
[edit] Research and evolution
In addition to industry driven demand for standards and interoperability,
professional and academic activity have also made considerable
contributions to the evolution and rigor of the methods and models; an
article published in a 2008 issue of the International Journal of Information
Technology and Decision Making summarizes the results of a literature
survey which traces and analyzes this evolution.[10]
The premier professional body in the field is the Association for Computing
Machinery's Special Interest Group on Knowledge discovery and Data
Mining (SIGKDD).[citation needed] Since 1989 they have hosted an annual
international conference and published its proceedings,[11] and since 1999
have published a biannual academic journal titled "SIGKDD Explorations".
[12] Other Computer Science conferences on data mining include:
Process
Pre-processing
Before data mining algorithms can be used, a target data set must be
assembled. As data mining can only uncover patterns already present in the
data, the target dataset must be large enough to contain these patterns while
remaining concise enough to be mined in an acceptable timeframe. A
common source for data is a datamart or data warehouse. Pre-process is
essential to analyse the multivariate datasets before data mining.
The target set is then cleaned. Data cleaning removes the observations with
noise and missing data.
Data mining
Data mining commonly involves four classes of tasks:[16]
• Clustering – is the task of discovering groups and structures in the
data that are in some way or another "similar", without using known
structures in the data.
• Χλα σ σ ι φ ι χ α τ ι ο ν – is the task of generalizing known
structure to apply to new data. For example, an email program might
attempt to classify an email as legitimate or spam. Common
algorithms include decision tree learning, nearest neighbor, naive
Bayesian classification, neural networks and support vector machines.
• Ρε γ ρ ε σ σ ι ο ν – Attempts to find a function which models
the data with the least error.
• Ασ σ ο χ ι α τ ι ο ν ρυλ ε λεαρ ν ι ν γ –
Searches for relationships between variables. For example a
supermarket might gather data on customer purchasing habits. Using
association rule learning, the supermarket can determine which
products are frequently bought together and use this information for
marketing purposes. This is sometimes referred to as market basket
analysis.
Results validation
The final step of knowledge discovery from data is to verify the patterns
produced by the data mining algorithms occur in the wider data set. Not all
patterns found by the data mining algorithms are necessarily valid. It is
common for the data mining algorithms to find patterns in the training set
which are not present in the general data set, this is called overfitting. To
overcome this, the evaluation uses a test set of data which the data mining
algorithm was not trained on. The learnt patterns are applied to this test set
and the resulting output is compared to the desired output. For example, a
data mining algorithm trying to distinguish spam from legitimate emails
would be trained on a training set of sample emails. Once trained, the learnt
patterns would be applied to the test set of emails which it had not been
trained on, the accuracy of these patterns can then be measured from how
many emails they correctly classify. A number of statistical methods may be
used to evaluate the algorithm such as ROC curves.
If the learnt patterns do not meet the desired standards, then it is necessary to
reevaluate and change the pre-processing and data mining. If the learnt
patterns do meet the desired standards then the final step is to interpret the
learnt patterns and turn them into knowledge.
Notable uses
Games
Since the early 1960s, with the availability of oracles for certain
combinatorial games, also called tablebases (e.g. for 3x3-chess) with any
beginning configuration, small-board dots-and-boxes, small-board-hex, and
certain endgames in chess, dots-and-boxes, and hex; a new area for data
mining has been opened up. This is the extraction of human-usable strategies
from these oracles. Current pattern recognition approaches do not seem to
fully have the required high level of abstraction in order to be applied
successfully. Instead, extensive experimentation with the tablebases,
combined with an intensive study of tablebase-answers to well designed
problems and with knowledge of prior art, i.e. pre-tablebase knowledge, is
used to yield insightful patterns. Berlekamp in dots-and-boxes etc. and John
Nunn in chess endgames are notable examples of researchers doing this
work, though they were not and are not involved in tablebase generation.
Business
Data mining in customer relationship management applications can
contribute significantly to the bottom line.Rather than randomly contacting a
prospect or customer through a call center or sending mail, a company can
concentrate its efforts on prospects that are predicted to have a high
likelihood of responding to an offer. More sophisticated methods may be
used to optimize resources across campaigns so that one may predict which
channel and which offer an individual is most likely to respond to—across
all potential offers. Additionally, sophisticated applications could be used to
automate the mailing. Once the results from data mining (potential
prospect/customer and channel/offer) are determined, this "sophisticated
application" can either automatically send an e-mail or regular mail. Finally,
in cases where many people will take an action without an offer, uplift
modeling can be used to determine which people will have the greatest
increase in responding if given an offer.Data clustering can also be used to
automatically discover the segments or groups within a customer data set.
Businesses employing data mining may see a return on investment, but also
they recognize that the number of predictive models can quickly become
very large. Rather than one model to predict how many customers will
churn, a business could build a separate model for each region and customer
type. Then instead of sending an offer to all people that are likely to churn, it
may only want to send offers to customers. And finally, it may also want to
determine which customers are going to be profitable over a window of time
and only send the offers to those that are likely to be profitable. In order to
maintain this quantity of models, they need to manage model versions and
move to automated data mining.
Data mining can also be helpful to human-resources departments in
identifying the characteristics of their most successful employees.
Information obtained, such as universities attended by highly successful
employees, can help HR focus recruiting efforts accordingly. Additionally,
Strategic Enterprise Management applications help a company translate
corporate-level goals, such as profit and margin share targets, into
operational decisions, such as production plans and workforce levels.
Another example of data mining, often called the Market basket analysis ,
relates to its use in retail sales. If a clothing store records the purchases of
customers, a data-mining system could identify those customers who favor
silk shirts over cotton ones. Although some explanations of relationships
may be difficult, taking advantage of it is easier. The example deals with
assotiation rules within transaction-based data. Not all data are transaction
based and logical or inexact rules may also be present within a data base. In
a manufacturing application, an inexact rule may state that 73% of products
which have a specific defect or problem will develop a secondary problem
within the next six months.Market basket analysis has also been used to
identify the purchase patterns of the Alpha consumer. Alpha Consumers are
people that play a key role in connecting with the concept behind a product,
then adopting that product, and finally validating it for the rest of society.
Analyzing the data collected on this type of users has allowed companies to
predict future buying trends and forecast supply demands.
Data Mining is a highly effective tool in the catalog marketing industry
citation needed. Catalogers have a rich history of customer transactions on
millions of customers dating back several years. Data mining tools can
identify patterns among customers and help identify the most likely
customers to respond to upcoming mailing campaigns.
Data Mining for business applications is a component which needs to be
integrated into a complex modelling and decision making process. RBI
advocates an holistic approach that integrates data mining,modeling and
intractive visulization, into an end-to-end discovery and continuous
innovation process powered by human and automated learning. In the area
of decision making the RBI approach has been used to mine the knowledge
which is progressively acquired from the decision maker and self-tune the
decision method accordingly
Related to an integrated-circuit production line, an example of data mining is
described in the paper "Mining IC Test Data to Optimize VLSI Testing."In
this paper the application of data mining and decision analysis to the
problem of die-level functional test is described. Experiments mentioned in
this paper demonstrate the ability of applying a system of mining historical
die-test data to create a probabilistic model of patterns of die failure which
are then utilized to decide in real time which die to test next and when to
stop testing. This system has been shown, based on experiments with
historical test data, to have the potential to improve profits on mature IC
products.
Science and engineering
In recent years, data mining has been widely used in area of science and
engineering, such as bioinformatics, genetics, medicine, education and
electrical power engineering.
In the area of study on human genetics, an important goal is to understand
the mapping relationship between the inter-individual variation in human
DNA sequences and variability in disease susceptibility. In lay terms, it is to
find out how the changes in an individual's DNA sequence affect the risk of
developing common diseases such as cancer. This is very important to help
improve the diagnosis, prevention and treatment of the diseases. The data
mining technique that is used to perform this task is known as multifactor
dimantionality reduction.
In the area of electrical power engineering, data mining techniques have
been widely used for condition monitoring of high voltage electrical
equipment. The purpose of condition monitoring is to obtain valuable
information on the intulation's health status of the equipment. Data
clustering such as (SOM) has been applied on the vibration monitoring and
analysis of transformer on-load tap-changers(OLTCS). Using vibration
monitoring, it can be observed that each tap change operation generates a
signal that contains information about the condition of the tap changer
contacts and the drive mechanisms. Obviously, different tap positions will
generate different signals. However, there was considerable variability
amongst normal condition signals for exactly the same tap position. SOM
has been applied to detect abnormal conditions and to estimate the nature of
the abnormalities.
Data mining techniques have also been applied for (DGA) on power
tyransforms. DGA, as a diagnostics for power transformer, has been
available for many years. Data mining techniques such as SOM has been
applied to analyze data and to determine trends which are not obvious to the
standard DGA ratio techniques such as Duval Triangle.
A fourth area of application for data mining in science/engineering is within
educational research, where data mining has been used to study the factors
leading students to choose to engage in behaviors which reduce their
learning and to understand the factors influencing university student
retention.A similar example of the social application of data mining is its use
whereby descriptors of human expertise are extracted, normalized and
classified so as to facilitate the finding of experts, particularly in scientific
and technical fields. In this way, data mining can facilitate Institutional
memory.
Other examples of applying data mining technique applications are
biomedical data facilitated by domain ontologys, mining clinical trial
data,trafic analysis using SOM, et cetera.
In adverse drug reaction surveillance, since 1998, used data mining methods
to routinely screen for reporting patterns indicative of emerging drug safety
issues in the WHO global database of 4.6 million suspected adverse drug
reaction incidents. Recently, similar methodology has been developed to
mine large collections of electronic health records for temporal patterns
associating drug prescriptions to medical diagnoses.
Spatial data mining
Spatial data mining is the application of data mining techniques to spatial
data. Spatial data mining follows along the same functions in data mining,
with the end objective to find patterns in geography. So far, data mining
and(GIS) have existed as two separate technologies, each with its own
methods, traditions and approaches to visualization and data analysis.
Particularly, most contemporary GIS have only very basic spatial analysis
functionality. The immense explosion in geographically referenced data
occasioned by developments in IT, digital mapping, remote sensing, and the
global diffusion of GIS emphasises the importance of developing data driven
inductive approaches to geographical analysis and modeling.
Data mining, which is the partially automated search for hidden patterns in
large databases, offers great potential benefits for applied GIS-based
decision-making. Recently, the task of integrating these two technologies
has become critical, especially as various public and private sector
organizations possessing huge databases with thematic and geographically
referenced data begin to realise the huge potential of the information hidden
there. Among those organizations are:
• offices requiring analysis or dissemination of geo-referenced
statistical data
• public health services searching for explanations of disease clusters
• environmental agencies assessing the impact of changing land-use
patterns on climate change
• geo-marketing companies doing customer segmentation based on
spatial location.
Challenges
Geospatial data repositories tend to be very large. Moreover, existing GIS
datasets are often splintered into feature and attribute components, that are
conventionally archived in hybrid data management systems. Algorithmic
requirements differ substantially for relational (attribute) data management
and for topological (feature) data management. Related to this is the range
and diversity of geographic data formats, that also presents unique
challenges. The digital geographic data revolution is creating new types of
data formats beyond the traditional "vector" and "raster" formats.
Geographic data repositories increasingly include ill-structured data such as
imagery and geo-referenced multi-media.
There are several critical research challenges in geographic knowledge
discovery and data mining. Miller and Han offer the following list of
emerging research topics in the field:
• Developing and supporting geographic data warehouses – Spatial
properties are often reduced to simple aspatial attributes in
mainstream data warehouses. Creating an integrated GDW requires
solving issues in spatial and temporal data interoperability, including
differences in semantics, referencing systems, geometry, accuracy and
position.
• Better spatio-temporal representations in geographic knowledge
discovery – Current geographic knowledge discovery (GKD)
techniques generally use very simple representations of geographic
objects and spatial relationships. Geographic data mining techniques
should recognize more complex geographic objects (lines and
polygons) and relationships (non-Euclidean distances, direction,
connectivity and interaction through attributed geographic space such
as terrain). Time needs to be more fully integrated into these
geographic representations and relationships.
• Geographic knowledge discovery using diverse data types – GKD
techniques should be developed that can handle diverse data types
beyond the traditional raster and vector models, including imagery
and geo-referenced multimedia, as well as dynamic data types (video
streams, animation).
In four annual survey of data miners (2007-2010), data mining practitioners
consistently identified that they faced three key challenges more than any
others:
• Dirty Data
• Explaining Data Mining to Others
• Unavailability of Data / Difficult Access to Data
In the 2010 SURVEYS data miners also shared their experiences in
overcoming these challenges.
Surveillance
Previous data mining to stop terrorist programs under the U.S. government
include the (TIA) program, Secure Flight (formerly known as Computer-
Assisted Passenger Prescreening System )
Analysis, Dissemination, Visualization, Insight, Semantic Enhanceme and
the Multi-state Anti-Terrorism Information Exchange MATRIX. These
programs have been discontinued due to controversy over whether they
violate the US Constitution's 4th amendment, although many programs that
were formed under them continue to be funded by different organizations, or
under different names.
Two plausible data mining techniques in the context of combating terrorism
include "pattern mining" and "subject-based data mining".
Pattern mining
"Pattern mining" is a data mining technique that involves finding existing
paterns in data. In this context patterns often means association rules. The
original motivation for searching association rules came from the desire to
analyze supermarket transaction data, that is, to examine customer behavior
in terms of the purchased products. For example, an association rule "beer ⇒
potato chips (80%)" states that four out of five customers that bought beer
also bought potato chips.
In the context of pattern mining as a tool to identify terrorist activity, the
National reasearch council provides the following definition: "Pattern-based
data mining looks for patterns (including anomalous data patterns) that
might be associated with terrorist activity — these patterns might be
regarded as small signals in a large ocean of noise.Pattern Mining includes
new areas such a MIR where patterns seen both in the temporal and non
temporal domains are imported to classical knowledge discovery search
techniques.
Subject-based data mining
"Subject-based data mining" is a data mining technique involving the search
for associations between individuals in data. In the context of combating
terrorism, the National reasearch council provides the following definition:
"Subject-based data mining uses an initiating individual or other datum that
is considered, based on other information, to be of high interest, and the goal
is to determine what other persons or financial transactions or movements,
etc., are related to that initiating datum."
Privacy concerns and ethics
Some people believe that data mining itself is ethically neutral.It is
important to note that the term data mining has no ethical implications. The
term is often associated with the mining of information in relation to
peoples' behavior. However, data mining is a statistical technique that is
applied to a set of information, or a data set. Associating these data sets with
people is an extreme narrowing of the types of data that are available in
today's technological society. Examples could range from a set of crash test
data for passenger vehicles, to the performance of a group of stocks. These
types of data sets make up a great proportion of the information available to
be acted on by data mining techniques, and rarely have ethical concerns
associated with them. However, the ways in which data mining can be used
can raise questions regarding privacy, legality, and ethics.In particular, data
mining government or commercial data sets for national security or law
enforcement purposes, such as in the Total information awerness Program or
in advise, has raised privacy concern.
Data mining requires data preparation which can uncover information or
patterns which may compromise confidentiality and privacy obligations. A
common way for this to occur is through Data aggregation . Data
aggregation is when the data are accrued, possibly from various sources, and
put together so that they can be analyzed. This is not data mining per se, but
a result of the preparation of data before and for the purposes of the analysis.
The threat to an individual's privacy comes into play when the data, once
compiled, cause the data miner, or anyone who has access to the newly
compiled data set, to be able to identify specific individuals, especially when
originally the data were anonymous.
It is recommended that an individual is made aware of the following before
data are collected:
• the purpose of the data collection and any data mining projects,
• how the data will be used,
• who will be able to mine the data and use them,
• the security surrounding access to the data, and in addition,
• how collected data can be updated.
In the USA, privacy concerns have been somewhat addressed by their
congress via the passage of regulatory controls such as the HIPAA. The
HIPAA requires individuals to be given "informed consent" regarding any
information that they provide and its intended future uses by the facility
receiving that information. According to an article in Biotech Business
Week, "In practice, HIPAA may not offer any greater protection than the
longstanding regulations in the research arena, says the AAHC. More
importantly, the rule's goal of protection through informed consent is
undermined by the complexity of consent forms that are required of patients
and participants, which approach a level of incomprehensibility to average
individuals."This underscores the necessity for data anonymity in data
aggregation practices.
One may additionally modify the data so that they are anonymous, so that
individuals may not be readily identified.However, even de-identified data
sets can contain enough information to identify individuals, as occurred
when journalists were able to find several individuals based on a set of
search histories that were inadvertently released by AOL.
Marketplace surveys
Several researchers and organizations have conducted reviews of data
mining tools and surveys of data miners. These identify some of the
strengths and weaknesses of the software packages. They also provide an
overview of the behaviors, preferences and views of data miners.

INFORMATION RETRIEVAL

Information retrieval (IR) is the area of study concerned with searching for
documents, for information within documents, and for meta data about
documents, as well as that of searching relational database and the world
wide web. There is overlap in the usage of the terms data retrieval, document
retrieval, information retrieval, and text retrieval, but each also has its own
body of literature, theory, praxis, and technologies. IR is interdisciplinary,
based on computer science, mathematics, library science, information
science, information architecture, cognitive psychology, linguistic, and
statistic.
Automated information retrieval systems are used to reduce what has been
called "information overloaded". Many universities and public librarys use
IR systems to provide access to books, journals and other documents. Web
search engines are the most visible IR application.

The idea of using computers to search for relevant pieces of information was
popularized in the article As we may think by Vannever Bush in 1945.The
first automated information retrieval systems were introduced in the 1950s
and 1960s. By 1970 several different techniques had been shown to perform
well on small text corpora such as the Cranfield collection (several thousand
documents).Large-scale retrieval systems, such as the Lockheed Dialog
system, came into use early in the 1970s.
In 1992, the US Department of Defense along with the NIST, cosponsored
the TREC as part of the TIPSTER text program. The aim of this was to look
into the information retrieval community by supplying the infrastructure that
was needed for evaluation of text retrieval methodologies on a very large
text collection. This catalyzed research on methods that scale to huge
corpora. The introduction of web search engines has boosted the need for
very large scale retrieval systems even further.
The use of digital methods for storing and retrieving information has led to
the phenomenon of digital, where a digital resource ceases to be readable
because the physical media, the reader required to read the media, the
hardware, or the software that runs on it, is no longer available. The
information is initially easier to retrieve than if it were on paper, but is then
effectively lost.

Overview
An information retrieval process begins when a user enters a query into the
system. Queries are formal statements of information needs, for example
search strings in web search engines. In information retrieval a query does
not uniquely identify a single object in the collection. Instead, several
objects may match the query, perhaps with different degrees of relavancy.
An object is an entity that is represented by information in a data base. User
queries are matched against the database information. Depending on the
application the data objects may be, for example, text documents, images,
audio,mind maps or videos. Often the documents themselves are not kept or
stored directly in the IR system, but are instead represented in the system by
document surrogates or metadata.
Most IR systems compute a numeric score on how well each object in the
database match the query, and rank the objects according to this value. The
top ranking objects are then shown to the user. The process may then be
iterated if the user wishes to refine the query.
Performance measures
Many different measures for evaluating the performance of information
retrieval systems have been proposed. The measures require a collection of
documents and a query. All common measures described here assume a
ground truth notion of relevancy: every document is known to be either
relevant or non-relevant to a particular query. In practice queries may be
illposed and there may be different shades of relevancy.
Precision
Precision is the fraction of the documents retrieved that are relevant to the
user's information need.

In binary cllassification, precision is analogous to positive predictive value.


Precision takes all retrieved documents into account. It can also be evaluated
at a given cut-off rank, considering only the topmost results returned by the
system. This measure is called precision at n or P@n.
Note that the meaning and usage of "precision" in the field of Information
Retrieval differs from the definition of accuracy and precision within other
branches of science and technology.
Recall
Recall is the fraction of the documents that are relevant to the query that are
successfully retrieved.
In binary classification, recall is called sensitivity. So it can be looked at as
the probability that a relevant document is retrieved by the query.
It is trivial to achieve recall of 100% by returning all documents in response
to any query. Therefore recall alone is not enough but one needs to measure
the number of non-relevant documents also, for example by computing the
precision.
Fall-Out
The proportion of non-relevant documents that are retrieved, out of all non-
relevant documents available:

In binary classification, fall-out is closely related to specify. It can be looked


at as the probability that a non-relevant document is retrieved by the query.
It is trivial to achieve fall-out of 0% by returning zero documents in response
to any query.
F-measure
Main article: F-sorce
The weighted harmonic mean of precision and recall, the traditional F-
measure or balanced F-score is:

This is also known as the F1 measure, because recall and precision are
evenly weighted.
The general formula for non-negative real β is:
.
Two other commonly used F measures are the F2 measure, which weights
recall twice as much as precision, and the F0.5 measure, which weights
precision twice as much as recall.
The F-measure was derived by van Rijsbergen (1979) so that Fβ "measures
the effectiveness of retrieval with respect to a user who attaches β times as
much importance to recall as precision". It is based on van Rijsbergen's
effectiveness measure E = 1 − (1 / (α / P + (1 − α) / R)). Their relationship is
Fβ = 1 − E where α = 1 / (β2 + 1).

Average precision
Precision and recall are single-value metrics based on the whole list of
documents returned by the system. For systems that return a ranked
sequence of documents, it is desirable to also consider the order in which the
returned documents are presented. Average precision emphasizes ranking
relevant documents higher. It is the average of precisions computed at the
point of each of the relevant documents in the ranked sequence:

where r is the rank, N the number retrieved, rel() a binary function on the
relevance of a given rank, and P(r) precision at a given cut-off rank:

This metric is also sometimes referred to geometrically as the area under the
Precision-Recall curve.
Note that the denominator (number of relevant documents) is the number of
relevant documents in the entire collection, so that the metric reflects
performance over all relevant documents, regardless of a retrieval cutoff.
R-Precision
Precision at rank R. Where R is the total number of relevant documents.
This measure is highly correlated to Average Precision.
Mean average precision
Mean average precision for a set of queries is the mean of the average
precision scores for each query.

Discounted cumulative gain


DCG uses a graded relevance scale of documents from the result set to
evaluate the usefulness, or gain, of a document based on its position in the
result list. The premise of DCG is that highly relevant documents appearing
lower in a search result list should be penalized as the graded relevance
value is reduced logarithmically proportional to the position of the result.
The DCG accumulated at a particular rank position p is defined as:

Since result set may vary in size among different queries or systems, to
compare performances the normalised version of DCG uses an ideal DCG.
To this end, it sorts documents of a result list by relevance, producing an
ideal DCG at position p (IDCGp), which normalizes the score:

The nDCG values for all queries can be averaged to obtain a measure of the
average performance of a ranking algorithm. Note that in a perfect ranking
algorithm, the DCGp will be the same as the IDCGp producing an nDCG of
1.0. All nDCG calculations are then relative values on the interval 0.0 to 1.0
and so are cross-query comparable

SEMANTIC WEB

The Semantic Web is a "web of data" that enables machines to understand


the semantic, or meaning, of information on the World wide web. It extends
the network of hyperlinked human-readable web pages by inserting
machine-readable meta data about pages and how they are related to each
other, enabling automated agents to access the Web more intelligently and
perform tasks on behalf of users. The term was coined by Tim lee
berners,the inventor of the World Wide Web and director of the world wide
web consortiun which oversees the development of proposed Semantic Web
standards. He defines the Semantic Web as "a web of data that can be
processed directly and indirectly by machines."
The term "Semantic Web" is often used more specifically to refer to the
formats and technologies that enable it.These technologies include the RDF
a variety of data interchange formats and notations such as RDFS and the
OWL all of which are intended to provide a formal description of concepts,
terms, and relation ship within a given knoledge domain.
Many of the technologies proposed by the W3C already exist and are used in
various contexts, particularly those dealing with information that
encompasses a limited and defined domain, and where sharing data is a
common necessity, such as scientific research or data exchange among
businesses. In addition, other technologies with similar goals have emerged,
such as micro formats. However, the Semantic Web as originally envisioned,
a system that enables machines to understand and respond to complex
human requests based on their meaning, has remained largely unrealized and
its critics have questioned its feasibility.

Purpose
The main purpose of the Semantic Web is driving the evolution of the
current Web by allowing users to use it to its full potential, thus allowing
them to find, share, and combine information more easily. Humans are
capable of using the Web to carry out tasks such as finding the Irish word for
"folder," reserving a library book, and searching for a low price for a DVD.
However, machines cannot accomplish all of these tasks without human
direction, because web pages are designed to be read by people, not
machines. The semantic web is a vision of information that can be
interpreted by machines, so machines can perform more of the tedious work
involved in finding, combining, and acting upon information on the web.
Tim Berners-Lee originally expressed the vision of the semantic web as
follows:

Semantic Web application areas are experiencing intensified interest due to


the rapid growth in the use of the Web, together with the innovation and
renovation of information content technologies. The Semantic Web is
regarded as an integrator across different content, information applications
and systems, it also provides mechanisms for the realisation of Enterprise
Information Systems. The rapidity of the growth experienced provides the
impetus for researchers to focus on the creation and dissemination of
innovative Semantic Web technologies, where the envisaged ’Semantic
Web’ is long overdue. Often the terms ’Semantics’, ’metadata’, ’ontologies’
and ’Semantic Web’ are used inconsistently. In particular, these terms are
used as everyday terminology by researchers and practitioners, spanning a
vast landscape of different fields, technologies, concepts and application
areas. Furthermore, there is confusion with regard to the current status of the
enabling technologies envisioned to realise the Semantic Web. In a paper
presented by Gerber, Barnard and Van der Merwe the Semantic Web
landscape is charted and a brief summary of related terms and enabling
technologies is presented. The architectural model proposed by Tim
Berners-Lee is used as basis to present a status model that reflects current
and emerging technologies.
Semantic Publishing
Semantic publishing will greatly benefit from the semantic web. In
particular, the semantic web is expected to revolutionize scintific publishing,
such as real-time publishing and sharing of experimental data on the
Internet. This simple but radical idea is now being explored by W3C HCLS
group's Scintific publishing task force.
Semantic Blogging
Semantic blogging, like semantic publishing, will change the way blogs are
read. Currently "the process of blogging inherently emphasizes metadata
creation more than traditional Web publishing methodologies".Some blog
users already tag their entries with topics, allowing for easier migration into
a semantic web environment. It is intentionally saved in not only a human-
readable format, but also in a machine-readable format as the tags can be
linked easily to other blogs containing similar information. When a release
of a game or movie occurs, bloggers tend to rate them using their own
system. If there were to be a unified system, these blogs could easily become
assimilated using similar semantics and give a user a score when searching
using a semantic search. RSS feeds are another way that blogs already have
machine-readable data that is easily accessible by the semantic web.
Web 3.0
Main article: Web 3.0
Tim berners lee has described the semantic web as a component of 'Web
3.0'.
The internet community as a whole tends to find the two terms "Semantic
Web" and "Web 3.0" to be at least synonymous in concept if not completely
interchangeable. The definition continues to vary depending on to whom you
speak. The overwhelming consensus is that Web 3.0 is most assuredly the
"next big thing" but there only lies speculation as to just what that might be.
It will be an improvement in the respect that it will still contain Web 2.0
properties while continuing to add to its ever expanding lexicon and library
of applications. There are some who claim that Web 3.0 will be more
application based and center its efforts towards more graphically capable
environments, "non-browser applications and non-computer based
devices...geographic or location-based information retrieval" and even more
applicable use and growth of Artificial Intelligence.For example, Conard
Wolfram, has argued that Web 3.0 is where "the computer is generating new
information", rather than humans.
Others simply state their belief that Web 3.0 will primarily focus on
dramatically improving the functionality and usability of search engines.An
important factor that users must continue to keep in mind is that the
transition to web 2.0 from "Web 1.0" took approximately ten years. Given
the same time frame, this next transition will not be complete until around
the year 2015.

Semantic Web solutions


The Semantic Web takes the solution further. It involves publishing in
languages specifically designed for data: (RDF),web ontology language
(OWL), and Extensible Markup Language (XML). HTML describes
documents and the links between them. RDF, OWL, and XML, by contrast,
can describe arbitrary things such as people, meetings, or airplane parts. Tim
Berners-Lee calls the resulting network of linked data the giant global graph,
in contrast to the HTML-based world wide web.
These technologies are combined in order to provide descriptions that
supplement or replace the content of Web documents. Thus, content may
manifest itself as descriptive data stored in Web-accessible data base or as
markup within documents (particularly, in Extensible HTML (XHTML)
interspersed with XML, or, more often, purely in XML, with layout or
rendering cues stored separately). The machine-readable descriptions enable
content managers to add meaning to the content, i.e., to describe the
structure of the knowledge we have about that content. In this way, a
machine can process knowledge itself, instead of text, using processes
similar to human deductive reasning and infernce, thereby obtaining more
meaningful results and helping computers to perform automated information
gathering and reasearch.
An example of a tag that would be used in a non-semantic web page:
<item>cat</item>
Encoding similar information in a semantic web page might look like this:
<item
rdf:about="http://dbpedia.org/resource/Cat">Cat</it
em>
Skeptical reactions
Practical feasibility
Critics question the basic feasibility of a complete or even partial fulfillment
of the semantic web. Cory Doctorow's critique ("meta crap") is from the
perspective of human behavior and personal preferences. For example,
people lie: they may include spurious metadata into Web pages in an attempt
to mislead Semantic Web engines that naively assume the metadata's
veracity. This phenomenon was well-known with metatags that fooled the
AltaVista ranking algorithm into elevating the ranking of certain Web pages:
the Google indexing engine specifically looks for such attempts at
manipulation. peter grandonforce and Timo honkela point out that logic-
based semantic web technologies cover only a fraction of the relevant
phenomena related to semantics.
Where semantic web technologies have found a greater degree of practical
adoption, it has tended to be among core specialized communities and
organizations for intra-company projects.The practical constraints toward
adoption have appeared less challenging where domain and scope is more
limited than that of the general public and the World-Wide Web.
The potential of an idea in fast progress
The original 2001 scintific american article by Berners-Lee described an
expected evolution of the existing Web to a Semantic Web.A complete
evolution as described by Berners-Lee has yet to occur. In 2006, Berners-
Lee and colleagues stated that: "This simple idea, however, remains largely
unrealized."While the idea is still in the making, it seems to evolve quickly
and inspire many. Between 2007–2010 several scholars have already
explored first applications and the social potential of the semantic web in the
business and health sectors, and for social networking and even for the
broader evolution of democracy, specifically, how a society forms its
common will in a democratic manner through a semantic web.
Censorship and privacy
Enthusiasm about the semantic web could be tempered by concerns
regarding censor ship and privacy. For instance, text analyzing techniques
can now be easily bypassed by using other words, metaphors for instance, or
by using images in place of words. An advanced implementation of the
semantic web would make it much easier for governments to control the
viewing and creation of online information, as this information would be
much easier for an automated content-blocking machine to understand. In
addition, the issue has also been raised that, with the use of FOAF files and
geo location meta data, there would be very little anonymity associated with
the authorship of articles on things such as a personal blog. Some of these
concerns were addressed in the "Policy Aware Web" project and is an active
research and development topic.
Doubling output formats
Another criticism of the semantic web is that it would be much more time-
consuming to create and publish content because there would need to be two
formats for one piece of data: one for human viewing and one for machines.
However, many web application in development are addressing this issue by
creating a machine-readable format upon the publishing of data or the
request of a machine for such data. The development of micro formats has
been one reaction to this kind of criticism. Another argument in defense of
the feasibility of semantic web is the likely falling price of human
intelligence tasks in digital labor markets like the Amazone mechanical
truck.
Specifications such as eRDF and RDFa allow arbitrary RDF data to be
embedded in HTML pages. The GRDDL (Gleaning Resource Descriptions
from Dialects of Language) mechanism allows existing material (including
microformats) to be automatically interpreted as RDF, so publishers only
need to use a single format, such as HTML.
Need
The idea of a semantic web, able to describe and associate meaning with
data necessarily involves more than simple XHTML mark-up code. It is
based on an assumption that in order for it to be possible to endow machines
with an ability to accurately interpret web homed content, far more than the
mere ordered relationships involving letters and words, is necessary as
underlying infrastructure (attendant to semantic issues). Otherwise, most of
the supportive functionality would have been available in Web 2.0 (and
before) and it would have been possible to derive a semantically capable
Web with minor, incremental additions.
Additions to the infrastructure to support semantic functionality include
latent dynamic network models that can, under certain conditions, be
'trained' to appropriately 'learn' meaning based on order data, in the process
'learning' relationships with order (a kind of rudimentary working grammar).
See for example latent semantic web.

Components

You might also like