SlideShare a Scribd company logo
Apache Solr crash
     course
     Tommaso Teofili
Agenda

• IR
• Solr
• Tips&tricks
• Case study
• Extras
Information Retrieval

• “Information Retrieval (IR) is finding
  material (usually documents) of an
  unstructured nature (usually text) that
  satisfies an information need from within
  large collections (usually stored on
  computers)” - P. Nayak, Stanford University
Inverted index
• Each document has an id
  and a list of terms
• For each term t we must
  store a list of all documents
  that contain t
• Identify each document by
  its id
IR Metrics

• How good is an IR system?
• Precision: Fraction of retrieved docs that
  are relevant to user’s information need
• Recall: Fraction of relevant docs in
  collection that are retrieved
Apache Lucene

• Information Retrieval library
• Inverted index of documents
• Vector space model
• Advanced search options (synonims,
  stopwords, similarity, proximity)
Lucene API - indexing
•   Lucene indexes are built on a Directory

•   Directory can be accessed by IndexReaders and
    IndexWriters

•   IndexSearchers are built on top of Directories and
    IndexReaders

•   IndexWriters can write Documents inside the index

•   Documents are made of Fields

•   Fields have values

•   Directory > IndexReader/Writer > Document > Field
Lucene API - searching
•   Open an IndexSearcher on top of an IndexReader over a
    Directory

•   Many query types: TermQuery, MultiTermQuery,
    BooleanQuery, WildcardQuery, PhraseQuery, PrefixQuery,
    MultiPhraseQuery, FuzzyQuery, NumericRangeQuery, ...

•   Get results from a TopDocs object
Apache Solr
• Ready to use enterprise search server
• REST (and programmatic) API
• Results in XML, JSON, PHP, Ruby, etc...
• Exploit Lucene power
• Scaling capabilities (replication, distributed search, ...)
• Administration interface
• Customizable via plugins
Apache Solr 3.1.0
Apache Solr - admin UI
Solr - project status

• Solr 3.1.0 version released in March 2011
• Lucene/Solr is now a single project
• Huge community
• Backed by Lucid Imagination
Solr basic configuration
• schema.xml
 • contains types definitions for field
    analysis (field type+tokenizers+filters)
 • contains field definitions
• solrconfig.xml
 • contains the Solr instance configuration
Solr - schema.xml

• Types (with index/query Analyzers)
• Fields with name, type and options
• Unique key
• Dynamic fields
• Copy fields
Solr - content analysis
• define documents’ model
• each document consists of fields
• each field
 • has attributes telling Solr how to handle
    its contents
  • contains free text, keywords, dates,
    numbers, etc.
Solr - content analysis
• Analyzer: create tokens using a Tokenizer
  and, eventually, some filters (TokenFilters)
• Each field can define an Analyzer at ‘query’
  time and another at ‘index’ time, or the
  same in both cases
• Each field can be indexed (searchable),
  stored (possibly fetched with results),
  multivalued, required, etc.
Solr - content analysis
• Commonly used tokenizers:
 • WhitespaceTokenizerFactory
 • StandardTokenizerFactory
 • KeywordTokenizerFactory
 • PatternTokenizerFactory
 • HTMLStripWhitespaceTokenizerFactory
 • HTMLStripStandardTokenizerFactory
Solr - content analysis
•   Commonly used TokenFilters:

    •   SnowballPorterFilterFactory

    •   StopFilterFactory

    •   LengthFilterFactory

    •   LowerCaseFilterFactory

    •   WordDelimiterFilterFactory

    •   SynonymFilterFactory

    •   PatternReplaceFilterFactory

    •   ReverseWildcardFilterFactory

    •   CharFilterFactories (Mapping,HtmlString)
Debugging analysis
Solr - solrconfig.xml

•   Data directory (where Solr will write the Lucene index)

•   Caches configuration: documents, query results, filters

•   Request handlers definition (search/update handlers)

•   Update request processor chains definition

•   Event listeners (newSearcher, firstSearcher)

•   Fine tuning parameters

•   ...
Solr - indexing
•   Update requests on index are given with XML commands via
    HTTP POST

•   <add> to insert and update

    •   <add> <doc boost="2.5">

    •     <field name="employeeId">05991</field>

    •   </doc></add>

•   <delete> to remove by unique key or query

    •   <delete><id>05991</id></delete>

    •   <delete><query>office:Bridgewater</query></delete>

•   <commit/> reopen readers on the new index version

•   <optimize/> optimize index internal structure for faster access
Solr - basic indexing
• REST call - XML/JSON
 • curl ‘http://localhost:8983/solr/update?
    commit=true’ -H "Content-Type: text/xml" --
    data-binary '<add><doc><field
    name="id">testdoc</field></doc></add>'
  • curl 'http://localhost:8983/solr/update/json?
    commit=true' -H 'Content-type:application/json'
    -d ' { "add": {"doc": {"id" : "TestDoc1", "title" :
    "test1"} } }’
Solr - binary files indexing
 • Many documents are produced in
   (properietary) binary formats : PDF, RTF,
   XLS, etc.
 • Apache Tika integrated in Solr REST service
   for indexing such documents
 • curl "http://localhost:8983/solr/update/
   extract?literal.id=doc1&commit=true" -F
   "myfile=@tutorial.html"
Solr - index analysis
• Luke is a tool for navigating Lucene indexes
• For each field : top terms, distinct terms,
  terms histogram, etc.
• LukeRequestHandler :
 • http://localhost:8983/solr/admin/luke?
    wt=xslt&tr=luke.xsl
Solr - data import handler

 • DBMS
 • FileSystem
 • HTTP
Solr - searching
Solr - searching
Solr - query syntax
• query fields with fieldname:value
• + - AND OR NOT operators
• Range queries on date or numeric fields, ex:
  timestamp:[* TO NOW]
• Boost terms, ex: people^4 profits
• Fuzzy search, ex: roam~0.6
• Proximity search, ex: “apache solr”~2
• ...
Solr - basic search
• parameters:
 • q: the query
 • start: offset of the first result
 • rows: max no. of results returned
 • fl: comma separated list of fields to return
 • defType: specify the query parser
 • debugQuery: enable query debugging
 • wt: result format (xml, json, php, ruby, javabin, etc)
Solr - query parsers

• Most used:
 • Default Lucene query parser
 • DisMax query parser
 • eDisMax query parser
Solr - highlighting

• can be done on fields with stored=”true”
• returns a snippet containing the higlighted
  terms for each doc
• enabled with
  hl=true&hl.fl=fieldname1,fieldname2
Solr - sorting results
• Sorting can be done on the "score" of the
  document, or on any
  multiValued="false" indexed="true" field
  provided that field is either non-tokenized (ie:
  has no Analyzer) or uses an Analyzer that only
  produces a single term
• add parameter &sort=score desc,
  inStock desc, price asc
• can sort on function queries (see later)
Solr - filter queries
• get a subset of the index
• place it in a cache
• run queries for such a “filter” in memory
• add parameter &fq=category:hardware
• if multiple fq parameters the query will be
  run against the intersection of the specified
  filters
Solr - facets

• facet by:
 • field value
 • arbitrary queries
 • range
• can facet on fields with indexed=”true”
Solr - function queries

• allow deep customization of ranking :
 • http://localhost:8983/solr/select/?
    fl=score,id&q=DDR&sort=termfreq
    (text,memory)%20desc
• functions : sum, sub, product , div ,pow, abs,
  log, sqrt, map, scale, termfreq, ...
Solr - query elevation

• useful for “marketing”
• configure the top results for a given query
  regardless of the normal Lucene scoring
• http://localhost:8983/solr/elevate?q=best
  %20product&enableElevation=true
Solr - spellchecking
• collects suggestions about input query
• eventually correct user query with
  “suggested” terms
Solr - spellchecking
• build a spellcheck index dynamically
• return suggested results
• http://localhost:8983/solr/spell?q=hell
  ultrashar&spellcheck=true&spellcheck.collate=true
  &spellcheck.build=true
• useful to create custom query converters
 • <queryConverter name="queryConverter"
    class="org.apache.solr.spelling.SpellingQueryConv
    erter"/>
Solr - similarity

• get documents “similar” to a given
  document or a set of documents
• Vector Space Model
• http://localhost:8983/solr/select?
  q=apache&mlt=true&mlt.fl=manu,cat&mlt.
  mindf=1&mlt.mintf=1&fl=id,score
Solr - geospatial search

• index location data
• query by spatial concepts and sort by distance
• find all documents with store position at no more
  than 5km than a specified point
• http://localhost:8983/solr/select?
  &indent=true&fl=name,store&q=*:*&fq={!geofilt
  %20sfield=store}&pt=45.15,-93.85&d=5
Solr - field collapsing
• group resulting documents on per field
  basis
  • http://localhost:8983/solr/select?
    &indent=true&fl=id,name&q=solr
    +memory&group=true&group.field=man
    u_exact
• useful for displaying results in a smart way
• see SOLR-236
Solr - join
• new feature (SOLR-2272)
• many users ask for it
• quite of a paradigm change
• http://localhost:8983/solr/select?q={!join
  +from=manu_id_s%20to=id}
  ipod&fl=id,manu_id&debugQuery=true
Solr statistics
Solr statistics
Solr statistics
Solr replication
Solr Architectures

• Simple
• Multicore
• Replication
• Sharded
Solr - MultiCore
• Define multiple Solr cores inside one only Solr
  instance
• Each cores maintain its own index
• Unified administration interface
• Runtime commands to create, reload, load, unload,
  delete, swap cores
• Cores can be thought as ‘collections’
• Allow no downtime while deploying new features/
  bugfixes
Solr - Replication
• It’s useful in case of high traffic to replicate a Solr
  instance and split (with eventually a VIP in front)
  the search load
• Master has the “original” index
• Slave polls master asking the latest version of index
• If slave has a different version of the index asks the
  master for the delta (rsync like)
• In the meanwhile indexes remain available
• No impact of indexing on search (almost)
Solr - Shards
•   When an index is too large, in terms of space or memory
    required, it can be useful to define two or more shards

•   A shard is a Solr instance and can be searched or indexed
    independently

•   At the same time it’s possible to query all the shards having
    the result be merged from the sub-results of each shard

•   http://localhost:8983/solr/select?shards=localhost:8983/
    solr,localhost:7574/solr&q=category:information

•   Note that the document distribution among indexes is up to
    the user (or who feeds the indexes)
Solr - Architectures
• When to use each?
• KISS principle
• High query load : replication
• Huge index : shard
Solr - Architectures




• High queries w/ large indexes : shard + replication
Solr - Architectures
• Tips & Tricks:
  • Don’t share indexes between Master and Slaves on
    distributed file systems (locking)
 • Anyway get rid of distributed file systems (slow)
 • Lucene/Solr is I/O intensive thus behaves better
    with quick disks
 • Always use MultiCore - hot deploy of changes/
    bugfixes
 • Replication is network intensive
 • Check replication poll time and indexing rates
Tips&Tricks

• Solr based SE development process
• Plugins
• Performance tuning
• Deploy
Process - t0 analysis
• Analyze content
• Analyze queries
• Analyze collections
• Pre-existing query/index load (if any)
• Expected query/index load
• Desired throughput/avg response time
• First architecture
Process - n-th iteration
•   index 10-15% content

•   search stress test (analyze peaks) - use SolrMeter

•   quality tests from stakeholders (accuracy, recall)

•   eventually add/reconfigure features

•   check http://wiki.apache.org/solr/FieldOptionsByUseCase
    and make sure fields used for faceting/sorting/highlighting/
    etc. have proper options

•   need to change field types/analysis/options - rebuild the
    index
Solr - Plugins

• QParserPlugin
• RequestHandler (Search/UpdateHandler)
• UpdateRequestProcessor
• ResponseWriter
• Cache
Performance tuning

• A huge tuning is done in schema.xml
• Configure Solr caches
• Set auto commit where possible
• Play with mergeFactor
Performance tuning
• The number of indexed fields greatly increases
  memory usage during indexing, segment merge time,
  optimization times, index size
• Stored fields impact on index size, search time, ...
• set omitNorms=”true” where it makes sense
  (disabling length normalization and index time
  boosting)
• set omitTermFreqAndPositions=”true” if no queries
  on this field using positions or should not influence
  score
Performance tuning

•   FilterCache - unordered document ids for caching filter
    queries
•   QueryResultCache - ordered document ids for caching
    queries results (caching only the returned docs)
•   DocumentCache - stores stored fields (at least
    <max_results> * <max_concurrent_queries>
•   Setup autowarming - keep caches warm after commits
Performance tuning

•   Choose correct cache implementation
    FastLRUCache vs LRUCache

• FastLRUCache has faster gets and slower
    puts in single threaded operation and thus
    is generally faster than LRUCache when the
    hit ratio of the cache is high (> 75%)
Performance tuning
•   Explicit warm sorted fields

•   Often check cache statistics

•   JVM options - don’t let the OS without memory!

•   mergeFactor - impacts on the number of index
    segments created on the disk

    •   low mF : smaller number of index files, which speeds
        up searching but more segment merges slow down
        indexing

    •   high mF : generally improves indexing speed but gets
        less frequent merges, resulting in a collection with
        more index files which may slow searching
Performance tuning
• set autocommit where possible, this will
  avoid close and reopen of IndexReaders
  everytime a document is indexed - can
  choose max number of documents and/or
  time to wait before automatically do the
  commit
• finally...need to get your hand dirty!
Deploy
• SolrPackager by Simone Tripodi!
• It’s a Maven archetype
• Create standalone/multicore project
• Each project will generate a master and a slave
  instance
• Define environment dependent properties without
  having to manage N config files
• ‘mvn -Pdev package’ // will create a Tomcat package
  for the development environment
Case study
Case study
Case Study

• Architecture analysis
• Plugin development
• Testing and support
Challenges


• Architecture
• Schema design
Challenge
• Architecture
 • 4B docs of ~4k each
 • ~3 req/sec overall
 • 3 collections:
   • |archive| = 3B
   • |2010-2011| = 1M
   • |intranet| = 0.9B
Challenge
• Content analysis
• get the example Solr schema.xml
• optimize the schema in order to enable both
  stemmed and unstemmed versions of fields: author,
  title, text, cat
• add omitNorms=”true” where possible
• add a field ‘html_content’ which will contain an
  HTML text but will be searched as clean text
• all string fields should be lowercased
Extras
• Clustering (Solr-Carrot2)
• Named entity extraction (Solr-UIMA)
• SolrCloud (Solr-Zookeeper)
• ManifoldCF
• Stanbol EntityHub
• Solandra (Solr-Cassandra)
THANKS!

More Related Content

What's hot (20)

How Solr Search Works
How Solr Search WorksHow Solr Search Works
How Solr Search Works
Atlogys Technical Consulting
 
ORC File - Optimizing Your Big Data
ORC File - Optimizing Your Big DataORC File - Optimizing Your Big Data
ORC File - Optimizing Your Big Data
DataWorks Summit
 
Solr introduction
Solr introductionSolr introduction
Solr introduction
Lap Tran
 
Rest API
Rest APIRest API
Rest API
Rohana K Amarakoon
 
Apache Solr Workshop
Apache Solr WorkshopApache Solr Workshop
Apache Solr Workshop
Saumitra Srivastav
 
Introduction to elasticsearch
Introduction to elasticsearchIntroduction to elasticsearch
Introduction to elasticsearch
hypto
 
Elastic search overview
Elastic search overviewElastic search overview
Elastic search overview
ABC Talks
 
What is REST API? REST API Concepts and Examples | Edureka
What is REST API? REST API Concepts and Examples | EdurekaWhat is REST API? REST API Concepts and Examples | Edureka
What is REST API? REST API Concepts and Examples | Edureka
Edureka!
 
HBASE Overview
HBASE OverviewHBASE Overview
HBASE Overview
Sampath Rachakonda
 
Elasticsearch for beginners
Elasticsearch for beginnersElasticsearch for beginners
Elasticsearch for beginners
Neil Baker
 
Spark overview
Spark overviewSpark overview
Spark overview
Lisa Hua
 
Introduction to Apache Kudu
Introduction to Apache KuduIntroduction to Apache Kudu
Introduction to Apache Kudu
Jeff Holoman
 
MongoDB Sharding Fundamentals
MongoDB Sharding Fundamentals MongoDB Sharding Fundamentals
MongoDB Sharding Fundamentals
Antonios Giannopoulos
 
An Introduction To REST API
An Introduction To REST APIAn Introduction To REST API
An Introduction To REST API
Aniruddh Bhilvare
 
What should a hacker know about WebDav?
What should a hacker know about WebDav?What should a hacker know about WebDav?
What should a hacker know about WebDav?
Mikhail Egorov
 
InfluxDb
InfluxDbInfluxDb
InfluxDb
Guamaral Vasil
 
An Introduction to Solr
An Introduction to SolrAn Introduction to Solr
An Introduction to Solr
tomhill
 
Introduction to elasticsearch
Introduction to elasticsearchIntroduction to elasticsearch
Introduction to elasticsearch
pmanvi
 
Web api
Web apiWeb api
Web api
Sudhakar Sharma
 
Monitoring with Prometheus
Monitoring with PrometheusMonitoring with Prometheus
Monitoring with Prometheus
Shiao-An Yuan
 
ORC File - Optimizing Your Big Data
ORC File - Optimizing Your Big DataORC File - Optimizing Your Big Data
ORC File - Optimizing Your Big Data
DataWorks Summit
 
Solr introduction
Solr introductionSolr introduction
Solr introduction
Lap Tran
 
Introduction to elasticsearch
Introduction to elasticsearchIntroduction to elasticsearch
Introduction to elasticsearch
hypto
 
Elastic search overview
Elastic search overviewElastic search overview
Elastic search overview
ABC Talks
 
What is REST API? REST API Concepts and Examples | Edureka
What is REST API? REST API Concepts and Examples | EdurekaWhat is REST API? REST API Concepts and Examples | Edureka
What is REST API? REST API Concepts and Examples | Edureka
Edureka!
 
Elasticsearch for beginners
Elasticsearch for beginnersElasticsearch for beginners
Elasticsearch for beginners
Neil Baker
 
Spark overview
Spark overviewSpark overview
Spark overview
Lisa Hua
 
Introduction to Apache Kudu
Introduction to Apache KuduIntroduction to Apache Kudu
Introduction to Apache Kudu
Jeff Holoman
 
What should a hacker know about WebDav?
What should a hacker know about WebDav?What should a hacker know about WebDav?
What should a hacker know about WebDav?
Mikhail Egorov
 
An Introduction to Solr
An Introduction to SolrAn Introduction to Solr
An Introduction to Solr
tomhill
 
Introduction to elasticsearch
Introduction to elasticsearchIntroduction to elasticsearch
Introduction to elasticsearch
pmanvi
 
Monitoring with Prometheus
Monitoring with PrometheusMonitoring with Prometheus
Monitoring with Prometheus
Shiao-An Yuan
 

Similar to Apache Solr crash course (20)

Introduction to Solr
Introduction to SolrIntroduction to Solr
Introduction to Solr
Erik Hatcher
 
Solr Powered Lucene
Solr Powered LuceneSolr Powered Lucene
Solr Powered Lucene
Erik Hatcher
 
Apache Solr Workshop
Apache Solr WorkshopApache Solr Workshop
Apache Solr Workshop
JSGB
 
Apache Solr - Enterprise search platform
Apache Solr - Enterprise search platformApache Solr - Enterprise search platform
Apache Solr - Enterprise search platform
Tommaso Teofili
 
Solr Recipes Workshop
Solr Recipes WorkshopSolr Recipes Workshop
Solr Recipes Workshop
Erik Hatcher
 
Apache Solr 1.4 – Faster, Easier, and More Versatile than Ever
Apache Solr 1.4 – Faster, Easier, and More Versatile than EverApache Solr 1.4 – Faster, Easier, and More Versatile than Ever
Apache Solr 1.4 – Faster, Easier, and More Versatile than Ever
Lucidworks (Archived)
 
Lucene for Solr Developers
Lucene for Solr DevelopersLucene for Solr Developers
Lucene for Solr Developers
Erik Hatcher
 
Solr Recipes
Solr RecipesSolr Recipes
Solr Recipes
Erik Hatcher
 
Lucene for Solr Developers
Lucene for Solr DevelopersLucene for Solr Developers
Lucene for Solr Developers
Erik Hatcher
 
Introduction to Lucene & Solr and Usecases
Introduction to Lucene & Solr and UsecasesIntroduction to Lucene & Solr and Usecases
Introduction to Lucene & Solr and Usecases
Rahul Jain
 
Solr
SolrSolr
Solr
Claudio Devecchi
 
Apache solr
Apache solrApache solr
Apache solr
Dipen Rangwani
 
Introduction to Apache Lucene/Solr
Introduction to Apache Lucene/SolrIntroduction to Apache Lucene/Solr
Introduction to Apache Lucene/Solr
Rahul Jain
 
Rapid Prototyping with Solr
Rapid Prototyping with SolrRapid Prototyping with Solr
Rapid Prototyping with Solr
Erik Hatcher
 
20130310 solr tuorial
20130310 solr tuorial20130310 solr tuorial
20130310 solr tuorial
Chris Huang
 
IT talk SPb "Full text search for lazy guys"
IT talk SPb "Full text search for lazy guys" IT talk SPb "Full text search for lazy guys"
IT talk SPb "Full text search for lazy guys"
DataArt
 
Rapid Prototyping with Solr
Rapid Prototyping with SolrRapid Prototyping with Solr
Rapid Prototyping with Solr
Erik Hatcher
 
Apache Solr for begginers
Apache Solr for begginersApache Solr for begginers
Apache Solr for begginers
Alexander Tokarev
 
The Apache Solr Smart Data Ecosystem
The Apache Solr Smart Data EcosystemThe Apache Solr Smart Data Ecosystem
The Apache Solr Smart Data Ecosystem
Trey Grainger
 
Information Retrieval - Data Science Bootcamp
Information Retrieval - Data Science BootcampInformation Retrieval - Data Science Bootcamp
Information Retrieval - Data Science Bootcamp
Kais Hassan, PhD
 
Introduction to Solr
Introduction to SolrIntroduction to Solr
Introduction to Solr
Erik Hatcher
 
Solr Powered Lucene
Solr Powered LuceneSolr Powered Lucene
Solr Powered Lucene
Erik Hatcher
 
Apache Solr Workshop
Apache Solr WorkshopApache Solr Workshop
Apache Solr Workshop
JSGB
 
Apache Solr - Enterprise search platform
Apache Solr - Enterprise search platformApache Solr - Enterprise search platform
Apache Solr - Enterprise search platform
Tommaso Teofili
 
Solr Recipes Workshop
Solr Recipes WorkshopSolr Recipes Workshop
Solr Recipes Workshop
Erik Hatcher
 
Apache Solr 1.4 – Faster, Easier, and More Versatile than Ever
Apache Solr 1.4 – Faster, Easier, and More Versatile than EverApache Solr 1.4 – Faster, Easier, and More Versatile than Ever
Apache Solr 1.4 – Faster, Easier, and More Versatile than Ever
Lucidworks (Archived)
 
Lucene for Solr Developers
Lucene for Solr DevelopersLucene for Solr Developers
Lucene for Solr Developers
Erik Hatcher
 
Lucene for Solr Developers
Lucene for Solr DevelopersLucene for Solr Developers
Lucene for Solr Developers
Erik Hatcher
 
Introduction to Lucene & Solr and Usecases
Introduction to Lucene & Solr and UsecasesIntroduction to Lucene & Solr and Usecases
Introduction to Lucene & Solr and Usecases
Rahul Jain
 
Introduction to Apache Lucene/Solr
Introduction to Apache Lucene/SolrIntroduction to Apache Lucene/Solr
Introduction to Apache Lucene/Solr
Rahul Jain
 
Rapid Prototyping with Solr
Rapid Prototyping with SolrRapid Prototyping with Solr
Rapid Prototyping with Solr
Erik Hatcher
 
20130310 solr tuorial
20130310 solr tuorial20130310 solr tuorial
20130310 solr tuorial
Chris Huang
 
IT talk SPb "Full text search for lazy guys"
IT talk SPb "Full text search for lazy guys" IT talk SPb "Full text search for lazy guys"
IT talk SPb "Full text search for lazy guys"
DataArt
 
Rapid Prototyping with Solr
Rapid Prototyping with SolrRapid Prototyping with Solr
Rapid Prototyping with Solr
Erik Hatcher
 
The Apache Solr Smart Data Ecosystem
The Apache Solr Smart Data EcosystemThe Apache Solr Smart Data Ecosystem
The Apache Solr Smart Data Ecosystem
Trey Grainger
 
Information Retrieval - Data Science Bootcamp
Information Retrieval - Data Science BootcampInformation Retrieval - Data Science Bootcamp
Information Retrieval - Data Science Bootcamp
Kais Hassan, PhD
 
Ad

More from Tommaso Teofili (18)

Affect Enriched Word Embeddings for News IR
Affect Enriched Word Embeddings for News IRAffect Enriched Word Embeddings for News IR
Affect Enriched Word Embeddings for News IR
Tommaso Teofili
 
Flexible search in Apache Jackrabbit Oak
Flexible search in Apache Jackrabbit OakFlexible search in Apache Jackrabbit Oak
Flexible search in Apache Jackrabbit Oak
Tommaso Teofili
 
Data replication in Sling
Data replication in SlingData replication in Sling
Data replication in Sling
Tommaso Teofili
 
Search engines in the industry
Search engines in the industrySearch engines in the industry
Search engines in the industry
Tommaso Teofili
 
Scaling search in Oak with Solr
Scaling search in Oak with Solr Scaling search in Oak with Solr
Scaling search in Oak with Solr
Tommaso Teofili
 
Text categorization with Lucene and Solr
Text categorization with Lucene and SolrText categorization with Lucene and Solr
Text categorization with Lucene and Solr
Tommaso Teofili
 
Machine learning with Apache Hama
Machine learning with Apache HamaMachine learning with Apache Hama
Machine learning with Apache Hama
Tommaso Teofili
 
Adapting Apache UIMA to OSGi
Adapting Apache UIMA to OSGiAdapting Apache UIMA to OSGi
Adapting Apache UIMA to OSGi
Tommaso Teofili
 
Oak / Solr integration
Oak / Solr integrationOak / Solr integration
Oak / Solr integration
Tommaso Teofili
 
Domeo, Text Mining, UIMA and Clerezza
Domeo, Text Mining, UIMA and ClerezzaDomeo, Text Mining, UIMA and Clerezza
Domeo, Text Mining, UIMA and Clerezza
Tommaso Teofili
 
Natural Language Search in Solr
Natural Language Search in SolrNatural Language Search in Solr
Natural Language Search in Solr
Tommaso Teofili
 
Apache UIMA - Hands on code
Apache UIMA - Hands on codeApache UIMA - Hands on code
Apache UIMA - Hands on code
Tommaso Teofili
 
Apache UIMA Introduction
Apache UIMA IntroductionApache UIMA Introduction
Apache UIMA Introduction
Tommaso Teofili
 
OSS Enterprise Search EU Tour
OSS Enterprise Search EU TourOSS Enterprise Search EU Tour
OSS Enterprise Search EU Tour
Tommaso Teofili
 
Information Extraction with UIMA - Usecases
Information Extraction with UIMA - UsecasesInformation Extraction with UIMA - Usecases
Information Extraction with UIMA - Usecases
Tommaso Teofili
 
Apache UIMA and Metadata Generation
Apache UIMA and Metadata GenerationApache UIMA and Metadata Generation
Apache UIMA and Metadata Generation
Tommaso Teofili
 
Data and Information Extraction on the Web
Data and Information Extraction on the WebData and Information Extraction on the Web
Data and Information Extraction on the Web
Tommaso Teofili
 
Apache UIMA and Semantic Search
Apache UIMA and Semantic SearchApache UIMA and Semantic Search
Apache UIMA and Semantic Search
Tommaso Teofili
 
Affect Enriched Word Embeddings for News IR
Affect Enriched Word Embeddings for News IRAffect Enriched Word Embeddings for News IR
Affect Enriched Word Embeddings for News IR
Tommaso Teofili
 
Flexible search in Apache Jackrabbit Oak
Flexible search in Apache Jackrabbit OakFlexible search in Apache Jackrabbit Oak
Flexible search in Apache Jackrabbit Oak
Tommaso Teofili
 
Data replication in Sling
Data replication in SlingData replication in Sling
Data replication in Sling
Tommaso Teofili
 
Search engines in the industry
Search engines in the industrySearch engines in the industry
Search engines in the industry
Tommaso Teofili
 
Scaling search in Oak with Solr
Scaling search in Oak with Solr Scaling search in Oak with Solr
Scaling search in Oak with Solr
Tommaso Teofili
 
Text categorization with Lucene and Solr
Text categorization with Lucene and SolrText categorization with Lucene and Solr
Text categorization with Lucene and Solr
Tommaso Teofili
 
Machine learning with Apache Hama
Machine learning with Apache HamaMachine learning with Apache Hama
Machine learning with Apache Hama
Tommaso Teofili
 
Adapting Apache UIMA to OSGi
Adapting Apache UIMA to OSGiAdapting Apache UIMA to OSGi
Adapting Apache UIMA to OSGi
Tommaso Teofili
 
Domeo, Text Mining, UIMA and Clerezza
Domeo, Text Mining, UIMA and ClerezzaDomeo, Text Mining, UIMA and Clerezza
Domeo, Text Mining, UIMA and Clerezza
Tommaso Teofili
 
Natural Language Search in Solr
Natural Language Search in SolrNatural Language Search in Solr
Natural Language Search in Solr
Tommaso Teofili
 
Apache UIMA - Hands on code
Apache UIMA - Hands on codeApache UIMA - Hands on code
Apache UIMA - Hands on code
Tommaso Teofili
 
Apache UIMA Introduction
Apache UIMA IntroductionApache UIMA Introduction
Apache UIMA Introduction
Tommaso Teofili
 
OSS Enterprise Search EU Tour
OSS Enterprise Search EU TourOSS Enterprise Search EU Tour
OSS Enterprise Search EU Tour
Tommaso Teofili
 
Information Extraction with UIMA - Usecases
Information Extraction with UIMA - UsecasesInformation Extraction with UIMA - Usecases
Information Extraction with UIMA - Usecases
Tommaso Teofili
 
Apache UIMA and Metadata Generation
Apache UIMA and Metadata GenerationApache UIMA and Metadata Generation
Apache UIMA and Metadata Generation
Tommaso Teofili
 
Data and Information Extraction on the Web
Data and Information Extraction on the WebData and Information Extraction on the Web
Data and Information Extraction on the Web
Tommaso Teofili
 
Apache UIMA and Semantic Search
Apache UIMA and Semantic SearchApache UIMA and Semantic Search
Apache UIMA and Semantic Search
Tommaso Teofili
 
Ad

Recently uploaded (20)

Palo Alto Networks Cybersecurity Foundation
Palo Alto Networks Cybersecurity FoundationPalo Alto Networks Cybersecurity Foundation
Palo Alto Networks Cybersecurity Foundation
VICTOR MAESTRE RAMIREZ
 
New Ways to Reduce Database Costs with ScyllaDB
New Ways to Reduce Database Costs with ScyllaDBNew Ways to Reduce Database Costs with ScyllaDB
New Ways to Reduce Database Costs with ScyllaDB
ScyllaDB
 
Grannie’s Journey to Using Healthcare AI Experiences
Grannie’s Journey to Using Healthcare AI ExperiencesGrannie’s Journey to Using Healthcare AI Experiences
Grannie’s Journey to Using Healthcare AI Experiences
Lauren Parr
 
Kubernetes Cloud Native Indonesia Meetup - May 2025
Kubernetes Cloud Native Indonesia Meetup - May 2025Kubernetes Cloud Native Indonesia Meetup - May 2025
Kubernetes Cloud Native Indonesia Meetup - May 2025
Prasta Maha
 
UiPath Community Berlin: Studio Tips & Tricks and UiPath Insights
UiPath Community Berlin: Studio Tips & Tricks and UiPath InsightsUiPath Community Berlin: Studio Tips & Tricks and UiPath Insights
UiPath Community Berlin: Studio Tips & Tricks and UiPath Insights
UiPathCommunity
 
Evaluation Challenges in Using Generative AI for Science & Technical Content
Evaluation Challenges in Using Generative AI for Science & Technical ContentEvaluation Challenges in Using Generative AI for Science & Technical Content
Evaluation Challenges in Using Generative AI for Science & Technical Content
Paul Groth
 
Nix(OS) for Python Developers - PyCon 25 (Bologna, Italia)
Nix(OS) for Python Developers - PyCon 25 (Bologna, Italia)Nix(OS) for Python Developers - PyCon 25 (Bologna, Italia)
Nix(OS) for Python Developers - PyCon 25 (Bologna, Italia)
Peter Bittner
 
Jira Administration Training – Day 1 : Introduction
Jira Administration Training – Day 1 : IntroductionJira Administration Training – Day 1 : Introduction
Jira Administration Training – Day 1 : Introduction
Ravi Teja
 
Improving Developer Productivity With DORA, SPACE, and DevEx
Improving Developer Productivity With DORA, SPACE, and DevExImproving Developer Productivity With DORA, SPACE, and DevEx
Improving Developer Productivity With DORA, SPACE, and DevEx
Justin Reock
 
Dr Jimmy Schwarzkopf presentation on the SUMMIT 2025 A
Dr Jimmy Schwarzkopf presentation on the SUMMIT 2025 ADr Jimmy Schwarzkopf presentation on the SUMMIT 2025 A
Dr Jimmy Schwarzkopf presentation on the SUMMIT 2025 A
Dr. Jimmy Schwarzkopf
 
End-to-end Assurance for SD-WAN & SASE with ThousandEyes
End-to-end Assurance for SD-WAN & SASE with ThousandEyesEnd-to-end Assurance for SD-WAN & SASE with ThousandEyes
End-to-end Assurance for SD-WAN & SASE with ThousandEyes
ThousandEyes
 
Measuring Microsoft 365 Copilot and Gen AI Success
Measuring Microsoft 365 Copilot and Gen AI SuccessMeasuring Microsoft 365 Copilot and Gen AI Success
Measuring Microsoft 365 Copilot and Gen AI Success
Nikki Chapple
 
Data Virtualization: Bringing the Power of FME to Any Application
Data Virtualization: Bringing the Power of FME to Any ApplicationData Virtualization: Bringing the Power of FME to Any Application
Data Virtualization: Bringing the Power of FME to Any Application
Safe Software
 
STKI Israel Market Study 2025 final v1 version
STKI Israel Market Study 2025 final v1 versionSTKI Israel Market Study 2025 final v1 version
STKI Israel Market Study 2025 final v1 version
Dr. Jimmy Schwarzkopf
 
Co-Constructing Explanations for AI Systems using Provenance
Co-Constructing Explanations for AI Systems using ProvenanceCo-Constructing Explanations for AI Systems using Provenance
Co-Constructing Explanations for AI Systems using Provenance
Paul Groth
 
Cyber security cyber security cyber security cyber security cyber security cy...
Cyber security cyber security cyber security cyber security cyber security cy...Cyber security cyber security cyber security cyber security cyber security cy...
Cyber security cyber security cyber security cyber security cyber security cy...
pranavbodhak
 
Gihbli AI and Geo sitution |use/misuse of Ai Technology
Gihbli AI and Geo sitution |use/misuse of Ai TechnologyGihbli AI and Geo sitution |use/misuse of Ai Technology
Gihbli AI and Geo sitution |use/misuse of Ai Technology
zainkhurram1111
 
Agentic AI - The New Era of Intelligence
Agentic AI - The New Era of IntelligenceAgentic AI - The New Era of Intelligence
Agentic AI - The New Era of Intelligence
Muzammil Shah
 
ECS25 - The adventures of a Microsoft 365 Platform Owner - Website.pptx
ECS25 - The adventures of a Microsoft 365 Platform Owner - Website.pptxECS25 - The adventures of a Microsoft 365 Platform Owner - Website.pptx
ECS25 - The adventures of a Microsoft 365 Platform Owner - Website.pptx
Jasper Oosterveld
 
AI Emotional Actors: “When Machines Learn to Feel and Perform"
AI Emotional Actors:  “When Machines Learn to Feel and Perform"AI Emotional Actors:  “When Machines Learn to Feel and Perform"
AI Emotional Actors: “When Machines Learn to Feel and Perform"
AkashKumar809858
 
Palo Alto Networks Cybersecurity Foundation
Palo Alto Networks Cybersecurity FoundationPalo Alto Networks Cybersecurity Foundation
Palo Alto Networks Cybersecurity Foundation
VICTOR MAESTRE RAMIREZ
 
New Ways to Reduce Database Costs with ScyllaDB
New Ways to Reduce Database Costs with ScyllaDBNew Ways to Reduce Database Costs with ScyllaDB
New Ways to Reduce Database Costs with ScyllaDB
ScyllaDB
 
Grannie’s Journey to Using Healthcare AI Experiences
Grannie’s Journey to Using Healthcare AI ExperiencesGrannie’s Journey to Using Healthcare AI Experiences
Grannie’s Journey to Using Healthcare AI Experiences
Lauren Parr
 
Kubernetes Cloud Native Indonesia Meetup - May 2025
Kubernetes Cloud Native Indonesia Meetup - May 2025Kubernetes Cloud Native Indonesia Meetup - May 2025
Kubernetes Cloud Native Indonesia Meetup - May 2025
Prasta Maha
 
UiPath Community Berlin: Studio Tips & Tricks and UiPath Insights
UiPath Community Berlin: Studio Tips & Tricks and UiPath InsightsUiPath Community Berlin: Studio Tips & Tricks and UiPath Insights
UiPath Community Berlin: Studio Tips & Tricks and UiPath Insights
UiPathCommunity
 
Evaluation Challenges in Using Generative AI for Science & Technical Content
Evaluation Challenges in Using Generative AI for Science & Technical ContentEvaluation Challenges in Using Generative AI for Science & Technical Content
Evaluation Challenges in Using Generative AI for Science & Technical Content
Paul Groth
 
Nix(OS) for Python Developers - PyCon 25 (Bologna, Italia)
Nix(OS) for Python Developers - PyCon 25 (Bologna, Italia)Nix(OS) for Python Developers - PyCon 25 (Bologna, Italia)
Nix(OS) for Python Developers - PyCon 25 (Bologna, Italia)
Peter Bittner
 
Jira Administration Training – Day 1 : Introduction
Jira Administration Training – Day 1 : IntroductionJira Administration Training – Day 1 : Introduction
Jira Administration Training – Day 1 : Introduction
Ravi Teja
 
Improving Developer Productivity With DORA, SPACE, and DevEx
Improving Developer Productivity With DORA, SPACE, and DevExImproving Developer Productivity With DORA, SPACE, and DevEx
Improving Developer Productivity With DORA, SPACE, and DevEx
Justin Reock
 
Dr Jimmy Schwarzkopf presentation on the SUMMIT 2025 A
Dr Jimmy Schwarzkopf presentation on the SUMMIT 2025 ADr Jimmy Schwarzkopf presentation on the SUMMIT 2025 A
Dr Jimmy Schwarzkopf presentation on the SUMMIT 2025 A
Dr. Jimmy Schwarzkopf
 
End-to-end Assurance for SD-WAN & SASE with ThousandEyes
End-to-end Assurance for SD-WAN & SASE with ThousandEyesEnd-to-end Assurance for SD-WAN & SASE with ThousandEyes
End-to-end Assurance for SD-WAN & SASE with ThousandEyes
ThousandEyes
 
Measuring Microsoft 365 Copilot and Gen AI Success
Measuring Microsoft 365 Copilot and Gen AI SuccessMeasuring Microsoft 365 Copilot and Gen AI Success
Measuring Microsoft 365 Copilot and Gen AI Success
Nikki Chapple
 
Data Virtualization: Bringing the Power of FME to Any Application
Data Virtualization: Bringing the Power of FME to Any ApplicationData Virtualization: Bringing the Power of FME to Any Application
Data Virtualization: Bringing the Power of FME to Any Application
Safe Software
 
STKI Israel Market Study 2025 final v1 version
STKI Israel Market Study 2025 final v1 versionSTKI Israel Market Study 2025 final v1 version
STKI Israel Market Study 2025 final v1 version
Dr. Jimmy Schwarzkopf
 
Co-Constructing Explanations for AI Systems using Provenance
Co-Constructing Explanations for AI Systems using ProvenanceCo-Constructing Explanations for AI Systems using Provenance
Co-Constructing Explanations for AI Systems using Provenance
Paul Groth
 
Cyber security cyber security cyber security cyber security cyber security cy...
Cyber security cyber security cyber security cyber security cyber security cy...Cyber security cyber security cyber security cyber security cyber security cy...
Cyber security cyber security cyber security cyber security cyber security cy...
pranavbodhak
 
Gihbli AI and Geo sitution |use/misuse of Ai Technology
Gihbli AI and Geo sitution |use/misuse of Ai TechnologyGihbli AI and Geo sitution |use/misuse of Ai Technology
Gihbli AI and Geo sitution |use/misuse of Ai Technology
zainkhurram1111
 
Agentic AI - The New Era of Intelligence
Agentic AI - The New Era of IntelligenceAgentic AI - The New Era of Intelligence
Agentic AI - The New Era of Intelligence
Muzammil Shah
 
ECS25 - The adventures of a Microsoft 365 Platform Owner - Website.pptx
ECS25 - The adventures of a Microsoft 365 Platform Owner - Website.pptxECS25 - The adventures of a Microsoft 365 Platform Owner - Website.pptx
ECS25 - The adventures of a Microsoft 365 Platform Owner - Website.pptx
Jasper Oosterveld
 
AI Emotional Actors: “When Machines Learn to Feel and Perform"
AI Emotional Actors:  “When Machines Learn to Feel and Perform"AI Emotional Actors:  “When Machines Learn to Feel and Perform"
AI Emotional Actors: “When Machines Learn to Feel and Perform"
AkashKumar809858
 

Apache Solr crash course

  • 1. Apache Solr crash course Tommaso Teofili
  • 2. Agenda • IR • Solr • Tips&tricks • Case study • Extras
  • 3. Information Retrieval • “Information Retrieval (IR) is finding material (usually documents) of an unstructured nature (usually text) that satisfies an information need from within large collections (usually stored on computers)” - P. Nayak, Stanford University
  • 4. Inverted index • Each document has an id and a list of terms • For each term t we must store a list of all documents that contain t • Identify each document by its id
  • 5. IR Metrics • How good is an IR system? • Precision: Fraction of retrieved docs that are relevant to user’s information need • Recall: Fraction of relevant docs in collection that are retrieved
  • 6. Apache Lucene • Information Retrieval library • Inverted index of documents • Vector space model • Advanced search options (synonims, stopwords, similarity, proximity)
  • 7. Lucene API - indexing • Lucene indexes are built on a Directory • Directory can be accessed by IndexReaders and IndexWriters • IndexSearchers are built on top of Directories and IndexReaders • IndexWriters can write Documents inside the index • Documents are made of Fields • Fields have values • Directory > IndexReader/Writer > Document > Field
  • 8. Lucene API - searching • Open an IndexSearcher on top of an IndexReader over a Directory • Many query types: TermQuery, MultiTermQuery, BooleanQuery, WildcardQuery, PhraseQuery, PrefixQuery, MultiPhraseQuery, FuzzyQuery, NumericRangeQuery, ... • Get results from a TopDocs object
  • 9. Apache Solr • Ready to use enterprise search server • REST (and programmatic) API • Results in XML, JSON, PHP, Ruby, etc... • Exploit Lucene power • Scaling capabilities (replication, distributed search, ...) • Administration interface • Customizable via plugins
  • 11. Apache Solr - admin UI
  • 12. Solr - project status • Solr 3.1.0 version released in March 2011 • Lucene/Solr is now a single project • Huge community • Backed by Lucid Imagination
  • 13. Solr basic configuration • schema.xml • contains types definitions for field analysis (field type+tokenizers+filters) • contains field definitions • solrconfig.xml • contains the Solr instance configuration
  • 14. Solr - schema.xml • Types (with index/query Analyzers) • Fields with name, type and options • Unique key • Dynamic fields • Copy fields
  • 15. Solr - content analysis • define documents’ model • each document consists of fields • each field • has attributes telling Solr how to handle its contents • contains free text, keywords, dates, numbers, etc.
  • 16. Solr - content analysis • Analyzer: create tokens using a Tokenizer and, eventually, some filters (TokenFilters) • Each field can define an Analyzer at ‘query’ time and another at ‘index’ time, or the same in both cases • Each field can be indexed (searchable), stored (possibly fetched with results), multivalued, required, etc.
  • 17. Solr - content analysis • Commonly used tokenizers: • WhitespaceTokenizerFactory • StandardTokenizerFactory • KeywordTokenizerFactory • PatternTokenizerFactory • HTMLStripWhitespaceTokenizerFactory • HTMLStripStandardTokenizerFactory
  • 18. Solr - content analysis • Commonly used TokenFilters: • SnowballPorterFilterFactory • StopFilterFactory • LengthFilterFactory • LowerCaseFilterFactory • WordDelimiterFilterFactory • SynonymFilterFactory • PatternReplaceFilterFactory • ReverseWildcardFilterFactory • CharFilterFactories (Mapping,HtmlString)
  • 20. Solr - solrconfig.xml • Data directory (where Solr will write the Lucene index) • Caches configuration: documents, query results, filters • Request handlers definition (search/update handlers) • Update request processor chains definition • Event listeners (newSearcher, firstSearcher) • Fine tuning parameters • ...
  • 21. Solr - indexing • Update requests on index are given with XML commands via HTTP POST • <add> to insert and update • <add> <doc boost="2.5"> • <field name="employeeId">05991</field> • </doc></add> • <delete> to remove by unique key or query • <delete><id>05991</id></delete> • <delete><query>office:Bridgewater</query></delete> • <commit/> reopen readers on the new index version • <optimize/> optimize index internal structure for faster access
  • 22. Solr - basic indexing • REST call - XML/JSON • curl ‘http://localhost:8983/solr/update? commit=true’ -H "Content-Type: text/xml" -- data-binary '<add><doc><field name="id">testdoc</field></doc></add>' • curl 'http://localhost:8983/solr/update/json? commit=true' -H 'Content-type:application/json' -d ' { "add": {"doc": {"id" : "TestDoc1", "title" : "test1"} } }’
  • 23. Solr - binary files indexing • Many documents are produced in (properietary) binary formats : PDF, RTF, XLS, etc. • Apache Tika integrated in Solr REST service for indexing such documents • curl "http://localhost:8983/solr/update/ extract?literal.id=doc1&commit=true" -F "myfi[email protected]"
  • 24. Solr - index analysis • Luke is a tool for navigating Lucene indexes • For each field : top terms, distinct terms, terms histogram, etc. • LukeRequestHandler : • http://localhost:8983/solr/admin/luke? wt=xslt&tr=luke.xsl
  • 25. Solr - data import handler • DBMS • FileSystem • HTTP
  • 28. Solr - query syntax • query fields with fieldname:value • + - AND OR NOT operators • Range queries on date or numeric fields, ex: timestamp:[* TO NOW] • Boost terms, ex: people^4 profits • Fuzzy search, ex: roam~0.6 • Proximity search, ex: “apache solr”~2 • ...
  • 29. Solr - basic search • parameters: • q: the query • start: offset of the first result • rows: max no. of results returned • fl: comma separated list of fields to return • defType: specify the query parser • debugQuery: enable query debugging • wt: result format (xml, json, php, ruby, javabin, etc)
  • 30. Solr - query parsers • Most used: • Default Lucene query parser • DisMax query parser • eDisMax query parser
  • 31. Solr - highlighting • can be done on fields with stored=”true” • returns a snippet containing the higlighted terms for each doc • enabled with hl=true&hl.fl=fieldname1,fieldname2
  • 32. Solr - sorting results • Sorting can be done on the "score" of the document, or on any multiValued="false" indexed="true" field provided that field is either non-tokenized (ie: has no Analyzer) or uses an Analyzer that only produces a single term • add parameter &sort=score desc, inStock desc, price asc • can sort on function queries (see later)
  • 33. Solr - filter queries • get a subset of the index • place it in a cache • run queries for such a “filter” in memory • add parameter &fq=category:hardware • if multiple fq parameters the query will be run against the intersection of the specified filters
  • 34. Solr - facets • facet by: • field value • arbitrary queries • range • can facet on fields with indexed=”true”
  • 35. Solr - function queries • allow deep customization of ranking : • http://localhost:8983/solr/select/? fl=score,id&q=DDR&sort=termfreq (text,memory)%20desc • functions : sum, sub, product , div ,pow, abs, log, sqrt, map, scale, termfreq, ...
  • 36. Solr - query elevation • useful for “marketing” • configure the top results for a given query regardless of the normal Lucene scoring • http://localhost:8983/solr/elevate?q=best %20product&enableElevation=true
  • 37. Solr - spellchecking • collects suggestions about input query • eventually correct user query with “suggested” terms
  • 38. Solr - spellchecking • build a spellcheck index dynamically • return suggested results • http://localhost:8983/solr/spell?q=hell ultrashar&spellcheck=true&spellcheck.collate=true &spellcheck.build=true • useful to create custom query converters • <queryConverter name="queryConverter" class="org.apache.solr.spelling.SpellingQueryConv erter"/>
  • 39. Solr - similarity • get documents “similar” to a given document or a set of documents • Vector Space Model • http://localhost:8983/solr/select? q=apache&mlt=true&mlt.fl=manu,cat&mlt. mindf=1&mlt.mintf=1&fl=id,score
  • 40. Solr - geospatial search • index location data • query by spatial concepts and sort by distance • find all documents with store position at no more than 5km than a specified point • http://localhost:8983/solr/select? &indent=true&fl=name,store&q=*:*&fq={!geofilt %20sfield=store}&pt=45.15,-93.85&d=5
  • 41. Solr - field collapsing • group resulting documents on per field basis • http://localhost:8983/solr/select? &indent=true&fl=id,name&q=solr +memory&group=true&group.field=man u_exact • useful for displaying results in a smart way • see SOLR-236
  • 42. Solr - join • new feature (SOLR-2272) • many users ask for it • quite of a paradigm change • http://localhost:8983/solr/select?q={!join +from=manu_id_s%20to=id} ipod&fl=id,manu_id&debugQuery=true
  • 47. Solr Architectures • Simple • Multicore • Replication • Sharded
  • 48. Solr - MultiCore • Define multiple Solr cores inside one only Solr instance • Each cores maintain its own index • Unified administration interface • Runtime commands to create, reload, load, unload, delete, swap cores • Cores can be thought as ‘collections’ • Allow no downtime while deploying new features/ bugfixes
  • 49. Solr - Replication • It’s useful in case of high traffic to replicate a Solr instance and split (with eventually a VIP in front) the search load • Master has the “original” index • Slave polls master asking the latest version of index • If slave has a different version of the index asks the master for the delta (rsync like) • In the meanwhile indexes remain available • No impact of indexing on search (almost)
  • 50. Solr - Shards • When an index is too large, in terms of space or memory required, it can be useful to define two or more shards • A shard is a Solr instance and can be searched or indexed independently • At the same time it’s possible to query all the shards having the result be merged from the sub-results of each shard • http://localhost:8983/solr/select?shards=localhost:8983/ solr,localhost:7574/solr&q=category:information • Note that the document distribution among indexes is up to the user (or who feeds the indexes)
  • 51. Solr - Architectures • When to use each? • KISS principle • High query load : replication • Huge index : shard
  • 52. Solr - Architectures • High queries w/ large indexes : shard + replication
  • 53. Solr - Architectures • Tips & Tricks: • Don’t share indexes between Master and Slaves on distributed file systems (locking) • Anyway get rid of distributed file systems (slow) • Lucene/Solr is I/O intensive thus behaves better with quick disks • Always use MultiCore - hot deploy of changes/ bugfixes • Replication is network intensive • Check replication poll time and indexing rates
  • 54. Tips&Tricks • Solr based SE development process • Plugins • Performance tuning • Deploy
  • 55. Process - t0 analysis • Analyze content • Analyze queries • Analyze collections • Pre-existing query/index load (if any) • Expected query/index load • Desired throughput/avg response time • First architecture
  • 56. Process - n-th iteration • index 10-15% content • search stress test (analyze peaks) - use SolrMeter • quality tests from stakeholders (accuracy, recall) • eventually add/reconfigure features • check http://wiki.apache.org/solr/FieldOptionsByUseCase and make sure fields used for faceting/sorting/highlighting/ etc. have proper options • need to change field types/analysis/options - rebuild the index
  • 57. Solr - Plugins • QParserPlugin • RequestHandler (Search/UpdateHandler) • UpdateRequestProcessor • ResponseWriter • Cache
  • 58. Performance tuning • A huge tuning is done in schema.xml • Configure Solr caches • Set auto commit where possible • Play with mergeFactor
  • 59. Performance tuning • The number of indexed fields greatly increases memory usage during indexing, segment merge time, optimization times, index size • Stored fields impact on index size, search time, ... • set omitNorms=”true” where it makes sense (disabling length normalization and index time boosting) • set omitTermFreqAndPositions=”true” if no queries on this field using positions or should not influence score
  • 60. Performance tuning • FilterCache - unordered document ids for caching filter queries • QueryResultCache - ordered document ids for caching queries results (caching only the returned docs) • DocumentCache - stores stored fields (at least <max_results> * <max_concurrent_queries> • Setup autowarming - keep caches warm after commits
  • 61. Performance tuning • Choose correct cache implementation FastLRUCache vs LRUCache • FastLRUCache has faster gets and slower puts in single threaded operation and thus is generally faster than LRUCache when the hit ratio of the cache is high (> 75%)
  • 62. Performance tuning • Explicit warm sorted fields • Often check cache statistics • JVM options - don’t let the OS without memory! • mergeFactor - impacts on the number of index segments created on the disk • low mF : smaller number of index files, which speeds up searching but more segment merges slow down indexing • high mF : generally improves indexing speed but gets less frequent merges, resulting in a collection with more index files which may slow searching
  • 63. Performance tuning • set autocommit where possible, this will avoid close and reopen of IndexReaders everytime a document is indexed - can choose max number of documents and/or time to wait before automatically do the commit • finally...need to get your hand dirty!
  • 64. Deploy • SolrPackager by Simone Tripodi! • It’s a Maven archetype • Create standalone/multicore project • Each project will generate a master and a slave instance • Define environment dependent properties without having to manage N config files • ‘mvn -Pdev package’ // will create a Tomcat package for the development environment
  • 67. Case Study • Architecture analysis • Plugin development • Testing and support
  • 69. Challenge • Architecture • 4B docs of ~4k each • ~3 req/sec overall • 3 collections: • |archive| = 3B • |2010-2011| = 1M • |intranet| = 0.9B
  • 70. Challenge • Content analysis • get the example Solr schema.xml • optimize the schema in order to enable both stemmed and unstemmed versions of fields: author, title, text, cat • add omitNorms=”true” where possible • add a field ‘html_content’ which will contain an HTML text but will be searched as clean text • all string fields should be lowercased
  • 71. Extras • Clustering (Solr-Carrot2) • Named entity extraction (Solr-UIMA) • SolrCloud (Solr-Zookeeper) • ManifoldCF • Stanbol EntityHub • Solandra (Solr-Cassandra)