SlideShare a Scribd company logo
Solr Update
                              code4lib conference, 13 Feburary '13, Chicago
                                                   presented by
                                                   Erik Hatcher




© Copyright 2013 LucidWorks
Abstract

      Solr is continually improving. Solr 4 was recently released,
      bringing dramatic changes in the underlying Lucene library and
      Solr-level features. It's tough for us all to keep up with the
      various versions and capabilities.

      This talk will blaze through the highlights of new features and
      improvements in Solr 4 (and up). Topics will include: SolrCloud,
      direct spell checking, surround query parser, and many other
      features. We will focus on the features library coders really need
      to know about.




    © 2013 LucidWorks

2
About: È  ℛ  Ỉ  Ķ  Ḫ  Ằ  Ţ  Ḉ  Ḣ  Ể  Ŕ

    • Co-author “Lucene in Action”
    • Lucene/Solr Committer and PMC, ASF Member
    • Senior Solutions Architect and co-founder, LucidWorks
       - (formerly Lucid Imagination)
    • Library Cred:
       - developer for Rossetti Archive and NINES
       - originator/namer of Blacklight




    © 2013 LucidWorks

3
© 2013 LucidWorks

4
Lucene 4 Highlights

    • Flexible index formats
    • Pluggable scoring
    • String -> BytesRef
    • DWPT (Document Writer Per Thread)
       - faster, more consistent indexing speed
    • NRT (Near Real-Time)
       - per-segment loading of FieldCache, soft commits
    • Spatial overhaul
    • FST/FSA
       - FuzzyQuery over 100x faster
       - also reduces memory footprint for Terms index
    • And much much more!
       - See http://lucene.apache.org/core/4_1_0/changes/Changes.html


    © 2013 LucidWorks

5
Flexible index formats

    •For terms, postings lists, stored fields, term vectors, etc
    •Several new posting list codecs
      - Pulsing (inlines low doc freq)
      - Block (packed int blocks)
      - SimpleText (debugging, transparency)
      - Bloom (experimental, also inlines low doc freq)
      - Appending (for append-only filesystems such as HDFS)
      - Memory (terms as FST)
    •Compressed stored fields




    © 2013 LucidWorks

6
Pluggable scoring

    •Decoupled from traditional vector space (TF/IDF)
    •Additional index statistics
      - number of tokens for a term or field
      - number of postings for a field
      - number of documents with a posting for a field
    •Several built-in alternatives:
      - BM25
      - DFR – divergence from randomness
      - Information-based models




    © 2013 LucidWorks

7
Indexing performance

    • http://people.apache.org/~mikemccand/lucenebench/
      indexing.html




    © 2013 LucidWorks

8
QPS (primary key lookup)

    • http://people.apache.org/~mikemccand/lucenebench/
      PKLookup.html




    © 2013 LucidWorks

9
FuzzyQuery

     • http://people.apache.org/~mikemccand/lucenebench/
       Fuzzy2.html




     © 2013 LucidWorks

10
© 2013 LucidWorks

11
Solr 4 Highlights

     • Requires Java 1.6+
     • Pivot facets
        - http://wiki.apache.org/solr/SimpleFacetParameters#facet.pivot
     • DirectSpellChecker support
        - http://wiki.apache.org/solr/SpellCheckComponent
     • Improved document response
        - DocTransformer: [shard], [explain], [value], [docid]
        - Function query results
        - http://wiki.apache.org/solr/DocTransformers
     • Pseudo-join
        - http://wiki.apache.org/solr/Join
     • Surround query parser



     © 2013 LucidWorks

12
More Solr 4 Highlights

     • Transaction log
     • Several new update processors, including a “script” one
        - http://wiki.apache.org/solr/ScriptUpdateProcessor
     • Spatial overhaul
        - http://wiki.apache.org/solr/SpatialSearch
     • Content-type savvy /update handler
     • SolrCloud
        - http://wiki.apache.org/solr/SolrCloud
     • And more!
        - See http://lucene.apache.org/solr/4_1_0/changes/Changes.html




     © 2013 LucidWorks

13
Solr 4.1

     • Enhanced document routing (custom sharding)
     • Compressed stored fields
     • MoreLikeThis distributed capability
     • AnalyzingSuggester
        - http://blog.mikemccandless.com/2012/09/lucenes-new-analyzing-
          suggester.html
        - via lookupImpl = org.apache.solr.spelling.suggest.fst.AnalyzingLookupFactory
        - and FuzzyLookupFactory
     • Many SolrCloud fixes and improvements
     • Stanford! - _query_ no longer needed to specify nested query
       parsers




     © 2013 LucidWorks

14
Looks Good!




     © 2013 LucidWorks

15
Pivot Faceting

     • Finds the top N constraints for field1, then for each of those,
       finds the top N constraints for field2, etc
     • Syntax: facet.pivot=field1,field2,field3,…

     facet.pivot=cat,inStock
                             #docs #docs w/             #docs w/
                                   inStock:true         instock:false
     cat:electronics         14      10                 4
     cat:memory              3       3                  0
     cat:connector           2       0                  2
     cat:graphics card       2       0                  2
     cat:hard drive          2       2                  0
     © 2013 LucidWorks

16
DirectSpellChecker

     • Automaton-based
     • Candidates are presented directly from the term dictionary,
       based on Levenshtein distance.
     • A practical benefit of this spellchecker is that it requires no
       additional datastructures (neither in RAM nor on disk) to do its
       work.
        - http://lucene.apache.org/core/4_1_0/suggest/org/apache/lucene/search/spell/
          DirectSpellChecker.html




     © 2013 LucidWorks

17
Improved document response

     • Returns other info along with document stored fields
     • Function queries
        - fl=name,location,geodist(),add(myfield,10)
     • Fieldname globs
        - fl=id,attr_*
     • Multiple “fl” (field list) values
        - &fl=id,attr_*
        - &fl=geodist()
        - &fl=termfreq(text,’solr’)
     • Aliasing
        - fl=id,location:loc,_dist_:geodist()
     • fl=id,[explain],[shard]



     © 2013 LucidWorks

18
Improved document response example
     $ curl http://localhost:8983/solr/query?
         q=solr
         &fl=id,apache_mentions:termfreq(text,’apache’)
         &fl=my_constant:”this is cool!”
         &fl=inStock, not(inStock)
         &fl=other_query_score:query($qq)
         &qq=text:search

     { "response":{"numFound":1,"start":0,"docs":[
           {
             "id":"SOLR1000",
             "apache_mentions":1,
             "my_constant":"this is cool!",
             "inStock":true,
             "not(inStock)":false,
             "other_query_score":0.84178084
           }]}




      © 2013 LucidWorks

19
Query Parsing

     • _query_ no longer needed for nested queries
        - https://issues.apache.org/jira/browse/SOLR-4093
     • "surround" query parser
        - enables the use of Lucene's SpanQuery family, sophisticated proximity
          matching
        - Examples:
            »5n(dog cat)
            »dog 5w cat
        - http://wiki.apache.org/solr/SurroundQueryParse




     © 2013 LucidWorks

20
New Spatial Support

     • wiki.apache.org/solr/SpatialSearch
     • Multiple values per field
     • Index shapes other than points (circles, polygons, etc)
     • Indexing:
        - "geo”:”43.17614,-90.57341”
        - “geo”:”Circle(4.56,1.23 d=0.0710)”
        - “geo”:”POLYGON((-10 30, -40 40, -10 -20, 40 20, 0 0, -10 30))”


     • Searching:
        - fq=geo:"Intersects(-74.093 41.042 -69.347 44.558)"
        - fq=geo:"Intersects(POLYGON((-10 30, -40 40, -10 -20, 40 20, 0 0, -10 30)))"




     © 2013 LucidWorks

21
Add and Retrieve document

     $ curl http://localhost:8983/solr/update -H 'Content-type:application/
     json' -d '
     [
       { "id" : "book1",
         "title" : "Infinite Jest",
         "author" : "David Foster Wallace"
       }
     ]'




     $ curl http://localhost:8983/solr/get?id=book1
     {
       "doc": {
         "id" : "book1",
         "author": "David Foster Wallace",
         "title" : "Infinite Jest",
         "_version_": 1410390803582287872
       }
     }


       © 2013 LucidWorks

22
Atomic Updates
         $ curl http://localhost:8983/solr/update
                  -H 'Content-type:application/json' -d '
         [
           {"id"         : "book1",
             "pubyear_i" : { "add" : 2006 },
             "ISBN_s"    : { "add" : "0-380-97365-1"}
           }
         ]'




        $ curl http://localhost:8983/solr/update
                 -H 'Content-type:application/json' -d '
        [
          {"id"         : "book1",
           "copies_i" : { "inc" : 1},
           "cat"        : { "add" : "fiction"},
           "ISBN_s"     : { "set" : "0-316-92004-5"}
           "remove_s" : { "set" : null } }
        ]'


     © 2013 LucidWorks

23
Pseudo-Join
                                               id: post1
id: blog1
                                               blog_id: blog1
name: Blog 1
                                               author: John Doe
owner: c4l
                                               title: Pseudo-join can be handy!
Started: 2007-10-26
                                               body: Here's how to use {!join....}


id: blog2                                      id: post2
name: Blog 2                                   blog_id: blog1
owner: zoia                                    author: John Doe
started: 2005-1-31                             title: Solr Update
                                               body: Live streaming today!


                                               id: post3
                                               blog_id: blog2
                                               author: Jane Doe
                                               title: What's New at code4lib

  Restrict to blogs mentioning netflix:
    - How it works:
                               fq={!join from=blog_id to=id}body:code4lib
             - Finds all documents matching “code4lib”
             - Maps to different docs by following blog_id to id
     © 2013 LucidWorks
Pseudo-Join Examples

• Only show posts from blogs started after 2010
         &fq={!join from=id to=blog_id}started:[2010 TO *]
• If any post in a blog mentions “Chicago”, then search all posts in
  that blog for “conference” (self-join)
         q=conference
         &fq={!join from=blog_id to=blog_id}Chicago
• If any blog post mentions “Chicago”, then search all emails with
  the same blog owner for “conference”
         q=email_body:conference
         &fq={!join from=owner_email_user to=email_user}{!join from=blog_id to=id}
         Chicago




© 2013 LucidWorks
Cross-Core Join

http://localhost:8983/solr/collection1/select?q=*:*
&fq={!join fromIndex=sec1 from=security_groups to=security}user:john



id: doc1
security: managers
                                              id: mary
title: doc for managers only
                                              security_groups: managers, employees
body: …

                                              id: john
id: doc1                                      security_groups: employees
security: managers, employees
title: doc for everyone
body: …


                    collection1                               sec1

                                  Single Solr Server

© 2013 LucidWorks
New UpdateProcessor's

     • FieldMutatingUpdateProcessor family:
        - ConcatField, CountField, FieldLength, HTMLStripField, IgnoreField,
          RegexReplace, RemoveBlankField, TrimField, TruncateField
     • ScriptUpdateProcessor
        - enables update processing code to be written in a scripting language. The
          script can be written in any scripting language supported by your JVM (such as
          JavaScript), and executed dynamically so no pre-compilation is necessary.
        - http://wiki.apache.org/solr/ScriptUpdateProcessor




     © 2013 LucidWorks

27
SolrCloud: Solr 4’s scalability

     • Sharded leaders and replicas
     • ZooKeeper used for cluster management
     • Distributed indexing
        - Automatically distributes updates to appropriate shard
        - Facilitates Near Real-Time (NRT) searching
     • Distributed search
        - Automatically distributes to nodes of each shard
     • Robust, automatic update recovery
     • Real-time /get
        - Leverages transaction log
     • No single point of failure
     • Large scale NRT using soft commits
     • Transaction log uses:
       - Durability for updates that have not yet been committed
       - Peer syncing in SolrCloud
       - Real-time get


     © 2013 LucidWorks

28
SolrCloud Visualization




                         Image from http://bit.ly/X4E5H9
     © 2013 LucidWorks

29
Near Real Time (NRT) softCommit

      • softCommit opens a new view of the index without flushing +
        fsyncing files to disk
         - Decouples update visibility from update durability
      • commitWithin now implies a soft commit
      • Current autoCommit defaults from solrconfig.xml:

     <autoCommit>
       <maxTime>15000</maxTime>
       <openSearcher>false</openSearcher>
     </autoCommit>

     <!--
       <autoSoftCommit>
          <maxTime>1000</maxTime>
       </autoSoftCommit>
     -->



      © 2013 LucidWorks

30
Solr is NoSQL

     • Update durability
        - A transaction log ensures that even uncommitted documents are never lost.
     • Real-time Get
        - The ability to quickly retrieve the latest version of a document, without the need
          to commit or open a new searcher
     • Versioning and Optimistic Locking
        - combined with real-time get, this allows read-update-write functionality that
          ensures no conflicting changes were made concurrently by other clients.
     • Atomic updates
        - the ability to add, remove, change, and increment fields of an existing
          document without having to send in the complete document again.
     • Real-time /get combined with SolrCloud make a very powerful
       key/value pair database



     © 2013 LucidWorks

31
Ṁ  Ȇ  Ʈ  Ẳ  Ḍ  Â  Ṭ  Ä




     © 2013 LucidWorks

32
Future

     • JSON Query Parser
        - https://issues.apache.org/jira/browse/SOLR-4351
     • Shard splitting
        - https://issues.apache.org/jira/browse/SOLR-3755




     © 2013 LucidWorks

33
Credits

     • LucidWorks
        - lucidworks.com
     • Manning Publications
        - manning.com/lucene
     • Apache Software Foundation
        - apache.org
     • Apache Lucene
        - lucene.apache.org




     © 2013 LucidWorks

34
Contact Info

     •IRC: erikhatcher
     •erik dot hatcher @ lucidworks dot com
     •@ErikHatcher
     •http://searchhub.org/author/erik/
     •http://erikhatcher.tumblr.com/




     © 2013 LucidWorks

35
Get at me...




                         @ErikHatcher




     © 2013 LucidWorks

36

More Related Content

ODP
Linked Open Communism - c4l13
charper
 
PPT
Tthornton code4lib
trevorthornton
 
PDF
Mysql
Chris Henry
 
PDF
NoSQL store everyone ignored - Postgres Conf 2021
Zohaib Hassan
 
PDF
Spark Cassandra 2016
Duyhai Doan
 
PDF
Cassandra and Spark, closing the gap between no sql and analytics codemotio...
Duyhai Doan
 
PDF
Elastic Search Training#1 (brief tutorial)-ESCC#1
medcl
 
PDF
Perl Programming - 04 Programming Database
Danairat Thanabodithammachari
 
Linked Open Communism - c4l13
charper
 
Tthornton code4lib
trevorthornton
 
NoSQL store everyone ignored - Postgres Conf 2021
Zohaib Hassan
 
Spark Cassandra 2016
Duyhai Doan
 
Cassandra and Spark, closing the gap between no sql and analytics codemotio...
Duyhai Doan
 
Elastic Search Training#1 (brief tutorial)-ESCC#1
medcl
 
Perl Programming - 04 Programming Database
Danairat Thanabodithammachari
 

What's hot (20)

PDF
MongoDB Advanced Topics
César Rodas
 
PDF
Beyond full-text searches with Lucene and Solr
Bertrand Delacretaz
 
ODP
Database Programming with Perl and DBIx::Class
Dave Cross
 
PPTX
MongoDB (Advanced)
TO THE NEW | Technology
 
PPTX
Mongodb hackathon 02
Vivek Aanand Ganesan
 
PDF
쉽게 이해하는 LOD
Myungjin Lee
 
PDF
Introduction to PostgreSQL
Mark Wong
 
PDF
Hands On Spring Data
Eric Bottard
 
PDF
Rapid Prototyping with Solr
Erik Hatcher
 
PPTX
Intro to Linked, Dutch Ships and Sailors and SPARQL handson
Victor de Boer
 
PPT
Enterprise Search Solution: Apache SOLR. What's available and why it's so cool
Ecommerce Solution Provider SysIQ
 
PDF
第2回 Hadoop 輪読会
Toshihiro Suzuki
 
PDF
Apache Solr Workshop
Saumitra Srivastav
 
PDF
Solr Flair: Search User Interfaces Powered by Apache Solr (ApacheCon US 2009,...
Erik Hatcher
 
PDF
Building your own search engine with Apache Solr
Biogeeks
 
PDF
Solr Application Development Tutorial
Erik Hatcher
 
PPTX
Apache Solr
Minh Tran
 
PPTX
SuRf – Tapping Into The Web Of Data
cosbas
 
PDF
Solr Recipes Workshop
Erik Hatcher
 
PDF
Catmandu / LibreCat Project
Patrick Hochstenbach
 
MongoDB Advanced Topics
César Rodas
 
Beyond full-text searches with Lucene and Solr
Bertrand Delacretaz
 
Database Programming with Perl and DBIx::Class
Dave Cross
 
MongoDB (Advanced)
TO THE NEW | Technology
 
Mongodb hackathon 02
Vivek Aanand Ganesan
 
쉽게 이해하는 LOD
Myungjin Lee
 
Introduction to PostgreSQL
Mark Wong
 
Hands On Spring Data
Eric Bottard
 
Rapid Prototyping with Solr
Erik Hatcher
 
Intro to Linked, Dutch Ships and Sailors and SPARQL handson
Victor de Boer
 
Enterprise Search Solution: Apache SOLR. What's available and why it's so cool
Ecommerce Solution Provider SysIQ
 
第2回 Hadoop 輪読会
Toshihiro Suzuki
 
Apache Solr Workshop
Saumitra Srivastav
 
Solr Flair: Search User Interfaces Powered by Apache Solr (ApacheCon US 2009,...
Erik Hatcher
 
Building your own search engine with Apache Solr
Biogeeks
 
Solr Application Development Tutorial
Erik Hatcher
 
Apache Solr
Minh Tran
 
SuRf – Tapping Into The Web Of Data
cosbas
 
Solr Recipes Workshop
Erik Hatcher
 
Catmandu / LibreCat Project
Patrick Hochstenbach
 
Ad

Viewers also liked (20)

PPTX
Gimme shelter: Tips on protecting proprietary and open source code
Rogue Wave Software
 
PPTX
Hackathon
Provectus
 
PDF
What's New in Solr 3.x / 4.0
Erik Hatcher
 
PPTX
Open source applied: Real-world uses
Rogue Wave Software
 
PDF
Solr Black Belt Pre-conference
Erik Hatcher
 
PDF
Solr 4
Erik Hatcher
 
PPT
Faceted Search – the 120 Million Documents Story
Sourcesense
 
PDF
Lucene for Solr Developers
Erik Hatcher
 
PPTX
Сергей Моренец: "Gradle. Write once, build everywhere"
Provectus
 
PDF
Apache Solr Changes the Way You Build Sites
Peter
 
PDF
Lucene for Solr Developers
Erik Hatcher
 
PDF
Solr Indexing and Analysis Tricks
Erik Hatcher
 
PDF
Why I want to Kazan
Provectus
 
PDF
Meet Solr For The Tirst Again
Varun Thacker
 
PDF
Solr Masterclass Bangkok, June 2014
Alexandre Rafalovitch
 
PPTX
Solr 6 Feature Preview
Yonik Seeley
 
PDF
Call me maybe: Jepsen and flaky networks
Shalin Shekhar Mangar
 
PDF
Multi faceted responsive search, autocomplete, feeds engine & logging
lucenerevolution
 
PDF
Solr Powered Libraries
Erik Hatcher
 
PDF
Lucene's Latest (for Libraries)
Erik Hatcher
 
Gimme shelter: Tips on protecting proprietary and open source code
Rogue Wave Software
 
Hackathon
Provectus
 
What's New in Solr 3.x / 4.0
Erik Hatcher
 
Open source applied: Real-world uses
Rogue Wave Software
 
Solr Black Belt Pre-conference
Erik Hatcher
 
Solr 4
Erik Hatcher
 
Faceted Search – the 120 Million Documents Story
Sourcesense
 
Lucene for Solr Developers
Erik Hatcher
 
Сергей Моренец: "Gradle. Write once, build everywhere"
Provectus
 
Apache Solr Changes the Way You Build Sites
Peter
 
Lucene for Solr Developers
Erik Hatcher
 
Solr Indexing and Analysis Tricks
Erik Hatcher
 
Why I want to Kazan
Provectus
 
Meet Solr For The Tirst Again
Varun Thacker
 
Solr Masterclass Bangkok, June 2014
Alexandre Rafalovitch
 
Solr 6 Feature Preview
Yonik Seeley
 
Call me maybe: Jepsen and flaky networks
Shalin Shekhar Mangar
 
Multi faceted responsive search, autocomplete, feeds engine & logging
lucenerevolution
 
Solr Powered Libraries
Erik Hatcher
 
Lucene's Latest (for Libraries)
Erik Hatcher
 
Ad

Similar to "Solr Update" at code4lib '13 - Chicago (20)

PPTX
What's new in Lucene and Solr 4.x
Grant Ingersoll
 
PDF
Building Lanyrd
Simon Willison
 
PPTX
Open Source Search FTW
Grant Ingersoll
 
PDF
Introduction to Solr
Erik Hatcher
 
PDF
Apache Solr 1.4 – Faster, Easier, and More Versatile than Ever
Lucidworks (Archived)
 
PDF
Apache Solr crash course
Tommaso Teofili
 
PDF
Introduction to Solr
Erik Hatcher
 
PDF
Solr Powered Lucene
Erik Hatcher
 
PDF
Needle in an enterprise haystack
Andrew Mleczko
 
PDF
Solr 3.1 and beyond
Lucidworks (Archived)
 
PPTX
Search Engine Building with Lucene and Solr (So Code Camp San Diego 2014)
Kai Chan
 
PDF
Solr Recipes
Erik Hatcher
 
PPTX
Solr introduction
Lap Tran
 
PDF
Rapid Prototyping with Solr
Erik Hatcher
 
PDF
Rapid Prototyping with Solr
Erik Hatcher
 
PDF
NoSQL, Apache SOLR and Apache Hadoop
Dmitry Kan
 
PDF
Building a Real-time Solr-powered Recommendation Engine
lucenerevolution
 
PDF
Introduction to Solr
Erik Hatcher
 
PPTX
Introduction to Apache Lucene/Solr
Rahul Jain
 
PDF
Oslo Solr MeetUp March 2012 - Solr4 alpha
Cominvent AS
 
What's new in Lucene and Solr 4.x
Grant Ingersoll
 
Building Lanyrd
Simon Willison
 
Open Source Search FTW
Grant Ingersoll
 
Introduction to Solr
Erik Hatcher
 
Apache Solr 1.4 – Faster, Easier, and More Versatile than Ever
Lucidworks (Archived)
 
Apache Solr crash course
Tommaso Teofili
 
Introduction to Solr
Erik Hatcher
 
Solr Powered Lucene
Erik Hatcher
 
Needle in an enterprise haystack
Andrew Mleczko
 
Solr 3.1 and beyond
Lucidworks (Archived)
 
Search Engine Building with Lucene and Solr (So Code Camp San Diego 2014)
Kai Chan
 
Solr Recipes
Erik Hatcher
 
Solr introduction
Lap Tran
 
Rapid Prototyping with Solr
Erik Hatcher
 
Rapid Prototyping with Solr
Erik Hatcher
 
NoSQL, Apache SOLR and Apache Hadoop
Dmitry Kan
 
Building a Real-time Solr-powered Recommendation Engine
lucenerevolution
 
Introduction to Solr
Erik Hatcher
 
Introduction to Apache Lucene/Solr
Rahul Jain
 
Oslo Solr MeetUp March 2012 - Solr4 alpha
Cominvent AS
 

More from Erik Hatcher (11)

PDF
Ted Talk
Erik Hatcher
 
PDF
Solr Payloads
Erik Hatcher
 
PDF
it's just search
Erik Hatcher
 
PDF
Solr Query Parsing
Erik Hatcher
 
PDF
Query Parsing - Tips and Tricks
Erik Hatcher
 
PDF
Lucene for Solr Developers
Erik Hatcher
 
PDF
Solr Flair
Erik Hatcher
 
PDF
Lucene for Solr Developers
Erik Hatcher
 
PDF
code4lib 2011 preconference: What's New in Solr (since 1.4.1)
Erik Hatcher
 
PDF
Rapid Prototyping with Solr
Erik Hatcher
 
PDF
Solr Flair: Search User Interfaces Powered by Apache Solr
Erik Hatcher
 
Ted Talk
Erik Hatcher
 
Solr Payloads
Erik Hatcher
 
it's just search
Erik Hatcher
 
Solr Query Parsing
Erik Hatcher
 
Query Parsing - Tips and Tricks
Erik Hatcher
 
Lucene for Solr Developers
Erik Hatcher
 
Solr Flair
Erik Hatcher
 
Lucene for Solr Developers
Erik Hatcher
 
code4lib 2011 preconference: What's New in Solr (since 1.4.1)
Erik Hatcher
 
Rapid Prototyping with Solr
Erik Hatcher
 
Solr Flair: Search User Interfaces Powered by Apache Solr
Erik Hatcher
 

Recently uploaded (20)

PDF
Oracle AI Vector Search- Getting Started and what's new in 2025- AIOUG Yatra ...
Sandesh Rao
 
DOCX
Top AI API Alternatives to OpenAI: A Side-by-Side Breakdown
vilush
 
PDF
Software Development Company | KodekX
KodekX
 
PDF
BLW VOCATIONAL TRAINING SUMMER INTERNSHIP REPORT
codernjn73
 
PDF
CIFDAQ's Teaching Thursday: Moving Averages Made Simple
CIFDAQ
 
PDF
AI Unleashed - Shaping the Future -Starting Today - AIOUG Yatra 2025 - For Co...
Sandesh Rao
 
PDF
Presentation about Hardware and Software in Computer
snehamodhawadiya
 
PPTX
How to Build a Scalable Micro-Investing Platform in 2025 - A Founder’s Guide ...
Third Rock Techkno
 
PDF
Using Anchore and DefectDojo to Stand Up Your DevSecOps Function
Anchore
 
PDF
Unlocking the Future- AI Agents Meet Oracle Database 23ai - AIOUG Yatra 2025.pdf
Sandesh Rao
 
PDF
Security features in Dell, HP, and Lenovo PC systems: A research-based compar...
Principled Technologies
 
PDF
How Onsite IT Support Drives Business Efficiency, Security, and Growth.pdf
Captain IT
 
PDF
Enable Enterprise-Ready Security on IBM i Systems.pdf
Precisely
 
PPT
L2 Rules of Netiquette in Empowerment technology
Archibal2
 
PDF
This slide provides an overview Technology
mineshkharadi333
 
PPTX
New ThousandEyes Product Innovations: Cisco Live June 2025
ThousandEyes
 
PPTX
The-Ethical-Hackers-Imperative-Safeguarding-the-Digital-Frontier.pptx
sujalchauhan1305
 
PDF
Building High-Performance Oracle Teams: Strategic Staffing for Database Manag...
SMACT Works
 
PDF
Chapter 2 Digital Image Fundamentals.pdf
Getnet Tigabie Askale -(GM)
 
PDF
Doc9.....................................
SofiaCollazos
 
Oracle AI Vector Search- Getting Started and what's new in 2025- AIOUG Yatra ...
Sandesh Rao
 
Top AI API Alternatives to OpenAI: A Side-by-Side Breakdown
vilush
 
Software Development Company | KodekX
KodekX
 
BLW VOCATIONAL TRAINING SUMMER INTERNSHIP REPORT
codernjn73
 
CIFDAQ's Teaching Thursday: Moving Averages Made Simple
CIFDAQ
 
AI Unleashed - Shaping the Future -Starting Today - AIOUG Yatra 2025 - For Co...
Sandesh Rao
 
Presentation about Hardware and Software in Computer
snehamodhawadiya
 
How to Build a Scalable Micro-Investing Platform in 2025 - A Founder’s Guide ...
Third Rock Techkno
 
Using Anchore and DefectDojo to Stand Up Your DevSecOps Function
Anchore
 
Unlocking the Future- AI Agents Meet Oracle Database 23ai - AIOUG Yatra 2025.pdf
Sandesh Rao
 
Security features in Dell, HP, and Lenovo PC systems: A research-based compar...
Principled Technologies
 
How Onsite IT Support Drives Business Efficiency, Security, and Growth.pdf
Captain IT
 
Enable Enterprise-Ready Security on IBM i Systems.pdf
Precisely
 
L2 Rules of Netiquette in Empowerment technology
Archibal2
 
This slide provides an overview Technology
mineshkharadi333
 
New ThousandEyes Product Innovations: Cisco Live June 2025
ThousandEyes
 
The-Ethical-Hackers-Imperative-Safeguarding-the-Digital-Frontier.pptx
sujalchauhan1305
 
Building High-Performance Oracle Teams: Strategic Staffing for Database Manag...
SMACT Works
 
Chapter 2 Digital Image Fundamentals.pdf
Getnet Tigabie Askale -(GM)
 
Doc9.....................................
SofiaCollazos
 

"Solr Update" at code4lib '13 - Chicago

  • 1. Solr Update code4lib conference, 13 Feburary '13, Chicago presented by Erik Hatcher © Copyright 2013 LucidWorks
  • 2. Abstract Solr is continually improving. Solr 4 was recently released, bringing dramatic changes in the underlying Lucene library and Solr-level features. It's tough for us all to keep up with the various versions and capabilities. This talk will blaze through the highlights of new features and improvements in Solr 4 (and up). Topics will include: SolrCloud, direct spell checking, surround query parser, and many other features. We will focus on the features library coders really need to know about. © 2013 LucidWorks 2
  • 3. About: È  ℛ  Ỉ  Ķ  Ḫ  Ằ  Ţ  Ḉ  Ḣ  Ể  Ŕ • Co-author “Lucene in Action” • Lucene/Solr Committer and PMC, ASF Member • Senior Solutions Architect and co-founder, LucidWorks - (formerly Lucid Imagination) • Library Cred: - developer for Rossetti Archive and NINES - originator/namer of Blacklight © 2013 LucidWorks 3
  • 5. Lucene 4 Highlights • Flexible index formats • Pluggable scoring • String -> BytesRef • DWPT (Document Writer Per Thread) - faster, more consistent indexing speed • NRT (Near Real-Time) - per-segment loading of FieldCache, soft commits • Spatial overhaul • FST/FSA - FuzzyQuery over 100x faster - also reduces memory footprint for Terms index • And much much more! - See http://lucene.apache.org/core/4_1_0/changes/Changes.html © 2013 LucidWorks 5
  • 6. Flexible index formats •For terms, postings lists, stored fields, term vectors, etc •Several new posting list codecs - Pulsing (inlines low doc freq) - Block (packed int blocks) - SimpleText (debugging, transparency) - Bloom (experimental, also inlines low doc freq) - Appending (for append-only filesystems such as HDFS) - Memory (terms as FST) •Compressed stored fields © 2013 LucidWorks 6
  • 7. Pluggable scoring •Decoupled from traditional vector space (TF/IDF) •Additional index statistics - number of tokens for a term or field - number of postings for a field - number of documents with a posting for a field •Several built-in alternatives: - BM25 - DFR – divergence from randomness - Information-based models © 2013 LucidWorks 7
  • 8. Indexing performance • http://people.apache.org/~mikemccand/lucenebench/ indexing.html © 2013 LucidWorks 8
  • 9. QPS (primary key lookup) • http://people.apache.org/~mikemccand/lucenebench/ PKLookup.html © 2013 LucidWorks 9
  • 10. FuzzyQuery • http://people.apache.org/~mikemccand/lucenebench/ Fuzzy2.html © 2013 LucidWorks 10
  • 12. Solr 4 Highlights • Requires Java 1.6+ • Pivot facets - http://wiki.apache.org/solr/SimpleFacetParameters#facet.pivot • DirectSpellChecker support - http://wiki.apache.org/solr/SpellCheckComponent • Improved document response - DocTransformer: [shard], [explain], [value], [docid] - Function query results - http://wiki.apache.org/solr/DocTransformers • Pseudo-join - http://wiki.apache.org/solr/Join • Surround query parser © 2013 LucidWorks 12
  • 13. More Solr 4 Highlights • Transaction log • Several new update processors, including a “script” one - http://wiki.apache.org/solr/ScriptUpdateProcessor • Spatial overhaul - http://wiki.apache.org/solr/SpatialSearch • Content-type savvy /update handler • SolrCloud - http://wiki.apache.org/solr/SolrCloud • And more! - See http://lucene.apache.org/solr/4_1_0/changes/Changes.html © 2013 LucidWorks 13
  • 14. Solr 4.1 • Enhanced document routing (custom sharding) • Compressed stored fields • MoreLikeThis distributed capability • AnalyzingSuggester - http://blog.mikemccandless.com/2012/09/lucenes-new-analyzing- suggester.html - via lookupImpl = org.apache.solr.spelling.suggest.fst.AnalyzingLookupFactory - and FuzzyLookupFactory • Many SolrCloud fixes and improvements • Stanford! - _query_ no longer needed to specify nested query parsers © 2013 LucidWorks 14
  • 15. Looks Good! © 2013 LucidWorks 15
  • 16. Pivot Faceting • Finds the top N constraints for field1, then for each of those, finds the top N constraints for field2, etc • Syntax: facet.pivot=field1,field2,field3,… facet.pivot=cat,inStock #docs #docs w/ #docs w/ inStock:true instock:false cat:electronics 14 10 4 cat:memory 3 3 0 cat:connector 2 0 2 cat:graphics card 2 0 2 cat:hard drive 2 2 0 © 2013 LucidWorks 16
  • 17. DirectSpellChecker • Automaton-based • Candidates are presented directly from the term dictionary, based on Levenshtein distance. • A practical benefit of this spellchecker is that it requires no additional datastructures (neither in RAM nor on disk) to do its work. - http://lucene.apache.org/core/4_1_0/suggest/org/apache/lucene/search/spell/ DirectSpellChecker.html © 2013 LucidWorks 17
  • 18. Improved document response • Returns other info along with document stored fields • Function queries - fl=name,location,geodist(),add(myfield,10) • Fieldname globs - fl=id,attr_* • Multiple “fl” (field list) values - &fl=id,attr_* - &fl=geodist() - &fl=termfreq(text,’solr’) • Aliasing - fl=id,location:loc,_dist_:geodist() • fl=id,[explain],[shard] © 2013 LucidWorks 18
  • 19. Improved document response example $ curl http://localhost:8983/solr/query? q=solr &fl=id,apache_mentions:termfreq(text,’apache’) &fl=my_constant:”this is cool!” &fl=inStock, not(inStock) &fl=other_query_score:query($qq) &qq=text:search { "response":{"numFound":1,"start":0,"docs":[ { "id":"SOLR1000", "apache_mentions":1, "my_constant":"this is cool!", "inStock":true, "not(inStock)":false, "other_query_score":0.84178084 }]} © 2013 LucidWorks 19
  • 20. Query Parsing • _query_ no longer needed for nested queries - https://issues.apache.org/jira/browse/SOLR-4093 • "surround" query parser - enables the use of Lucene's SpanQuery family, sophisticated proximity matching - Examples: »5n(dog cat) »dog 5w cat - http://wiki.apache.org/solr/SurroundQueryParse © 2013 LucidWorks 20
  • 21. New Spatial Support • wiki.apache.org/solr/SpatialSearch • Multiple values per field • Index shapes other than points (circles, polygons, etc) • Indexing: - "geo”:”43.17614,-90.57341” - “geo”:”Circle(4.56,1.23 d=0.0710)” - “geo”:”POLYGON((-10 30, -40 40, -10 -20, 40 20, 0 0, -10 30))” • Searching: - fq=geo:"Intersects(-74.093 41.042 -69.347 44.558)" - fq=geo:"Intersects(POLYGON((-10 30, -40 40, -10 -20, 40 20, 0 0, -10 30)))" © 2013 LucidWorks 21
  • 22. Add and Retrieve document $ curl http://localhost:8983/solr/update -H 'Content-type:application/ json' -d ' [ { "id" : "book1", "title" : "Infinite Jest", "author" : "David Foster Wallace" } ]' $ curl http://localhost:8983/solr/get?id=book1 { "doc": { "id" : "book1", "author": "David Foster Wallace", "title" : "Infinite Jest", "_version_": 1410390803582287872 } } © 2013 LucidWorks 22
  • 23. Atomic Updates $ curl http://localhost:8983/solr/update -H 'Content-type:application/json' -d ' [ {"id" : "book1", "pubyear_i" : { "add" : 2006 }, "ISBN_s" : { "add" : "0-380-97365-1"} } ]' $ curl http://localhost:8983/solr/update -H 'Content-type:application/json' -d ' [ {"id" : "book1", "copies_i" : { "inc" : 1}, "cat" : { "add" : "fiction"}, "ISBN_s" : { "set" : "0-316-92004-5"} "remove_s" : { "set" : null } } ]' © 2013 LucidWorks 23
  • 24. Pseudo-Join id: post1 id: blog1 blog_id: blog1 name: Blog 1 author: John Doe owner: c4l title: Pseudo-join can be handy! Started: 2007-10-26 body: Here's how to use {!join....} id: blog2 id: post2 name: Blog 2 blog_id: blog1 owner: zoia author: John Doe started: 2005-1-31 title: Solr Update body: Live streaming today! id: post3 blog_id: blog2 author: Jane Doe title: What's New at code4lib Restrict to blogs mentioning netflix: - How it works: fq={!join from=blog_id to=id}body:code4lib - Finds all documents matching “code4lib” - Maps to different docs by following blog_id to id © 2013 LucidWorks
  • 25. Pseudo-Join Examples • Only show posts from blogs started after 2010 &fq={!join from=id to=blog_id}started:[2010 TO *] • If any post in a blog mentions “Chicago”, then search all posts in that blog for “conference” (self-join) q=conference &fq={!join from=blog_id to=blog_id}Chicago • If any blog post mentions “Chicago”, then search all emails with the same blog owner for “conference” q=email_body:conference &fq={!join from=owner_email_user to=email_user}{!join from=blog_id to=id} Chicago © 2013 LucidWorks
  • 26. Cross-Core Join http://localhost:8983/solr/collection1/select?q=*:* &fq={!join fromIndex=sec1 from=security_groups to=security}user:john id: doc1 security: managers id: mary title: doc for managers only security_groups: managers, employees body: … id: john id: doc1 security_groups: employees security: managers, employees title: doc for everyone body: … collection1 sec1 Single Solr Server © 2013 LucidWorks
  • 27. New UpdateProcessor's • FieldMutatingUpdateProcessor family: - ConcatField, CountField, FieldLength, HTMLStripField, IgnoreField, RegexReplace, RemoveBlankField, TrimField, TruncateField • ScriptUpdateProcessor - enables update processing code to be written in a scripting language. The script can be written in any scripting language supported by your JVM (such as JavaScript), and executed dynamically so no pre-compilation is necessary. - http://wiki.apache.org/solr/ScriptUpdateProcessor © 2013 LucidWorks 27
  • 28. SolrCloud: Solr 4’s scalability • Sharded leaders and replicas • ZooKeeper used for cluster management • Distributed indexing - Automatically distributes updates to appropriate shard - Facilitates Near Real-Time (NRT) searching • Distributed search - Automatically distributes to nodes of each shard • Robust, automatic update recovery • Real-time /get - Leverages transaction log • No single point of failure • Large scale NRT using soft commits • Transaction log uses: - Durability for updates that have not yet been committed - Peer syncing in SolrCloud - Real-time get © 2013 LucidWorks 28
  • 29. SolrCloud Visualization Image from http://bit.ly/X4E5H9 © 2013 LucidWorks 29
  • 30. Near Real Time (NRT) softCommit • softCommit opens a new view of the index without flushing + fsyncing files to disk - Decouples update visibility from update durability • commitWithin now implies a soft commit • Current autoCommit defaults from solrconfig.xml: <autoCommit> <maxTime>15000</maxTime> <openSearcher>false</openSearcher> </autoCommit> <!-- <autoSoftCommit> <maxTime>1000</maxTime> </autoSoftCommit> --> © 2013 LucidWorks 30
  • 31. Solr is NoSQL • Update durability - A transaction log ensures that even uncommitted documents are never lost. • Real-time Get - The ability to quickly retrieve the latest version of a document, without the need to commit or open a new searcher • Versioning and Optimistic Locking - combined with real-time get, this allows read-update-write functionality that ensures no conflicting changes were made concurrently by other clients. • Atomic updates - the ability to add, remove, change, and increment fields of an existing document without having to send in the complete document again. • Real-time /get combined with SolrCloud make a very powerful key/value pair database © 2013 LucidWorks 31
  • 33. Future • JSON Query Parser - https://issues.apache.org/jira/browse/SOLR-4351 • Shard splitting - https://issues.apache.org/jira/browse/SOLR-3755 © 2013 LucidWorks 33
  • 34. Credits • LucidWorks - lucidworks.com • Manning Publications - manning.com/lucene • Apache Software Foundation - apache.org • Apache Lucene - lucene.apache.org © 2013 LucidWorks 34
  • 35. Contact Info •IRC: erikhatcher •erik dot hatcher @ lucidworks dot com •@ErikHatcher •http://searchhub.org/author/erik/ •http://erikhatcher.tumblr.com/ © 2013 LucidWorks 35
  • 36. Get at me... @ErikHatcher © 2013 LucidWorks 36