自定义ES分词器

原创已于 2025-03-26 22:18:13 修改 · 1.3k 阅读

0 ·

CC 4.0 BY-SA版权

文章标签：

#elasticsearch #大数据 #搜索引擎 #自定义分词 #分词器

于 2023-11-16 13:12:18 首次发布

ES 专栏收录该内容

16 篇文章

订阅专栏

本文详细介绍了Elasticsearch中的分词器构成，包括原始文本处理器、切词器和单词处理器的作用，以及如何自定义分词器。并通过实例展示了分词过程和不同查询场景（termvsmatch）的效果。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

1 分词器的组成和原理

ES的分词器主要由三部分组成：

（1）原始文本处理器-charactor filters

进行切词前，对原始文本进行预处理。如增加、删除和替换某些字符。该操作是为后续的切词作准备。

（2）切词器-tokenizer

按照规则对文本进行切词。如按照空格进行切词等。ES自带的切词器有 Standard Tokenizer（standard）、Whitespace Tokenizer（whitespace）等。

（3）单词处理器-token filters

将切词获取的单词进行加工。如大小写转化，删除stopwords，增加同义词等。

分词时，依次经过上述三项处理得到分词结果。

2 自定义分词器

下面是一个自定义分词器的案例，自定义分词器的以上三部分内容。

# custom_analyzer - 自定义分词器的名称
# char_filter - 原始文本预处理
# tokenizer - 按照指定的规则切词
# filter - 将切词后的结果进行加工

# _english_ 英文停用词，如 a,an,the

PUT /test_analyzer_index_001
{
  "settings": {
    "analysis": {
      "analyzer": {
        "custom_analyzer":{ 
          "type":"custom", 
          "char_filter":["emoticons"],
          "tokenizer": "threeVerticalLine",
          "filter":["english_stop"]
        }
      },
      "char_filter": {
        "emoticons":{ 
          "type" : "mapping",
          "mappings" : [
            ":) => _happy_",
            ":( => _sad_"
          ]
        }
      },
      "tokenizer": {
        "threeVerticalLine":{
          "type":"pattern",
          "pattern":"(\\|\\|\\|)"
        }
      },
      "filter": {
        "english_stop":{
          "type":"stop",
          "stopwords":"_english_"
        }
      }
    }
  },
  "mappings": {
    "dynamic": "strict",
    "properties": {
      "remark": {
        "type": "text",
        "analyzer": "custom_analyzer",
        "search_analyzer": "custom_analyzer"
      }
    }
  }
}

analyzer 用于指定对插入ES中的数据使用的分词器

search_analyzer 用于指定查询入参使用的分词器

3 测试分词效果

3.1 测试分词

POST test_analyzer_index_001/_analyze
{
  "analyzer": "custom_analyzer",
  "text": "进口药品|注册证H20130650|||进口药品注册证h20130650|||h20130650_haha_a_the|||tom :)|||jack :(|||a|||an|||the"
}

或者
POST test_analyzer_index_001/_analyze
{
  "field":"remark",
  "text": "进口药品|注册证H20130650|||进口药品注册证h20130650|||h20130650_haha_a_the|||tom :)|||jack :(|||a|||an|||the"
}

分词结果如下所示。

{
  "tokens" : [
    {
      "token" : "进口药品|注册证H20130650",
      "start_offset" : 0,
      "end_offset" : 17,
      "type" : "word",
      "position" : 0
    },
    {
      "token" : "进口药品注册证h20130650",
      "start_offset" : 20,
      "end_offset" : 36,
      "type" : "word",
      "position" : 1
    },
    {
      "token" : "h20130650_haha_a_the",
      "start_offset" : 39,
      "end_offset" : 59,
      "type" : "word",
      "position" : 2
    },
    {
      "token" : "tom _happy_",
      "start_offset" : 62,
      "end_offset" : 68,
      "type" : "word",
      "position" : 3
    },
    {
      "token" : "jack _sad_",
      "start_offset" : 71,
      "end_offset" : 78,
      "type" : "word",
      "position" : 4
    }
  ]
}

由结果可知，分词时，先按照emoticons规则进行了原始文本处理，然后根据threeVerticalLine规则进行分词（即使用"|||"分词），最后根据english_stop规则对英文停用词进行去除。

3.2 测试查询

下面来测试下查询效果。

3.2.1 插入数据

PUT /test_analyzer_index_001/_doc/1
{
  "remark": "进口药品|注册证H20130650|||进口药品注册证h20130650|||h20130650_haha_a_the|||tom :)|||jack :(|||a|||an|||the"
}

3.2.2 查询所有数据

GET /test_analyzer_index_001/_search
{
  "query": {
    "match_all": {}
  }
}

结果如下

{
  
    "hits" : [
      {
        "_index" : "test_analyzer_index_001",
        "_type" : "_doc",
        "_id" : "1",
        "_score" : 1.0,
        "_source" : {
          "remark" : "进口药品|注册证H20130650|||进口药品注册证h20130650|||h20130650_haha_a_the|||tom :)|||jack :(|||a|||an|||the"
        }
      }
    ]
  }
}

3.2.3 term查询

term 查询对输入不做分词，会将输入作为一个整体，到倒排索引中查找准确的词项。

（1）场景1-可召回插入的数据

GET /test_analyzer_index_001/_search
{
  "query": {
    "term": {
      "remark": "进口药品注册证h20130650"
    }
  }
}

（2）场景2-查询结果为空

以下查询结果为空，因为”进口药品|注册证H20130650|||进口药品注册证h20130650|||h20130650_haha_a_the|||tom :)|||jack :(|||a|||an|||the“在新增ES倒排索引时会进行分词，将”tom :)“转化为了”tom _happy_“，因此在倒排索引中存储的值为分词后的值：”tom _happy_“，因此使用”tom :)“查询不到数据。

GET /test_analyzer_index_001/_search
{
  "query": {
    "term": {
      "remark": "tom :)"
    }
  }
}

3.2.4 match查询

match 查询对输入的查询条件进行分词，生成一个供查询的词项列表，然后每个词项逐个进行底层的查询，最终将结果进行合并。

（1）场景1-可召回插入的数据

分词后的入参为 ”进口药品|注册证H20130650“、”tom _happy_“ 和 ”jack _sad_“。

GET /test_analyzer_index_001/_search
{
  "query": {
    "match": {
      "remark": "进口药品|注册证H20130650|||tom :)|||jack :("
    }
  }
}

（2）场景2-可召回插入的数据

分词后的入参为”tom _happy_“ 。

GET /test_analyzer_index_001/_search
{
  "query": {
    "match": {
      "remark": "tom :)"
    }
  }
}

4 参考文献

（1）Tokenizer reference | Elasticsearch Guide [7.10] | Elastic