jieba 自定义词库（海量词语）以及模型复用

Know_nothing_

已于 2023-10-10 19:23:09 修改

阅读量1.4k

点赞数 1

CC 4.0 BY-SA版权

分类专栏： python 文章标签：开发语言 python 中文分词自然语言处理

于 2023-10-10 19:12:34 首次发布

本文链接：https://blog.csdn.net/Know_nothing_/article/details/133754085

简介

网络上有很多介绍 jieba 自定义词库的文章。
但基本都是浅显的模仿官方文档，告诉读者使用 jieba.add_word 或者  jieba.load_userdict。

但在实际生产中，需要面对：
1 自定义词典可能会非常大 
2 每次重启程序都需要较长时间 
3 不知道如何复用词典模型

本文将解决上述问题。

为啥要自建词库

使用默认词库，往往会把特定词语进行分词，而我们希望这些词语完整的出现，不被拆分。
使用自定义词典，将这种词语放到词库模型中，即可避免这种问题。

自建词库的两种方式

jieba.add_word 或者 jieba.load_userdict（使用方法自己查，这里暂且不表。）

坑坑

没错，就是 jieba.load_userdict。

使用这个方法，每次程序重启的时候，都要重新加载，非常耗时。本人测试 400 万条词语时，加载需要 5 分钟左右。

这个加载非常耗时的原因是：该加载自定义词库的过程，是一个单线程的 IO 过程。看源码

def load_userdict(self, f):
    '''
    Load personalized dict to improve detect rate.

    Parameter:
        - f : A plain text file contains words and their ocurrences.
              Can be a file-like object, or the path of the dictionary file,
              whose encoding must be utf-8.

    Structure of dict file:
    word1 freq1 word_type1
    word2 freq2 word_type2
    ...
    Word type may be ignored
    '''
    self.check_initialized()
    if isinstance(f, string_types):
        f_name = f
        f = open(f, 'rb')
    else:
        f_name = resolve_filename(f)
    for lineno, ln in enumerate(f, 1):
        line = ln.strip(