文本语种检测---langid

最新推荐文章于 2024-09-24 19:37:31 发布

原创最新推荐文章于 2024-09-24 19:37:31 发布 · 4.6k 阅读

10 ·

CC 4.0 BY-SA版权

文章标签：

#自然语言处理 #机器学习

文本语种识别专栏收录该内容

1 篇文章

订阅专栏

本文介绍了一个快速且预训练覆盖97种语言的语言识别库langid。它具有跨领域适应性和简单部署特性，并提供详细的使用指南及训练模型的方法。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

一、langid

github源码

1.1 特点

(1) Fast
(2) Pre-trained over a large number of languages (currently 97)
(3) Not sensitive to domain-specific features (e.g. HTML/XML markup)
(4) Single .py file with minimal dependencies
(5) Deployable as a web service

1.2 语种与数据

1.2.1 目前支持 97 语种

af, am, an, ar, as, az, be, bg, bn, br, bs, ca, cs, cy, da, de, dz, el, en, eo, es, et, eu, fa, fi, fo, fr, ga, gl, gu, he, hi, hr, ht, hu, hy, id, is, it, ja, jv, ka, kk, km, kn, ko, ku, ky, la, lb, lo, lt, lv, mg, mk, ml, mn, mr, ms, mt, nb, ne, nl, nn, no, oc, or, pa, pl, ps, pt, qu, ro, ru, rw, se, si, sk, sl, sq, sr, sv, sw, ta, te, th, tl, tr, ug, uk, ur, vi, vo, wa, xh, zh, zu

1.2.2 训练数据来源—5种

JRC-Acquis
ClueWeb 09
Wikipedia
Reuters RCV2
Debian i18n

1.3 使用

1.3.1 命令行运行

langid.py 参数

在这里插入图片描述

（1）pip install langid，安装langid库

输入：langid，输入测试句子即可，如图：

在这里插入图片描述
注意：(‘ru’, -549.8846204280853) 中的负值不是概率，输入：langid -n ，会转为概率，如图：

（2）没安装langid库

直接运行： python langid.py ，界面同上。

1.3.2 python程序import

（1）预测

import langid
# 语种识别，返回 (language , confindence)  对数概率
print("classify","\t",langid.classify("I do not speak english"))
# 设置语种， None 为默认 识别语种
langid.set_languages(['it','ru'])
print("set_languages","\t",langid.classify("I do not speak english"))
# confindence 排序
print("rank","\t",langid.rank("I do not speak english"))

在这里插入图片描述

（2）预测 Probability Normalization

命令行中直接传入 -n 即可，但在python中需要实例化自己 LanguageIdentifier，具体如下

# Probability Normalization
from langid.langid import LanguageIdentifier, model
identifier = LanguageIdentifier.from_modelstring(model, norm_probs=True)
print("Probability Normalization","\t",identifier.classify("I do not speak english"))

在这里插入图片描述
使用时注意：Normalization更耗时，检测97种语言，一个短句子由2.38ms—>3.05ms

1.4 训练model

1.4.1 总览

langid 提供了训练的工具，使用机器学习中的 Naive Bayes 模型，运行tran.py，具体步骤在 train 目录下，如下图：
在这里插入图片描述
其中 information gain计算最耗时，占90%以上。

1.4.2 具体步骤

（1）数据准备

1、单语语料库（独立文件）
2、文件目录结构为：
			./corpus/domain1/en/File1.txt    或
			./corpus/domainX/en/001-file.xml

（2）训练

1、index.py

build a list of training documents

python index.py ./corpus

This will create a directory corpus.model, and produces a list of paths to documents in the corpus, with their associated language and domain.

2、tokenize.py

tokenize the files using the default byte n-gram tokenizer

python tokenize.py corpus.model

This runs each file through the tokenizer, tabulating the frequency of each token according to language and domain. This information is distributed into buckets according to a hash of the token, such that all the counts for any given token will be in the same bucket.

3、DFfeatureselect.py

identify the most frequent tokens by document frequency

python DFfeatureselect.py corpus.model

This sums up the frequency counts per token in each bucket, and produces a list of the highest-df tokens for use in the IG calculation stage. Note that this implementation of DFfeatureselect assumes byte n-gram tokenization, and will thus select a fixed number of features per ngram order. If tokenization is replaced with a word-based tokenizer, this should be replaced accordingly.

4、IGweight.py

compute the IG weights of each of the top features by DF. This is computed separately for domain and for language

python IGweight.py -d corpus.model # domain
python IGweight.py -lb corpus.model  # language

5、LDfeatureselect.py

Based on the IG weights, we compute the LD score for each token:

python LDfeatureselect.py corpus.model

This produces the final list of LD features to use for building the NB model.

6、scanner.py

assemble the scanner

python scanner.py corpus.model

The scanner is a compiled DFA over the set of features that can be used to count the number of times each of the features occurs in a document in a single pass over the document. This DFA is built using Aho-Corasick string matching.

7、NBtrain.py

learn the actual Naive Bayes parameters

python NBtrain.py corpus.model

（3）调用

新的模型：
	./corpus.model/model
调用：
	1、python langid.py -m ./corpus.model/model
	2、修改 langid.py 中 model 的 string