文本语种检测---langid

本文介绍了一个快速且预训练覆盖97种语言的语言识别库langid。它具有跨领域适应性和简单部署特性,并提供详细的使用指南及训练模型的方法。

摘要生成于 C知道 ,由 DeepSeek-R1 满血版支持, 前往体验 >

一、langid

github源码

1.1 特点

(1) Fast
(2) Pre-trained over a large number of languages (currently 97)
(3) Not sensitive to domain-specific features (e.g. HTML/XML markup)
(4) Single .py file with minimal dependencies
(5) Deployable as a web service

1.2 语种与数据

1.2.1 目前支持 97 语种

af, am, an, ar, as, az, be, bg, bn, br, bs, ca, cs, cy, da, de, dz, el, en, eo, es, et, eu, fa, fi, fo, fr, ga, gl, gu, he, hi, hr, ht, hu, hy, id, is, it, ja, jv, ka, kk, km, kn, ko, ku, ky, la, lb, lo, lt, lv, mg, mk, ml, mn, mr, ms, mt, nb, ne, nl, nn, no, oc, or, pa, pl, ps, pt, qu, ro, ru, rw, se, si, sk, sl, sq, sr, sv, sw, ta, te, th, tl, tr, ug, uk, ur, vi, vo, wa, xh, zh, zu

1.2.2 训练数据来源—5种

  • JRC-Acquis
  • ClueWeb 09
  • Wikipedia
  • Reuters RCV2
  • Debian i18n

1.3 使用

1.3.1 命令行运行

langid.py 参数

在这里插入图片描述

(1)pip install langid,安装langid库
输入:langid,输入测试句子即可,如图:

在这里插入图片描述
注意:(‘ru’, -549.8846204280853) 中的负值不是概率,输入:langid -n ,会转为概率,如图:在这里插入图片描述

(2)没安装langid库
直接运行: python langid.py ,界面同上。

1.3.2 python程序import

(1)预测
import langid
# 语种识别,返回 (language , confindence)  对数概率
print("classify","\t",langid.classify("I do not speak english"))
# 设置语种, None 为默认 识别语种
langid.set_languages(['it','ru'])
print("set_languages","\t",langid.classify("I do not speak english"))
# confindence 排序
print("rank","\t",langid.rank("I do not speak english"))

在这里插入图片描述

(2)预测 Probability Normalization

命令行中直接传入 -n 即可,但在python中需要实例化自己 LanguageIdentifier,具体如下

# Probability Normalization
from langid.langid import LanguageIdentifier, model
identifier = LanguageIdentifier.from_modelstring(model, norm_probs=True)
print("Probability Normalization","\t",identifier.classify("I do not speak english"))

在这里插入图片描述
使用时注意:Normalization更耗时,检测97种语言,一个短句子由2.38ms—>3.05ms

1.4 训练model

1.4.1 总览

langid 提供了训练的工具,使用机器学习中的 Naive Bayes 模型,运行tran.py,具体步骤在 train 目录下,如下图:
在这里插入图片描述
其中 information gain计算最耗时,占90%以上。

1.4.2 具体步骤

(1)数据准备
1、单语语料库(独立文件)
2、文件目录结构为:
			./corpus/domain1/en/File1.txt    或
			./corpus/domainX/en/001-file.xml
(2)训练
1、index.py

build a list of training documents

python index.py ./corpus

This will create a directory corpus.model, and produces a list of paths to documents in the corpus, with their associated language and domain.

2、tokenize.py

tokenize the files using the default byte n-gram tokenizer

python tokenize.py corpus.model

This runs each file through the tokenizer, tabulating the frequency of each token according to language and domain. This information is distributed into buckets according to a hash of the token, such that all the counts for any given token will be in the same bucket.

3、DFfeatureselect.py

identify the most frequent tokens by document frequency

python DFfeatureselect.py corpus.model

This sums up the frequency counts per token in each bucket, and produces a list of the highest-df tokens for use in the IG calculation stage. Note that this implementation of DFfeatureselect assumes byte n-gram tokenization, and will thus select a fixed number of features per ngram order. If tokenization is replaced with a word-based tokenizer, this should be replaced accordingly.

4、IGweight.py

compute the IG weights of each of the top features by DF. This is computed separately for domain and for language

python IGweight.py -d corpus.model # domain
python IGweight.py -lb corpus.model  # language
5、LDfeatureselect.py

Based on the IG weights, we compute the LD score for each token:

python LDfeatureselect.py corpus.model

This produces the final list of LD features to use for building the NB model.

6、scanner.py

assemble the scanner

python scanner.py corpus.model

The scanner is a compiled DFA over the set of features that can be used to count the number of times each of the features occurs in a document in a single pass over the document. This DFA is built using Aho-Corasick string matching.

7、NBtrain.py

learn the actual Naive Bayes parameters

python NBtrain.py corpus.model
(3)调用
新的模型:
	./corpus.model/model
调用:
	1、python langid.py -m ./corpus.model/model
	2、修改 langid.py 中 model 的 string
### 语种识别技术的实现方法与算法 #### 数据预处理阶段 在语种识别项目中,音频数据通常被转换为语谱图作为模型输入。这一过程涉及将时间序列信号映射到频域表示上,从而捕捉声音的时间频率特征[^1]。 #### 卷积神经网络的应用 通过构建深度卷积神经网络(CNN),可以有效地提取语谱图中的空间模式并完成分类任务。具体而言,输入层接收语谱图,隐藏层由多个卷积核组成以捕获局部特性,而全连接层则负责最终决策输出类别标签。 对于基于文本的语言检测方案,则可采用诸如Langid这样的开源库来解决实际应用需求;它自上世纪九十年代以来一直被视为一种监督式机器学习问题,并借鉴了许多来自文档分类领域的研究成果[^2]。 当面对包含多种稀少语言的情况时,组合使用不同工具可能是更优策略——比如先借助FastLID初步判断可能存在的低资源目标群体,随后再经由LangID进一步验证,最后利用LangDetect专门针对特定大类如日语做精确判定[^3]。 ```python import librosa import numpy as np from tensorflow.keras.models import Sequential from tensorflow.keras.layers import Conv2D, MaxPooling2D, Flatten, Dense def preprocess_audio(file_path): y, sr = librosa.load(file_path) spectogram = librosa.feature.melspectrogram(y=y, sr=sr) log_spectogram = librosa.power_to_db(spectogram, ref=np.max) return log_spectogram model = Sequential() model.add(Conv2D(32, kernel_size=(3, 3), activation='relu', input_shape=(128, None, 1))) model.add(MaxPooling2D(pool_size=(2, 2))) model.add(Flatten()) model.add(Dense(10, activation='softmax')) # Assuming there are 10 language classes ```
评论 4
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值