使用教程参见https://blog.csdn.net/jingyi130705008/article/details/137153444,安装之前只需要把model.cc文件中以下几行代码注释掉即可实现【冻结词向量】。
for (auto it = input.cbegin(); it != input.cend(); ++it) {
wi_->addRow(grad_, *it, 1.0);
}
注:make时如果提示C++版本不对,可以尝试把Makefile文件中的c++17改成c++11,如下图所示:
CXXFLAGS = -pthread -std=c++11 -march=native
增量训练词向量shell脚本
#!/bin/bash
# 设置相关模型参数
word_dim=100
wordNgrams=1
lr=0.01
epoch=1
model_name=model_${word_dim}_${epoch}_${wordNgrams}_${lr}.vector
# 训练第一版模型
pre_file_num=00
./fastText-IncrementalTraining/fasttext cbow -input ./pretrain/data/data_00 -output ./models/${model_name}_${pre_file_num} -dim ${word_dim} -wordNgrams ${wordNgrams} -epoch ${epoch} -lr ${lr} -nepoch 0 -thread 10 -verbose 1
# 基于第一版进行增量迭代
for((i=1;i<=10;i++));
do
file_num=`printf "%02d\n" ${i}`
echo start data ${file_num}
input=./pretrain/data/data_${file_num}
input_model=./models/${model_name}_${pre_file_num}.bin
output_model=./models/${model_name}_${file_num}
pre_file_num=${file_num}
./fastText-IncrementalTraining/fasttext cbow -input ${input} -output ${output_model} -dim ${word_dim} -wordNgrams ${wordNgrams} -epoch ${epoch} -lr ${lr} -nepoch 0 -thread 10 -verbose 1 -inputModel ${input_model} -nepoch 1;
# 删除旧的模型,可选
rm -rf ${input_model}*
echo finished
done
注:
1. fastText-IncrementalTraining为安装的C++的fasttext目录;
2. pretrain/data为预处理好(分词)的多份文本数据。
参考资料:
https://github.com/facebookresearch/fastText/issues/681