FastText 微调

jingyi130705008

已于 2024-05-08 12:00:35 修改

阅读量522

点赞数 2

CC 4.0 BY-SA版权

分类专栏：自然语言处理机器学习文章标签： FastText 文本分类词向量

于 2024-03-31 17:04:28 首次发布

本文链接：https://blog.csdn.net/jingyi130705008/article/details/137203982

机器学习同时被 2 个专栏收录

26 篇文章

订阅专栏

自然语言处理

7 篇文章

订阅专栏

使用教程参见https://blog.csdn.net/jingyi130705008/article/details/137153444，安装之前只需要把model.cc文件中以下几行代码注释掉即可实现【冻结词向量】。

  for (auto it = input.cbegin(); it != input.cend(); ++it) {
    wi_->addRow(grad_, *it, 1.0);
  }

注：make时如果提示C++版本不对，可以尝试把Makefile文件中的c++17改成c++11，如下图所示：

CXXFLAGS = -pthread -std=c++11 -march=native

增量训练词向量shell脚本

#!/bin/bash

# 设置相关模型参数
word_dim=100
wordNgrams=1
lr=0.01
epoch=1
model_name=model_${word_dim}_${epoch}_${wordNgrams}_${lr}.vector

# 训练第一版模型
pre_file_num=00
./fastText-IncrementalTraining/fasttext cbow -input ./pretrain/data/data_00 -output ./models/${model_name}_${pre_file_num} -dim ${word_dim} -wordNgrams ${wordNgrams} -epoch ${epoch} -lr ${lr} -nepoch 0  -thread 10 -verbose 1

# 基于第一版进行增量迭代
for((i=1;i<=10;i++));
do
file_num=`printf "%02d\n" ${i}`
echo start data ${file_num}
input=./pretrain/data/data_${file_num}
input_model=./models/${model_name}_${pre_file_num}.bin
output_model=./models/${model_name}_${file_num}
pre_file_num=${file_num}
./fastText-IncrementalTraining/fasttext cbow -input ${input} -output ${output_model} -dim ${word_dim} -wordNgrams ${wordNgrams} -epoch ${epoch} -lr ${lr} -nepoch 0  -thread 10 -verbose 1 -inputModel ${input_model} -nepoch 1;
# 删除旧的模型，可选
rm -rf ${input_model}*
echo finished
done

注：

1. fastText-IncrementalTraining为安装的C++的fasttext目录；

2. pretrain/data为预处理好（分词）的多份文本数据。

参考资料：

https://github.com/facebookresearch/fastText/issues/681