《AI大模型应用》--自然语言处理（nlp），小姜机器人，BERT句向量-相似度，XLNET句向量-相似度，文本分类.zip资源-CSDN下载

共168个文件

py：113个

txt：25个

md：8个

版权申诉

人工智能

85 浏览量 2024-07-15 11:16:57 上传评论收藏 2.18MB ZIP 举报

资源推荐

资源详情

资源评论

收起资源包目录

《AI大模型应用》--自然语言处理（nlp），小姜机器人，BERT句向量-相似度，XLNET句向量-相似度，文本分类.zip （168个子文件）

char 5B

sim_webank.csv 1007KB

dev.csv 2KB

train.csv 2KB

test.csv 1KB

people.dev 3KB

.gitignore 1KB

ccks_news_2020.json 17KB

params.json 285B

LICENSE 1KB

readme.md 8KB

README.md 5KB

README.md 4KB

README.md 912B

readme.md 849B

README.md 791B

readme.md 713B

README.md 415B

model_seq2seq.py 44KB

keras_bert_layer.py 36KB

enhance_simbert.py 27KB

enhance_roformer.py 20KB

keras_bert_classify_bi_lstm.py 19KB

sentence_sim_feature.py 16KB

bertWhiteTrain.py 15KB

TFServing_preprocess.py 14KB

keras_bert_classify_text_cnn.py 14KB

keras_bert_ner_bi_lstm.py 13KB

TFServing_save.py 12KB

distance_text_or_vec.py 11KB

chatbot_sentence_vec_by_word.py 9KB

data_utils.py 9KB

text_tools.py 8KB

enhance_eda.py 8KB

chatbot_sentence_vec_by_char.py 7KB

train_word_anti.py 7KB

enhance_marko.py 7KB

train_word_cg.py 7KB

enhance_word2vec.py 7KB

keyword_sim.py 6KB

translate_google.py 6KB

augment_mainpart.py 6KB

chatbot_fuzzy.py 6KB

extract_keras_bert_feature.py 6KB

extract_keras_xlnet_feature.py 6KB

train_char_anti.py 6KB

enhance_eda_v2.py 5KB

nmt_local.py 5KB

train_char_cg.py 5KB

mmr.py 5KB

word_sequence.py 5KB

path_config.py 4KB

extract_word_cg.py 4KB

normalization_util.py 4KB

extract_webank.py 4KB

layer_crf_bojone.py 4KB

translate_tencent_secret.py 4KB

keras_bert_embedding.py 4KB

cut_td_idf.py 4KB

keras_bert_embedding.py 4KB

chatbot_sentence_vec_by_bert.py 4KB

indexFaiss.py 4KB

keras_bert_layer.py 3KB

pred_word_cg.py 3KB

predict_word_anti.py 3KB

statistics_keyword.py 3KB

augment_constant.py 3KB

extract_char_cg.py 3KB

extract_char_webank.py 3KB

thread_generator.py 3KB

predict_char_cg.py 3KB

predict_char_anti.py 3KB

tet_bert_keras_sim.py 3KB

tet_xlnet_keras_sim.py 2KB

setup.py 2KB

bertWhiteConf.py 2KB

indexAnnoy.py 2KB

bertWhiteConf.py 2KB

word2vec_vector.py 2KB

distance_vec_TS_SS.py 2KB

bertWhiteTools.py 2KB

TFServing_postprocess.py 2KB

translate_translate.py 1KB

TFServing_tet_http.py 1KB

args.py 857B

layers_keras.py 837B

layers_keras.py 806B

args.py 767B

feature_config.py 722B

args.py 663B

tet_keras.py 541B

__init__.py 105B

共 168 条

# nlp_xiaojiang # AugmentText - 回译（效果比较好） - EDA（同义词替换、插入、交换和删除）（效果还行） - HMM-marko（质量较差） - syntax（依存句法、句法、语法书）（简单句还可） - seq2seq（深度学习同义句生成，效果不理想，seq2seq代码大都是 [https://github.com/qhduan/just_another_seq2seq] 的，效果不理想） - 预训练(UNILM生成、开源模型回译) # ChatBot - 检索式ChatBot - 像ES那样直接检索(如使用fuzzywuzzy)，只能字面匹配 - 构造句向量，检索问答库，能够检索有同义词的句子 - 生成式ChatBot（todo） - seq2seq - GAN # ClassificationText - bert+bi-lstm(keras) approach 0.78~0.79% acc of weBank Intelligent Customer Service Question Matching Competition - bert + text-cnn(keras) approach 0.78~0.79% acc of weBank Intelligent Customer Service Question Matching Competition - bert + r-cnn(keras) approach 0.78~0.79% acc of weBank Intelligent Customer Service Question Matching Competition - bert + avt-cnn(keras) approach 0.78~0.79% acc of weBank Intelligent Customer Service Question Matching Competition # Ner - bert命名实体提取(bert12层embedding + bilstm + crf) - args.py(配置一些参数) - keras_bert_embedding.py(bert embedding) - keras_bert_layer.py(layer层, 主要有CRF和NonMaskingLayer) - keras_bert_ner_bi_lstm.py(主函数, 定义模型、数据预处理和训练预测等) - layer_crf_bojone.py(CRF层, 未使用) # FeatureProject - bert句向量、文本相似度 - bert/extract_keras_bert_feature.py:提取bert句向量特征 - bert/tet_bert_keras_sim.py:测试xlnet句向量cosin相似度 - xlnet句向量、文本相似度 - xlnet/extract_keras_xlnet_feature.py:提取bert句向量特征 - xlnet/tet_xlnet_keras_sim.py:测试bert句向量cosin相似度 - normalization_util指的是数据归一化 - 0-1归一化处理 - 均值归一化 - sig归一化处理 - sim feature（ML） - distance_text_or_vec:各种计算文本、向量距离等 - distance_vec_TS_SS：TS_SS计算词向量距离 - cut_td_idf：将小黄鸡语料和gossip结合 - sentence_sim_feature：计算两个文本的相似度或者距离，例如qq（问题和问题），或者qa（问题和答案） # run(可以在win10下,pycharm下运行) - 1.创建tf-idf文件等（运行2需要先跑1）: ``` python cut_td_idf.py ``` - 2.计算两个句子间的各种相似度，先计算一个预定义的，然后可输入自定义的（先跑1）: ``` python sentence_sim_feature.py ``` - 3.chatbot_1跑起来(fuzzy检索-没)（独立）： ``` python chatbot_fuzzy.py ``` - 4.chatbot_2跑起来(句向量检索-词)（独立）： ``` python chatbot_sentence_vec_by_word.py ``` - 5.chatbot_3跑起来(句向量检索-字)（独立）： ``` python chatbot_sentence_vec_by_char.py ``` - 6.数据增强（eda)： python enhance_eda.py - 7.数据增强（marko）: python enhance_marko.py - 8.数据增强（translate_account）: python translate_tencent_secret.py - 9.数据增强（translate_tools）: python translate_translate.py - 10.数据增强（translate_web）: python translate_google.py - 11.数据增强（augment_seq2seq）: 先跑 python extract_char_webank.py生成数据，再跑 python train_char_anti.py 然后跑 python predict_char_anti.py - 12.特征计算(bert)（提取特征、计算相似度）: ``` run extract_keras_bert_feature.py run tet_bert_keras_sim.py ``` # Data - chinese_L-12_H-768_A-12（谷歌预训练好的模型） github项目中只是上传部分数据，需要的前往链接: https://pan.baidu.com/s/1I3vydhmFEQ9nuPG2fDou8Q 提取码: rket 解压后就可以啦 - chinese_xlnet_mid_L-24_H-768_A-12(哈工大训练的中文xlnet, mid, 24层, wiki语料+通用语料) - 下载地址[https://github.com/ymcui/Chinese-PreTrained-XLNet](https://github.com/ymcui/Chinese-PreTrained-XLNet) - chinese_vector github项目中只是上传部分数据，需要的前往链接: https://pan.baidu.com/s/1I3vydhmFEQ9nuPG2fDou8Q 提取码: rket - 截取的部分word2vec训练词向量（自己需要下载全效果才会好） - w2v_model_wiki_char.vec、w2v_model_wiki_word.vec都只有部分，词向量w2v_model_wiki_word.vec可以用这个下载地址的替换[https://pan.baidu.com/s/14JP1gD7hcmsWdSpTvA3vKA](https://pan.baidu.com/s/14JP1gD7hcmsWdSpTvA3vKA) - corpus github项目中只是上传部分数据，需要的前往链接: https://pan.baidu.com/s/1I3vydhmFEQ9nuPG2fDou8Q 提取码: rket - ner(train、dev、test----人民日报语料) - webank(train、dev、test) - 小黄鸡和gossip问答预料（数据没清洗）,chicken_and_gossip.txt - 微众银行和支付宝文本相似度竞赛数据， sim_webank.csv - sentence_vec_encode_char - 1.txt（字向量生成的前100000句向量） - sentence_vec_encode_word - 1.txt（词向量生成的前100000句向量） - tf_idf（chicken_and_gossip.txt生成的tf-idf） # requestments.txt - python_Levenshtei - 调用Levenshtein，我的python是3.6， - 打开其源文件: https://www.lfd.uci.edu/~gohlke/pythonlibs/ - 查找python_Levenshtein-0.12.0-cp36-cp36m-win_amd64.whl下载即可 - pyemd - pyhanlp - 下好依赖JPype1-0.6.3-cp36-cp36m-win_amd64.whl # 参考/感谢 * eda_chinese：[https://github.com/zhanlaoban/eda_nlp_for_Chinese](https://github.com/zhanlaoban/eda_nlp_for_Chinese) * 主谓宾提取器：[https://github.com/hankcs/MainPartExtractor](https://github.com/hankcs/MainPartExtractor) * HMM生成句子：[https://github.com/takeToDreamLand/SentenceGenerate_byMarkov](https://github.com/takeToDreamLand/SentenceGenerate_byMarkov) * 同义词等：[https://github.com/fighting41love/funNLP/tree/master/data/](https://github.com/fighting41love/funNLP/tree/master/data/) * 小牛翻译：[http://www.niutrans.com/index.html](http://www.niutrans.com/index.html) # 其他资料 * bert(keras):[https://github.com/CyberZHG/keras-bert](https://github.com/CyberZHG/keras-bert) * NLP数据增强汇总:[https://github.com/quincyliang/nlp-data-augmentation](https://github.com/quincyliang/nlp-data-augmentation) * 知乎NLP数据增强话题:[https://www.zhihu.com/question/305256736/answer/550873100](https://www.zhihu.com/question/305256736/answer/550873100) * chatbot_seq2seq_seqGan（比较好用）：[https://github.com/qhduan/just_another_seq2seq](https://github.com/qhduan/just_another_seq2seq) * 自己动手做聊天机器人教程: [https://github.com/warmheartli/ChatBotCourse](https://github.com/warmheartli/ChatBotCourse)

评论收藏

内容反馈

版权申诉