NLP(III)：n-gram语言模型

最新推荐文章于 2024-07-05 11:42:05 发布

swy_swy_swy

最新推荐文章于 2024-07-05 11:42:05 发布

阅读量779

点赞数

CC 4.0 BY-SA版权

分类专栏： NLP 文章标签：自然语言处理语言模型人工智能

版权声明：本文为博主原创文章，遵循 CC 4.0 BY-SA 版权协议，转载请附上原文出处链接和本声明。

本文链接：https://blog.csdn.net/swy_swy_swy/article/details/129399243

本文介绍了n-gram语言模型的概念，并通过nltk库训练了一元、二元和三元模型，详细讨论了计算困惑度的方法，以及如何利用训练好的模型生成模拟语料。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

NLP(III)：n-gram语言模型

什么是n-gram语言模型

这一篇文章讲的非常好：自然语言处理中N-Gram模型介绍

使用nltk训练n-gram语言模型

这里我们训练一元、二元以及三元模型（unigram, bigram, trigram）。数据是上一节预处理过的推文。
请注意，n-gram模型是一个滑窗模型，因此在句首和句尾需要padding。

from nltk.lm.preprocessing import padded_everygram_pipeline
from nltk.lm import MLE

def trainNGramAddOneSmoothing(trainData,ngram):
  # Input: a list of tweet sentences, each element is a list of tokens; n for ngram model
  
  # Output: a n-gram model with add-one smoothing trained on your input data. 
  
  train, vocab = padded_everygram_pipeline(ngram, trainData)
  
  lm = MLE(ngram)
  lm.fit(train, vocab)
  return lm


unigramFish = trainNGramAddOneSmoothing(preprocessedFishTrain, 1)
bigramFish = trainNGramAddOneSmoothing(preprocessedFishTrain, 2)
trigramFish = trainNGramAddOneSmoothing(preprocessedFishTrain, 3)

计算困惑度

接下来我们分析生成的模型，其中一个指标是模型相对于测试集的困惑度（perplexity）。使用nltk计算困惑度可以参考官方网站：NLTK::nltk.lm package。

def computePerplexity(model,testData):
  # Input: your model; the testing data

  # Output: average perplexity of the model on your testing data.
  scoreSum =

最低0.47元/天解锁文章

200万优质内容无限畅学

博客等级

码龄6年

1398
原创

2825
点赞

8525
收藏

1万+
粉丝

关注

私信

热门文章

分类专栏

NLP 8篇
设计模式 13篇
sml 1篇
Java || Android 74篇
机器学习 3篇
GUI 4篇
UE4 3篇
c语言 31篇
CSAPP 19篇
python 47篇
树莓派 3篇
数据科学 9篇
数学与逻辑 41篇
网络原理 20篇
数据结构 8篇
浮生日记 22篇
小问题 30篇
GIT 6篇
shell 1篇
hexo-blog-markdown 2篇
汇编 2篇
琉璃神社 908篇
A Song of Ice&Fire 111篇
Vincent's Gallery 4篇
RED NAVY FORWARD! 42篇
测试 4篇

展开全部收起

上一篇：: NLP(II)：使用NLTK进行数据预处理

下一篇：: NLP(IV)：使用VADER进行情感分析

最新评论

华为校招记录
2301_79748324: 大哥是thu的吗
Android Studio 3.6 layout文件text模式切换
wjx_666: 感谢，刚接触这个软件，找了半天
vscode报错：undefined reference to `WinMain'
q771012: 压根忘写main了
《计算机网络自顶向下方法》（第7版）答案（第一章）（一）
tough_coder: p8的第四问可以用正态分布计算出近似值
《CSAPP》（第3版）答案（第十二章）（一）
???HiHi: P16怎么可能是对的嘛 void* thread_function(void* arg) { int thread_num = *((int*)arg); printf("Thread %d is running.\n", thread_num); // 模拟一些工作 sleep(1); printf("Thread %d is finished.\n", thread_num); return NULL; } int main(int argc, char* argv[]) { if (argc != 2) { fprintf(stderr, "Usage: %s <number_of_threads>\n", argv[0]); return EXIT_FAILURE; } int n = atoi(argv[1]); if (n <= 0) { fprintf(stderr, "Please enter a valid number of threads.\n"); return EXIT_FAILURE; } pthread_t* threads = malloc(n * sizeof(pthread_t)); int* thread_ids = malloc(n * sizeof(int)); // 创建 n 个线程 for (int i = 0; i < n; i++) { thread_ids[i] = i + 1; // 线程编号从 1 开始 if (pthread_create(&threads[i], NULL, thread_function, &thread_ids[i]) != 0) { perror("Failed to create thread"); free(threads); free(thread_ids); return EXIT_FAILURE; } } // 等待所有线程完成

大家在看

最新文章

目录

展开全部

收起

评论

被折叠的条评论为什么被折叠?

到【灌水乐园】发言

查看更多评论

添加红包

成就一亿技术人!

hope_wisdom

发出的红包

实付元

使用余额支付

点击重新获取

扫码支付

钱包余额 0

抵扣说明：

1.余额是钱包充值的虚拟货币，按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载，可以购买VIP、付费专栏及课程。