NLP实验——LDA主题模型

最新推荐文章于 2022-08-31 18:10:20 发布

原创最新推荐文章于 2022-08-31 18:10:20 发布 · 695 阅读

1 ·

CC 4.0 BY-SA版权

文章标签：

#python #LDA #主题模型

机器学习专栏收录该内容

5 篇文章

订阅专栏

本文深入探讨了LDA主题模型的原理与应用，通过实例演示了如何使用gensim库进行文本切词、生成词典及主题建模。文章详细解释了从数据预处理到模型训练的全过程，并分享了常见问题解决技巧。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

原理

原理我就不细致讨论啦，放上几个比较好的博客

我的理解比较简单：因为有一词多义的存在，通过建立词和主题之间的联系，通过Gibbs抽样，实现降维（将文档降维到主题）。

gensim实现

简单来说LDA实现主要几个步骤

文本切词
生成词典
doc2bow：就是词——>稀疏向量集
调用gensim中的LDA模型

完整代码如下：

# 导入库
import os
import jieba  # 分词模块
from gensim import corpora, models  # gensim的词频统计和主题建模模块

# 构建停词库
def get_custom_stopwords(stop_words_file):
    with open(stop_words_file,encoding='UTF-8') as f:
        stopwords = f.read()
    stopwords_list = stopwords.split('\n')
    custom_stopwords_list = [i for i in stopwords_list]
    return custom_stopwords_list

# 文件夹文件读取
print('读取文件并获取内容...')
all_content = []  # 总列表，用于存储所有文件的文本内容
for root, dirs, files in os.walk('../清洗过'):  # 分别读取遍历目录下的根目录、子目录和文件列表
    for file in files:  # 读取每个文件
        file_name = os.path.join(root, file)  # 将目录路径与文件名合并为带有完整路径的文件名
        with open(file_name, encoding='utf-8') as f:  # 以只读方式打开文件
            data = f.read()  # 读取文件内容
            words = jieba.cut(data)
        all_content.extend(words)  # 从文件内容中获取文本并将结果追加到总列表


# 读取单个文件
# with open('../result.txt', encoding='utf-8') as f:  # 打开新的文本
#     data = f.read()  # 读取文本数据
# text = data
# words = jieba.cut(text)

# 去停词
stop_words_file = "E:/python练习文件/停用词/hit_stopwords.txt"
stopwords = get_custom_stopwords(stop_words_file)
words_ls = [i for i in all_content if i not in stopwords] # 将不在去除词列表中的词添加到列表中
# words_ls = list(set(words).difference(set(stopwords)))

for x in words_ls:
    print(x)

# 构造词典:分词列表转字典
dictionary = corpora.Dictionary([words_ls])
print('{:*^60}'.format('token和word mapping预览：'))
for i, w in list(dictionary.items())[:5]:  # 循环读出字典前5条的每个key和value，对应的是索引值和分词
    print('token:%s -- word:%s' % (i, w))

# 生成语料库
corpus = [dictionary.doc2bow([words]) for words in words_ls]
print('{:*^60}'.format('bag of words review:'))# 打印输出第一条语料库
print(corpus[0])

# 设置主题的个数
lda = models.ldamodel.LdaModel(corpus=corpus, id2word=dictionary, num_topics=3)

# 打印所有主题，每个主题显示5个词
for topic in lda.print_topics(num_words=5):
    print(topic)

# 主题推断
print('{:*^60}'.format('主题推断'))
print(lda.inference(corpus))

遇到的问题

doc2bow expects an array of unicode tokens on input, not a single string
翻译过来就是：doc2bow希望输入一个Unicode的数组输入，不是单独的字符串，首先是参考了stackoverflow代码
所以将dictionary = corpora.Dictionary(words_ls)修改成dictionary = corpora.Dictionary([words_ls])就解决了问题。