使用词云图看看大家的表白（吃一大波狗粮）

本文链接：https://blog.csdn.net/lzq603/article/details/111704771

本文使用Python对微信小程序“表白码”用户输入的表白文本进行分析。先从tomcat日志中提取有用数据并进行预处理，再用jieba库分词和统计词频，最后使用wordcloud插件生成词云，展示表白文本的用词情况。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

最近在抖音上发现某歌手的某程序员歌迷，对其歌词进行分析，发现了意外的收获：歌词中出现最多的词语“没有”。于是突发奇想，分析一下别人的表白😏，看看大家的表白构成。
先看效果：http://www.crazystone.work/love/wordcloud/example/optionKeywords.html

数据来源：微信小程序“表白码”

小程序“表白码”是根据文本内容生成二维码的一个小工具，并且可以自定义二维码底图，非常有趣。用户基本分布于24岁以下。

编程语言：python
框架、模块：jieba, wordcloud

开工

首先从tomcat中取出所有access的日志，查看日志内容。
经分析，所有在“/meetingroom/artqrcode?txt=”之后、在“&c”之前的内容为用户输入的内容。如图，%加数字是汉字经过URL编码后的结果，在处理时需进行解码。
在这里插入图片描述

数据预处理

这些日志中的数据很杂乱，很多并不是生成二维码的请求，所以需要将其过滤。下面这段代码从日志文件中提取日志中的有用数据，写到新文件honey_words.txt中去。

import re
import os
from urllib import parse

'''取该行中用户要生成二维码的内容'''
def parse_out(line):
    # 正则匹配
    match = re.search('/meetingroom/artqrcode\\?txt=(.*)&c', line)
    # 匹配对象match不是None说明已匹配成功
    if match is not None:
        # 取第一个捕获组（正则表达式中括号里的内容）
        content = match.group(1)
        # 过滤链接内容
        if not content.startswith('http'):
            # 以追加模式打开新文件
            with open('honey_words.txt', 'a', encoding='utf-8') as fout:
                # URL解码
                content = parse.unquote(content)
                # 写到新文件
                fout.write(content + '\n')


'''遍历每个文件中的每一行，交由parse_out函数处理'''
if __name__ == '__main__':
    filename_list = os.listdir('log')
    for filename in filename_list:
        # 以只读模式打开日志文件
        with open('log/' + filename, 'r', encoding='utf-8') as fo:
            lines = fo.readlines()
            for line in lines:
                # 解析内容并输出到新文件
                parse_out(line)

词频统计

为制作能够体现用词情况的词云图，需要统计每个词语在honey_words.txt文件中出现了多少次。这里首先对每一行文本使用jieba库进行分词，然后统计每个词语出现的次数，并过滤停用词。

import jieba
import json

# 统计每一个词语出现次数
statistic = {}

# 读取停用词
fo = open('stopwords.txt', encoding='utf-8')
stopwords = fo.read().split()
stopwords += ['', ' ', '\n', '，']
print(stopwords)
fo.close()

# 遍历每一行内容，进行分词，统计
fo = open('honey_words.txt', encoding='utf-8')
line = fo.readline()
while line:
    seg_words = list(jieba.cut(line, cut_all=True))
    for word in seg_words:
        # 过滤停用词
        if word in stopwords:
            continue
        if word in statistic.keys():
            statistic[word] += 1
        else:
            statistic[word] = 1
    line = fo.readline()

# 按出现次数排序
word_list = list(statistic)
word_list.sort(key=lambda x:statistic[x], reverse=True)
print(word_list)

# 构造Json
json_data = {}
for word in word_list:
    # 忽略出现次数过少的词语
    if statistic[word] > 4:
        json_data[word] = statistic[word]
print(json.dumps(json_data))