NLTK与深度学习的自然语言处理实战：NLTK基础入门-文本分类与应用-CSDN博客

文本分类：从零开始使用NLTK

学习目标

通过本实验的学习，学员将掌握文本分类的基本原理，了解如何使用NLTK库进行特征提取、模型训练和评估。实验将通过实际案例，帮助学员理解文本分类在自然语言处理中的应用。

学习内容

1 NLTK文本分类

1.1 文本预处理

文本预处理是文本分类的第一步，它包括了文本的清洗、分词、去除停用词等步骤。这些步骤对于提高文本分类的准确性至关重要。在本实验中，将详细介绍如何使用NLTK进行文本预处理。

1.1.1 文本清洗

文本清洗是指去除文本中的噪声，如HTML标签、特殊字符等，以确保后续处理的准确性。在Python中，可以使用正则表达式来实现这一功能。

#安装nltk库
%pip install nltk

#下载数据
!wget https://model-community-picture.obs.cn-north-4.myhuaweicloud.com/ascend-zone/notebook_datasets/4ac04110300911f0bbc0fa163edcddae/nltk_data.zip

!unzip nltk_data.zip

import re

def clean_text(text):
    # 去除HTML标签
    text = re.sub(r'<.*?>', '', text)
    # 去除特殊字符
    text = re.sub(r'[^a-zA-Z0-9\s]', '', text)
    return text

# 示例
text = "<p>This is a test text that contains some special characters.！@#</p>"
cleaned_text = clean_text(text)
print(cleaned_text)

1.1.2 分词

分词是将文本分割成单词或短语的过程。NLTK提供了多种分词方法，其中最常用的是word_tokenize。

import nltk
from nltk.tokenize import word_tokenize
nltk.data.path.append('./nltk_data')


def tokenize(text):
    return word_tokenize(text)

# 示例
text = "This is a test text that contains some special characters.！"
tokens = tokenize(text)
print(tokens)

1.1.3 去除停用词

停用词是指在文本中频繁出现但对分类没有帮助的词汇，如“的”、“和”等。去除停用词可以减少特征空间的维度，提高模型的效率。

from nltk.corpus import stopwords


def remove_stopwords(tokens):
    stop_words = set(stopwords.words('english'))
    return [word for word in tokens if word.lower() not in stop_words]

# 示例
tokens = ['This', 'is', 'a', 'test', 'text', 'with', 'some', 'stop', 'words']
filtered_tokens = remove_stopwords(tokens)
print(filtered_tokens)

1.2 特征提取

特征提取是从文本中提取有用信息的过程，这些信息将用于训练分类模型。常见的特征提取方法包括词袋模型、TF-IDF等。在本实验中，将介绍如何使用NLTK进行特征提取。

1.2.1 词袋模型

词袋模型是一种简单的特征提取方法，它将文本表示为一个词频向量。每个词在向量中的位置对应其在词汇表中的位置，值为该词在文本中出现的次数。

from collections import Counter

def bag_of_words(tokens):
    return dict(Counter(tokens))

# 示例
tokens = ['This', 'is', 'a', 'test', 'text', 'with', 'some', 'stop', 'words']
bow = bag_of_words(tokens)
print(bow)

1.2.2 TF-IDF

TF-IDF（Term Frequency-Inverse Document Frequency）是一种更高级的特征提取方法，它不仅考虑了词频，还考虑了词在文档集合中的重要性。NLTK中没有直接提供TF-IDF的实现，但可以使用sklearn库来实现。

from sklearn.feature_extraction.text import TfidfVectorizer

def tf_idf(texts):
    vectorizer = TfidfVectorizer()
    tfidf_matrix = vectorizer.fit_transform(texts)
    return tfidf_matrix, vectorizer

# 示例
texts = [
    "This is a test text with some stop words",
    "Another test text with different words"
]
tfidf_matrix, vectorizer = tf_idf(texts)
print(tfidf_matrix.toarray())
print(vectorizer.get_feature_names_out())

1.3 模型训练与评估

模型训练是使用提取的特征来训练分类模型的过程。评估则是通过测试集来验证模型的性能。在本实验中，将介绍如何使用NLTK和sklearn进行模型训练和评估。

1.3.1 模型训练

可以使用sklearn中的分类器来训练模型。常见的分类器包括朴素贝叶斯、支持向量机等。

from sklearn.naive_bayes import MultinomialNB
from sklearn.model_selection import train_test_split

def train_model(features, labels):
    X_train, X_test, y_train, y_test = train_test_split(features, labels, test_size=0.2, random_state=42)
    model = MultinomialNB()
    model.fit(X_train, y_train)
    return model, X_test, y_test

# 示例
texts = [
    "This is a positive text",
    "This is another positive text",
    "Another positive text",
    "Another negative text",
    "Yet another negative text",
    "More negative text"
]
labels = [1, 1, 1, 0, 0, 0]  # 3 positive, 3 negative

tfidf_matrix, vectorizer = tf_idf(texts)

model, X_test, y_test = train_model(tfidf_matrix, labels)

1.3.2 模型评估

模型评估是通过测试集来验证模型性能的过程。常见的评估指标包括准确率、精确率、召回率和F1分数。

from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

probs = model.predict_proba(X_test)

# 设定阈值，这里以0.3为例
threshold = 0.3

# 对于二分类问题，假设正类为1，负类为0
predictions = (probs[:, 1] >= threshold).astype(int)

# 计算指标
accuracy = accuracy_score(y_test, predictions)
precision = precision_score(y_test, predictions, average='binary')
recall = recall_score(y_test, predictions, average='binary')
f1 = f1_score(y_test, predictions, average='binary')

print(f"Accuracy: {accuracy}")
print(f"Precision: {precision}")
print(f"Recall: {recall}")
print(f"F1 Score: {f1}")

NLTK与深度学习的自然语言处理实战：NLTK基础入门-文本分类与应用