spaCy自然语言处理实战:spaCy项目实践

WSSWWWSSW

已于 2025-09-06 12:17:52 修改

阅读量476

点赞数 21

CC 4.0 BY-SA版权

文章标签：自然语言处理人工智能 spaCy NLP

于 2025-09-06 12:16:30 首次发布

本文链接：https://blog.csdn.net/WSSWWWSSW/article/details/151252266

spaCy项目实践

学习目标

本课程将引导学员通过实际项目练习，综合运用所学的spaCy知识，构建一个实体识别和情感分析模型。学员将学习如何处理文本数据，使用spaCy进行预处理，以及如何训练和评估模型。

学习内容

1 spaCy项目实践

1.1 文本预处理

在自然语言处理（NLP）中，文本预处理是一个非常重要的步骤，它直接影响到后续模型的性能。预处理的目的是清理和格式化文本数据，使其更适合机器学习模型的输入。常见的预处理步骤包括去除标点符号、转换为小写、去除停用词、词干化或词形还原等。

这里先安装实验所需的spacy包：

%pip install spacy==3.7.5

同时需要获取`en_core_web_sm-3.7.1-py3-none-any.whl`模型包进行安装，其中`en_core_web_sm`是SpaCy库中的一个英文文本处理模型，它能够进行分词、词性标注等基本的自然语言处理任务，为英文文本处理提供了便捷的解决方案和工具，适用于各种文本处理应用和研究场景。其已被封装为了whl安装包，通过pip安装即可在spacy中对其加载和调用。

获取方式如下：

!wget https://model-community-picture.obs.cn-north-4.myhuaweicloud.com/ascend-zone/notebook_codes/268ff7c84a9d11f0bd04fa163edcddae/en_core_web_sm-3.7.1-py3-none-any.whl --no-check-certificate

%pip install en_core_web_sm-3.7.1-py3-none-any.whl

1.1.1 去除标点符号和转换为小写

去除标点符号和转换为小写是文本预处理中最基本的步骤。这些步骤有助于减少文本的复杂性，使模型更容易学习到文本的特征。这里定义了一个文本预处理函数preprocess_text，通过移除所有标点符号并将文本转换为小写来标准化文本，最后展示了该函数对示例文本的处理效果。

import string

def preprocess_text(text):
    # 去除标点符号
    text = text.translate(str.maketrans('', '', string.punctuation))
    # 转换为小写
    text = text.lower()
    return text

# 示例
text = "Hello, World! This is a test."
preprocessed_text = preprocess_text(text)
print("去除标点符号和转换为小写:")
print(preprocessed_text)

1.1.2 去除停用词

停用词是指在文本中频繁出现但对语义贡献不大的词汇，如“the”、“is”、“and”等。去除停用词可以减少噪声，提高模型的性能。这里加载spaCy的英文小型模型en_core_web_sm，定义一个移除文本中停用词的函数remove_stopwords，并通过示例展示了该函数将原始句子处理为仅保留实义词的结果。

import spacy

nlp = spacy.load('en_core_web_sm')

def remove_stopwords(text):
    doc = nlp(text)
    tokens = [token.text for token in doc if not token.is_stop]
    return ' '.join(tokens)

# 示例
text = "This is a test sentence with some stop words."
clean_text = remove_stopwords(text)
print("去除停用词:")
print(clean_text)

1.2 使用spaCy进行实体识别

实体识别（Named Entity Recognition, NER）是NLP中的一个重要任务，它旨在从文本中识别出具有特定意义的实体，如人名、地名、组织名等。spaCy提供了强大的NER功能，可以轻松地在文本中识别出这些实体。

1.2.1 加载预训练模型

spaCy提供了多种预训练模型，这些模型已经在大量数据上进行了训练，可以直接用于实体识别。这里同样加载spaCy的英文小型预训练模型en_core_web_sm，定义了一个从文本中提取命名实体及其类型的函数extract_entities，并通过示例展示了该函数对包含公司、地点和货币等实体的句子的识别结果。

import spacy

nlp = spacy.load('en_core_web_sm')

def extract_entities(text):
    doc = nlp(text)
    entities = [(ent.text, ent.label_) for ent in doc.ents]
    return entities

# 示例
text = "Apple is looking at buying U.K. startup for $1 billion"
entities = extract_entities(text)
print("预训练模型实体识别结果:")
print(entities)

1.2.2 自定义实体识别

在某些情况下，预训练模型可能无法识别出特定领域的实体。这时可以使用spaCy的训练功能来训练自定义的实体识别模型。这里使用spaCy框架训练一个命名实体识别(NER)模型，通过自定义的训练数据（包含人名和地点实体）进行20轮迭代训练，并对测试文本" How about Paris and Tokyo?"进行实体识别，输出识别出的实体及其类型。

import spacy
from spacy.tokens import DocBin
from spacy.util import minibatch, compounding
import random

# 训练数据
TRAIN_DATA = [
    ("Who is Shaka Khan?", {"entities": [(7, 17, "PERSON")]}),
    ("I like London and Berlin.", {"entities": [(7, 13, "LOC"), (18, 24, "LOC")]}),
]

def train_ner_model(train_data, n_iter=20):
    nlp = spacy.blank('en')  # 创建一个空的模型
    if 'ner' not in nlp.pipe_names:
        ner = nlp.add_pipe('ner', last=True)
    else:
        ner = nlp.get_pipe('ner')

    # 添加标签
    for _, annotations in train_data:
        for ent in annotations.get('entities'):
            ner.add_label(ent[2])

    # 转换为Example对象
    from spacy.training import Example
    
    examples = []
    for text, annots in train_data:
        doc = nlp.make_doc(text)
        example = Example.from_dict(doc, annots)
        examples.append(example)

    other_pipes = [pipe for pipe in nlp.pipe_names if pipe != 'ner']
    with nlp.disable_pipes(*other_pipes):  # 只训练NER
        optimizer = nlp.begin_training()

        for itn in range(n_iter):
            random.shuffle(examples)
            batches = minibatch(examples, size=compounding(4.0, 32.0, 1.001))
            losses = {}
            for batch in batches:
                nlp.update(batch, sgd=optimizer, losses=losses)
            print(f'Epoch {itn+1}/{n_iter} Losses', losses)

    return nlp

# 训练模型
nlp = train_ner_model(TRAIN_DATA)

# 测试用例
test_text = "How about Paris and Tokyo?"
doc = nlp(test_text)
print("\n测试文本:", test_text)
print("识别实体:")
for ent in doc.ents:
    print(f"- {ent.text} ({ent.label_})")

1.3 构建和评估情感分析模型

情感分析是NLP中的一个重要应用，它旨在从文本中识别出情感倾向，如正面、负面或中性。在本节中将使用spaCy处理文本数据，并使用机器学习模型进行情感分析。

1.3.1 准备数据

首先需要准备用于训练和测试的数据集。数据集通常包含文本和对应的情感标签。这里准备了15条积极和15条消极评论，交替添加标签（积极为1、消极为0）后创建DataFrame，并将数据保存为CSV文件，最后展示数据前几行内容。

import pandas as pd

# 定义积极和消极评论的示例
positive_reviews = [
    "This product is absolutely amazing! I love it.",
    "Great quality and very useful. Highly recommended.",
    "Works perfectly! Couldn't be happier with my purchase.",
    "Fantastic service and fast delivery. Will buy again!",
    "Incredible value for the price. Definitely worth it.",
    "This has exceeded my expectations. 5 stars!",
    "So impressed with the performance. A+++",
    "Absolutely wonderful! I use it every day.",
    "The best product I've bought in a long time.",
    "Perfect in every way. I'm very satisfied.",
    "Brilliant design and functionality. Love it!",
    "This product made my life so much easier.",
    "Outstanding quality. I'm a happy customer.",
    "Highly satisfied with the results. Great buy!",
    "Exceptional service and a great product."
]

negative_reviews = [
    "Terrible product. Doesn't work as advertised.",
    "Waste of money. I regret buying this.",
    "Poor quality. Broke after just a few uses.",
    "Very disappointed. Not worth the price.",
    "Awful customer service. Avoid this company.",
    "Doesn't live up to the hype. Save your money.",
    "Extremely dissatisfied. Returning immediately.",
    "Faulty product. Wouldn't recommend to anyone.",
    "Horrible experience. Stay away!",
    "Incredibly frustrating. Does not work properly.",
    "Complete disappointment. Look elsewhere.",
    "This product is a disaster. Avoid at all costs.",
    "Badly designed. Unusable in practice.",
    "Disappointing performance. Not what I expected.",
    "Terrible value. Overpriced and ineffective."
]

# 创建交替的标签和文本列表
labels = []
texts = []

for i in range(15):  # 生成30条数据（15正+15负）
    # 添加积极评论
    labels.append(1)
    texts.append(positive_reviews[i])
    
    # 添加消极评论
    labels.append(0)
    texts.append(negative_reviews[i])

# 创建DataFrame
data = pd.DataFrame({
    'text': texts,
    'label': labels
})

# 保存到CSV文件
data.to_csv('sentiment_data.csv', index=False)

# 读取数据
data = pd.read_csv('sentiment_data.csv')

# 查看数据
print(data.head())

1.3.2 特征提取

在训练模型之前需要将文本数据转换为模型可以理解的特征向量，可以使用spaCy来提取词向量。这里同样加载spaCy的英文小型模型en_core_web_sm，定义了一个从文本中提取向量特征的函数extract_features，并对DataFrame中的评论数据提取特征向量，同时获取对应的情感标签。

import spacy

nlp = spacy.load('en_core_web_sm')

def extract_features(texts):
    features = []
    for text in texts:
        doc = nlp(text)
        features.append(doc.vector)
    return features

# 提取特征
X = extract_features(data['text'])
y = data['label']

1.3.3 训练模型

可以使用多种机器学习模型进行情感分析，如逻辑回归、支持向量机等。这里将提取的文本特征向量和情感标签划分为训练集与测试集（测试集占30%），使用逻辑回归模型进行训练，然后对测试集进行预测，并输出模型的准确率和分类报告以评估情感分类效果。

from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, classification_report

# 划分训练集和测试集
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, stratify=y, random_state=42)

# 训练模型
model = LogisticRegression()
model.fit(X_train, y_train)

# 预测
y_pred = model.predict(X_test)

# 评估
print('准确率:', accuracy_score(y_test, y_pred))
print('分类报告:\n', classification_report(y_test, y_pred))

1.3.4 模型应用

最后可以将训练好的模型应用于新的文本数据，进行情感分析。定义一个使用训练好的逻辑回归模型预测文本情感的函数predict_sentiment，通过spaCy提取文本向量特征并进行预测，最后对示例文本"I love this product! It’s amazing."输出情感预测结果。

def predict_sentiment(text):
    doc = nlp(text)
    features = [doc.vector]
    prediction = model.predict(features)
    return 'Positive' if prediction[0] == 1 else 'Negative'

# 示例
text = "I love this product! It's amazing."
sentiment = predict_sentiment(text)
print("情感预测:")
print(sentiment)