（7-3-2）神经机器翻译（NMT）：基于NMT的简易翻译系统

码农三叔

已于 2024-09-12 08:48:36 修改

阅读量1.4k

点赞数 32

CC 4.0 BY-SA版权

分类专栏：《NLP算法实战》文章标签：机器翻译人工智能自然语言处理 python 算法

于 2024-03-29 19:58:11 首次发布

本文链接：https://blog.csdn.net/asd343442/article/details/137154828

7.3.4 基于NMT的简易翻译系统

在本节的内容中，将展示构建一个基于NMT模型的机器翻译系统的过程，包括数据准备、模型架构、训练和评估。本项目使用注意力机制来提高翻译质量，并提供了可视化工具来帮助理解模型的翻译过程。这个项目可用作机器翻译任务的起点，可以根据需要进行扩展和改进。

具体来说，本项目的功能概括如下：

数据准备：从一个包含英语句子和对应葡萄牙语翻译的数据集中获取句子对，并进行基本的数据清洗和预处理。
模型架构：使用编码器-解码器（Encoder-Decoder）架构，其中编码器将输入句子编码为固定长度的向量，而解码器将该向量解码为目标语言句子。
词汇表：创建英语和葡萄牙语的词汇表，并为每个单词建立索引映射。
模型训练：通过多个训练轮次，使用批量数据来训练模型，优化模型权重以最小化翻译误差。在训练期间，使用教师强制（Teacher Forcing）来加速学习。
注意力机制：模型采用注意力机制，允许模型在翻译过程中关注输入句子的不同部分，以提高翻译质量。
模型评估：提供了一个评估函数，可用于输入句子并获得模型的翻译输出以及注意力权重分布。
随机样本预测：提供了一个函数，可以随机选择验证集中的样本，进行翻译预测并生成注意力热图，以帮助理解模型的翻译行为。

实例7-3：综合项目：翻译系统（源码路径：daima\7\NMT-translation.ipynb）

（1）安装Chart Studio

本项目用到了库Chart Studio，这是Plotly提供的在线图表编辑和共享平台，允许用户轻松创建、自定义、分享和部署交互式图表和数据可视化。Chart Studio的目标是使数据可视化变得更加容易，并提供工具来探索、理解和传达数据。它与Plotly的Python、R和JavaScript图表库紧密集成，使用户能够轻松地将其创建的图表集成到数据科学和Web开发项目中。安装Chart Studio的命令如下所示：

pip install chart-studio

（2）准备数据集文件，本项目使用了一个包含英语句子及其葡萄牙语翻译的数据集文件por.txt，在文件中的每一行，文本文件包含一个英语句子及其法语翻译，用制表符分隔。编写如下代码递归遍历保存数据集的目录'input'以及其子目录中的所有文件，并将它们的完整路径打印到控制台。

import os
for dirname, _, filenames in os.walk('input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

执行后输出：

input/por.txt

（3）使用UTF-8编码格式打开文件por.txt，然后将文件内容按行分割，并输出文件的第5000行到第5010行的内容。

file_path = '../input/por.txt'
lines = open(file_path, encoding='UTF-8').read().strip().split('\n')
lines[5000:5010]

执行后会输出：

['Will it rain?\tSerá que chove?\tCC-BY 2.0 (France) Attribution: tatoeba.org #8918600 (CK) & #8930552 (JGEN)',
 'Wish me luck.\tDeseje-me sorte.\tCC-BY 2.0 (France) Attribution: tatoeba.org #2254917 (CK) & #872788 (alexmarcelo)',
 "Won't you go?\tVocê não vai?\tCC-BY 2.0 (France) Attribution: tatoeba.org #241051 (CK) & #6212788 (bill)",
 'Write in ink.\tEscreva à tinta.\tCC-BY 2.0 (France) Attribution: tatoeba.org #3258764 (CM) & #7351595 (alexmarcelo)',
 'Write in ink.\tEscreva a tinta.\tCC-BY 2.0 (France) Attribution: tatoeba.org #3258764 (CM) & #7351606 (alexmarcelo)',
 'Write to Tom.\tEscreva para o Tom.\tCC-BY 2.0 (France) Attribution: tatoeba.org #2240357 (CK) & #5985551 (Ricardo14)',
 'Years passed.\tPassaram os anos.\tCC-BY 2.0 (France) Attribution: tatoeba.org #282197 (CK) & #977841 (alexmarcelo)',
 'Years passed.\tAnos se passaram.\tCC-BY 2.0 (France) Attribution: tatoeba.org #282197 (CK) & #2324530 (Matheus)',
 'You amuse me.\tVocê me diverte.\tCC-BY 2.0 (France) Attribution: tatoeba.org #268209 (CM) & #1199960 (alexmarcelo)',
 'You are late.\tVocê está atrasado.\tCC-BY 2.0 (France) Attribution: tatoeba.org #277403 (CK) & #1275547 (alexmarcelo)']

（4）打印输出在前面代码中读取的文本文件的行数，也就是文件中的记录总数。

print("total number of records: ",len(lines))

在上述代码中，len(lines) 返回 lines 列表的长度，也就是文件中的行数。然后，通过 print() 函数将这个行数与文本消息 "total number of records: " 一起打印到屏幕上，以提供用户关于文件中记录数量的信息。执行后会输出：

total number of records:  168903

（5）使用Python标准库中的string模块来创建两个关于文本处理的工具，分别是exclude和remove_digits。这两个工具可以在文本处理中非常有用，例如，你可以使用它们来清洗文本，去除标点符号或数字，以便进行文本分析或其他自然语言处理任务。

exclude = set(string.punctuation)
remove_digits = str.maketrans('', '', string.digits)

（6）定义一个名为preprocess_eng_sentence的函数，用于预处理英语句子，以便在自然语言处理任务中使用，如机器翻译或文本生成。

def preprocess_port_sentence(sent):
    '''Function to preprocess Portuguese sentence'''
    sent = re.sub("'", '', sent) # remove the quotation marks if any
    sent = ''.join(ch for ch in sent if ch not in exclude)
    #sent = re.sub("[२३०८१५७९४६]", "", sent) # remove the digits
    sent = sent.strip()
    sent = re.sub(" +", " ", sent) # remove extra spaces
    sent = '<start> ' + sent + ' <end>' # add <start> and <end> tokens
    return sent

（7）定义一个名为preprocess_port_sentence的函数，用于预处理葡萄牙语句子。

def preprocess_port_sentence(sent):
    '''Function to preprocess Portuguese sentence'''
    sent = re.sub("'", '', sent) # remove the quotation marks if any
    sent = ''.join(ch for ch in sent if ch not in exclude)
    #sent = re.sub("[२३०८१५७९४६]", "", sent) # remove the digits
    sent = sent.strip()
    sent = re.sub(" +", " ", sent) # remove extra spaces
    sent = '<