百度AIstudio比赛-汽车大师问答摘要与推理

共96个文件

py：54个

pyc：36个

ds_store：2个

需积分: 5 133 浏览量 2023-09-30 18:07:34 上传评论收藏 21.16MB ZIP 举报

该项目是关于百度AI Studio举办的一场竞赛，主题为“汽车大师问答摘要与推理”。参赛者可能需要利用自然语言处理（NLP）技术和机器学习算法来处理汽车相关的问答数据，进行文本摘要和逻辑推理。在这个项目中，我们可以学习到以下几个关键知识点： 1. 自然语言处理（NLP）基础：NLP是人工智能的一个分支，专注于理解、生成和处理人类语言。在汽车问答中，NLP技术可以用于理解问题和答案的语义，提取关键信息。 2. 文本预处理：预处理是NLP流程的重要步骤，包括分词、去除停用词、词干提取和词向量化等。这些操作能将原始文本转换成机器可理解的形式。 3. 文本摘要：文本摘要技术用于生成文本的简短概述，保持原意不变。常见的方法有基于抽取和基于生成的摘要。在这个项目中，可能需要设计或应用模型来自动提取问答对中的关键信息。 4. 逻辑推理：逻辑推理涉及到根据已知事实推断出新信息。在汽车问答场景下，可能需要通过分析问题和答案之间的关系，进行事实验证或者找出隐藏的关联。 5. 深度学习模型：可能需要用到如BERT、RoBERTa等预训练语言模型进行特征提取，这些模型在理解和生成自然语言方面表现出色，可以有效提升问答系统的效果。 6. 序列到序列模型（seq2seq）：对于问答和推理任务，seq2seq模型常被用于将输入序列转换为输出序列，如Transformer结构，它在机器翻译、对话系统等领域有广泛应用。 7. 数据集构建与处理：项目可能提供了汽车领域的问答数据集，参赛者需要进行数据清洗、标注和划分训练集、验证集和测试集。 8. 模型训练与优化：使用深度学习框架（如TensorFlow或PyTorch）进行模型训练，通过调整超参数、应用正则化策略和优化器来优化模型性能。 9. 评估指标：可能包括ROUGE（召回率、准确率和F1分数）等用于文本摘要的评价标准，以及准确率、精确率、召回率和F1分数等用于逻辑推理任务的评估。 10. 实时推理与部署：项目最终可能需要将训练好的模型部署到实际应用中，实现在线问答和推理，这涉及模型的轻量化、推理效率优化以及与前端交互的接口设计。通过参与这样的竞赛，不仅可以提升自身的NLP和机器学习技术，还能了解到如何处理实际问题，提高解决复杂任务的能力。同时，这也是一个很好的机会，与其他开发者交流，学习他们的解决方案，共同推动AI技术的发展。

资源推荐

资源详情

资源评论

收起资源包目录

百度AIstudio比赛-汽车大师问答摘要与推理_项目备份.zip （96个子文件）

Baidu-AI-studio_Competition_Car-master-quiz-summary-and-reasoning-master

ReadMe

百度AI studio比赛--汽车大师问答摘要与推理——项目总结.md 14KB

PGN

utils

dataset_split.py 1KB

__init__.py 0B

preprocess.py 5KB

build_w2v.py 3KB

tokenizer.py 2KB

new.py 7KB

data_reader.py 3KB

io_utils.py 1KB

__pycache__

data_utils.cpython-37.pyc 8KB

tokenizer.cpython-37.pyc 984B

__init__.cpython-37.pyc 279B

log_utils.py 553B

data_utils.py 8KB

seq2seq_pgn_tf2

__init__.py 0B

.DS_Store 8KB

eval.py 595B

encoders

__init__.py 115B

__pycache__

__init__.cpython-37.pyc 294B

rnn_encoder.cpython-37.pyc 2KB

rnn_encoder.py 4KB

utils

losses.py 4KB

decoding.py 3KB

__pycache__

decoding.cpython-37.pyc 2KB

decoding.cpython-36.pyc 2KB

losses.cpython-36.pyc 3KB

losses.cpython-37.pyc 3KB

decoders

__init__.py 256B

rnn_decoder.py 11KB

__pycache__

rnn_decoder.cpython-37.pyc 4KB

__init__.cpython-37.pyc 294B

beam_search.py 9KB

bin

__init__.py 0B

result_save_path.csv 64B

main.py 7KB

temp.ipynb 9KB

train_helper.py 5KB

train_eval_test.py 3KB

batcher.py 27KB

models

__init__.py 0B

pgn.py 4KB

sequence_to_sequence.py 8KB

__pycache__

sequence_to_sequence.cpython-37.pyc 2KB

pgn.cpython-37.pyc 2KB

__init__.cpython-37.pyc 292B

__pycache__

train_eval_test.cpython-37.pyc 3KB

beam_search.cpython-37.pyc 7KB

test_helper.cpython-37.pyc 8KB

batcher.cpython-37.pyc 8KB

test_helper_copyed.cpython-37.pyc 6KB

__init__.cpython-37.pyc 285B

train_helper.cpython-37.pyc 2KB

test_helper.py 16KB

S2S

utils

dataset_split.py 1KB

__init__.py 0B

preprocess.py 5KB

build_w2v.py 3KB

tokenizer.py 2KB

new.py 7KB

data_reader.py 3KB

io_utils.py 1KB

__pycache__

data_utils.cpython-37.pyc 8KB

tokenizer.cpython-37.pyc 984B

__init__.cpython-37.pyc 279B

log_utils.py 553B

data_utils.py 8KB

seq2seq_tf2

__init__.py 0B

.DS_Store 8KB

eval.py 595B

test_helper_copyed.py 17KB

encoders

__init__.py 115B

__pycache__

__init__.cpython-37.pyc 294B

rnn_encoder.cpython-37.pyc 2KB

rnn_encoder.py 4KB

decoders

__init__.py 256B

rnn_decoder.py 9KB

__pycache__

rnn_decoder.cpython-37.pyc 2KB

__init__.cpython-37.pyc 294B

beam_search.py 9KB

bin

__init__.py 0B

main.py 6KB

train_helper.py 5KB

train_eval_test.py 5KB

batcher.py 28KB

models

__init__.py 0B

sequence_to_sequence.py 8KB

__pycache__

sequence_to_sequence.cpython-37.pyc 2KB

__init__.cpython-37.pyc 292B

__pycache__

train_eval_test.cpython-37.pyc 4KB

beam_search.cpython-37.pyc 7KB

test_helper.cpython-37.pyc 7KB

batcher.cpython-37.pyc 8KB

__init__.cpython-37.pyc 285B

train_helper.cpython-37.pyc 2KB

test_helper.py 12KB

Data

Data.7z 20.98MB

import tensorflow as tf SENTENCE_START = '<s>' SENTENCE_END = '</s>' PAD_TOKEN = '[PAD]' UNKNOWN_TOKEN = '[UNK]' START_DECODING = '[START]' STOP_DECODING = '[STOP]' #vocab_file=》vocab.txt,max_size=30000 #建立词表类，与之前vocab.txt的区别在于，加入对特殊符号的处理 class Vocab: def __init__(self, vocab_file, max_size): self.word2id = {UNKNOWN_TOKEN: 0, PAD_TOKEN: 1, START_DECODING: 2, STOP_DECODING: 3} self.id2word = {0: UNKNOWN_TOKEN, 1: PAD_TOKEN, 2: START_DECODING, 3: STOP_DECODING} self.count = 4 with open(vocab_file, 'r', encoding='utf-8') as f: #读取词表的每一行，应该的格式形如“说 0” for line in f: pieces = line.split() if len(pieces) != 2: #跳过不合法的数据 print('Warning : incorrectly formatted line in vocabulary file : %s\n' % line) continue w = pieces[0]#取单词,出现非预期词时报错 if w in [SENTENCE_START, SENTENCE_END, UNKNOWN_TOKEN, PAD_TOKEN, START_DECODING, STOP_DECODING]: raise Exception(r'<s>, </s>, [UNK], [PAD], [START] and [STOP] shouldn\'t be in the vocab file, ' r'but %s is' % w) #出现重复词时报错 if w in self.word2id: raise Exception('Duplicated word in vocabulary file: %s' % w) #建立双向词表，字典形式 self.word2id[w] = self.count self.id2word[self.count] = w self.count += 1 #超过最大值时报错退出 if max_size != 0 and self.count >= max_size: print("max_size of vocab was specified as %i; we now have %i words. Stopping reading." % (max_size, self.count)) break print("Finished constructing vocabulary of %i total words. Last word added: %s" % (self.count, self.id2word[self.count - 1])) #根据词查对应id，遇到OOV返回UNK：0 def word_to_id(self, word): if word not in self.word2id: return self.word2id[UNKNOWN_TOKEN] return self.word2id[word] #根据id查词，id不合法时报错 def id_to_word(self, word_id): if word_id not in self.id2word: raise ValueError('Id not found in vocab: %d' % word_id) return self.id2word[word_id] #词表实际大小 def size(self): return self.count #将文章中的OOV词合集oovs、扩充后的词表大小ids def article_to_ids(article_words, vocab): ids = [] oovs = [] unk_id = vocab.word_to_id(UNKNOWN_TOKEN) for w in article_words: i = vocab.word_to_id(w) if i == unk_id: # If w is OOV if w not in oovs: # Add to list of OOVs oovs.append(w) oov_num = oovs.index(w) # This is 0 for the first article OOV, 1 for the second article OOV... ids.append(vocab.size() + oov_num) # This is e.g. 50000 for the first article OOV, 50001 for the second... else: ids.append(i) return ids, oovs #找到摘要中的OOVs,核对其是不是文章OOVs中的一员，若不是，继续添加 def abstract_to_ids(abstract_words, vocab, article_oovs): ids = [] unk_id = vocab.word_to_id(UNKNOWN_TOKEN) for w in abstract_words: i = vocab.word_to_id(w) if i == unk_id: # If w is an OOV word if w in article_oovs: # If w is an in-article OOV vocab_idx = vocab.size() + article_oovs.index(w) # Map to its temporary article OOV number ids.append(vocab_idx) else: # If w is an out-of-article OOV ids.append(unk_id) # Map to the UNK token id else: ids.append(i) return ids #未看，暂无引用 def output_to_words(id_list, vocab, article_oovs): words = [] for i in id_list: try: w = vocab.id_to_word(i) # might be [UNK] except ValueError as e: # w is OOV assert article_oovs is not None, "Error: model produced a word ID that isn't in the vocabulary. " \ "This should not happen in baseline (no pointer-generator) mode" article_oov_idx = i - vocab.size() try: w = article_oovs[article_oov_idx] except ValueError as e: # i doesn't correspond to an article oov raise ValueError('Error: model produced word ID %i which corresponds to article OOV %i but this ' 'example only has %i article OOVs' % (i, article_oov_idx, len(article_oovs))) words.append(w) return words #暂无引用 def abstract_to_sents(abstract): """ Splits abstract text from datafile into list of sentences. Args: abstract: string containing <s> and </s> tags for starts and ends of sentences Returns: sents: List of sentence strings (no tags) """ cur = 0 sents = [] while True: try: start_p = abstract.index(SENTENCE_START, cur) end_p = abstract.index(SENTENCE_END, start_p + 1) cur = end_p + len(SENTENCE_END) sents.append(abstract[start_p + len(SENTENCE_START): end_p]) except ValueError as e: # no more sentences return sents def get_dec_inp_targ_seqs(sequence, max_len, start_id, stop_id): """ Given the reference summary as a sequence of tokens, return the input sequence for the decoder, and the target sequence which we will use to calculate loss. The sequence will be truncated if it is longer than max_len. The input sequence must start with the start_id and the target sequence must end with the stop_id (but not if it's been truncated). Args: sequence: List of ids (integers) max_len: integer start_id: integer stop_id: integer Returns: inp: sequence length <=max_len starting with start_id target: sequence same length as input, ending with stop_id only if there was no truncation """ inp = [start_id] + sequence[:] target = sequence[:] #如果输入长度超过限定最大长度max，inp取start_id+前(max-1)个sequence，target取前max个；否则target加上stop_id, if len(inp) > max_len: # truncate inp = inp[:max_len] target = target[:max_len] # no end_token else: # no truncation target.append(stop_id) # end token #断言，如果实际情况不满足后面条件，则退出返回错误 assert len(inp) == len(target)#不管是一个加STA还是另一个加END，最终应该长度都相等，要不是max,要不是len+1 return inp, target def example_generator(vocab, train_x_path, train_y_path, test_x_path, max_enc_len, max_dec_len, mode, batch_size): #训练数据处理 if mode == "train": #提供文件名自动构造一个dataset/ dataset_train_x = tf.data.TextLineDataset(train_x_path) dataset_train_y = tf.data.TextLineDataset(train_y_path) #通过给定的数据集压缩构造一个数据集，形如[(x1,y1),(x2,y2),(x3,y3)] train_dataset = tf.data.Dataset.zip((dataset_train_x, dataset_train_y)) # train_dataset = train_dataset.shuffle(1000, reshuffle_each_iteration=True).repeat() # i = 0 #print("gen",train_dataset) for raw_record in train_dataset: #编码转换 article = raw_record[0].numpy().decode("utf-8") #print("article",article) #article 新车，全款，买了半个月，去 4S店贴膜时才发现右侧尾灯下 (...)。车主说：恩 abstract = raw_record[1].numpy().decode("utf-8") #print("abstract",abstract) #abstract 你好，像这�

评论收藏

内容反馈