书接上文
遇到一个报错
langdetect.lang_detect_exception.LangDetectException: No features in text.
没有检测出语言类型
那跳过吧,反正数据挺多的 (不是
修改这个脚本的fetch_pubmed_articles方法
# fetch
err=0
def fetch_pubmed_articles(ids):
ids = ",".join(ids)
handle = Entrez.efetch(db="pubmed", id=ids, retmode="xml")
records = Entrez.read(handle)
#print(records)
set_articles = []
set_langs = []
for record in get_set_articles(records):
#print(record)
try:
article, langs = build_article(record)
set_articles.append(article)
set_langs.append(langs)
except LangDetectException as e:
err+=1
print(e+":"+str(err))
finally:
continue # 跳过吧
handle.close()
#print(len(articles))
return set_articles, set_langs
这个数据集是段落数据,1.5w多条
后续如果要使用还得分句,对齐,如果要用有时间再做吧