没有合适的资源?快使用搜索试试~ 我知道了~
温馨提示
文本相似度计算被信息检索,问答系统,窃检测等广泛使用。 目前,大多数研究只是针对同一种语言的文本相似度,而跨语言文本相似度计算的研究很少,语言之间的差异使得跨语言文本相似度的计算非常困难,针对这种情况,本文提出了一种解决方案。基于WordNet的中文-老挝语跨语言文本相似度计算方法。 首先对医学上的中文文本和老挝文本进行预处理和特征选择,然后使用语义词典WordNet将中文文本和老挝文本转换为中间层语言,最后计算中间层中文和老挝文本之间的相似度。
资源详情
资源评论
资源推荐

This paper is supported by National Nature Science Foundation No.61662040, 61562049
Chinese-Lao Cross-Language Test Similarity
Computing Based on WordNet
Sizhuo Li
1,2
, Lanjiang Zhou
*,1,2
, Jianan Zhang
3
, Feng Zhou
2
, Jianyi Guo
2
,Wenjie Huo
1,2
1
School of Information Engineering and Automation, Kunming University of Science and Technology,
Kunming 650500, China
2
The Key Laboratory of Intelligent Information Processing, Kunming University of Science and
Technology, Kunming, Yunnan 650500, China
3
Information Engineering University, Kunming team of the three schools 650500, China
Abstract. Text similarity calculation is widely used by information retrieval,
question answering system, plagiarism detection and so on. At present, most
research just aim at text similarity of the same language, and research on
cross-language text similarity calculation is rarely, differences between languages
make cross-language text similarity calculation very difficult, in view of this
situation, this paper propose a WordNet-based method of Chinese-Lao
cross-language text similarity calculation. First, preprocessing and feature
selection for Chinese text and Lao text which in medicine, then use the semantic
dictionary WordNet to convert the Chinese text and Lao text into a middle layer
language, finally, compute the text similarity between Chinese and Lao in the
middle layer.
Key words: WordNet; middle layer language; cross-language text similarity
1. Introduction
Text similarity computing has been widely discussed in the fields of linguistics,
psychology, information theory and so on. Text similarity calculation aims to compare
the correlation between the two texts. In recent years, the method of text similarity
computation
[1,2,3]
based on the same language is more and more perfect, the algorithm
model represented by the Boolean model, vector space model, probability model and
so on. However, the research on cross-language text similarity is very rare.
Cross-language text similarity is to quantify the similarity between two different
language tests, and make the quantitative results as far as possible in accordance with
the results of the artificial judgment. Due to the differences in grammar between
Chinese and Lao, we can not use the existing method which calculate the similarity of
text in the same language to calculate the similarity between Chinese and Lao text. At
present, there are several methods to calculate the similarity of cross-language text:
The method based on Machine Translation
[4]
, The method based on statistical
translation model
[5]
, The method based on Parallel Corpus
[6]
.

This paper is supported by National Nature Science Foundation No.61662040, 61562049
WordNet is a semantic dictionary using synonym set represents a concept and
has multi language version. The Chinese WordNet used in this paper is developed by
Southeast University and Lao semantic dictionary is constructed by our laboratory.
The synonym set synset_id of WordNet between different language versions are
corresponding to each other. Therefore, this paper uses this characteristic and
proposes the method of Chinese-Lao Cross-Language Test Similarity Computing
Based on WordNet in Medicine. This method uses the WordNet to convert the
Chinese text and Lao text into an middle layer language, then, compute the text
similarity between Chinese and Lao in the middle layer.
2. The process of Chinese-Lao text similarity computing
2.1 Text preprocessing
Although the original text contains all the text information, but the current Natural
Language Processing technology can not completely processing these text messages.
Therefore, we need processing the text. Because the method of this paper needs to
analyze the semantic of the word, so it is necessary to deal with some special words,
such as names, place names and so on. Then convert these special words into a
specific string. In feature selection, these special words are ignored to avoid noise
interference.
2.2 Text feature selection
The purpose of feature selection is to select the characteristic items which have the
real contribution to the similarity computing, and the selected feature item should be
able to express the theme of the original text. In this paper, the word is extracted as
the feature of the text, and each document is treated as a word bag. Through the word
segmentation and remove the stop words, the Chinese document and the Lao
documents can form a feature word set. Then, by using the method of document
frequency selection to remove the useless words that interfere with the original text.
Document frequency (DF) refers to the number of texts that contain the feature word t
in the set of whole text. When DF is greater than a certain threshold value, then
remove the t. Because the higher the DF, the more t appears in text. When DF is less
than a certain threshold value then remove the t, because t is either a rare word or
noise.
2.3 Conversion of language space between Chinese and Lao
This paper uses the WordNet to convert the Chinese text and Lao text into an middle
layer language, then, compute the text similarity between Chinese and Lao in the
middle layer. The conversion model is shown in Figure 1.
剩余7页未读,继续阅读
























weixin_38726712
- 粉丝: 2
上传资源 快速赚钱
我的内容管理 展开
我的资源 快来上传第一个资源
我的收益
登录查看自己的收益我的积分 登录查看自己的积分
我的C币 登录后查看C币余额
我的收藏
我的下载
下载帮助


最新资源
- 房屋租赁合同(b).doc
- 2023年c语言实训心得800字(五篇).docx
- 诊断试验的研究与评价.pdf
- 日报社“报业大厦”结构设计与分析.doc
- XX公司安全生产预警管理制度.docx
- 道路雨水工程跟踪审计工程报告.doc
- 电梯电气设备安装.doc
- 监理公司新闻危机专项应急预案(修改版).doc
- 项目工程进度款及联系单操作流程.doc
- 厦门某工程穹顶钢结构施工方案(网架).doc
- 图形算量中级培训-零星构件.pptx
- 西宁市某河重点整治工程施工组织设计.doc
- 工程施工进度管理制度.doc
- 工程项目验收管理制度.doc
- 综合布线方案盂县电子政务外网建设项目方案(1).docx.docx
- 壁挂炉培训(地暖施工关键点).ppt
资源上传下载、课程学习等过程中有任何疑问或建议,欢迎提出宝贵意见哦~我们会及时处理!
点击此处反馈



安全验证
文档复制为VIP权益,开通VIP直接复制

评论0