Research on Short Text Classification Based on Tex
Research on Short Text Classification Based on Tex
Abstract. The TextCNN model is widely used in text classification tasks. It has become a
comparative advantage model due to its small number of parameters, low calculation, and fast
training speed. However, training a convolutional neural network requires a large amount of
sample data. In many cases, there are not enough data sets as training samples. Therefore, this
paper proposes a Chinese short text classification model based on TextCNN, which uses back
translation to achieve data augment and compensates for the lack of training data. The
experimental data shows that our proposed model has achieved good results.
1. Introduction
Classification refers to automatically labeling data. People divide categories by experience in daily life.
However, it is impossible to manually classify every page on the Internet according to some rules.
Therefore, computer-based efficient automatic classification technology has become an urgent need
for people to solve Internet application problems. Similar to classification, technology is clustering.
Clustering does not match data to a pre-defined set of tags but automatically aggregates into one or
more categories through implicit structures related to other data. Text classification is an important
research direction in the field of data mining and machine learning.
Classification is a topic that has been studied in the field of information retrieval for many years.
On the one hand, it aims to improve the effectiveness and efficiency in some cases with the application
of search; on the other hand, classification is also a classic machine learning technology. In the field of
machine learning, classification is performed under a pre-defined category system with annotations.
Text Classification (Text Classification or Text Categorization, TC), or Automatic Text
Categorization (Automatic Text Categorization), refers to the computer mapping a text containing
information to a predetermined category or several categories of topics process. Text classification
also belongs to the field of natural language processing. In this article, Text and Document are
indistinguishable and have the same meaning.
2. Related Work
Text classification and clustering technology have a wide range of applications in intelligent
information processing services. For example, most online news portals (such as Sina, Sohu, Tencent,
etc.) generate many news articles every day. If the manual sorting of this news is very time-consuming
and labor-intensive, and the automatic classification or clustering of this news will be News
Content from this work may be used under the terms of the Creative Commons Attribution 3.0 licence. Any further distribution
of this work must maintain attribution to the author(s) and the title of the work, journal citation and DOI.
Published under licence by IOP Publishing Ltd 1
ICCBDAI 2020 IOP Publishing
Journal of Physics: Conference Series 1757 (2021) 012092 doi:10.1088/1742-6596/1757/1/012092
classification and follow-up personalized recommendations provide great help. The Internet also has a
large number of text data such as webpages, papers, patents, and e-books. The classification and
clustering of the text content is an important basis for fast browsing and retrieval of this content. In
addition, many natural language analysis tasks, such as opinion mining, spam detection, etc., can also
be regarded as specific applications of text classification or clustering techniques.
With the continuous improvement of research methods such as machine learning and deep learning,
the solution path of text classification problems has gradually shifted from the previous vector space
model (VSM) to the combination of machine learning and deep learning [1]. In the deep learning
network, the convolutional neural network CNN can identify predictive n-grams in the text; the
convolution structure supports n-grams with similar components to share their prediction behaviors,
even if they have not logged in during the prediction process. The specific n-grams that have been
passed are also possible, while each layer of the hierarchical CNN focuses on the longer n-grams in
the sentence so that the model can be more sensitive to non-continuous n-grams. It can have a
significant impact on the effect of text classification [2-3].
What this article realizes is the classification of news headlines. The main feature of news
headlines is to summarize rich information in a concise language as possible. According to statistics,
95% of news headlines do not exceed 20 Chinese characters in length. Therefore, the existing research
summarizes headline classification as short text classification[4]. Classified categories include finance,
real estate, stocks, education, technology, society, current affairs, sports, games, and entertainment.
However, when the number of training data sets is insufficient, the training effect will be poor. At
this time, the use of text data enhancement methods can achieve the purpose of improving the training
effect. Data enhancement refers to the process of transforming (limited) training data into new data.
Moreover, text data enhancement is to operate on text data. In short, it is the use of data augmentation
to expand the scale of data.
3. Model structure
2
ICCBDAI 2020 IOP Publishing
Journal of Physics: Conference Series 1757 (2021) 012092 doi:10.1088/1742-6596/1757/1/012092
of the sentence, which has a strong ability to extract shallow text features. The model can identify the
linguistic n-gram in the task. When it encounters a specific n-gram that has not been registered during
the prediction process, its convolution structure can also allow n-grams with similar elements to share
the predicted behavior, and each layer of the hierarchical CNN pays attention to the longer n-grams in
the sentence, so that the model can be more responsive to non-continuous n-grams Sensitive. By
adjusting the height of the convolution kernel, TextCNN can flexibly process various timing
information of the comprehensive vocabulary, which improves the model's ability to interpret the text.
Compared with the traditional image CNN network, textCNN has no changes in the network
structure (or even simpler). In fact, textCNN has only one layer of convolution, one layer of max-
pooling, and finally, the output is externally connected to softmax for n classification. As shown in the
following Fig.1:
The textCNN model mainly uses a one-dimensional convolutional layer and a sequential maximum
pooling layer. A one-dimensional convolutional layer is equivalent to a two-dimensional convolutional
layer with a height of 1. The one-dimensional cross-correlation operation of multiple input channels is
also similar to the two-dimensional cross-correlation operation of multiple input channels: on each
channel, the core and the corresponding input is subjected to a one-dimensional cross-correlation
operation, and the results between the channels are added to obtain the output result.
The text matrix is composed of word vectors. The filter core sizes are 2, 3, and 4, respectively.
After convolution pooling, the feature vector is obtained, and its dimension = the number of
convolution kernel sizes ×the size of each convolution kernel Number[10].
3
ICCBDAI 2020 IOP Publishing
Journal of Physics: Conference Series 1757 (2021) 012092 doi:10.1088/1742-6596/1757/1/012092
article uses the n-gram language model to generate the sub-words of the words in the text, which helps
to solve the expression problems of unregistered words and low-frequency words. It can also capture
the word order of words to a certain extent.
First, a sentence can be regarded as a sequence of words; the length of the sequence is n, each word
is represented by a vector Xi, and the dimension of each word embedding is k. So the sentence is
expressed as follows:
X X X 2 ::: X n
i:n 1 (1)
C f( X
i
) b
i:i h -1 (2)
4
ICCBDAI 2020 IOP Publishing
Journal of Physics: Conference Series 1757 (2021) 012092 doi:10.1088/1742-6596/1757/1/012092
In order to convert the output vector of the pooling layer into the prediction result we want, a softmax
layer is added. You can use dropout and L2 regularization to prevent overfitting.
Tab.1 Text classification for classification and labeling results statistics for category ci.
The meanings of the symbols in the above figure and table are as follows:
1)a represents the number of texts that correctly label the test set text as category ci;
2) b represents the number of texts that incorrectly label the test set text as category ci;
3) c means the number of texts that have been excluded from the category ci in the test set by
mistake;
4)d represents the number of texts that correctly exclude the test set text outside the category ci
The recall rate (also called recall rate) of the classifier in category ci is defined as:
recall [ a /( a c )] 100%
i (3)
The accuracy of the classifier in the category ci (also called precision) is defined as follows:
5
ICCBDAI 2020 IOP Publishing
Journal of Physics: Conference Series 1757 (2021) 012092 doi:10.1088/1742-6596/1757/1/012092
5. Conclusion
This article is based on the text classification function implemented by the convolutional neural
network TextCNN. It uses the back translation in the data enhancement to translate the data set into
English and then translates back to Chinese. The data set is doubled in number, and the sample size is
simulated. In this case, the use of data augmentation can expand the training set. Let the model achieve
a better training effect so that it can show higher accuracy in the test set.
Acknowledgments
This work was supported by Hunan Provincial Education Department Foundation under grant
No.18C1311.
BIGC Project (Ec202007).
References
[1] Salton G, Wong A and Yang CS 1975 A vector space model for automatic indexing
Communications of the ACM 18(l1) pp 6l3-620(doi: 10.1145/361219.361220)
[2] Joulin A Grave E and Bojanowski P et al 2017 Bag of tricks for efficient text classification.
Proceedings of the 15th Conference of the European Chapter of the Association for
Computational Linguistics (Valencia, Spain) pp 427-431
[3] Xueliang H, Xin L and Yuanping C 2020 A short text classification model based on a mixture
of multiple neural networks Computer System Applications 29(10) pp 9-19
6
ICCBDAI 2020 IOP Publishing
Journal of Physics: Conference Series 1757 (2021) 012092 doi:10.1088/1742-6596/1757/1/012092
[4] Xiaozheng D, Rui S, Hongyu and et al 2018 News headline classification based on multiple
models J.Journal of Chinese Information Processing 32(10) p 69
[5] Xingyu L and Py torch 2017 M. First edition Electronic Industry Press
[6] Dingpeng D, Yajian Z, Junhui C and et al 2020 Overview of Short Text Classification
Technology Research J. Software 41 (02) pp 141-144
[7] Jing W, Lang L and Deqiang W 2018 Research on Chinese short text classification based on
word2vec J. Computer system applications 7 (05) pp 211-217
[8] Jian L and Qian Y 2018 Overview of Convolutional Neural Networks J. Computer Times
317(11) pp 19-23
[9] Kim Y 2014 Convolutional neural networks for sentence classification Proceedings of the 2014
Conference on Empirical Methods in Natural Language Processing (Doha, Qatar: EMNLP)
pp 1746-1751
[10] Zhijie L, Chaoyang G and Peng S 2020 Study on Short Text Classification of LSTM-TextCNN
Joint Model J. Journal of Xi'an Technological University 40(3) pp 299-304