2019 Eighth International Conference of Educational Innovation through Technology (EITT)
Deep Learning for Dropout Prediction in MOOC瀆
Di Sun Yueheng Mao Junlei Du
IDD&E School of Artificial Intelligence Research Center of Distance Education
Syracuse University Beijing Normal University Beijing Normal University
Syracuse, USA Beijing, China Beijing, China
dsun02@[Link] soloown@[Link] [Link]@[Link]
Pengfei Xu* Qinhua Zheng Hongtao Sun
School of Artificial Intelligence Faculty of Education Center of Info. & Network Technology
Beijing Normal University Beijing Normal University Beijing Normal University
Beijing, China Beijing, China Beijing, China
xupf@[Link] zhengqinhua@[Link] sunhtao@[Link]
Abstract—In recent years, the rapid rise of massive open learning activity. Second, using an GRU-RNN model with a
online courses (MOOCs) has aroused great attention. Dropout learning resource representation layer, instead of weekly
prediction or identifying students at risk of dropping out of a activity feature, can help us obtain a general dropout
course is an open problem for MOOC researchers and providers. prediction model based on the resource access log directly.
This paper formulates the dropout prediction problem as
predicting how much content in the whole course syllabus can The rest of this article is organized as follows. Section 2
be completed by the student. A dropout rate prediction model is summarizes several previous works on MOOCs dropout
based on a recurrent neural network (RNN), and an URL prediction. Our proposed dropout formulation and prediction
embedding layer is proposed to solve this problem. The results model is presented in Section 3. Section 4 presents our
show that the prediction accuracy of the model is significantly experimental setup and results, while the conclusion is given
higher than that of the traditional machine learning model. in Section 5.
Keywords—MOOCs, Dropout Prediction, completion, RNN II. RELATED WORK
I. INTRODUCTION Before we address the proposed model and how it
operates, some background information and definitions need
Massive open online courses (MOOCs) are attracting to be explained to provide foundation.
unprecedented interest and growing rapidly around the world
[1]. Its enrollment and teacher-student ratio are high. By 2018, Data: Dropout predict models could be built on different
the number of online students had exceeded 100 million. In types of MOOCs data. Take EdX platform as an example, the
2018, 20 million freshmen enrolled in at least one MOOC raw data comes from the following sources: JSON-formatted
course through Coursera, Udacity, and edX. By the end of clickstream log, MongoDB-based forum and wiki data,
2018, more than 900 universities around the world had MySQL-based student status data, and XML-formatted course
announced the opening of 114,000 open online courses, calendar data.
adding about 2,000 courses [2].
Definition of dropout: Researchers have used different
MOOC courses however, face the problem of having a definitions of dropout in their work. Some researchers define
high number of dropouts and low completion rates. Jordan [3] dropout based on learning behaviors only [5, 6], and some
reported that the average completion rate for MOOCs is dropout definition involves whether certification is earned [7].
around 15 percent. Many researchers are working on Common that dropout is defined based on several conditions.
understanding the reason behind why students fail to finish the For example, Halawa, Greene, and Mitchell [8] defined
course. Taylor and Colin [4] reported that the percentage of dropout as less than 50 percent of the video had been watched
students who passed the mid-term exams was about 6 percent, and had no learning behavior in the past month.
while the percentage of students who received the certificate
was under 5 percent. Features used for prediction: The most commonly used
features in dropout prediction models are clickstream features
Many researchers tried to use machine learning algorithms that are computed from the clickstream log that contains all
to build dropout prediction models to identify students who interaction events generated from user behaviors [9-14].
are likely to drop out of the course. This system could help These events includes play/pause/rewind of lecture videos,
teachers provide some adjustments during the teaching answers to quiz questions, reading or replying to a forum post,
process to improve learning. As an example, sending email and more. Other features, such as assignment grades [8],
reminders to specific students not performing well can social network analysis [15], and biographical information
increase retention by identifying underlying issues. (e.g., job, age) [16] are also explored to build the prediction
Current dropout prediction models are mostly built upon models.
human engineered features that does not work well. For Classification methods: Most existing work uses
example, clickstream features. This paper adopts a different supervised learning algorithms to build discriminative models
approach that builds the prediction model based on to predicting dropout [9-14]. Generative models such as
clickstream log directly using a temporal deep learning Hidden Markov Model (HMM) is also used, which is more
algorithm. This paper also adopts a new formulation of the favorable in dropout intervention setting [17]. Survival
dropout prediction problem. analysis techniques have also been used [16]. Temporal neural
The proposed model can lead to two contributions in network models like Recurrent Neural Networks (RNNs) are
dropout prediction. First, the dropout prediction is formulated also used based on weekly clickstream features [14].
based on progress of the course content, instead of presence
2166-0549/19/$31.00 ©2019 IEEE 87
DOI 10.1109/EITT.2019.00025
Interoperability: Veeramachaneni, Dernoncourt, Taylor, x The update gate is defined as:
Pardos, and O’Reilly [18] proposed a standard MOOC
database named MoocDB to foster feature extraction from = ⋅( ⋅ [ℎ , ])
MOOCs across different platforms. MoocDB proposed to
group student behaviors into four categories: the observing σ is sigmoid function, are parameters, ℎ is the
mode, the submitting mode, the collaborating mode, and the hidden state of time t-1.
feedback mode. Andres et al. [19] proposed the MOOC
Replication Framework (MORF) to facilitate the replication x The temporary hidden state is defined as:
of previously published findings across multiple datasets.
MORF enables larger-scale analysis of MOOC research ℎ = tanh( ∙[ ∗ℎ , ])
questions over multi-MOOC data sets.
tanh(.) is the activation function, are parameters.
III. PROPOSED METHOD
x The hidden state is defined as:
A. Problem Formulation
Before conducting research, the meaning of dropout in ℎ = (1 − )∗ℎ + ∗ℎ
MOOC needs to be defined. However, to best of the authors’
knowledge, there is no standard definition of “dropout” yet. x y is :
In this paper, instead of using a hard rule for dropout
definition, we formulate dropout prediction problem as = ( ∙ ℎ )
predicting the percentage of course content completed in the
whole course syllabus. We argue that existing diverse where σ is sigmoid function, are parameters, ℎ is the
definitions of dropout make it difficult to compare algorithms hidden state of time t.
and to migrate models to other platforms. To identify at-risk
students, a time-varying threshold could be applied on the C. Embedding Layer in Proposed GRU-RNN model
predicted completion percentage to filter out students likely to The embedding layer idea is borrowed from the word
dropout in near future. For example, in the first week of the embedding used in natural language processing [12]. In this
course, a threshold of 0.2 could be used, while in the middle paper, “URL-Embedding” is performed on the embedding
of the course, a threshold of 0.7 could be used. However, this layer of the proposed GRU-RNN model, to help modeling
application-level usage of proposed model is not in the scope student’s course completion based on students’ action
of our work. sequences.
Each URL’s embedding vector is updated during the
training phase of the overall GRU-RNN architecture in Figure
1. Before the training phase, each URL k is initialized as a two-
dimensional vector (tk, pk), which represents the average
viewing time of the URL and the average position of URL in
the learning process, respectively. They are obtained using the
method depicted below.
For each appearance of URL k in student i’s sessions, the
time interval between its timestamp and the next timestamp in
the same session is recorded as one viewing time of URL k.
Last URL of the session is not recorded, as it has no next
timestamp and its duration is difficult to estimate accurately.
The viewing times of URL k are summed for each student to
get student i’s total viewing time of URL k, which is denoted
Fig. 1. Propoed GRU-RNN structure with an embedding layer
as Tik.
B. GRU-RNN Then, Tik are averaged for all students to get the average
The proposed drop-out prediction model is based on a viewing time of each URL, as follows:
Recursive Neural Network (RNN) with GRU units. GRU is a
gating mechanism in RNNs, and GRU-RNNs is a well-known
=
variant of traditional RNNs. Comparing with Long Short-
Term Memory (LSTM), another well-known variant of RNN,
GRU has fewer parameters, and thus, less computation is where represents the average viewing time of URL k,
required. GRUs have been shown to exhibit better and represents the average time that student i spends at
performance on certain tasks. URL k in total.
x The reset gate is defined as: Similarly, for each appearance of URL k in student i’s
sessions, normalized position of this appearance in student i’s
= ( ⋅ [ℎ , ]) whole learning process, which is concatenated from all
sessions of student i, is recorded as one viewing position of
where σ(.) is sigmoid function, are parameters, ℎ is the URL. All views positioned are averaged for each student
the hidden state of time t-1, et is the input vector to GRU layer to get student i’s average viewing position of URL k, which is
at time t. denoted as Pik.
88
Then, Pik are averaged over all students to get the average students in total. The students are divided into two sets: one
viewing location of each URL, as follows: is training set, containing 10278 students, the other is test set,
containing 2569 students.
=
TABLE II. EXAMPLAR DATA OF FIELDS EXTRACTED
where represents the average viewing position of URL Student_Id Session Date URL
k, and represents the average viewing position of student i
1e2ffffffdf4da8
on URL k. 9291778 e028fedddaadf
2018/6/5
/analytic_track/t
23:55
50fb
1e2ffffffdf4da8
2018/6/5
IV. EXPERIMENTS AND RESULTS 9291778 e028fedddaadf /analytic_track/p
23:55
51fb
A. Evaluation Criteria 1e2ffffffdf4da8
2018/6/5
9291778 e028fedddaadf pause video
As mentioned earlier, dropout prediction is now 23:55
52fb
formulated as a regression problem to predict students’ /courses/course-v1/
endpoints in course syllabus. Thus, we choose to use R- 1e2ffffffdf4da8 courseware/714db39f8
2018/6/5
9291778 e028fedddaadf 6d44d6fa25203f5cb5d
squared (R2), to be the evaluation criteria through our 23:55
53fb ac66/y2ag241f13ta6dr
experiments. It provides a measure of how well future 124b3er51efdf0f8/
outcomes are likely to be predicted by the model. 1e2ffffffdf4da8
2018/6/5
9291778 e028fedddaadf page close
B. Baseline models 23:55
54fb
The space of machine learning models is large, and we
only train and tune three models: Random Forest, Gradient Completion percentage of a student is defined as
Boosting Decision Tree (GBDT), and XGBoost. They are
used as baselines models in our experiments. Twenty-three =
manually designed features are used for the baseline models.
Table 1 provides an incomplete list of the features. It should
be noted here that proposed models do not require feature where yi is completion percentage of student i, ni is the
engineering, and thus, is easier to implement and migrate to number of completed sections by student i, and N is the total
other platforms. number of sections in the whole course.
Figure 2 depicts the distribution of completion percentage
TABLE I. INCOMPLETE LIST OF FEATURES USED IN BASELINE of the dataset from XueTangX. It is shown that students’
MODELS endpoint distribution in syllabus have two peaks: 0.1 and 1.0.
About 1/3 of the students drop out at completion ratio of 0.1,
Features Description and about 1/10 of the students have the completion ratio of 1.0,
online_time Total online time which means they have finished the whole course.
session_time_avg Average session time
Average interval between two
interval_time_avg
logins
Total interval between two
interval_time_total
logins
Duration of watching the
video time
video
Number of times learners do
exercises_cnt
exercises
Number of forums visited by
forums_cnt
learners
access_count Access total count
last_login Last login time Fig. 2. Complertion ratio of the dataset from XuetangX platform.
average_respond_time Average respond time
D. Model Performance Comparision
One key parameter in the proposed RNN model is max-
C. Experimental Dataset and Settings
length, which is the maximum length of the input sequence. In
As explained, the proposed model works directly on the our experiments, max-length of 500 and 1000 are tested
resource access log. JSON-formatted student resource access respectively. As shown in Figure 3, max-length of 1000 had
log is provided by XuetangX platform and used to evaluate better results than the model with max-length of 500, at the
the proposed model. Each line in the log contains a JSON cost of requiring more computation time. In the rest of this
record that records an event generated from the browser or paper, these two models will be denoted as GRU-AS-500 and
server. To train and test proposed RNN model, only four fields GRU-AS-1000, respectively.
<student-id, session-d, timestamp, URL> need to be extracted
from the log (as shown in Table 2). The dataset have 12847 The other key parameter is w, to denote which week the
prediction is made in. If w=1, only the access sequence in the
89
first week is fed into the RNN models to make the endpoint [Link]
prediction. As shown in Figure 3, with increasing w, [Link].
prediction accuracy increases steadily in both GRU-AS-500 [6] K. Veeramachaneni, U.-M. O'Reilly, and C. Taylor, (2014). “Towards
feature engineering at scale for data from massive open online courses,”
and GRU-AS-1000 models. retrieved from [Link]
Features for baseline models are extracted from the full [7] J. Whitehill, J. Williams, G. Lopez, C. Coleman, and J. Reich. (2015).
sequences, thus baseline models are shown around w=12 in “Beyond prediction: First steps toward automatic intervention in
MOOC student stopout,” retrieved from
Figure 3. As we can see, proposed GRU-AS-500 and GRU- [Link]
AS-1000 outperforms baseline models by a large margin, even er_112.pdf
at the early weeks. [8] S. Halawa, D. Greene, and J. Mitchell. (2014). “Dropout prediction in
MOOCs using learner activity features,” retrieved from
[Link]
epth_37_1%20(1).pdf
[9] B. Amnueypornsakul, S. Bhat, and P. Chinprutthiwong, “Predicting
attrition along the way: The UIUC model,” Proceedings of the EMNLP
2014 Workshop on Analysis of Large Scale Social Interaction in
MOOCs, 2014, pp. 55-59: Association for Computational Linguistics.
[10] M. Kloft, F. Stiehler, Z. Zheng, and N. Pinkwart, “Predicting MOOC
dropout over weeks using machine learning methods,” Proceedings of
the EMNLP 2014 Workshop on Analysis of Large Scale Social
Interaction in MOOCs, 2014, pp. 60-65: Association for
Computational Linguistics.
[11] S. Crossley, “Combining click-stream data with nlp tools to better
understand mooc completion,” Proceedings of the Sixth International
Conference on Learning Analytics & Knowledge (LAK’16) , 2016, pp.
6-14.
[12] S. Nagrecha, J. Z. Dillon, and N. V. Chawla, “MOOC dropout
prediction: Lessons learned from making pipelines interpretable,”
Fig. 3. The influence of the length of max-len on the experimental length Proceedings of the 26th International Conference on World Wide Web
Companion, 2017, pp. 351-359.
V. CONCLUSIONS [13] J. Liang, C. Li, and L. Zheng, “Machine learning application in
MOOCs: Dropout prediction,” Proceedings of 2016 11th International
MOOC dropout prediction is an interesting research Conference on Computer Science & Education (ICCSE) , 2016, pp. 52-
question. This paper formulated a dropout prediction as 57.
predicting completion percentage in terms of students’ [14] Fei, M., & Yeung, D. Y. M. Fei and D.-Y. Yeung, “Temporal models
endpoints in the whole course syllabus. An RNN model with for predicting student dropout in Massive Open Online Courses,” pp.
a learning resource representation layer is proposed to 256-263: IEEE, 2015.
addressed this problem. Comparing with state-of-art feature- [15] S. Jiang, A. Williams, K. Schenke, M. Warschauer, and D. O’dowd.
based models, the proposed model could make better (2014). “Predicting MOOC performance with week 1 behavior,”
retrieved from
prediction, even at the early weeks. [Link]
short%20papers/273_EDM-[Link]
[16] G. Allione and R. M. Stein, “Mass attrition: An analysis of drop out
REFERENCES from principles of microeconomics MOOC,” The journal of economic
education, vol. 47, no. 2, pp. 174-186, 2016.
[1] D. Shah. (2018). “By The Numbers: MOOCs in 2018,” retrieved from [17] G. Balakrishnan and D. Coetzee. (2013). “Predicting student retention
[Link] in massive open online courses using hidden markov
[2] J. Bailey, R. Zhang, B. Rubinstein, and J. He, “Identifying at-risk models,” retrieved from
students in massive open online courses,” Proceedings of 29th AAAI [Link]
Conference on Artificial Intelligence, 2015, pp. 1749-1755. [Link]
[3] K. Jordan, “Initial trends in enrolment and completion of massive open [18] K. Veeramachaneni, F. Dernoncourt, C. Taylor, and Z. Pardos. (2013).
online courses,” The International Review of Research in Open and “Moocdb: Developing data standards for mooc data science,” retrieved
Distributed Learning, vol. 15, no. 1, pp. 133-160, 2014. from [Link]
[4] C. Taylor. (2014). “Stopout prediction in massive open online courses,” DesignOpt/groupWebSite/uploads/Site/[Link]
retrieved from [Link] [19] J. M. L. Andres, R. S. Baker, D. Gasevic, G. Siemens, S. A. Crossley,
DesignOpt/groupWebSite/uploads/Site/[Link] and S. Joksimovic, “Studying MOOC completion at scale using the
[5] G. Balakrishnan. (2013). “Predicting student retention in massive open MOOC replication framework,” Proceedings of the 8th International
online courses using hidden markov models,” retrieved from Conference on Learning Analytics and Knowledge, 2018, pp. 71-78.
90