Teaching materials for Data Science are found in the folders above.
Find provisional calendar and further information at study.ruc.dk; direct link for F2025: https://study.ruc.dk/class/view/35410
The overall objective of this course is to introduce the concept of data science and visualization of data to enable students within experimental sciences to design, perform, visualize, evaluate, interpret and communicate experiments where many parameters are measured and so called big data experiments (‘omics’ data).
Furthermore, the aim is to provide students with the necessary methodological and data analysis skills to be able to evaluate validity and quality of methods and data related to analysis of large datasets.
Detailed description of content\ The course consists of lectures combined with hands-on exercises, and projects where the students can work on their own data or other data from their own field.
No previous programming experience is required, but students will be expected to learn basic programming (R and Rstudio) and statistical analysis during this course.
Course material and Reading list No textbooks are needed, course material will be specified on Moodle.
This course is run in parallel with Good Practices in Experimental Sciences (GxP), and you will learn complementary tools and analysis methods from each course.
Reading course material and problem solving at home: 38 hrs
Lectures: 8 hours
Discussion and problem solving in class: 24 hrs
Working on mini-projects and report writing: 65 hours
Total 135 hrs
Most classes will begin with a short lecture / introduction to a new concept followed by time for discussion and work with programming exercises.
Students will then work in pairs to analyze a new data set based on concepts covered in the introduction. Selected groups will then present their data analysis to the class.
The students will write a report for each data set (usually a PowerPoint presentation or R markdown) where an emphasis should be on explaining the analyses used, the implication of the results, and on the visualization of the data.
These reports will be turned in at the end of the class, or before the next class, with the names of the group members along with code used for the analysis and visualization.
The topic for mini-project is to present a visual and statistical analysis of an approved dataset. You will use what you learned during the semester to create an R script that goes step-by-step on an analysis, and you will present this analysis and script at the last class.
| Date and Time | Location | Notes |
|---|---|---|
| 03/02-2025 kl. 14:15 - 16:00 | 07.1-008 | Course introduction, Github, Setting up R, Very basics of R |
| 05/02-2025 kl. 12:15 - 16:00 | 11.2-047 | Data arrangement, Formatting and ggplot, color and themes |
| 10/02-2025 kl. 14:15 - 16:00 | 22.1-009 | ChatGPT and R, ANOVA |
| 12/02-2025 kl. 12:15 - 16:00 | 22.1-009 | Tables in R, Data transformations |
| 19/02-2025 kl. 12:15 - 16:00 | 22.1-009 | Block design, Multifactorial design |
| 05/03-2025 kl. 12:15 - 16:00 | 22.1-009 | Repeated measurements and mixed effect model, correlation and regression |
| 12/03-2025 kl. 12:15 - 16:00 | 22.1-009 | Logistic regression, Proportions and enrichment, Intro to the mini-project |
| 19/03-2025 kl. 12:15 - 16:00 | 22.1-009 | T-tests and OR/HR/RR, time for mini-project work |
| 26/03-2025 kl. 12:15 - 16:00 | 22.1-009 | Mini-project presentations, Good vs Bad data visualizations |
| 30/06-2025 kl. 10:00 | Hand-in of written products (reexam only) |
After completing the course, the students will be able to:
-
describe and explain the concepts of multivariable data processing and visualization
-
handle multivariable data using relevant software such as R or using statistical software
-
identify and extract relevant parameters from large data sets
-
implement appropriate descriptive statistics on high complexity and big data
-
describe and analyze the intrinsic structure of a large multivariable dataset using relevant methods, such as clustering methods, principal component analysis (PCA) or least-squares analyses (PLS)
-
analyze multivate data using basic linear models with covariate adjustments, and interpret and discuss results these
-
describe simple machine learning algorithms and explain their differences with regard to purpose of use, strengths and weaknesses, as well as use selected machine learning algorithms for tasks such as selection of the variable with the best predicting power, and interpret results from these.
-
explain the results from these methods to both lay people and specialists
-
be aware of the limitations of the chosen tests
-
visualize the results in an informative and rigorous way.
-
design complex experiments, including ‘omics’ experiments based on the methodological considerations of the ensuing data analysis
-
write documents describing methodological considerations regarding the analysis of big (‘omics’) data
-
communicate the knowledge and understanding gained from the course in a precise and scientific way.
The course is passed through active, regular attendance and satisfactory participation.
Active participation is defined as: The student must participate in course related activities (e.g. workshops, seminars, field excursions, process study groups, working conferences, supervision groups, feedback sessions).
Regular attendance is defined as:
- The student must be present for minimum 75 percent of the lessons.
Satisfactory participation is defined as:
- e.g. oral presentations (individually or in a group), peer reviews, mini projects, test, planning of a course session .
Assessment: Pass/Fail.
Form of Re-examination\ Students that have not participated satisfactory must hand in renewed written products.
Students that have only met the requirement of regular attendance between 50% and 75% must hand in an additional written product.
Participate actively is define as:
- The student must participate actively in lectures, discussion and problem solving classes. Students may be selected to present their report to the class at the end of a lecture. Active participation means students must present if they are called upon.
Regular attendance is defined as:
- The student must be present for minimum 75 percent of the lessons.
Satisfactory participation is defined as:
- The student must write and submit reports (usually a PowerPoint presentation or R markdown) following every class.
Assessment criteria in relation to satisfactory participation/students will be assessed by their ability to:
-
Explain the analyses used
-
Account for, how choice of analysis have implication on the results, the visualization of the data, and programming code for analysis and visualization
-
Communicate the knowledge and understanding gained from the lesson in a precise way within the submitted reports
There will be different options for data in the mini projects.
-
The website Kaggle is a great source for publicly available data, as well as competitions. https://www.kaggle.com/datasets
-
GEUS's nationwide drilling database for groundwater, drinking water, raw material, environmental and geotechnical data https://data.geus.dk/geusmap/
-
Michigan Fish Contaminant Monitoring Program Sampling Sites and Select Results https://gis-michigan.opendata.arcgis.com/maps/7303c25d46c5468bacd157ea828c4760/about
I ask that you install R and RStudio before our first class. We will try to use RStudio on UCloud, however there may be technical problems with UCloud and we will then need to fall back on local installations of RStudio for the code walk through
https://posit.co/download/rstudio-desktop/
1: Install R. RStudio requires R 3.3.0+. Choose a version of R that matches your computer’s operating system.
2: Install RStudio. RStudio requires a 64-bit operating system.
https://docs.cloud.sdu.dk/
UCloud is designed to be user-friendly High-Performance Computing (HPC) with a graphical user interface.
UCloud walkthrough PowerPoint can be found on Moodle.
-
R for Data Science https://r4ds.hadley.nz/
-
Reproducible Research in R https://r-cubed.rostools.org/
-
R Programming for Data Science https://bookdown.org/rdpeng/rprogdatascience/
-
Chat with your data using AI http://rtutor.ai/
GitHub is a useful tool for data science, as it can house code, share project, and enable collaboration. We will walk through creating a Github account, and you are highly encouraged to upload your code and data from class exercises.
You should include your Github webpage on your resume/CV to show employers that you are skilled at a programming language and you have experience analyzing and visualizing data. This will give you an advantage during your job hunt.
If you continue in research, you can always revisit your Github and implement past code in future research projects.