0% found this document useful (0 votes)
14 views

CSM6720 Assignment

Uploaded by

crsamsung9
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
14 views

CSM6720 Assignment

Uploaded by

crsamsung9
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 7

CSM6720 Assignment

David Hunter

Due: 13:00 Friday, 19th April 2024

1 Assignment
Your work must be entirely your own and any material coming from
external sources must be properly referenced. University regulations
concerning plagiarism, collusion and the use of AI tools apply here.
You are required to perform an exploratory analysis and visualisation of a
dataset. You should use the process of data exploration taught in class and in
Berthold et al. [1]. You should document your process as part of your report.
You will be required to create queries on the dataset and to visualise the
results. You must create queries in the MongoDB query language. No other
database, query language, programming languages or methods will be accept-
able. It is recommended that you use MongoDB Compass to create and execute
your query but you may use other MongoDB tools or Python (using the py-
mongo tool) if you wish.
You may create your visualisations in any tool you wish as long as you can
save the visualisation as an image file (e.g. png). You may use Python scripts
to create the visualisation.

2 The Dataset
Aberystwyth Shipping Records.
The Aberystwyth shipping records are a collection of Excel spreadsheets con-
taining information about the ships and sailor who visited Aberystwyth in the
19th Century. You have already been provided with the dataset in JSON form
and uploaded it to your database in your practical sessions. The data can be
found on Blackboard in the ‘Unit 3’ folder of ‘Learning Materials’ and is called
abership.json.
The original dataset can be found at: https://github.com/LlGC-NLW/
shippingrecords/releases. You should use the dataset provided in the aber-
ship.json file.

1
The Customer
The customer is a made-up organisation called Cymdeithas Cadwraeth Leol/Lo-
cal History Society. They are interested in the history of ships and sailors who
have visited the port of Aberystwyth.

We are interested in the lives and experiences of Sailors using the


port of Aberystwyth. We believe that the hand-written records of
ships kept by the harbour master would be very helpful in illumi-
nating these past experiences.
We are interested in both who has been visiting and from where.
Therefore, we are asking for two exploratory analyses of the data;
one to retrieve individual stories and one to look at experiences in
the round.
Individual Stories. We want to be able to trace the lives of individual
sailors as they progress through their lives and careers. Do they serve
on the same ship all their working lives or do they change regularly?
Are they promoted, or do they remain at the same grade? We need
you to be able to retrieve all records about an individual sailor and
present them as a narrative.
Who is visiting. Ships and sailors arrive in Aberystwyth from a wide
variety of places. We need you to collate these places. Are some more
popular as sources and destinations than others? Do ships ply the
same route time and time again or do they follow different routes?

3 The deliverables
The data is fairly messy and contains many missing, poorly formatted and erro-
neous elements. A substantial part of the assignment is dealing with this missing
data. This forms a substantial part of the assignment and report. You should
apply the data exploration and visualisation techniques taught in seminars.

Part 1. Individual Stories


1. Extract all records about an individual sailor. Your query should be able
to identify a single sailor from the dataset and return all documents related
to that sailor. You should give consideration to how a sailor could be
uniquely individually identified and what issues you encounter. Document
your query and the decision making in the report.
2. Visualise the promotion history of two of the previously identified individ-
ual sailors that were born in Aberystwyth. The choice of visualisation is
yours, but a timeline of some sort would seem appropriate.

2
Part 2. Who is visiting
For each ship extract the ports the ship has visited and create the following
visualisations:
1. Visualise the proportion of individual records of sailors that were born
in Aberystwyth at each capacity (a single individual may have multiple
capacities over their career).
2. Which ports are sailors joining the ships from? Visualise the number of
sailors sailing from each port.

4 The Report
At Masters level your report should not only describe your work but show an
awareness of similar projects how your solution fits into a wider context. In this
case the context is using databases for historical research. You need to describe
other solutions to similar problems and how your work compares. Finding other
similar works is a matter of your own research. You should find research that
has been published as a peer-reviewed paper. Google Scholar is a good place to
start.

4.1 A literature review.


An overview of the problem area and how your problem (the shipping records)
fits in. A overview of the technical issues involved (with sources).

4.2 Data Exploration


Describe your process for exploring the data. You should use the process of
data exploration taught in class and in Berthold et al. [1]. By reading your
report the reader should be able to understand the dataset, how it is arranged,
what data it contains, what issues you encountered during the data preparation
phase and how you resolved these issues.
Note. I often get questions such as; ”I have found this problem with the
data, how do I solve it?”, ”is this the correct solution?” I can’t answer this sort
of question. How you handle these issues is part of the assessment. The most
appropriate solution is one where you have thought through the consequences
with respect to the information you wish to discover and convey, i.e. the solution
that offers the most reliable, undistorted view of the data.

4.3 Queries
A technical description of your queries with an explanation of the choices in-
volved. The MongoDB queries themselves should be exported from MongoDB
Compass in text form and included in the appendices. They do not count to-
wards the word limit.

3
4.4 Visualisation
Include appropriate visualisation for each of the questions asked by the cus-
tomer. You should justify you choice of visualisation in your report. When
choosing the visualisations you should justify your choice with respect the con-
cepts discussed in the visualisation lecture.

4.5 Results, Discussion and Self Reflection


An overall analysis of your project, this should summarise evidence from your
results section, and place it in its context. You need to consider the technical
choices you have made and how they could impact the results.
Although it is not normally part of a paper, it is university policy that you
include a reflection section here. Reflection on your work has been shown to
improve your learning in the long term. You should consult the marking schema
and position your own work on it. You should also reflect on how you would do
it differently if you were to repeat the exercise.
Word Limit. 4000 words. Place state your word count on the cover page
of your report. You do not need to include appendices, or bibliography in the
word count.

5 The Code
The focus of this assignment is data exploration and developing appropriate
queries. You queries should be written using the MongoDB query language
only. You may use any appropriate tool to generate the visualisations (such as
Excel or Python) the visualisations must be included in the report. You must
also state what tools you used in the report.
There is no need to submit either the Shipping record excel files or your
database (in fact please don’t). Do not include your MongoDB database in
your report.
All queries should be submitted as text files. If you are using MongoDB
Compass you can export your queries using the methods shown in practicals.
Do not use Generative AI to develop any code for this assignment.
Generative AI includes tools like ChatGPT and Co-Pilot. You may use grammar
checking/improvement tool like Grammarly.

6 Weighting
7 Submission
Your report should be submitted on blackboard before the deadline. Your report
must be in format that can be read by the Turnitin plagiarism detection tool.
Suitable formats include Microsoft word files, OpenOffice .ods files and pdf files.

4
Literature Review 25%
Data Exploration 20%
Queries 20%
Visualisation 20%
Results, Discussion and Reflection 10%
Overall report quality 5%

Table 1: Marks breakdown

All visualisations should appear in the main text of the document. All MongoDB
queries, or python files should be included in text form in the appendices.
Your completed coursework must be submitted to blackboard no later than
the deadline at the top of this assignment.

8 Appendix A – Marking Schema


The marking schema for the report can the found in the student handbook (it
is the standard schema). Copied here for convenience.

Distinction
The submitted work is conducted to an expert level. Data exploration is con-
ducted to a expert level and fully documented in the report. The data is thor-
oughly examined and issues dealt with in a manner that is logical, robust, and
documented with excellent reasoning. Requested visualisations are complete,
based on demonstrably reliable queries on appropriately organised data. Issues
of reliability are appropriately addressed either through data cleaning (where
appropriate) or through documentation in the report, in either case there is an
excellent discussion of the issues in the report.

Merit
The submitted work is conducted to a high level with few mistakes. Data explo-
ration is conducted to a high level and Well documented in the report. The data
is properly examined and issues dealt with in a manner that is logical, robust,
and documented with good reasoning. Requested visualisations are complete,
based on reliable queries on appropriately organised data. Issues of reliability
are appropriate addressed either through data cleaning (where appropriate) or
through documentation in the report, in either case there is an appropriate dis-
cussion of the issues in the report.

5
Pass
The submitted work is largely complete with only a few errors. Data exploration
is largely complete and documented in the report. The data is examined and
issues dealt with in a manner that is mostly logical, robust, and documented
with reasoning. Requested visualisations are largely complete with a few omis-
sions or errors. Requested visualisation based on queries that generally capture
the requested information. Issues of reliability are addressed to some extent
either through data cleaning (where appropriate) or through documentation in
the report, in either case there is some discussion of the issues in the report.

Fail
The submitted work fails to satisfy the criteria above.

8.1 Using Python


The assignment can be completed without using Python. All of the methods
you will need have been taught in the practical/workshop sessions. However
you may use Python in order to create plots if you wish.
If you use python you will still need to create appropriate MongoDB queries
and do the bulk of information retrieval using MongoDB queries.
The following code uses Python to perform a MongoDB aggregation query.
As uses MongoDB’s query language you may submit this query as part of you
assignment. You can also use the ‘result’ variable to create a visualisation.
from pymongo import MongoClient
import m a t p l o t l i b . p y p l o t a s p l t

# R e q u i r e s t h e PyMongo package .
# h t t p s : / / a p i . mongodb . com/ python / c u r r e n t

c l i e n t = MongoClient ( ’ Put t h e d a t a b a s e URI h e r e mongodb : / / . . . ’ )


r e s u l t = c l i e n t [ ’ dah56 ’ ] [ ’ dah56 ’ ] . a g g r e g a t e ( [
{
’ $unwind ’ : {
’ path ’ : ’ $ m a r i n e r s ’
}
}, {
’ $project ’ : {
’ name ’ : ’ $ m a r i n e r s . name ’ ,
’ age ’ : ’ $ m a r i n e r s . age ’
}
}, {
’ $match ’ : {
’ age ’ : {

6
’ $ e x i s t s ’ : True
}
}
}
])

# E x t r a c t a g e s from t h e query r e s u l t
a g e s = [ r e c o r d [ ’ age ’ ] f o r r e c o r d i n r e s u l t ]

import m a t p l o t l i b . p y p l o t a s p l t

# Plot the histogram


p l t . h i s t ( ages , b i n s =10 , c o l o r =’ s k y b l u e ’ , e d g e c o l o r =’ black ’ )

# Customize t h e p l o t
p l t . x l a b e l ( ’ Age ’ )
p l t . y l a b e l ( ’ Frequency ’ )
p l t . t i t l e ( ’ Histogram o f Ages from MongoDB Query R e s u l t ’ )
p l t . g r i d ( True )

# Show t h e p l o t
p l t . s a v e f i g ( ’ my plot . png ’ )

References
[1] Michael R Berthold, Christian Borgelt, Frank Höppner, and Frank Klawonn.
Guide to intelligent data analysis: how to intelligently make sense of real
data. Springer Science & Business Media, 2010.

You might also like