CSM6720 Assignment
CSM6720 Assignment
David Hunter
1 Assignment
Your work must be entirely your own and any material coming from
external sources must be properly referenced. University regulations
concerning plagiarism, collusion and the use of AI tools apply here.
You are required to perform an exploratory analysis and visualisation of a
dataset. You should use the process of data exploration taught in class and in
Berthold et al. [1]. You should document your process as part of your report.
You will be required to create queries on the dataset and to visualise the
results. You must create queries in the MongoDB query language. No other
database, query language, programming languages or methods will be accept-
able. It is recommended that you use MongoDB Compass to create and execute
your query but you may use other MongoDB tools or Python (using the py-
mongo tool) if you wish.
You may create your visualisations in any tool you wish as long as you can
save the visualisation as an image file (e.g. png). You may use Python scripts
to create the visualisation.
2 The Dataset
Aberystwyth Shipping Records.
The Aberystwyth shipping records are a collection of Excel spreadsheets con-
taining information about the ships and sailor who visited Aberystwyth in the
19th Century. You have already been provided with the dataset in JSON form
and uploaded it to your database in your practical sessions. The data can be
found on Blackboard in the ‘Unit 3’ folder of ‘Learning Materials’ and is called
abership.json.
The original dataset can be found at: https://github.com/LlGC-NLW/
shippingrecords/releases. You should use the dataset provided in the aber-
ship.json file.
1
The Customer
The customer is a made-up organisation called Cymdeithas Cadwraeth Leol/Lo-
cal History Society. They are interested in the history of ships and sailors who
have visited the port of Aberystwyth.
3 The deliverables
The data is fairly messy and contains many missing, poorly formatted and erro-
neous elements. A substantial part of the assignment is dealing with this missing
data. This forms a substantial part of the assignment and report. You should
apply the data exploration and visualisation techniques taught in seminars.
2
Part 2. Who is visiting
For each ship extract the ports the ship has visited and create the following
visualisations:
1. Visualise the proportion of individual records of sailors that were born
in Aberystwyth at each capacity (a single individual may have multiple
capacities over their career).
2. Which ports are sailors joining the ships from? Visualise the number of
sailors sailing from each port.
4 The Report
At Masters level your report should not only describe your work but show an
awareness of similar projects how your solution fits into a wider context. In this
case the context is using databases for historical research. You need to describe
other solutions to similar problems and how your work compares. Finding other
similar works is a matter of your own research. You should find research that
has been published as a peer-reviewed paper. Google Scholar is a good place to
start.
4.3 Queries
A technical description of your queries with an explanation of the choices in-
volved. The MongoDB queries themselves should be exported from MongoDB
Compass in text form and included in the appendices. They do not count to-
wards the word limit.
3
4.4 Visualisation
Include appropriate visualisation for each of the questions asked by the cus-
tomer. You should justify you choice of visualisation in your report. When
choosing the visualisations you should justify your choice with respect the con-
cepts discussed in the visualisation lecture.
5 The Code
The focus of this assignment is data exploration and developing appropriate
queries. You queries should be written using the MongoDB query language
only. You may use any appropriate tool to generate the visualisations (such as
Excel or Python) the visualisations must be included in the report. You must
also state what tools you used in the report.
There is no need to submit either the Shipping record excel files or your
database (in fact please don’t). Do not include your MongoDB database in
your report.
All queries should be submitted as text files. If you are using MongoDB
Compass you can export your queries using the methods shown in practicals.
Do not use Generative AI to develop any code for this assignment.
Generative AI includes tools like ChatGPT and Co-Pilot. You may use grammar
checking/improvement tool like Grammarly.
6 Weighting
7 Submission
Your report should be submitted on blackboard before the deadline. Your report
must be in format that can be read by the Turnitin plagiarism detection tool.
Suitable formats include Microsoft word files, OpenOffice .ods files and pdf files.
4
Literature Review 25%
Data Exploration 20%
Queries 20%
Visualisation 20%
Results, Discussion and Reflection 10%
Overall report quality 5%
All visualisations should appear in the main text of the document. All MongoDB
queries, or python files should be included in text form in the appendices.
Your completed coursework must be submitted to blackboard no later than
the deadline at the top of this assignment.
Distinction
The submitted work is conducted to an expert level. Data exploration is con-
ducted to a expert level and fully documented in the report. The data is thor-
oughly examined and issues dealt with in a manner that is logical, robust, and
documented with excellent reasoning. Requested visualisations are complete,
based on demonstrably reliable queries on appropriately organised data. Issues
of reliability are appropriately addressed either through data cleaning (where
appropriate) or through documentation in the report, in either case there is an
excellent discussion of the issues in the report.
Merit
The submitted work is conducted to a high level with few mistakes. Data explo-
ration is conducted to a high level and Well documented in the report. The data
is properly examined and issues dealt with in a manner that is logical, robust,
and documented with good reasoning. Requested visualisations are complete,
based on reliable queries on appropriately organised data. Issues of reliability
are appropriate addressed either through data cleaning (where appropriate) or
through documentation in the report, in either case there is an appropriate dis-
cussion of the issues in the report.
5
Pass
The submitted work is largely complete with only a few errors. Data exploration
is largely complete and documented in the report. The data is examined and
issues dealt with in a manner that is mostly logical, robust, and documented
with reasoning. Requested visualisations are largely complete with a few omis-
sions or errors. Requested visualisation based on queries that generally capture
the requested information. Issues of reliability are addressed to some extent
either through data cleaning (where appropriate) or through documentation in
the report, in either case there is some discussion of the issues in the report.
Fail
The submitted work fails to satisfy the criteria above.
# R e q u i r e s t h e PyMongo package .
# h t t p s : / / a p i . mongodb . com/ python / c u r r e n t
6
’ $ e x i s t s ’ : True
}
}
}
])
# E x t r a c t a g e s from t h e query r e s u l t
a g e s = [ r e c o r d [ ’ age ’ ] f o r r e c o r d i n r e s u l t ]
import m a t p l o t l i b . p y p l o t a s p l t
# Customize t h e p l o t
p l t . x l a b e l ( ’ Age ’ )
p l t . y l a b e l ( ’ Frequency ’ )
p l t . t i t l e ( ’ Histogram o f Ages from MongoDB Query R e s u l t ’ )
p l t . g r i d ( True )
# Show t h e p l o t
p l t . s a v e f i g ( ’ my plot . png ’ )
References
[1] Michael R Berthold, Christian Borgelt, Frank Höppner, and Frank Klawonn.
Guide to intelligent data analysis: how to intelligently make sense of real
data. Springer Science & Business Media, 2010.