Preparing Data For Sharing and Archiving
Preparing Data For Sharing and Archiving
2016
Recommended Citation
Data Management Services, Zach S. Henderson Library, "Preparing Data for Sharing and Archiving" (2016). Data Management
Services Instructional Materials. 7.
https://digitalcommons.georgiasouthern.edu/lib-promo-dms-instr/7
This presentation is brought to you for free and open access by the Promotional and Instructional Material at Digital Commons@Georgia Southern. It
has been accepted for inclusion in Data Management Services Instructional Materials by an authorized administrator of Digital Commons@Georgia
Southern. For more information, please contact [email protected].
Preparing Data for Sharing
and Archiving
Significant portions of this presentation are adapted from the Cornell University Research Data
Management Services Group website under a Creative Commons Attribution 4.0 International License,
and from ICPSR’s Guide to Social Science Data Preparation and Archiving: Best Practice Throughout
the Data Life Cycle (2012, 5th ed.). Ann Arbor, MI.
http://georgiasouthern.libguides.com/data
Agenda
• Why share and archive data?
• What should I share and archive?
• Data collection, file creation, and management
• Metadata creation
• Protecting subjects
• Copyright and re-use licensing
Why share and archive data?
• Many research funders now require PIs to maximize open
public access to data products.
• Many publishers now require open access to replication data
as a condition of publication.
• It benefits you, your collaborators, and
your research community.
What should I share and archive?
Content + Metadata
• Packaging Information
Explains how the data is organized
• Representation Information
Makes the data understandable; renders the
bit-level content into something meaningful.
• Preservation Information
Provides information to support long-term preservation and use
Data collection, file creation, and management
• Data and file structure
What is the data file going to look like and how will it be organized? What file types
will be used?
• Naming conventions
How will files and variables be named? What conventions will be used?
• Data integrity
How will the data be captured and checked for accuracy and integrity? How will
versions be checked?
• Dataset documentation
When and how will documentation be produced? What standards should be
used/applied to make it usable by others?
• Variable construction
What variables will be constructed following data collection? According to what
standards and how will they be documented?
• Project documentation
How will decisions be documented over the course of the research (e.g., field
procedures, coding decisions, variable construction)?
Data collection, file creation, and management
The likelihood of long-term preservation of content and
functionality is higher when file formats possess the
following characteristics:
• Indirect identifiers
Variables make unique cases visible when combined with other
identifiers.
• Geographic identifiers
Direct geographic identifiers may include specific addresses. Indirect
geographic identifiers may include census tracts, area codes, place of
birth or education, etc.
Protecting subjects
Common techniques for treating identifiers to protect subjetcs:
• Removal: Eliminate the variable.
• Top-Coding: Restrict the upper range of a variable
• Collapsing and/or Combining: Combining values into a summary variable
• Sampling: Release a random sample of sufficient size to yield reasonable
inferences.
• Swapping: Match unique cases on the indirect identifier, then exchange the
values of key variables between the cases. This retains the analytic utility and
covariate structure of the dataset while protecting subject confidentiality.
• Disturbing: Add random variation or stochastic error to the variable. This
retains the statistical properties between the variable and its covariates,
while preventing using the variable as a means for linking records.
Protecting subjects
Alternatives to altering the data:
• Restricted-use datasets
A dataset released only to approved researchers who agree to
abide by rules assuring subjects’ privacy and confidentiality
is maintained. Researchers are usually given access to the
data for a limited time, at the end of which they must return
or destroy the data.
• Data Enclaves
A physical or virtual environment that allows access, but
prevents researchers from retaining any of the data.
Copyright and re-use licensing
• Unmediated factual data cannot be copyrighted because it
is not possible to copyright facts (e.g., a temperature
reading). However:
– Some data may be protected, such as photographs.
– Organized data (e.g., a database) has a thin layer of copyright
protection because of the researcher’s creative input into creating
it.
• Copyright may govern the use of databases and some kinds
original data content, but contract law, trademarks, and
other mechanisms are required to regulate factual data.
Copyright and re-use licensing
Creative Commons (http://www.creativecommons.org/) offers a library of
standardized licenses, some of which may be used with data. Creative Commons
recommends the following three licenses only for data sharing. :
• CC Zero (“CC0”)
Waive all copyright and database rights, including your right to attribution.
This license effectively places the database and data into the public domain,
and maximizes the likelihood of reuse.
• CC Attribution 4.0 International (“CC BY 4.0”)
Waive all copyright and database rights except the right to attribution. This
license balances your right to be acknowledged with encouraging reuse.
• CC Attribution-ShareAlike 4.0 International (“CC BY-SA 4.0”)
Protect your right to attribution, as well as require that any derivative work be
shared under the same licensing conditions. This license may result in
confusing “license chaining,” and discourage some reuse and citation.
The library recommends using the CC BY 4.0 license in most cases.
Copyright and re-use licensing
Data Ownership at Georgia Southern University:
• Ownership of works produced by Georgia Southern faculty,
students, and non-academic staff is governed by the University
System of Georgia's Policy on the Use of Copyrighted Works in
Education and Research and Georgia Southern
University's Intellectual Property and Technology Transfer Policy.
• The precise answer to who owns your data depends on whether
the project was created as part of sponsored research; the
employment status of the creator; whether the work was
conducted pursuant to a specific direction or assigned duty; and,
whether substantial university resources were used in the creation
of the work.
• Consult with the Office of Research and Economic Development.
ICPSR Guide to Social Science Data
Preparation and Archiving
http://www.icpsr.umich.edu/files/ICPSR/access/dataprep.pdf
Digital Commons for Data
Comprehensive hosted solution for storing, managing, securing,
and sharing data:
http://digitalcommons.bepress.com/cgi/viewcontent.cgi?article=1000&context=promotional
Archiving and Publication
• Once ready for publication, data objects are re-described,
assigned DOIs if needed, and released to the public website.
• Update versions and file types as needed.
• Host on university, departmental, or
project-related data structure.
• Backup and Archiving is automatic.
• Links are permanent.
• Link to SelectedWorks profile.
http://georgiasouthern.libguides.com/data
Data Management Services
@ Henderson Library
http://georgiasouthern.libguides.com/data