Mercè Crosas, Ph.D.
Chief Data Science andTechnology Officer
Institute for Quantitative Social Science (IQSS)
Harvard University
@mercecrosas mercecrosas.com
Data should be Findable, Accessible,
Interoperable, Reusable (FAIR) by machines
Wilkinson et al,‘The FAIR Guiding Principles scientific data management and stewardship,” Nature Scientific Data, 2016;
NIH Data Commons Principles; Joint Declaration of Data Citation Principles (Force11)
“FAIR Principles put specific emphasis on enhancing the
ability of machines to automatically find and use the data,
in addition to supporting its reuse by individuals.”
“Good data management is not a goal in itself, but rather is
the key conduit leading to knowledge discovery and
innovation, and to subsequent data and knowledge integration
and reuse by the community after the data publication
process.”
FAIR Data Principles in Brief
• To be Findable:
๏ (meta)data are assigned a globally
unique and persistent identifier
๏ data are described with rich
metadata
๏ metadata clearly and explicitly
include the identifier of the data it
describes
๏ (meta)data are registered or
indexed in a searchable resource
• To be Accessible:
๏ (meta)data are retrievable by their
identifier using a standardized
communications protocol
๏ the protocol is open, free, and
universally implementable
๏ the protocol allows for an
authentication and authorization
procedure, where necessary
๏ metadata are accessible, even
when the data are no longer
available
• To be Interoperable:
๏ (meta)data use a formal, accessible,
shared, and broadly applicable
language for knowledge
representation.
๏ (meta)data use vocabularies that
follow FAIR principles
๏ (meta)data include qualified
references to other (meta)data
• To be Reusable:
๏ meta(data) are richly described
with a plurality of accurate and
relevant attributes
๏ (meta)data are released with a
clear and accessible data usage
license
๏ (meta)data are associated with
detailed provenance
๏ (meta)data meet domain-relevant
community standards
We built Dataverse to incentivize data sharing,
with “good data management” in mind
• An open-source platform to share and archive data

• Developed at Harvard’s Institute for Quantitative Social
Science since 2006

• Gives credit and control to data authors & producers

• Builds a community to:

• define new standards and best practices

• foster new research in data sharing and reproducibility

• Has brought data publishing into the hands of data authors
21 installations around the world
Used by researchers from > 500 institutions
60,000 datasets in Harvard Dataverse repository
http://dataverse.org
Dataverse is now a widely used repository platform
Dataverse has a growing, engaged
community of developers and users
38

GitHub
contributors
332

members in the
community list
23

community calls
so far with 239
participants from
8countries
Annual 

Community Meeting, 

with 200 attendees
Dataverse implements FAIR Data Principles
๏ Data Citation with global persistent IDs:
๏ generate DOI automatically
๏ attribution to data authors and repository
๏ registration to DataCite
๏ Rich Metadata:
๏ citation metadata
๏ domain-specific descriptive metadata
๏ variable and file metadata (extracted automatically)
๏ Access and usage controls:
๏ open data as default, with CC0 waiver
๏ custom terms of use and licenses, when needed
๏ data can be restricted, but citation & metadata always publicly accessible
๏ APIs and standards:
๏ SWORD, OAI-PMH, native API to search and get data and metadata
๏ Dublin Core and DDI metadata standards
๏ PROV ontology standard to capture provenance of a dataset (coming soon)
Dataset Landing Page
Files: data,
docs,code
Data Citation
Dataset Landing Page
Metadata
Dataset Landing Page
Terms of Use
& Licenses
Dataset Landing Page
Versions
Standard file formats and automatic
metadata extraction allow data exploration
Var1 Var2 Var3 Var4
Var1 Var2 Var3 Var4
TwoRavens: summary stats & analysis
WorldMap: geospatial exploration
geospatial
variable
Led by Boston University, the MOC is a collaborative effort
among BU, Harvard, UMass Amherst, MIT, and Northeastern
University, as well as the Massachusetts Green High-
Performance Computing Center (MGHPCC) and Oak Ridge
National Laboratory (ORNL)
Cloud Dataverse:
•is a collaboration
between Massachusetts
Open Cloud (MOC)
and Dataverse
•will allow replication of
data in multiple storage
locations, and access to
cloud computing
Dataverse led by Harvard’s IQSS
Data$depositor$
Data$users$
Metadata$
Data$files$
Data$+$metadata$
Access$object$in$Swi8$+$$
Compute+with$Sahara/Hadoop$
download$
Swi8$
Object$
Store$
Dataverse$Now$$$ with$Cloud$Dataverse$
Repository+
Publish$dataset$
Data+
Replica3on+
Cloud&Access&+&
Compute&
Dataset enabled in
Cloud Dataverse
Future Dataset Landing Page: Access to Compute
http://privacytools.seas.harvard.edu http://datatags.org
DataTags:
•is part of the Harvard
University PrivacyTools
Project
•will be integrated with
Dataverse
http://dataverse.org
A datatag is a set of security features and
access requirements for file handling.
A datatags repository is one that stores and
shares data files in accordance with a
standardized and ordered levels of security and
access requirements
DataTags Levels
Tag Type Description Security Features Access Credentials
Blue Public
Clear storage,

Clear transmit
Open
Green Controlled public
Clear storage,

Clear transmit
Email- or OAuth Verified
Registration
Yellow Accountable
Clear storage,

Encrypted transmit
Password, Registered,
Approval, Click-through DUA
Orange More accountable
Encrypted storage,
Encrypted transmit
Password, Registered,
Approval, Signed DUA
Red Fully accountable
Encrypted storage,
Encrypted transmit
Two-factor authentication,
Approval, Signed DUA
Crimson Maximally restricted
Multi-encrypted storage,
Encrypted transmit
Two-factor authentication,
Approval, Signed DUA
DataTags and their respective policies
Sweeney L, Crosas M, Bar-Sinai M. Sharing Sensitive Data with Confidence: The Datatags System. 

Technology Science. 2015.
Requirements for a DataTags Repository
1. Supports more than one datatag
2. Each file in the repository must have one and only one datatag
๏ additional requirements cannot weaken the file security
๏ and cannot required the same or more security than a more
restrictive datatag
3. A recipient of a file from the repository must
๏ satisfy file’s access requirements,
๏ produce sufficient credentials as requested,
๏ and agree to any terms of use required to acquire the file.
4. Provides technological guarantees for requirements 1, 2 and 3.
Repositories today do not have an easy,
standard way to support sensitive data
“User Uploads must be void of all identifiable information, such
that re-identification of any subjects from the amalgamation of the
information available from all of the materials (across datasets and
dataverses) uploaded under any one author and/or user should
not be possible.”
Terms of Use
We are making Dataverse a DataTags
Repository to share sensitive data
Data File
Deposit
DataTags
Automated
Interview
Review Board
Approval
Sensitive
Data File
Direct
Access
Privacy
Preserving
Access
PSI
Differential
Privacy Tool
Two-factor auth;
approval; signed-
DUA
The DataTags automated interview …
… helps you generate a machine-readable
datatag for your data
Thanks!
Learn more at http//dataverse.org
@mercecrosas mercecrosas.com

Dataverse, Cloud Dataverse, and DataTags

  • 1.
    Mercè Crosas, Ph.D. ChiefData Science andTechnology Officer Institute for Quantitative Social Science (IQSS) Harvard University @mercecrosas mercecrosas.com
  • 2.
    Data should beFindable, Accessible, Interoperable, Reusable (FAIR) by machines Wilkinson et al,‘The FAIR Guiding Principles scientific data management and stewardship,” Nature Scientific Data, 2016; NIH Data Commons Principles; Joint Declaration of Data Citation Principles (Force11)
  • 3.
    “FAIR Principles putspecific emphasis on enhancing the ability of machines to automatically find and use the data, in addition to supporting its reuse by individuals.” “Good data management is not a goal in itself, but rather is the key conduit leading to knowledge discovery and innovation, and to subsequent data and knowledge integration and reuse by the community after the data publication process.”
  • 4.
    FAIR Data Principlesin Brief • To be Findable: ๏ (meta)data are assigned a globally unique and persistent identifier ๏ data are described with rich metadata ๏ metadata clearly and explicitly include the identifier of the data it describes ๏ (meta)data are registered or indexed in a searchable resource • To be Accessible: ๏ (meta)data are retrievable by their identifier using a standardized communications protocol ๏ the protocol is open, free, and universally implementable ๏ the protocol allows for an authentication and authorization procedure, where necessary ๏ metadata are accessible, even when the data are no longer available • To be Interoperable: ๏ (meta)data use a formal, accessible, shared, and broadly applicable language for knowledge representation. ๏ (meta)data use vocabularies that follow FAIR principles ๏ (meta)data include qualified references to other (meta)data • To be Reusable: ๏ meta(data) are richly described with a plurality of accurate and relevant attributes ๏ (meta)data are released with a clear and accessible data usage license ๏ (meta)data are associated with detailed provenance ๏ (meta)data meet domain-relevant community standards
  • 5.
    We built Dataverseto incentivize data sharing, with “good data management” in mind • An open-source platform to share and archive data • Developed at Harvard’s Institute for Quantitative Social Science since 2006 • Gives credit and control to data authors & producers • Builds a community to: • define new standards and best practices • foster new research in data sharing and reproducibility • Has brought data publishing into the hands of data authors
  • 6.
    21 installations aroundthe world Used by researchers from > 500 institutions 60,000 datasets in Harvard Dataverse repository http://dataverse.org Dataverse is now a widely used repository platform
  • 7.
    Dataverse has agrowing, engaged community of developers and users 38 GitHub contributors 332 members in the community list 23 community calls so far with 239 participants from 8countries Annual Community Meeting, with 200 attendees
  • 8.
    Dataverse implements FAIRData Principles ๏ Data Citation with global persistent IDs: ๏ generate DOI automatically ๏ attribution to data authors and repository ๏ registration to DataCite ๏ Rich Metadata: ๏ citation metadata ๏ domain-specific descriptive metadata ๏ variable and file metadata (extracted automatically) ๏ Access and usage controls: ๏ open data as default, with CC0 waiver ๏ custom terms of use and licenses, when needed ๏ data can be restricted, but citation & metadata always publicly accessible ๏ APIs and standards: ๏ SWORD, OAI-PMH, native API to search and get data and metadata ๏ Dublin Core and DDI metadata standards ๏ PROV ontology standard to capture provenance of a dataset (coming soon)
  • 9.
    Dataset Landing Page Files:data, docs,code Data Citation
  • 10.
  • 11.
    Dataset Landing Page Termsof Use & Licenses
  • 12.
  • 13.
    Standard file formatsand automatic metadata extraction allow data exploration Var1 Var2 Var3 Var4 Var1 Var2 Var3 Var4 TwoRavens: summary stats & analysis WorldMap: geospatial exploration geospatial variable
  • 14.
    Led by BostonUniversity, the MOC is a collaborative effort among BU, Harvard, UMass Amherst, MIT, and Northeastern University, as well as the Massachusetts Green High- Performance Computing Center (MGHPCC) and Oak Ridge National Laboratory (ORNL) Cloud Dataverse: •is a collaboration between Massachusetts Open Cloud (MOC) and Dataverse •will allow replication of data in multiple storage locations, and access to cloud computing Dataverse led by Harvard’s IQSS
  • 15.
  • 16.
    Cloud&Access&+& Compute& Dataset enabled in CloudDataverse Future Dataset Landing Page: Access to Compute
  • 18.
    http://privacytools.seas.harvard.edu http://datatags.org DataTags: •is partof the Harvard University PrivacyTools Project •will be integrated with Dataverse http://dataverse.org
  • 19.
    A datatag isa set of security features and access requirements for file handling. A datatags repository is one that stores and shares data files in accordance with a standardized and ordered levels of security and access requirements
  • 20.
    DataTags Levels Tag TypeDescription Security Features Access Credentials Blue Public Clear storage,
 Clear transmit Open Green Controlled public Clear storage,
 Clear transmit Email- or OAuth Verified Registration Yellow Accountable Clear storage,
 Encrypted transmit Password, Registered, Approval, Click-through DUA Orange More accountable Encrypted storage, Encrypted transmit Password, Registered, Approval, Signed DUA Red Fully accountable Encrypted storage, Encrypted transmit Two-factor authentication, Approval, Signed DUA Crimson Maximally restricted Multi-encrypted storage, Encrypted transmit Two-factor authentication, Approval, Signed DUA DataTags and their respective policies Sweeney L, Crosas M, Bar-Sinai M. Sharing Sensitive Data with Confidence: The Datatags System. 
 Technology Science. 2015.
  • 21.
    Requirements for aDataTags Repository 1. Supports more than one datatag 2. Each file in the repository must have one and only one datatag ๏ additional requirements cannot weaken the file security ๏ and cannot required the same or more security than a more restrictive datatag 3. A recipient of a file from the repository must ๏ satisfy file’s access requirements, ๏ produce sufficient credentials as requested, ๏ and agree to any terms of use required to acquire the file. 4. Provides technological guarantees for requirements 1, 2 and 3.
  • 22.
    Repositories today donot have an easy, standard way to support sensitive data “User Uploads must be void of all identifiable information, such that re-identification of any subjects from the amalgamation of the information available from all of the materials (across datasets and dataverses) uploaded under any one author and/or user should not be possible.” Terms of Use
  • 23.
    We are makingDataverse a DataTags Repository to share sensitive data Data File Deposit DataTags Automated Interview Review Board Approval Sensitive Data File Direct Access Privacy Preserving Access PSI Differential Privacy Tool Two-factor auth; approval; signed- DUA
  • 24.
  • 25.
    … helps yougenerate a machine-readable datatag for your data
  • 26.
    Thanks! Learn more athttp//dataverse.org @mercecrosas mercecrosas.com