Data standards in bioinformatics
Programmatic Access To Biological Databases (Perl)
1 – 4 October 2012
Rafael C. Jimenez
rafael@ebi.ac.uk
DB
I
DB
I
DB
I
DB
I
DB
I
Database I User
Molecular interaction information
Ideally Reality
Interface
Molecular Biology resources
13/12/2018
3
Genomics Databases (non-vertebrate) (17.9%)
Protein sequence databases (12.9%)
Human Genes and Diseases (9.8%)
Structure Databases (9.7%)
Metabolic and Signaling Pathways (9.3%)
Nucleotide Sequence Databases (8.8%)
Human and other Vertebrate Genomes (7.1%)
Plant databases (7.1%)
RNA sequence databases (4.9%)
Microarray and other Gene Expression Databases (4.5%)
Other Molecular Biology Databases (3.3%)
Immunological databases (1.8%)
Organelle databases (1.6%)
Proteomics Resources (1.2%)
Cell biology (0.2%)
1730
http://www.oxfordjournals.org/nar/database/c/
Pathway resources
13 December 2018
4
132
9
43
73
62
51
38
20
14
Pathway resource list
Protein-Protein Interactions
Genetic Interaction Networks
Protein-Compound Interactions
Metabolic Pathways
Signaling Pathways
Gene Regulatory Networks
Pathway Diagrams
Protein Sequence Focused
Other
http://www.pathguide.org/
184 molecular interactions
442 pathway
Utility of Bioinformatics
13/12/2018
5
Scientificimpact
Too little
bioinformatics
Too many databases
Too diverse interfaces
Tim Hubbard
Data integration
DB
I
DB
I
DB
I
DB
I
Ideally Compromise
Database InterfaceI User
Combining data residing in different sources
… providing users with a unified view of these data.
DB
I
DB DB DB
DB
I
Reality
Data integration problems
Many data resources
• Many to maintain
• Databases change
• New appearing
• Some disappearing*
• Not easy to find them
Different query interfaces
data integration?
Variable results
• Formats
• Schemas
• Controlled vocabularies
• Minimum information guidelines
* Merali Z. et all. Databases in peril. Nature 2005.
• Inconsistency
• Mapping problems
• Records with not enough information
• Redundancy
Nucleotide sequences
INSDC
EMBL
DDBJ
NCBI
Molecular interactions
IMEx
IntAct
InnateDB
DIP
MINT
…
Collaboration among data providers
Agreement on data standards and data exchange
• More data coverage
• Less redundancy
• Less inconsistency
• Better data management
Protein indentifications
ProteomeXchange
PRIDE
PeptideAtlas
GPMDB
Tranche
…
Data standards
13/12/2018
9
Integration
AccessExchange
Sharing
Portability
Interoperability
Annotation
Comparison
Verification
Reusability
Representation
Compliance
Consistency
Replication
Web service
API
Submission
Edition
Conversion
Comparison Visualization
Analysis
Integration
Validation
TOOLS
Data standards
• Data format
• File format
• Format definition (schema)
• Controlled vocabularies (ontologies)
• Guidelines
• Minimum information
• Best practices
• Common identifiers
• Common query interfaces
Schema
Interfaces
Guidelines
Ontologies
Format
Identifiers
Data
Definition Representation Access
• Data format
• File format
• Format definition (schema)
• Controlled vocabularies (ontologies)
• Guidelines
• Minimum information
• Best practices
• Common query interfaces
• Common identifiers
PSI-MI XSD
PSICQUIC
MIMIx
PSI-MI CV
XML
UniProt
Molecular
interactions
Definition Representation Access
Data standards
… molecular interactions
Registry of standards
http://biosharing.org
1 3
5
Popular data integration approaches
4
6
2
...
Data centralization Data warehousing Dataset integration Hyperlinks
Federated databases View integration
13.12.2018
14
Warehousing vs. Federation
Database Query InterfaceQI User
Data warehousing Federated databases
S
i
S i
i
S
integration
standardization
Warehousing vs. Federation
• Data warehousing
• Pull data from several resources into one resource.
• Main features:
• Data centralization
• High maintenance
• Data out of date
• Modifications (schema, format, content, …)
• Federated databases
• Data residing in different sources using a common query
interface.
• Main features:
• Fresh data (original)
• Data redundancy
• Data inconsistency
Heterogeneous
data sources
Same data types
Data integration
A B C
1
2
leverage
B
C
A
Tools• Formats
• DAS, PSI-MI, MzML , BioPAX , SBML , GFF3, CellML, …
• Registry (~125): Biosharing
• Controlled vocabularies
• Gene Ontology, Sequence Ontology, Pathway Ontology, Molecular Interaction, …
• Registries (~ 200 ontologies): Bioportal, OLS
• Minimum information guidelines
• i.e. MIAME, MIAPE , MIMIx , MIRIAM, …
• Registry (~ 35 guidelines): MIBBI
• ID Mapping services
• PICR, David , CRONOS , BridgeDB, Biomart, DAS, …
• API
• Ensembl API, Uniprot API, Biomart API, …
• Web Services
• ClustalW, ArrayExpress, Blast, Emboss…
• Registries (~ 2000 services): Biocatalogue, DAS registry, …
• Workflow management systems
• Taverna, Pegasys, Galaxy, …
Standard formats/schemas
BioPAX
PSI-MI
SBML,
CellML
Genetic
Interactions
Molecular Interactions
Pro:Pro All:All
Interaction Networks
Molecular Non-molecular
Pro:Pro TF:Gene Genetic
Regulatory Pathways
Low Detail High Detail
Database Exchange
Formats
Simulation Model
Exchange Formats
Rate
Formulas
Metabolic Pathways
Low Detail High Detail
Biochemical
Reactions
Small Molecules
Low Detail High Detail
Anatoly Sorokin
Formats, ontologies and guidelines
http://biosharing.org
Controlled vocabularies
• Ontology browser: http://www.ebi.ac.uk/ontology-lookup
Ontology Lookup Service
Minimum information guidelines
• PSI: Proteomics Standards Initiative
• Work group of the Human Proteome Organization
• Defines community standards for data in proteomics
• … facilitating data comparison, exchange and verification
Minimum information guidelines
22
• MIAPE: The Minimum Information About a Proteomics Experiment
• Data and metadata from proteomics experiments
• Data: results
• Metadata: data about the data
• Where the samples came from
• How the analysis were performed
Minimum information guidelines
MIMIx
• MIAPE document guideline for molecular interactions
• 1. Manuscript information
• 2. Experiment
• 3. Interaction
ID Mapping services
Logical xref
(hyperlinked)
Inactive xref
Secondary
Identifier
Active xref
(hyperlinked)
Richard Cote
Web services!
• REST
• SOAP
http://www.ebi.ac.uk/Tools/picr/
Protein Identifier Cross-Reference Service
Web services
Web services
Workflow management systems
Taverna
Workflow management systems
Examples from myExperiment
OLS
PICR
Biomart and
Microarray analysis
ChEBI
Thank you!
13/12/2018
29
ProteomicsServicesTeam

More Related Content

DOC
Protein databases
PPTX
DBMS information in detail || Dbms (lab) ppt
PPTX
biological detabase
PPT
Sequence Alignment In Bioinformatics
PPT
protein sturcture prediction and molecular modelling
PDF
Bioinformatics and BioPerl
PPT
Normalization case
PDF
Bioinformatics.Practical Notebook
Protein databases
DBMS information in detail || Dbms (lab) ppt
biological detabase
Sequence Alignment In Bioinformatics
protein sturcture prediction and molecular modelling
Bioinformatics and BioPerl
Normalization case
Bioinformatics.Practical Notebook

What's hot (20)

PPTX
(Expasy)
PDF
Perl Programming - 02 Regular Expression
PPSX
Data Structure (Circular Linked List)
PDF
LinkedIn Data Infrastructure Slides (Version 2)
PPT
Biostatistics and Statistical Bioinformatics
PPT
Dotplots for Bioinformatics
PPT
Protein docking
PDF
MongoDB Europe 2016 - Graph Operations with MongoDB
PPTX
Global and Local Sequence Alignment
PDF
2016 REU Poster Presentation
PDF
Protein Structure Alignment and Comparison
ODP
Biopython
PDF
Business Analysis Fundamentals
PDF
Graph database Use Cases
PDF
Using pySpark with Google Colab & Spark 3.0 preview
PPTX
Custom content provider in android
PDF
PostgreSQL Tutorial For Beginners | Edureka
PPTX
Application of molecular tools in environmental engineering (with references)
PPTX
222397 lecture 16 17
(Expasy)
Perl Programming - 02 Regular Expression
Data Structure (Circular Linked List)
LinkedIn Data Infrastructure Slides (Version 2)
Biostatistics and Statistical Bioinformatics
Dotplots for Bioinformatics
Protein docking
MongoDB Europe 2016 - Graph Operations with MongoDB
Global and Local Sequence Alignment
2016 REU Poster Presentation
Protein Structure Alignment and Comparison
Biopython
Business Analysis Fundamentals
Graph database Use Cases
Using pySpark with Google Colab & Spark 3.0 preview
Custom content provider in android
PostgreSQL Tutorial For Beginners | Edureka
Application of molecular tools in environmental engineering (with references)
222397 lecture 16 17

Similar to Data standards in bioinformatics (20)

PPT
Data integration
PPT
Data integration
PPTX
PSI-MI stadards
PPT
Data integration
PPTX
Data formats and ontologies
PPT
PDF
"Standards landscape" NIF Big Data 2 Knowledge (BD2K) Initiative, Sep, 2013
PPTX
ELIXIR . Technical Coordinator
PPT
Standardisation in BMS European infrastructures
PPTX
Bio db core-mockup-v1
PPTX
PSI-MI standards and PSICQUIC
PDF
BioSharing.org - mapping the landscape of community standards, databases, dat...
PDF
Bioinformatics: History of Bioinformatics, Components of Bioinformatics, Geno...
PPTX
Proteomics data standards
PPTX
Proteomics data standards
PDF
BioSharing update and next steps - ELIXIR ALL Hands - March, 2015
PPT
Data standards for systems biology
PPT
Data standards for systems biology
PPTX
BIOINFORMATICS BIOLOGICAL DATABASES DATA BASES.pptx
PPTX
Proteomics data standards
Data integration
Data integration
PSI-MI stadards
Data integration
Data formats and ontologies
"Standards landscape" NIF Big Data 2 Knowledge (BD2K) Initiative, Sep, 2013
ELIXIR . Technical Coordinator
Standardisation in BMS European infrastructures
Bio db core-mockup-v1
PSI-MI standards and PSICQUIC
BioSharing.org - mapping the landscape of community standards, databases, dat...
Bioinformatics: History of Bioinformatics, Components of Bioinformatics, Geno...
Proteomics data standards
Proteomics data standards
BioSharing update and next steps - ELIXIR ALL Hands - March, 2015
Data standards for systems biology
Data standards for systems biology
BIOINFORMATICS BIOLOGICAL DATABASES DATA BASES.pptx
Proteomics data standards

More from Rafael C. Jimenez (20)

PPTX
BMB Resource Integration Workshop
PPTX
Proteomics repositories integration using EUDAT resources
PPTX
Summary of Technical Coordinators discussions
PPTX
The European life-science data infrastructure: Data, Computing and Services ...
PPT
ELIXIR TCG update
PPT
An introduction to programmatic access
PPTX
Life science requirements from e-infrastructure: initial results from a joint...
PPT
Technical activities in ELIXIR Europe
PPTX
Challenges of big data. Summary day 1.
PPTX
Challenges of big data. Aims of the workshop.
PPTX
Data submissions and archiving raw data in life sciences. A pilot with Proteo...
PPT
ELIXIR and data grand challenges in life sciences
PPT
SASI, A lightweight standard for exchanging course information
PPTX
Introduction to the BioJS project
BMB Resource Integration Workshop
Proteomics repositories integration using EUDAT resources
Summary of Technical Coordinators discussions
The European life-science data infrastructure: Data, Computing and Services ...
ELIXIR TCG update
An introduction to programmatic access
Life science requirements from e-infrastructure: initial results from a joint...
Technical activities in ELIXIR Europe
Challenges of big data. Summary day 1.
Challenges of big data. Aims of the workshop.
Data submissions and archiving raw data in life sciences. A pilot with Proteo...
ELIXIR and data grand challenges in life sciences
SASI, A lightweight standard for exchanging course information
Introduction to the BioJS project

Recently uploaded (20)

PPTX
Reinforcement learning in artificial intelligence and deep learning
PPTX
Stats annual compiled ipd opd ot br 2024
PPTX
research framework and review of related literature chapter 2
PDF
Buddhism presentation about world religion
PPTX
cyber row.pptx for cyber proffesionals and hackers
PDF
book-34714 (2).pdfhjkkljgfdssawtjiiiiiujj
PPT
Classification methods in data analytics.ppt
PPTX
indiraparyavaranbhavan-240418134200-31d840b3.pptx
PDF
Nucleic-Acids_-Structure-Typ...-1.pdf 011
PPTX
inbound6529290805104538764.pptxmmmmmmmmm
PDF
Book Trusted Companions in Delhi – 24/7 Available Delhi Personal Meeting Ser...
PDF
General category merit rank list for neet pg
PPTX
Capstone Presentation a.pptx on data sci
PPTX
Basic Statistical Analysis for experimental data.pptx
PPTX
PPT for Diseases.pptx, there are 3 types of diseases
PPT
BME 301 Lecture Note 1_2.ppt mata kuliah Instrumentasi
PDF
Teal Blue Futuristic Metaverse Presentation.pdf
PPTX
ifsm.pptx, institutional food service management
PPTX
DATA ANALYTICS COURSE IN PITAMPURA.pptx
PPT
Technicalities in writing workshops indigenous language
Reinforcement learning in artificial intelligence and deep learning
Stats annual compiled ipd opd ot br 2024
research framework and review of related literature chapter 2
Buddhism presentation about world religion
cyber row.pptx for cyber proffesionals and hackers
book-34714 (2).pdfhjkkljgfdssawtjiiiiiujj
Classification methods in data analytics.ppt
indiraparyavaranbhavan-240418134200-31d840b3.pptx
Nucleic-Acids_-Structure-Typ...-1.pdf 011
inbound6529290805104538764.pptxmmmmmmmmm
Book Trusted Companions in Delhi – 24/7 Available Delhi Personal Meeting Ser...
General category merit rank list for neet pg
Capstone Presentation a.pptx on data sci
Basic Statistical Analysis for experimental data.pptx
PPT for Diseases.pptx, there are 3 types of diseases
BME 301 Lecture Note 1_2.ppt mata kuliah Instrumentasi
Teal Blue Futuristic Metaverse Presentation.pdf
ifsm.pptx, institutional food service management
DATA ANALYTICS COURSE IN PITAMPURA.pptx
Technicalities in writing workshops indigenous language

Data standards in bioinformatics