Data for AI
Models, The Past,
The Present, The
Future
John P. Overington
jpo@md.catapult.org.uk
© 2019 Medicines Discovery Catapult. All rights reserved.
“Public data is the
worst form of
training data for AI
except for all those
other forms that
have been tried
from time to time”
Winston Churchill, 2016
© 2019 Medicines Discovery Catapult. All rights reserved.
National facility connecting the UK
community to accelerate innovative
drug discovery
• Independent not-for-profit organisation
• Part of the U.K.’s Catapult network
• Helping to deliver the U.K.’s Industrial Strategy
• Funded by Innovate U.K., part of UK Research
and Innovation, reporting to the Department
for Business, Energy & Industrial Strategy
• Focus on SME and translational academic
sector support
MDC - Medicines Discovery Catapult
© 2019 Medicines Discovery Catapult. All rights reserved.
ChEMBL, SureChEMBL & UniChem
© 2019 Medicines Discovery Catapult. All rights reserved.
• Originally developed 2003 at
Inpharmatica
• Spun out to public domain
• The world’s largest primary public
database of medicinal chemistry data
• ~2.3 million compounds
• ~11,000 targets
• ~15 million bioactivities
• Truly Open Data - CC-BY-SA license
• API, MyChEMBL VM, RDF, full tables
download….
• Basis of vast majority of AI innovation
in compound design/optimisation
Gaulton et al (2012) Nucleic Acids Research Database Issue. 40 D1100-1107
ChEMBL – www.ebi.ac.uk/chembl
© 2019 Medicines Discovery Catapult. All rights reserved.
Compound
Assay
Ki=4.5 nM
>Thrombin
MAHVRGLQLPGCLALAALCSLVHSQHVFLAPQQARSLLQRVRRANTFLEEVRKGNLERECVEETCSY
EEAFEALESSTATDVFWAKYTACETARTPRDKLAACLEGNCAEGLGTNYRGHVNITRSGIECQLWRS
RYPHKPEINSTTHPGADLQENFCRNPDSSTTGPWCYTTDPTVRRQECSIPVCGQDQVTVAMTPRSEG
SSVNLSPPLEQCVPDRGQQYQGRLAVTTHGLPCLAWASAQAKALSKHQDFNSAVQLVENFCRNPDGD
EEGVWCYVAGKPGDFGYCDLNYCEEAVEEETGDGLDEDSDRAIEGRTATSEYQTFFNPRTFGSGEAD
CGLRPLFEKKSLEDKTERELLESYIDGRIVEGSDAEIGMSPWQVMLFRKSPQELLCGASLISDRWVL
TAAHCLLYPPWDKNFTENDLLVRIGKHSRTRYERNIEKISMLEKIYIHPRYNWRENLDRDIALMKLK
KPVAFSDYIHPVCLPDRETAASLLQAGYKGRVTGWGNLKETWTANVGKGQPSVLQVVNLPIVERPVC
KDSTRIRITDNMFCAGYKPDEGKRGDACEGDSGGPFVMKSPFNNRWYQMGIVSWGEGCDRDGKYGFY
THVFRLKKWIQKVIDQFGE
ED2=230 nM
Inhibition of
human Thrombin
PTT (partial
thromboplastin
time)
ChEMBL
© 2019 Medicines Discovery Catapult. All rights reserved.
• Public chemistry patent resource
• Donated by Digital Science –
SureChem commercial product
• Automatically extracted chemical
structures from full-text patents
• >18 million chemical structures
• Updated daily
• Full chemistry data download
SureChEMBL– www.surechembl.org
Papadatos et al (2016) Nucl. Acids Res Database Issue D1220-1228
© 2019 Medicines Discovery Catapult. All rights reserved.
UniChem – www.ebi.ac.uk/unichem
• Simple chemical
integration service
• >144 million structures
from ~30 sources
• URI/resource ID/Standard
InChI based lookups
• Available chemicals,
PubChem, ZINC, real
time, private
• Chemical structure ‘Time
Machine’
Chambers et al (2013) J. Cheminf. DOI:10.1186/1758-2946-5-3
© 2019 Medicines Discovery Catapult. All rights reserved.
Personal Perspectives on ChEMBL
• Things that worked well
• Single, major visionary funder – Wellcome Trust
• Focus on data content/backend not GUI
• Clear License – CC-BY-SA - same license as Wikipedia content
• Private/secure services
• Opportunism – SureChEMBL
• Open Data in ChEMBL re-invigorated cheminformatics research
• Things that didn’t work so well
• Community curation attempts – armchair critics
• Publisher interactions – except Royal Society of Chemistry
• I would do things very differently now
© 2019 Medicines Discovery Catapult. All rights reserved.
The Reproducibility Reproducibility Crisis!
Begley & Lee (2012) Nature DOI:10.1038/483531 & Prinz et al (2011) NRDD DOI:10.1038/nrd3439-c1
© 2019 Medicines Discovery Catapult. All rights reserved.
Enhanced data
model for ChEMBL
can appear as
‘errors’: e.g.
complexes,
receptor sets,
model organisms
“The more complex
the parameter, the
more frequent the
errors”
Errors in ChEMBL
Tiikkainen et al (2013) JCIM DOI:10.1021/ci400099q
© 2019 Medicines Discovery Catapult. All rights reserved.
Errors in SureChEMBL
Senger et al (2015) J Cheminf DOI:10.1186/s13321-015-0097-z
© 2019 Medicines Discovery Catapult. All rights reserved.
0.2
0.4
0.6
−4 −2 0 2 4
diff
density
Inter-species Assay Variability
Distribution of potency
differences
Scatter plot of
measured potencies
n = 2.781
Krüger & Overington (2012) PLoS Comp. Biol. DOI:10.1371/journal.pcbi.1002333
Same compound, same end-point for rat and human orthologs
pKi human
pKirat
diff(human, rat)
norm.dens.
2
4
6
8
10
12
2 4 6 8 10 12
orthoFrame$afnty1
orthoFrame$afnty2
© 2019 Medicines Discovery Catapult. All rights reserved.
2
4
6
8
10
12
2 4 6 8 10 12
sampleFrame$afnty1
sampleFrame$afnty2
0.2
0.4
0.6
−4 −2 0 2 4
diffdensity
pKi Assay1
pKiAssay2
diff(assay1, assay2)
n = 3.000
norm.dens.
Scatter plot of measured
potencies
Krüger & Overington (2012) PLoS Comp. Biol. DOI:10.1371/journal.pcbi.1002333
Same compound, same species, different publication
Distribution of potency
differences
Inter-lab Assay Variability
© 2019 Medicines Discovery Catapult. All rights reserved.
density
Inter-species vs Inter-lab Variability
Krüger & Overington (2012) PLoS Comp. Biol. DOI:10.1371/journal.pcbi.1002333
pKii - pKij
density Inter-laboratory
Inter-species
© 2019 Medicines Discovery Catapult. All rights reserved.
Garnett et al (2012) Nature DOI:10.1371/journal.pcbi.1002333 & Barretina et al (2012) Nature DOI:10.1038/nature11003
Large-Scale Cell-line Screening Data
© 2019 Medicines Discovery Catapult. All rights reserved.
Inconsistent Cell-line Screening Data
Haibe-Kains et al (2013) Nature DOI:10.1038/nature12831 (see also Stransky et al (2015) Nature DOI:10.1038/nature15736)
© 2019 Medicines Discovery Catapult. All rights reserved.
Primary Data – Batches and Replicates
http://www.wexlerwallace.com/wp-content/uploads/2012/04/Southeast-Laborers-Health-v-Pfizer.pdf
© 2019 Medicines Discovery Catapult. All rights reserved.
Incorrect Chemical Structures
Bosutinib Voxtalisib
http://cen.acs.org/articles/90/web/2012/05/Bosutinib-Buyer-Beware.html, & Overington & Wennerberg unpublished
© 2019 Medicines Discovery Catapult. All rights reserved.
Biochemical
assay
Cell-
based
screen
Functional
assay
Animal
disease
model
Human
clinical
trial
Variance – From Simple to Complex
Inter study variance
Number of assay variables
Steady state Time dependent
© 2019 Medicines Discovery Catapult. All rights reserved.
The Present
© 2019 Medicines Discovery Catapult. All rights reserved.
MDC Collaborating With The Sector
© 2019 Medicines Discovery Catapult. All rights reserved.
DeepADMET
• DeepADMET – InnovateUK grant
• Optibrium Ltd.
• Intellegens Ltd.
• Medicines Discovery Catapult
• MDC engineering software pipeline to
supply ‘SAR data on demand’
• Flexible wrt document source
• Fast and responsive
• Significantly boost public/internal data
• Deliver provenanced activity ‘vectors’
• Develop broader range of robust
ADMET models using deep learning
Document
gathering
NLP /
NER
Data
Extraction
&
Heuristics
SAR
vectors
© 2019 Medicines Discovery Catapult. All rights reserved.
Secondary
(compiled from literature review, databases)
Primary (preferred)
(measured in the same assay)
Assay conditions Assay conditions
Compound
Compound
*
DeepADMET – Data Structure
© 2019 Medicines Discovery Catapult. All rights reserved.
The Future
© 2019 Medicines Discovery Catapult. All rights reserved.
https://stevenmiller888.github.io/mind-how-to-build-a-neural-network/
Neural Networks
© 2019 Medicines Discovery Catapult. All rights reserved.
Assays in Drug Discovery
Biochemical
assays
Cell-based
assays
Functional
assays
In vivo
assays
Human
studies
Proteins Cell lines Tissues &
organs
Animal models Humans
ancient
“Human clinical trial”
• Error prone, serendipitous discoveries
• Traditional medicines: aspirin, quinine, …
© 2019 Medicines Discovery Catapult. All rights reserved.
Assays in Drug Discovery
Biochemical
assays
Cell-based
assays
Functional
assays
In vivo
assays
Human
studies
Proteins Cell lines Tissues &
organs
Animal models Humans
1910s ancient
Animal in vivo assays
• Faster, safer, cheaper
• … but less predictive
© 2019 Medicines Discovery Catapult. All rights reserved.
Assays in Drug Discovery
Biochemical
assays
Cell-based
assays
Functional
assays
In vivo
assays
Human
studies
Proteins Cell lines Tissues &
organs
Animal models Humans
1920s 1910s ancient
Ex vivo assays
• Higher throughput, cheaper
• Mechanistic insights
• … but less predictive
© 2019 Medicines Discovery Catapult. All rights reserved.
Assays in Drug Discovery
Biochemical
assays
Cell-based
assays
Functional
assays
In vivo
assays
Human
studies
Proteins Cell lines Tissues &
organs
Animal models Humans
1950s 1920s 1910s ancient
Cell-based assays
• Higher throughput, cheaper
• Mechanistic insights
• … but less predictive
© 2019 Medicines Discovery Catapult. All rights reserved.
Assays in Drug Discovery
Biochemical
assays
Cell-based
assays
Functional
assays
In vivo
assays
Human
studies
Proteins Cell lines Tissues &
organs
Animal models Humans
1970s 1950s 1920s 1910s ancient
Biochemical assays
• Higher throughput
• Mechanistic insights
• Recombinant DNA technology
• … but less predictive
© 2019 Medicines Discovery Catapult. All rights reserved.
Example Assay Path: Anti-inflammatory Drugs
Prostaglandin
G/H synthase 2
LPS-stimulated
THP-1 cells
LPS-stimulated
human whole blood
carrageenan-
injected rat
acute gout
patient
© 2019 Medicines Discovery Catapult. All rights reserved.
© 2019 Medicines Discovery Catapult. All rights reserved.
• Finding Assays
• Text-mining across papers, patents, vendor catalogues
• Indexing of Assays
• specialist dictionaries - techniques, equipment, genes, end-points, ….
• Classification of assays
• Efficacy/ADMET & biochemical, cell-based, organoid, tissue, ….
• Similarity of Assays
• how ‘similar’ are two assays?
• Chaining of Assays
• constructing the directed graph
• Learning thresholds
• Identification of ‘triggers’ from chained, directed assay pairs
AssayNet – Building the Network
© 2019 Medicines Discovery Catapult. All rights reserved.
© 2019 Medicines Discovery Catapult. All rights reserved.
© 2019 Medicines Discovery Catapult. All rights reserved.
© 2019 Medicines Discovery Catapult. All rights reserved.
© 2019 Medicines Discovery Catapult. All rights reserved.
Assay 1 Assay 2
• Decision Thresholds
• What activity threshold in Assay 1 makes it worth measuring in Assay 2?
• Learn from statistical distributions
• Probably artefactually thresholded at integral pIC50 thresholds – e.g. 1mM (cf P-value distributions)
Learning Decision Thresholds
pIC50
pIC50
#
#
Compounds
selected for
screening in
assay 2
Distribution of activity values of
compounds in Assay 1
Sharp cutoff
Sampled cutoff
© 2019 Medicines Discovery Catapult. All rights reserved.
Bayesian Networks
© 2019 Medicines Discovery Catapult. All rights reserved.
Bioassay data - ChEMBL Database
IC50 4.5 nM
>Thrombin
MAHVRGLQLPGCLALAALCSLVHSQHVFLA
PQQARSLLQRVRRANTFLEEVRKGNLEREC
VEETCSYEEAFEALESSTATDVFWAKYTAC
ETARTPRDKLAACLEGNCAEGLGTNYRGHV
APTT
11 min
Target
Compoun
d
Bioassay data
Compound
Assay
• Data manually extracted by a team of
curators from published pharmacology
and drug discovery literature (e.g.
Journal of Medicinal Chemistry)
• ChEMBL has transformed many aspects
of cheminformatics research
− Target prediction
− Large-scale QSAR
− Matched Molecular Pairs
− …
• ChEMBL is foundation data source of
almost all published AI compound
design research
© 2019 Medicines Discovery Catapult. All rights reserved.
1
a
b
d
2
3c
5e
4 g
f
h
6
ChEMBL as a Graph
assay-assay network
compound-compound network
b
f
c
h
ge
a d
1
a
1 a
compound assay
has activity in
Zwierzyna & Overington (in preparation)
1
2
4
6
5
3
© 2019 Medicines Discovery Catapult. All rights reserved.
Assay Network: Binding Assay Data (Subset)
A subset of the assay network (~6,000 nodes)
constructed using protein-binding assay data
from ChEMBL
Zwierzyna & Overington (in preparation)
© 2019 Medicines Discovery Catapult. All rights reserved.
Assay Network: Preclinical Assay Data
PPAR binding assay
DPP-4 binding assay
in vivo assay
cell-based assay
Zwierzyna & Overington (in preparation)
• Fragment of the assay network with a
subset of bioassays testing antidiabetic
compounds
• Assays involving closely related
biological targets are clustered
together, e.g. assays involving various
peroxisome proliferator-activated
receptors in the green cluster
• Antidiabetic compounds with different
mechanism of action (e.g. DPP-4
inhibitors and PPAR agonists) are often
tested in the same animal model (such
as Zucker diabetic rat) → in vivo
assays link distinct clusters
© 2019 Medicines Discovery Catapult. All rights reserved.
Animal Models: Assay Descriptions
CHEMBL893931:
“Inhibition of carrageenan-induced paw oedema
in Sprague-Dawley rat at 5.16 mg/kg, sc after 3 hrs.”
© 2019 Medicines Discovery Catapult. All rights reserved.
Animal Models: Assay Descriptions
Induced
Model Phenotype
Genetic
Strain
Dosage Administratio
n Route
Timing
CHEMBL893931:
“Inhibition of carrageenan-induced paw oedema
in Sprague-Dawley rat at 5.16 mg/kg, sc after 3 hrs.”
© 2019 Medicines Discovery Catapult. All rights reserved.
Information Extraction From Assay Descriptions
Antiallodynicactivity in Wistar albino rat chronicconstrictioninjury-induced neuropathic pain model assessed as attenuation of mechanicalallodynia
JJ NN IN NNP NN NN JJ NN JJ JJ NN NN VBN IN NN IN JJ NN
NP PP NP VP PP NP PP NP
S
CHEMBL1799193:
Antiallodynicactivity in Wistar albino rat chronic constriction injury-induced neuropathic pain model assessed as attenuation of mechanical allodynia.
Antiallodynicactivity Wistar albino rat chronicconstrictioninjury-induced neuropathic pain model assessed attenuation mechanicalallodynia
Experiment Phenotype PhenotypeStrain
Antiallodynicactivity Wistar albino rat chronicconstrictioninjury-induced neuropathic pain model assessed attenuation mechanicalallodynia
A
B
C
D
Antiallodynicactivity in Wistar albino rat chronicconstriction injury-induced neuropathic pain model assessed as attenuation of mechanical allodynia
JJ NN IN NNP NN NN JJ NN JJ JJ NN NN VBN IN NN IN JJ NN
NP PP NP VP PP NP PP NP
S
CHEMBL1799193:
Antiallodynicactivity in Wistar albino rat chronic constriction injury-induced neuropathic pain model assessed as attenuation of mechanical allodynia.
Antiallodynicactivity Wistar albino rat chronicconstriction injury-induced neuropathic pain model assessed attenuation mechanical allodynia
Experiment Phenotype PhenotypeStrain
Antiallodynicactivity Wistar albino rat chronicconstrictioninjury-induced neuropathic pain model assessed attenuation mechanical allodynia
A
B
C
D
Sentence
Noun Phrase
Verb Phrase
AdjectiveNoun Verb
Prepositional Phrase
Antiallodynicactivity in Wistar albino rat chronicconstrictioninjury-induced neuropathic pain model assessed as attenuation of mechanical allodynia
JJ NN IN NNP NN NN JJ NN JJ JJ NN NN VBN IN NN IN JJ NN
NP PP NP VP PP NP PP NP
S
CHEMBL1799193:
Antiallodynicactivity in Wistar albino rat chronic constriction injury-induced neuropathic pain model assessed as attenuation of mechanical allodynia.
Antiallodynicactivity Wistar albino rat chronicconstrictioninjury-induced neuropathic pain model assessed attenuation mechanicalallodynia
[9.11,8.73,9.19,...] [-0.17,-0.57,0.01,...] [8.95,3.39,-5.22,...] [9.08,8.02,8.09,...][9.11,8.73,9.19,...][9.56,9.14,2.10,...][9.10,8.72,9.18,...]
Experiment Phenotype PhenotypeStrain
Antiallodynicactivity Wistar albino rat chronicconstrictioninjury-induced neuropathic pain model assessed attenuation mechanical allodynia
A
B
C
D
E
Zwierzyna & Overington (in preparation)
© 2019 Medicines Discovery Catapult. All rights reserved.
PCA of Word2Vec Assay Descriptions
Each assay description: average over its word vectors. Data points projected from a 200-dimensional
space to 2D using PCA
Zwierzyna & Overington, unpublished
© 2019 Medicines Discovery Catapult. All rights reserved.
Word2vec Embedding of Assays
L01 (antineoplastic)M01 (anti-inflammatory)
ChEMBL assays of known drugs annotated with different ATC codes (~15k of ~94k)
N03 (antiepileptic)
A10 (antidiabetic)C02 (antihypertensive) N02 (analgesic)
Zwierzyna&Overington,unpublished
© 2019 Medicines Discovery Catapult. All rights reserved.
Biochemical
assay
Cell-based
screen
Functional
assay
Animal
disease
model
Human
clinical trial
Build assay networks
from literature/patent
co-occurrence
Link to animal models
and genetics
Understand target
engagement/
pharmacodynamics
through development
Directed graph of all
assays from targets to
clinical trials
AssayNet – Translational Path From Lab To Clinic
Compound
© 2019 Medicines Discovery Catapult. All rights reserved.
Acknowledgements
Bissan Al-Lazikani Aroon Hingorani,
Juan Pablo-Casas
Marc Marti-Renom
Francesco Martinez
Magda Zwierzyna
Mark Davies
Krister Wennerberg
Mark Warren, Gemma Holliday, Andrew Pannifer
Richard Seacome, James Welsh, Matthew Hodsgkiss
Charles Bury, Kepa Brurusco-Goni, Daiel James, Adam Poulston,
Matt Cockayne, Baydr Earls, Herve Barjat, Dave Allen, James Peach
Nathan Dedman, George Papadatos,
Grace Mugumbate, Anna Gaulton,
Prudence Mutowo, Louisa Bellis,
Anne Hersey, Jon Chambers,
Michal Nowotka, Anneli Karlsson,
Ines Smit, Francis Atkinson,
Paula Magarinos, Felix Kruger, Rita Santos

Data for AI models, the past, the present, the future

  • 1.
    Data for AI Models,The Past, The Present, The Future John P. Overington [email protected]
  • 2.
    © 2019 MedicinesDiscovery Catapult. All rights reserved. “Public data is the worst form of training data for AI except for all those other forms that have been tried from time to time” Winston Churchill, 2016
  • 3.
    © 2019 MedicinesDiscovery Catapult. All rights reserved. National facility connecting the UK community to accelerate innovative drug discovery • Independent not-for-profit organisation • Part of the U.K.’s Catapult network • Helping to deliver the U.K.’s Industrial Strategy • Funded by Innovate U.K., part of UK Research and Innovation, reporting to the Department for Business, Energy & Industrial Strategy • Focus on SME and translational academic sector support MDC - Medicines Discovery Catapult
  • 4.
    © 2019 MedicinesDiscovery Catapult. All rights reserved. ChEMBL, SureChEMBL & UniChem
  • 5.
    © 2019 MedicinesDiscovery Catapult. All rights reserved. • Originally developed 2003 at Inpharmatica • Spun out to public domain • The world’s largest primary public database of medicinal chemistry data • ~2.3 million compounds • ~11,000 targets • ~15 million bioactivities • Truly Open Data - CC-BY-SA license • API, MyChEMBL VM, RDF, full tables download…. • Basis of vast majority of AI innovation in compound design/optimisation Gaulton et al (2012) Nucleic Acids Research Database Issue. 40 D1100-1107 ChEMBL – www.ebi.ac.uk/chembl
  • 6.
    © 2019 MedicinesDiscovery Catapult. All rights reserved. Compound Assay Ki=4.5 nM >Thrombin MAHVRGLQLPGCLALAALCSLVHSQHVFLAPQQARSLLQRVRRANTFLEEVRKGNLERECVEETCSY EEAFEALESSTATDVFWAKYTACETARTPRDKLAACLEGNCAEGLGTNYRGHVNITRSGIECQLWRS RYPHKPEINSTTHPGADLQENFCRNPDSSTTGPWCYTTDPTVRRQECSIPVCGQDQVTVAMTPRSEG SSVNLSPPLEQCVPDRGQQYQGRLAVTTHGLPCLAWASAQAKALSKHQDFNSAVQLVENFCRNPDGD EEGVWCYVAGKPGDFGYCDLNYCEEAVEEETGDGLDEDSDRAIEGRTATSEYQTFFNPRTFGSGEAD CGLRPLFEKKSLEDKTERELLESYIDGRIVEGSDAEIGMSPWQVMLFRKSPQELLCGASLISDRWVL TAAHCLLYPPWDKNFTENDLLVRIGKHSRTRYERNIEKISMLEKIYIHPRYNWRENLDRDIALMKLK KPVAFSDYIHPVCLPDRETAASLLQAGYKGRVTGWGNLKETWTANVGKGQPSVLQVVNLPIVERPVC KDSTRIRITDNMFCAGYKPDEGKRGDACEGDSGGPFVMKSPFNNRWYQMGIVSWGEGCDRDGKYGFY THVFRLKKWIQKVIDQFGE ED2=230 nM Inhibition of human Thrombin PTT (partial thromboplastin time) ChEMBL
  • 7.
    © 2019 MedicinesDiscovery Catapult. All rights reserved. • Public chemistry patent resource • Donated by Digital Science – SureChem commercial product • Automatically extracted chemical structures from full-text patents • >18 million chemical structures • Updated daily • Full chemistry data download SureChEMBL– www.surechembl.org Papadatos et al (2016) Nucl. Acids Res Database Issue D1220-1228
  • 8.
    © 2019 MedicinesDiscovery Catapult. All rights reserved. UniChem – www.ebi.ac.uk/unichem • Simple chemical integration service • >144 million structures from ~30 sources • URI/resource ID/Standard InChI based lookups • Available chemicals, PubChem, ZINC, real time, private • Chemical structure ‘Time Machine’ Chambers et al (2013) J. Cheminf. DOI:10.1186/1758-2946-5-3
  • 9.
    © 2019 MedicinesDiscovery Catapult. All rights reserved. Personal Perspectives on ChEMBL • Things that worked well • Single, major visionary funder – Wellcome Trust • Focus on data content/backend not GUI • Clear License – CC-BY-SA - same license as Wikipedia content • Private/secure services • Opportunism – SureChEMBL • Open Data in ChEMBL re-invigorated cheminformatics research • Things that didn’t work so well • Community curation attempts – armchair critics • Publisher interactions – except Royal Society of Chemistry • I would do things very differently now
  • 10.
    © 2019 MedicinesDiscovery Catapult. All rights reserved. The Reproducibility Reproducibility Crisis! Begley & Lee (2012) Nature DOI:10.1038/483531 & Prinz et al (2011) NRDD DOI:10.1038/nrd3439-c1
  • 11.
    © 2019 MedicinesDiscovery Catapult. All rights reserved. Enhanced data model for ChEMBL can appear as ‘errors’: e.g. complexes, receptor sets, model organisms “The more complex the parameter, the more frequent the errors” Errors in ChEMBL Tiikkainen et al (2013) JCIM DOI:10.1021/ci400099q
  • 12.
    © 2019 MedicinesDiscovery Catapult. All rights reserved. Errors in SureChEMBL Senger et al (2015) J Cheminf DOI:10.1186/s13321-015-0097-z
  • 13.
    © 2019 MedicinesDiscovery Catapult. All rights reserved. 0.2 0.4 0.6 −4 −2 0 2 4 diff density Inter-species Assay Variability Distribution of potency differences Scatter plot of measured potencies n = 2.781 Krüger & Overington (2012) PLoS Comp. Biol. DOI:10.1371/journal.pcbi.1002333 Same compound, same end-point for rat and human orthologs pKi human pKirat diff(human, rat) norm.dens. 2 4 6 8 10 12 2 4 6 8 10 12 orthoFrame$afnty1 orthoFrame$afnty2
  • 14.
    © 2019 MedicinesDiscovery Catapult. All rights reserved. 2 4 6 8 10 12 2 4 6 8 10 12 sampleFrame$afnty1 sampleFrame$afnty2 0.2 0.4 0.6 −4 −2 0 2 4 diffdensity pKi Assay1 pKiAssay2 diff(assay1, assay2) n = 3.000 norm.dens. Scatter plot of measured potencies Krüger & Overington (2012) PLoS Comp. Biol. DOI:10.1371/journal.pcbi.1002333 Same compound, same species, different publication Distribution of potency differences Inter-lab Assay Variability
  • 15.
    © 2019 MedicinesDiscovery Catapult. All rights reserved. density Inter-species vs Inter-lab Variability Krüger & Overington (2012) PLoS Comp. Biol. DOI:10.1371/journal.pcbi.1002333 pKii - pKij density Inter-laboratory Inter-species
  • 16.
    © 2019 MedicinesDiscovery Catapult. All rights reserved. Garnett et al (2012) Nature DOI:10.1371/journal.pcbi.1002333 & Barretina et al (2012) Nature DOI:10.1038/nature11003 Large-Scale Cell-line Screening Data
  • 17.
    © 2019 MedicinesDiscovery Catapult. All rights reserved. Inconsistent Cell-line Screening Data Haibe-Kains et al (2013) Nature DOI:10.1038/nature12831 (see also Stransky et al (2015) Nature DOI:10.1038/nature15736)
  • 18.
    © 2019 MedicinesDiscovery Catapult. All rights reserved. Primary Data – Batches and Replicates http://www.wexlerwallace.com/wp-content/uploads/2012/04/Southeast-Laborers-Health-v-Pfizer.pdf
  • 19.
    © 2019 MedicinesDiscovery Catapult. All rights reserved. Incorrect Chemical Structures Bosutinib Voxtalisib http://cen.acs.org/articles/90/web/2012/05/Bosutinib-Buyer-Beware.html, & Overington & Wennerberg unpublished
  • 20.
    © 2019 MedicinesDiscovery Catapult. All rights reserved. Biochemical assay Cell- based screen Functional assay Animal disease model Human clinical trial Variance – From Simple to Complex Inter study variance Number of assay variables Steady state Time dependent
  • 21.
    © 2019 MedicinesDiscovery Catapult. All rights reserved. The Present
  • 22.
    © 2019 MedicinesDiscovery Catapult. All rights reserved. MDC Collaborating With The Sector
  • 23.
    © 2019 MedicinesDiscovery Catapult. All rights reserved. DeepADMET • DeepADMET – InnovateUK grant • Optibrium Ltd. • Intellegens Ltd. • Medicines Discovery Catapult • MDC engineering software pipeline to supply ‘SAR data on demand’ • Flexible wrt document source • Fast and responsive • Significantly boost public/internal data • Deliver provenanced activity ‘vectors’ • Develop broader range of robust ADMET models using deep learning Document gathering NLP / NER Data Extraction & Heuristics SAR vectors
  • 24.
    © 2019 MedicinesDiscovery Catapult. All rights reserved. Secondary (compiled from literature review, databases) Primary (preferred) (measured in the same assay) Assay conditions Assay conditions Compound Compound * DeepADMET – Data Structure
  • 25.
    © 2019 MedicinesDiscovery Catapult. All rights reserved. The Future
  • 26.
    © 2019 MedicinesDiscovery Catapult. All rights reserved. https://stevenmiller888.github.io/mind-how-to-build-a-neural-network/ Neural Networks
  • 27.
    © 2019 MedicinesDiscovery Catapult. All rights reserved. Assays in Drug Discovery Biochemical assays Cell-based assays Functional assays In vivo assays Human studies Proteins Cell lines Tissues & organs Animal models Humans ancient “Human clinical trial” • Error prone, serendipitous discoveries • Traditional medicines: aspirin, quinine, …
  • 28.
    © 2019 MedicinesDiscovery Catapult. All rights reserved. Assays in Drug Discovery Biochemical assays Cell-based assays Functional assays In vivo assays Human studies Proteins Cell lines Tissues & organs Animal models Humans 1910s ancient Animal in vivo assays • Faster, safer, cheaper • … but less predictive
  • 29.
    © 2019 MedicinesDiscovery Catapult. All rights reserved. Assays in Drug Discovery Biochemical assays Cell-based assays Functional assays In vivo assays Human studies Proteins Cell lines Tissues & organs Animal models Humans 1920s 1910s ancient Ex vivo assays • Higher throughput, cheaper • Mechanistic insights • … but less predictive
  • 30.
    © 2019 MedicinesDiscovery Catapult. All rights reserved. Assays in Drug Discovery Biochemical assays Cell-based assays Functional assays In vivo assays Human studies Proteins Cell lines Tissues & organs Animal models Humans 1950s 1920s 1910s ancient Cell-based assays • Higher throughput, cheaper • Mechanistic insights • … but less predictive
  • 31.
    © 2019 MedicinesDiscovery Catapult. All rights reserved. Assays in Drug Discovery Biochemical assays Cell-based assays Functional assays In vivo assays Human studies Proteins Cell lines Tissues & organs Animal models Humans 1970s 1950s 1920s 1910s ancient Biochemical assays • Higher throughput • Mechanistic insights • Recombinant DNA technology • … but less predictive
  • 32.
    © 2019 MedicinesDiscovery Catapult. All rights reserved. Example Assay Path: Anti-inflammatory Drugs Prostaglandin G/H synthase 2 LPS-stimulated THP-1 cells LPS-stimulated human whole blood carrageenan- injected rat acute gout patient
  • 33.
    © 2019 MedicinesDiscovery Catapult. All rights reserved.
  • 34.
    © 2019 MedicinesDiscovery Catapult. All rights reserved. • Finding Assays • Text-mining across papers, patents, vendor catalogues • Indexing of Assays • specialist dictionaries - techniques, equipment, genes, end-points, …. • Classification of assays • Efficacy/ADMET & biochemical, cell-based, organoid, tissue, …. • Similarity of Assays • how ‘similar’ are two assays? • Chaining of Assays • constructing the directed graph • Learning thresholds • Identification of ‘triggers’ from chained, directed assay pairs AssayNet – Building the Network
  • 35.
    © 2019 MedicinesDiscovery Catapult. All rights reserved.
  • 36.
    © 2019 MedicinesDiscovery Catapult. All rights reserved.
  • 37.
    © 2019 MedicinesDiscovery Catapult. All rights reserved.
  • 38.
    © 2019 MedicinesDiscovery Catapult. All rights reserved.
  • 39.
    © 2019 MedicinesDiscovery Catapult. All rights reserved. Assay 1 Assay 2 • Decision Thresholds • What activity threshold in Assay 1 makes it worth measuring in Assay 2? • Learn from statistical distributions • Probably artefactually thresholded at integral pIC50 thresholds – e.g. 1mM (cf P-value distributions) Learning Decision Thresholds pIC50 pIC50 # # Compounds selected for screening in assay 2 Distribution of activity values of compounds in Assay 1 Sharp cutoff Sampled cutoff
  • 40.
    © 2019 MedicinesDiscovery Catapult. All rights reserved. Bayesian Networks
  • 41.
    © 2019 MedicinesDiscovery Catapult. All rights reserved. Bioassay data - ChEMBL Database IC50 4.5 nM >Thrombin MAHVRGLQLPGCLALAALCSLVHSQHVFLA PQQARSLLQRVRRANTFLEEVRKGNLEREC VEETCSYEEAFEALESSTATDVFWAKYTAC ETARTPRDKLAACLEGNCAEGLGTNYRGHV APTT 11 min Target Compoun d Bioassay data Compound Assay • Data manually extracted by a team of curators from published pharmacology and drug discovery literature (e.g. Journal of Medicinal Chemistry) • ChEMBL has transformed many aspects of cheminformatics research − Target prediction − Large-scale QSAR − Matched Molecular Pairs − … • ChEMBL is foundation data source of almost all published AI compound design research
  • 42.
    © 2019 MedicinesDiscovery Catapult. All rights reserved. 1 a b d 2 3c 5e 4 g f h 6 ChEMBL as a Graph assay-assay network compound-compound network b f c h ge a d 1 a 1 a compound assay has activity in Zwierzyna & Overington (in preparation) 1 2 4 6 5 3
  • 43.
    © 2019 MedicinesDiscovery Catapult. All rights reserved. Assay Network: Binding Assay Data (Subset) A subset of the assay network (~6,000 nodes) constructed using protein-binding assay data from ChEMBL Zwierzyna & Overington (in preparation)
  • 44.
    © 2019 MedicinesDiscovery Catapult. All rights reserved. Assay Network: Preclinical Assay Data PPAR binding assay DPP-4 binding assay in vivo assay cell-based assay Zwierzyna & Overington (in preparation) • Fragment of the assay network with a subset of bioassays testing antidiabetic compounds • Assays involving closely related biological targets are clustered together, e.g. assays involving various peroxisome proliferator-activated receptors in the green cluster • Antidiabetic compounds with different mechanism of action (e.g. DPP-4 inhibitors and PPAR agonists) are often tested in the same animal model (such as Zucker diabetic rat) → in vivo assays link distinct clusters
  • 45.
    © 2019 MedicinesDiscovery Catapult. All rights reserved. Animal Models: Assay Descriptions CHEMBL893931: “Inhibition of carrageenan-induced paw oedema in Sprague-Dawley rat at 5.16 mg/kg, sc after 3 hrs.”
  • 46.
    © 2019 MedicinesDiscovery Catapult. All rights reserved. Animal Models: Assay Descriptions Induced Model Phenotype Genetic Strain Dosage Administratio n Route Timing CHEMBL893931: “Inhibition of carrageenan-induced paw oedema in Sprague-Dawley rat at 5.16 mg/kg, sc after 3 hrs.”
  • 47.
    © 2019 MedicinesDiscovery Catapult. All rights reserved. Information Extraction From Assay Descriptions Antiallodynicactivity in Wistar albino rat chronicconstrictioninjury-induced neuropathic pain model assessed as attenuation of mechanicalallodynia JJ NN IN NNP NN NN JJ NN JJ JJ NN NN VBN IN NN IN JJ NN NP PP NP VP PP NP PP NP S CHEMBL1799193: Antiallodynicactivity in Wistar albino rat chronic constriction injury-induced neuropathic pain model assessed as attenuation of mechanical allodynia. Antiallodynicactivity Wistar albino rat chronicconstrictioninjury-induced neuropathic pain model assessed attenuation mechanicalallodynia Experiment Phenotype PhenotypeStrain Antiallodynicactivity Wistar albino rat chronicconstrictioninjury-induced neuropathic pain model assessed attenuation mechanicalallodynia A B C D Antiallodynicactivity in Wistar albino rat chronicconstriction injury-induced neuropathic pain model assessed as attenuation of mechanical allodynia JJ NN IN NNP NN NN JJ NN JJ JJ NN NN VBN IN NN IN JJ NN NP PP NP VP PP NP PP NP S CHEMBL1799193: Antiallodynicactivity in Wistar albino rat chronic constriction injury-induced neuropathic pain model assessed as attenuation of mechanical allodynia. Antiallodynicactivity Wistar albino rat chronicconstriction injury-induced neuropathic pain model assessed attenuation mechanical allodynia Experiment Phenotype PhenotypeStrain Antiallodynicactivity Wistar albino rat chronicconstrictioninjury-induced neuropathic pain model assessed attenuation mechanical allodynia A B C D Sentence Noun Phrase Verb Phrase AdjectiveNoun Verb Prepositional Phrase Antiallodynicactivity in Wistar albino rat chronicconstrictioninjury-induced neuropathic pain model assessed as attenuation of mechanical allodynia JJ NN IN NNP NN NN JJ NN JJ JJ NN NN VBN IN NN IN JJ NN NP PP NP VP PP NP PP NP S CHEMBL1799193: Antiallodynicactivity in Wistar albino rat chronic constriction injury-induced neuropathic pain model assessed as attenuation of mechanical allodynia. Antiallodynicactivity Wistar albino rat chronicconstrictioninjury-induced neuropathic pain model assessed attenuation mechanicalallodynia [9.11,8.73,9.19,...] [-0.17,-0.57,0.01,...] [8.95,3.39,-5.22,...] [9.08,8.02,8.09,...][9.11,8.73,9.19,...][9.56,9.14,2.10,...][9.10,8.72,9.18,...] Experiment Phenotype PhenotypeStrain Antiallodynicactivity Wistar albino rat chronicconstrictioninjury-induced neuropathic pain model assessed attenuation mechanical allodynia A B C D E Zwierzyna & Overington (in preparation)
  • 48.
    © 2019 MedicinesDiscovery Catapult. All rights reserved. PCA of Word2Vec Assay Descriptions Each assay description: average over its word vectors. Data points projected from a 200-dimensional space to 2D using PCA Zwierzyna & Overington, unpublished
  • 49.
    © 2019 MedicinesDiscovery Catapult. All rights reserved. Word2vec Embedding of Assays L01 (antineoplastic)M01 (anti-inflammatory) ChEMBL assays of known drugs annotated with different ATC codes (~15k of ~94k) N03 (antiepileptic) A10 (antidiabetic)C02 (antihypertensive) N02 (analgesic) Zwierzyna&Overington,unpublished
  • 50.
    © 2019 MedicinesDiscovery Catapult. All rights reserved. Biochemical assay Cell-based screen Functional assay Animal disease model Human clinical trial Build assay networks from literature/patent co-occurrence Link to animal models and genetics Understand target engagement/ pharmacodynamics through development Directed graph of all assays from targets to clinical trials AssayNet – Translational Path From Lab To Clinic Compound
  • 51.
    © 2019 MedicinesDiscovery Catapult. All rights reserved. Acknowledgements Bissan Al-Lazikani Aroon Hingorani, Juan Pablo-Casas Marc Marti-Renom Francesco Martinez Magda Zwierzyna Mark Davies Krister Wennerberg Mark Warren, Gemma Holliday, Andrew Pannifer Richard Seacome, James Welsh, Matthew Hodsgkiss Charles Bury, Kepa Brurusco-Goni, Daiel James, Adam Poulston, Matt Cockayne, Baydr Earls, Herve Barjat, Dave Allen, James Peach Nathan Dedman, George Papadatos, Grace Mugumbate, Anna Gaulton, Prudence Mutowo, Louisa Bellis, Anne Hersey, Jon Chambers, Michal Nowotka, Anneli Karlsson, Ines Smit, Francis Atkinson, Paula Magarinos, Felix Kruger, Rita Santos