Using Machine Learning to Investigate Gene Set
Associations in Schizophrenia
Tim Vivian-Griffiths
Cardiff University
December 8, 2015
Tim V-G (Cardiff Uni) PhD project December 8, 2015 1 / 26
Schizophrenia
Brief Background
Symptoms1
Positive Symptoms
Impaired perceptions of reality
Delusions and hallucinations in all senses - auditory most common
Negative Symptoms
Blunting and impairment of affective and cognitive abilities
Reduced social skills and motivation
Prevalence and Heritability1,2
Lifetime risk of ∼ 0.7%
Heritability estimated from twin studies at ∼ 0.8
1
Tandon et al., 2009
2
Sullivan et al., 2003
Tim V-G (Cardiff Uni) PhD project December 8, 2015 2 / 26
Schizophrenia
Brief Background
Symptoms1
Positive Symptoms
Impaired perceptions of reality
Delusions and hallucinations in all senses - auditory most common
Negative Symptoms
Blunting and impairment of affective and cognitive abilities
Reduced social skills and motivation
Prevalence and Heritability1,2
Lifetime risk of ∼ 0.7%
Heritability estimated from twin studies at ∼ 0.8
1
Tandon et al., 2009
2
Sullivan et al., 2003
Tim V-G (Cardiff Uni) PhD project December 8, 2015 2 / 26
Attempts to uncover Genetic Etiology
Genome Wide Association Studies (GWAS)
No clear genetic biomarkers found
GWAS - Study method to find common variants of small effect
Findings from Psychiatric Genetics Consortium 2 study (PGC-2)1
1
Ripke et al., 2014
Tim V-G (Cardiff Uni) PhD project December 8, 2015 3 / 26
Risk Profile Scoring
Creation of Polygenic Score
Risk score per Single Nucleotide Polymorphism variant (SNP)
Calculated from differing rates of variants in cases and controls
Odds-Ratio and p-value calculated per variant
Linkage Disequilibrium
Sections of the genome not inherited independently
Measured by correlation of variants across samples - r2
Clumping procedure using PLINK
p1 Max p-value of main or index variant - 0.05
r2 Max r2 value - 0.1
kb Min kilobase distance between index variant - 500
Tim V-G (Cardiff Uni) PhD project December 8, 2015 4 / 26
Risk Profile Scoring
Creation of Polygenic Score
Risk score per Single Nucleotide Polymorphism variant (SNP)
Calculated from differing rates of variants in cases and controls
Odds-Ratio and p-value calculated per variant
Linkage Disequilibrium
Sections of the genome not inherited independently
Measured by correlation of variants across samples - r2
Clumping procedure using PLINK
p1 Max p-value of main or index variant - 0.05
r2 Max r2 value - 0.1
kb Min kilobase distance between index variant - 500
Tim V-G (Cardiff Uni) PhD project December 8, 2015 4 / 26
Risk Profile Scoring
Creation of Polygenic Score
Risk score per Single Nucleotide Polymorphism variant (SNP)
Calculated from differing rates of variants in cases and controls
Odds-Ratio and p-value calculated per variant
Linkage Disequilibrium
Sections of the genome not inherited independently
Measured by correlation of variants across samples - r2
Clumping procedure using PLINK
p1 Max p-value of main or index variant - 0.05
r2 Max r2 value - 0.1
kb Min kilobase distance between index variant - 500
Tim V-G (Cardiff Uni) PhD project December 8, 2015 4 / 26
Calculation of Polygenic Score
Number of variants of index SNPs noted (0, 1, or 2)
Natural log of Odds Ratio calculated per index SNP
ln 1/2 = − ln 2
This number used to weight the SNP count
These scores entered into a Logistic Regression Model (variation of
linear regression to give values 0 - 1)
1
1 + e−x
Tim V-G (Cardiff Uni) PhD project December 8, 2015 5 / 26
Aim of Current Study
Polygenic Score
Pros of Polygenic Score:
Simple model - single score per sample
Less prone to finding erroneous signal in noise
Cons:
Cannot locate regions of interest in the genome
Collecting into 134 Gene Sets
1 Find all Genic SNPs
2 Carry out clump routine
3 Group by Gene Set
4 Repeat with Border 35kb upstream - 10kb downstream1
1
O’Dushlaine et al, 2015
Tim V-G (Cardiff Uni) PhD project December 8, 2015 6 / 26
Aim of Current Study
Polygenic Score
Pros of Polygenic Score:
Simple model - single score per sample
Less prone to finding erroneous signal in noise
Cons:
Cannot locate regions of interest in the genome
Collecting into 134 Gene Sets
1 Find all Genic SNPs
2 Carry out clump routine
3 Group by Gene Set
4 Repeat with Border 35kb upstream - 10kb downstream1
1
O’Dushlaine et al, 2015
Tim V-G (Cardiff Uni) PhD project December 8, 2015 6 / 26
Comparison of Different Algorithms
Polygenic Score - Multiple Logistic Regression - Linear Support Vector Machine2
Features of Algorithms
Multiple Regression and SVMs can use multiple features
Possible to examine how important each feature is
Performance compared with Polygenic score
Datasets used
CLOZUK1 Study - 3446 Cases
Control data taken WTCCC - 4285 Controls
Aim is NOT to build a predictive model
But predictive metrics are used to measure performance
1
Hamshere et al, 2013
2
Cortes, Vapnik, 1995
Tim V-G (Cardiff Uni) PhD project December 8, 2015 7 / 26
Comparison of Different Algorithms
Polygenic Score - Multiple Logistic Regression - Linear Support Vector Machine2
Features of Algorithms
Multiple Regression and SVMs can use multiple features
Possible to examine how important each feature is
Performance compared with Polygenic score
Datasets used
CLOZUK1 Study - 3446 Cases
Control data taken WTCCC - 4285 Controls
Aim is NOT to build a predictive model
But predictive metrics are used to measure performance
1
Hamshere et al, 2013
2
Cortes, Vapnik, 1995
Tim V-G (Cardiff Uni) PhD project December 8, 2015 7 / 26
Comparison of Different Algorithms
Polygenic Score - Multiple Logistic Regression - Linear Support Vector Machine2
Features of Algorithms
Multiple Regression and SVMs can use multiple features
Possible to examine how important each feature is
Performance compared with Polygenic score
Datasets used
CLOZUK1 Study - 3446 Cases
Control data taken WTCCC - 4285 Controls
Aim is NOT to build a predictive model
But predictive metrics are used to measure performance
1
Hamshere et al, 2013
2
Cortes, Vapnik, 1995
Tim V-G (Cardiff Uni) PhD project December 8, 2015 7 / 26
Linear Support Vector Machine
Data with clear separation. Using examples adapted from
http://scikit-learn.org/stable/auto_examples/svm/plot_svm_kernels.html
Tim V-G (Cardiff Uni) PhD project December 8, 2015 8 / 26
Linear Support Vector Machine
Clear decision boundary and margins
Tim V-G (Cardiff Uni) PhD project December 8, 2015 9 / 26
Linear Support Vector Machine
Ambiguous data point
Tim V-G (Cardiff Uni) PhD project December 8, 2015 10 / 26
Linear Support Vector Machine
Using C value of 100
Tim V-G (Cardiff Uni) PhD project December 8, 2015 11 / 26
Linear Support Vector Machine
Using C value of 1
Tim V-G (Cardiff Uni) PhD project December 8, 2015 12 / 26
Linear Support Vector Machine
Using C value of 1
In this case, the feature on the x axis determines the outcome more
Tim V-G (Cardiff Uni) PhD project December 8, 2015 12 / 26
Linear Support Vector Machine
Equation of Model
α + β1x1 + β2x2
Tim V-G (Cardiff Uni) PhD project December 8, 2015 13 / 26
Model Building Procedure
Models cannot be tested on same data points that were used
to build them
Always hold some data out for assessment use (25%)
10 shuffles carried out to find best C values [0.1, 1, 10]
Performance assessed on 250 shuffles of the data
Independent test set should be used for predictive modelling
Tim V-G (Cardiff Uni) PhD project December 8, 2015 14 / 26
Comparison of algorithms performance
Using all gene sets - 134 in total (C always 0.1)
Polygenic Score Multiple Regression Linear SVM
0.55
0.60
0.65
0.70
ROCScore
Border
Genic Only
PGC Border
Tim V-G (Cardiff Uni) PhD project December 8, 2015 15 / 26
Gene Set Coefficients
Genic regions only
−0.1
0.0
0.1
0.2
0.3
−0.1
0.0
0.1
0.2
0.3
LinearSVMMultipleRegression
Coefficients
Gene Sets
FMRP_targets
abnormal_behavior
Presynapse
abnormal_cerebral_cortex_morphology
abnormal_nervous_system_morphology
abnormal_dendrite_morphology
abnormal_circadian_rhythm
abnormal_synaptic_depression
abnormal_nervous_system_development
PSD_(human_core)
5HT_2C
abnormal_sexual_interaction
abnormal_brain_development
abnormal_astrocyte_morphology
abnormal_excitatory_postsynaptic_potential
Tim V-G (Cardiff Uni) PhD project December 8, 2015 16 / 26
Gene Set Coefficients
PGC Border regions
−0.1
0.0
0.1
0.2
0.3
−0.1
0.0
0.1
0.2
0.3
LinearSVMMultipleRegression
Coefficients
Gene Sets
FMRP_targets
Presynapse
abnormal_behavior
abnormal_nervous_system_morphology
abnormal_circadian_rhythm
abnormal_cerebral_cortex_morphology
PSD_(human_core)
abnormal_synaptic_depression
abnormal_dendrite_morphology
5HT_2C
abnormal_nervous_system_development
abnormal_diencephalon_morphology
abnormal_brain_development
abnormal_excitatory_postsynaptic_potential
abnormal_excitatory_postsynaptic_currents
Tim V-G (Cardiff Uni) PhD project December 8, 2015 17 / 26
Recursive Feature Elimination Procedure
Gene Sets Identified as carrying the most signal for the Algorithms
Gene Boundary Region Linear SVM Multiple Regression
Genic Region Only
FMRP Targets FMRP Targets
Abnormal Behaviour Abnormal Behaviour
Abnormal Nervous System Morphology
PGC Border
FMRP Targets FMRP Targets
Abnormal Behaviour Presynapse
Abnormal Hippocampus Morphology Abnormal Behaviour
Abnormal Temporal Lobe Morphology Abnormal Nervous System Morphology
FMRP Targets and Abnormal Behaviour feature in all cases
Algorithms re-run with only these two gene sets featured
Tim V-G (Cardiff Uni) PhD project December 8, 2015 18 / 26
Recursive Feature Elimination Procedure
Gene Sets Identified as carrying the most signal for the Algorithms
Gene Boundary Region Linear SVM Multiple Regression
Genic Region Only
FMRP Targets FMRP Targets
Abnormal Behaviour Abnormal Behaviour
Abnormal Nervous System Morphology
PGC Border
FMRP Targets FMRP Targets
Abnormal Behaviour Presynapse
Abnormal Hippocampus Morphology Abnormal Behaviour
Abnormal Temporal Lobe Morphology Abnormal Nervous System Morphology
FMRP Targets and Abnormal Behaviour feature in all cases
Algorithms re-run with only these two gene sets featured
Tim V-G (Cardiff Uni) PhD project December 8, 2015 18 / 26
Comparison Results
All sets vs. RFE surviving sets for Genic and PGC Border regions
Polygenic Score Multiple Regression Linear SVM
0.55
0.60
0.65
0.70
0.55
0.60
0.65
0.70
GenicOnlyPGCBorder
ROCScore
Gene Sets
All Sets
RFE Sets
Tim V-G (Cardiff Uni) PhD project December 8, 2015 19 / 26
Comparison Results
All sets vs. RFE surviving sets for Genic and PGC Border regions
Polygenic Score Multiple Regression Linear SVM
0.55
0.60
0.65
0.70
0.55
0.60
0.65
0.70
AllSetsRFESets
ROCScore
Border
Genic Only
PGC Border
Tim V-G (Cardiff Uni) PhD project December 8, 2015 20 / 26
Comparison Results
All sets vs. RFE surviving sets for Genic and PGC Border regions
Polygenic Score Multiple Regression Linear SVM
0.55
0.60
0.65
0.70
0.55
0.60
0.65
0.70
AllSetsRFESets
ROCScore
Border
Genic Only
PGC Border
All Gene Sets
Repeated Measures ANOVA
Algorithm F = 62.9, p < 2 × 10−16
Border F = 31.2, p < 2.7 × 10−8
Interaction F = 0.101, p = 0.904
RFE Sets
Repeated Measures ANOVA
Algorithm F = 6.1, p < 0.003
Border F = 22.3, p < 2.4 × 10−6
Interaction F = 0.074, p = 0.929
Tim V-G (Cardiff Uni) PhD project December 8, 2015 21 / 26
Comparison of Index SNP numbers in Genes
FMRP genes (505/585) vs non-FMRP genes (1431/1989)
Genic Only PGC Window
0.0
0.1
0.2
0.3
0.4
0.5
0 10 20 30 0 10 20 30
Index SNP count per gene
density
Gene Type
FMRP Targets
Not FMRP Targets
Tim V-G (Cardiff Uni) PhD project December 8, 2015 22 / 26
Results of Gene Permutations
Uneven Index SNPs Genic (807.48/1543) PGC Window (991.72/1812)
Genic Only PGC Border
0
5
10
15
20
0.56 0.58 0.60 0.56 0.58 0.60
ROC Scores
count
Tim V-G (Cardiff Uni) PhD project December 8, 2015 23 / 26
Results of SNP Permutations
Same number of Index SNPs
Genic Only PGC Border
0
5
10
15
0.57 0.58 0.59 0.60 0.61 0.57 0.58 0.59 0.60 0.61
ROC Scores
count
Tim V-G (Cardiff Uni) PhD project December 8, 2015 24 / 26
Conclusions
FMRP Genes seem to carry signal
The FMRP genes clearly outperform other permutations of same size
Effect also seen in SNP permutations
Machine Learning finds the important features
Machine Learning does not yet improve on Polygenic Scoring
Both multi-feature algorithms capable of finding important features
Future Directions
Use a larger, more recent dataset
Use kernel methods to examine interactions
Look at annotation of variants
Look for gender difference in cases/controls
Tim V-G (Cardiff Uni) PhD project December 8, 2015 25 / 26
Conclusions
FMRP Genes seem to carry signal
The FMRP genes clearly outperform other permutations of same size
Effect also seen in SNP permutations
Machine Learning finds the important features
Machine Learning does not yet improve on Polygenic Scoring
Both multi-feature algorithms capable of finding important features
Future Directions
Use a larger, more recent dataset
Use kernel methods to examine interactions
Look at annotation of variants
Look for gender difference in cases/controls
Tim V-G (Cardiff Uni) PhD project December 8, 2015 25 / 26
Conclusions
FMRP Genes seem to carry signal
The FMRP genes clearly outperform other permutations of same size
Effect also seen in SNP permutations
Machine Learning finds the important features
Machine Learning does not yet improve on Polygenic Scoring
Both multi-feature algorithms capable of finding important features
Future Directions
Use a larger, more recent dataset
Use kernel methods to examine interactions
Look at annotation of variants
Look for gender difference in cases/controls
Tim V-G (Cardiff Uni) PhD project December 8, 2015 25 / 26
Acknowledgements
Prof. Michael Owen
Dr. Andrew Pocklington
Dr. Valentina Escott-Price
Dr. Andreas Artemiou
scikit-learn Developers
Pandas Developers
ggplot2 Developers
Tim V-G (Cardiff Uni) PhD project December 8, 2015 26 / 26

heb_lab_talk_2015

  • 1.
    Using Machine Learningto Investigate Gene Set Associations in Schizophrenia Tim Vivian-Griffiths Cardiff University December 8, 2015 Tim V-G (Cardiff Uni) PhD project December 8, 2015 1 / 26
  • 2.
    Schizophrenia Brief Background Symptoms1 Positive Symptoms Impairedperceptions of reality Delusions and hallucinations in all senses - auditory most common Negative Symptoms Blunting and impairment of affective and cognitive abilities Reduced social skills and motivation Prevalence and Heritability1,2 Lifetime risk of ∼ 0.7% Heritability estimated from twin studies at ∼ 0.8 1 Tandon et al., 2009 2 Sullivan et al., 2003 Tim V-G (Cardiff Uni) PhD project December 8, 2015 2 / 26
  • 3.
    Schizophrenia Brief Background Symptoms1 Positive Symptoms Impairedperceptions of reality Delusions and hallucinations in all senses - auditory most common Negative Symptoms Blunting and impairment of affective and cognitive abilities Reduced social skills and motivation Prevalence and Heritability1,2 Lifetime risk of ∼ 0.7% Heritability estimated from twin studies at ∼ 0.8 1 Tandon et al., 2009 2 Sullivan et al., 2003 Tim V-G (Cardiff Uni) PhD project December 8, 2015 2 / 26
  • 4.
    Attempts to uncoverGenetic Etiology Genome Wide Association Studies (GWAS) No clear genetic biomarkers found GWAS - Study method to find common variants of small effect Findings from Psychiatric Genetics Consortium 2 study (PGC-2)1 1 Ripke et al., 2014 Tim V-G (Cardiff Uni) PhD project December 8, 2015 3 / 26
  • 5.
    Risk Profile Scoring Creationof Polygenic Score Risk score per Single Nucleotide Polymorphism variant (SNP) Calculated from differing rates of variants in cases and controls Odds-Ratio and p-value calculated per variant Linkage Disequilibrium Sections of the genome not inherited independently Measured by correlation of variants across samples - r2 Clumping procedure using PLINK p1 Max p-value of main or index variant - 0.05 r2 Max r2 value - 0.1 kb Min kilobase distance between index variant - 500 Tim V-G (Cardiff Uni) PhD project December 8, 2015 4 / 26
  • 6.
    Risk Profile Scoring Creationof Polygenic Score Risk score per Single Nucleotide Polymorphism variant (SNP) Calculated from differing rates of variants in cases and controls Odds-Ratio and p-value calculated per variant Linkage Disequilibrium Sections of the genome not inherited independently Measured by correlation of variants across samples - r2 Clumping procedure using PLINK p1 Max p-value of main or index variant - 0.05 r2 Max r2 value - 0.1 kb Min kilobase distance between index variant - 500 Tim V-G (Cardiff Uni) PhD project December 8, 2015 4 / 26
  • 7.
    Risk Profile Scoring Creationof Polygenic Score Risk score per Single Nucleotide Polymorphism variant (SNP) Calculated from differing rates of variants in cases and controls Odds-Ratio and p-value calculated per variant Linkage Disequilibrium Sections of the genome not inherited independently Measured by correlation of variants across samples - r2 Clumping procedure using PLINK p1 Max p-value of main or index variant - 0.05 r2 Max r2 value - 0.1 kb Min kilobase distance between index variant - 500 Tim V-G (Cardiff Uni) PhD project December 8, 2015 4 / 26
  • 8.
    Calculation of PolygenicScore Number of variants of index SNPs noted (0, 1, or 2) Natural log of Odds Ratio calculated per index SNP ln 1/2 = − ln 2 This number used to weight the SNP count These scores entered into a Logistic Regression Model (variation of linear regression to give values 0 - 1) 1 1 + e−x Tim V-G (Cardiff Uni) PhD project December 8, 2015 5 / 26
  • 9.
    Aim of CurrentStudy Polygenic Score Pros of Polygenic Score: Simple model - single score per sample Less prone to finding erroneous signal in noise Cons: Cannot locate regions of interest in the genome Collecting into 134 Gene Sets 1 Find all Genic SNPs 2 Carry out clump routine 3 Group by Gene Set 4 Repeat with Border 35kb upstream - 10kb downstream1 1 O’Dushlaine et al, 2015 Tim V-G (Cardiff Uni) PhD project December 8, 2015 6 / 26
  • 10.
    Aim of CurrentStudy Polygenic Score Pros of Polygenic Score: Simple model - single score per sample Less prone to finding erroneous signal in noise Cons: Cannot locate regions of interest in the genome Collecting into 134 Gene Sets 1 Find all Genic SNPs 2 Carry out clump routine 3 Group by Gene Set 4 Repeat with Border 35kb upstream - 10kb downstream1 1 O’Dushlaine et al, 2015 Tim V-G (Cardiff Uni) PhD project December 8, 2015 6 / 26
  • 11.
    Comparison of DifferentAlgorithms Polygenic Score - Multiple Logistic Regression - Linear Support Vector Machine2 Features of Algorithms Multiple Regression and SVMs can use multiple features Possible to examine how important each feature is Performance compared with Polygenic score Datasets used CLOZUK1 Study - 3446 Cases Control data taken WTCCC - 4285 Controls Aim is NOT to build a predictive model But predictive metrics are used to measure performance 1 Hamshere et al, 2013 2 Cortes, Vapnik, 1995 Tim V-G (Cardiff Uni) PhD project December 8, 2015 7 / 26
  • 12.
    Comparison of DifferentAlgorithms Polygenic Score - Multiple Logistic Regression - Linear Support Vector Machine2 Features of Algorithms Multiple Regression and SVMs can use multiple features Possible to examine how important each feature is Performance compared with Polygenic score Datasets used CLOZUK1 Study - 3446 Cases Control data taken WTCCC - 4285 Controls Aim is NOT to build a predictive model But predictive metrics are used to measure performance 1 Hamshere et al, 2013 2 Cortes, Vapnik, 1995 Tim V-G (Cardiff Uni) PhD project December 8, 2015 7 / 26
  • 13.
    Comparison of DifferentAlgorithms Polygenic Score - Multiple Logistic Regression - Linear Support Vector Machine2 Features of Algorithms Multiple Regression and SVMs can use multiple features Possible to examine how important each feature is Performance compared with Polygenic score Datasets used CLOZUK1 Study - 3446 Cases Control data taken WTCCC - 4285 Controls Aim is NOT to build a predictive model But predictive metrics are used to measure performance 1 Hamshere et al, 2013 2 Cortes, Vapnik, 1995 Tim V-G (Cardiff Uni) PhD project December 8, 2015 7 / 26
  • 14.
    Linear Support VectorMachine Data with clear separation. Using examples adapted from http://scikit-learn.org/stable/auto_examples/svm/plot_svm_kernels.html Tim V-G (Cardiff Uni) PhD project December 8, 2015 8 / 26
  • 15.
    Linear Support VectorMachine Clear decision boundary and margins Tim V-G (Cardiff Uni) PhD project December 8, 2015 9 / 26
  • 16.
    Linear Support VectorMachine Ambiguous data point Tim V-G (Cardiff Uni) PhD project December 8, 2015 10 / 26
  • 17.
    Linear Support VectorMachine Using C value of 100 Tim V-G (Cardiff Uni) PhD project December 8, 2015 11 / 26
  • 18.
    Linear Support VectorMachine Using C value of 1 Tim V-G (Cardiff Uni) PhD project December 8, 2015 12 / 26
  • 19.
    Linear Support VectorMachine Using C value of 1 In this case, the feature on the x axis determines the outcome more Tim V-G (Cardiff Uni) PhD project December 8, 2015 12 / 26
  • 20.
    Linear Support VectorMachine Equation of Model α + β1x1 + β2x2 Tim V-G (Cardiff Uni) PhD project December 8, 2015 13 / 26
  • 21.
    Model Building Procedure Modelscannot be tested on same data points that were used to build them Always hold some data out for assessment use (25%) 10 shuffles carried out to find best C values [0.1, 1, 10] Performance assessed on 250 shuffles of the data Independent test set should be used for predictive modelling Tim V-G (Cardiff Uni) PhD project December 8, 2015 14 / 26
  • 22.
    Comparison of algorithmsperformance Using all gene sets - 134 in total (C always 0.1) Polygenic Score Multiple Regression Linear SVM 0.55 0.60 0.65 0.70 ROCScore Border Genic Only PGC Border Tim V-G (Cardiff Uni) PhD project December 8, 2015 15 / 26
  • 23.
    Gene Set Coefficients Genicregions only −0.1 0.0 0.1 0.2 0.3 −0.1 0.0 0.1 0.2 0.3 LinearSVMMultipleRegression Coefficients Gene Sets FMRP_targets abnormal_behavior Presynapse abnormal_cerebral_cortex_morphology abnormal_nervous_system_morphology abnormal_dendrite_morphology abnormal_circadian_rhythm abnormal_synaptic_depression abnormal_nervous_system_development PSD_(human_core) 5HT_2C abnormal_sexual_interaction abnormal_brain_development abnormal_astrocyte_morphology abnormal_excitatory_postsynaptic_potential Tim V-G (Cardiff Uni) PhD project December 8, 2015 16 / 26
  • 24.
    Gene Set Coefficients PGCBorder regions −0.1 0.0 0.1 0.2 0.3 −0.1 0.0 0.1 0.2 0.3 LinearSVMMultipleRegression Coefficients Gene Sets FMRP_targets Presynapse abnormal_behavior abnormal_nervous_system_morphology abnormal_circadian_rhythm abnormal_cerebral_cortex_morphology PSD_(human_core) abnormal_synaptic_depression abnormal_dendrite_morphology 5HT_2C abnormal_nervous_system_development abnormal_diencephalon_morphology abnormal_brain_development abnormal_excitatory_postsynaptic_potential abnormal_excitatory_postsynaptic_currents Tim V-G (Cardiff Uni) PhD project December 8, 2015 17 / 26
  • 25.
    Recursive Feature EliminationProcedure Gene Sets Identified as carrying the most signal for the Algorithms Gene Boundary Region Linear SVM Multiple Regression Genic Region Only FMRP Targets FMRP Targets Abnormal Behaviour Abnormal Behaviour Abnormal Nervous System Morphology PGC Border FMRP Targets FMRP Targets Abnormal Behaviour Presynapse Abnormal Hippocampus Morphology Abnormal Behaviour Abnormal Temporal Lobe Morphology Abnormal Nervous System Morphology FMRP Targets and Abnormal Behaviour feature in all cases Algorithms re-run with only these two gene sets featured Tim V-G (Cardiff Uni) PhD project December 8, 2015 18 / 26
  • 26.
    Recursive Feature EliminationProcedure Gene Sets Identified as carrying the most signal for the Algorithms Gene Boundary Region Linear SVM Multiple Regression Genic Region Only FMRP Targets FMRP Targets Abnormal Behaviour Abnormal Behaviour Abnormal Nervous System Morphology PGC Border FMRP Targets FMRP Targets Abnormal Behaviour Presynapse Abnormal Hippocampus Morphology Abnormal Behaviour Abnormal Temporal Lobe Morphology Abnormal Nervous System Morphology FMRP Targets and Abnormal Behaviour feature in all cases Algorithms re-run with only these two gene sets featured Tim V-G (Cardiff Uni) PhD project December 8, 2015 18 / 26
  • 27.
    Comparison Results All setsvs. RFE surviving sets for Genic and PGC Border regions Polygenic Score Multiple Regression Linear SVM 0.55 0.60 0.65 0.70 0.55 0.60 0.65 0.70 GenicOnlyPGCBorder ROCScore Gene Sets All Sets RFE Sets Tim V-G (Cardiff Uni) PhD project December 8, 2015 19 / 26
  • 28.
    Comparison Results All setsvs. RFE surviving sets for Genic and PGC Border regions Polygenic Score Multiple Regression Linear SVM 0.55 0.60 0.65 0.70 0.55 0.60 0.65 0.70 AllSetsRFESets ROCScore Border Genic Only PGC Border Tim V-G (Cardiff Uni) PhD project December 8, 2015 20 / 26
  • 29.
    Comparison Results All setsvs. RFE surviving sets for Genic and PGC Border regions Polygenic Score Multiple Regression Linear SVM 0.55 0.60 0.65 0.70 0.55 0.60 0.65 0.70 AllSetsRFESets ROCScore Border Genic Only PGC Border All Gene Sets Repeated Measures ANOVA Algorithm F = 62.9, p < 2 × 10−16 Border F = 31.2, p < 2.7 × 10−8 Interaction F = 0.101, p = 0.904 RFE Sets Repeated Measures ANOVA Algorithm F = 6.1, p < 0.003 Border F = 22.3, p < 2.4 × 10−6 Interaction F = 0.074, p = 0.929 Tim V-G (Cardiff Uni) PhD project December 8, 2015 21 / 26
  • 30.
    Comparison of IndexSNP numbers in Genes FMRP genes (505/585) vs non-FMRP genes (1431/1989) Genic Only PGC Window 0.0 0.1 0.2 0.3 0.4 0.5 0 10 20 30 0 10 20 30 Index SNP count per gene density Gene Type FMRP Targets Not FMRP Targets Tim V-G (Cardiff Uni) PhD project December 8, 2015 22 / 26
  • 31.
    Results of GenePermutations Uneven Index SNPs Genic (807.48/1543) PGC Window (991.72/1812) Genic Only PGC Border 0 5 10 15 20 0.56 0.58 0.60 0.56 0.58 0.60 ROC Scores count Tim V-G (Cardiff Uni) PhD project December 8, 2015 23 / 26
  • 32.
    Results of SNPPermutations Same number of Index SNPs Genic Only PGC Border 0 5 10 15 0.57 0.58 0.59 0.60 0.61 0.57 0.58 0.59 0.60 0.61 ROC Scores count Tim V-G (Cardiff Uni) PhD project December 8, 2015 24 / 26
  • 33.
    Conclusions FMRP Genes seemto carry signal The FMRP genes clearly outperform other permutations of same size Effect also seen in SNP permutations Machine Learning finds the important features Machine Learning does not yet improve on Polygenic Scoring Both multi-feature algorithms capable of finding important features Future Directions Use a larger, more recent dataset Use kernel methods to examine interactions Look at annotation of variants Look for gender difference in cases/controls Tim V-G (Cardiff Uni) PhD project December 8, 2015 25 / 26
  • 34.
    Conclusions FMRP Genes seemto carry signal The FMRP genes clearly outperform other permutations of same size Effect also seen in SNP permutations Machine Learning finds the important features Machine Learning does not yet improve on Polygenic Scoring Both multi-feature algorithms capable of finding important features Future Directions Use a larger, more recent dataset Use kernel methods to examine interactions Look at annotation of variants Look for gender difference in cases/controls Tim V-G (Cardiff Uni) PhD project December 8, 2015 25 / 26
  • 35.
    Conclusions FMRP Genes seemto carry signal The FMRP genes clearly outperform other permutations of same size Effect also seen in SNP permutations Machine Learning finds the important features Machine Learning does not yet improve on Polygenic Scoring Both multi-feature algorithms capable of finding important features Future Directions Use a larger, more recent dataset Use kernel methods to examine interactions Look at annotation of variants Look for gender difference in cases/controls Tim V-G (Cardiff Uni) PhD project December 8, 2015 25 / 26
  • 36.
    Acknowledgements Prof. Michael Owen Dr.Andrew Pocklington Dr. Valentina Escott-Price Dr. Andreas Artemiou scikit-learn Developers Pandas Developers ggplot2 Developers Tim V-G (Cardiff Uni) PhD project December 8, 2015 26 / 26