Menu

SFst

Guobo Chen

Fst estimation between cohorts and geographic location inference
Fst is commonly used as a measure for genetic differentiation. This function is designed to calculate Fst from GWAS summary statistics.

Master command
SFst

The meta-data should have these columns: "SNP", "EAF" (Reference allele frequency), "SE" (sampling variance of EAF), "A1", "A2", "CHR", "BP", and "P". The keywords are case-insensitive, but should be specified in order as listed (there is no requirement for the order of these columns in summary statistic files). "A1" is the reference allele, and "A2" is the other allele. Other columns such as reference allele frequency, standard error of allele frequency can also be included.

SNP CHR BP A1 A2 OR/Beta SE P RAF RAF_SE
snp1 1 100 G T 1.05 0.03 0.03 0.10 0.35
snp2 2 200 T A 0.95 0.033 0.12 0.3 0.03

The program will automatically eliminate ambiguous loci, such as A/T and G/C loci. In the example, the second row, which has ambiguous alleles will be eliminated.
Of note, all the summary statistic files should have the same column names, but their order can be different in each files.

Options

--meta-batch <arg>
Specify the batch file that each line contains one meta file.
It looks like:
gwas1.txt
gwas2.txt
...</arg>

--qt-size <arg>
Specify the file in which each line contains the sample size for the file at the corresponding row in meta-batch.
It looks like:
100
200
...</arg>

--cc-size <arg>
For case-control studies, each line has two elements, the number of cases and the number of controls for each corresponding file.
It looks like
200 300
1000 800
...</arg>

--me <arg>
Specifies the number of markers that should be sampled for calculating LambdaMeta or --fst. By default it samples 30000 markers.</arg>

--key <args>
Although summary statistic files have all the columns required, their names may different. For example, "markerID" for "SNP", "effect" for "beta", "SE" for "SE", "Ref_Allele" for "A1", "Other_Allele" for "A2", "Chromosome" for "CHR", "POS" for "BP", "Pval" for "P", "RAF" for "RAF", and "freq_se" for "RAF_SE". Then this option should be used as
"--key markerID effect SE Ref_Allele Other_allele Chromosome POS Pval"
Pval will be used to calculate genomic inflation factor for each cohort.</args>

--top <arg>
This option tells the program only the top X files listed in --meta-batch will be compared to all files. For example, if there are 10 summary statistic files included in --meta-batch, when "--top 1" is used, it only calculate lambdaMeta (of fst) between the first file and other files.
In practice, if only want to calculate fst between the cohort to 1KG European samples, the user can put the summary statistic file for 1KG as the first file in --meta-batch and use "--top 1" option.</arg>

--chr <arg>
Specify the chromosome for analysis.</arg>

--top <arg>
This option tells the program that only the top X files listed in --meta-batch will be compared to all files. For example, if there are 10 summary statistic files included in --meta-batch, when "--top 1" is used, it only calculate fst between the first file and the rest of files.
In particular, if only want to calculate fst between any cohorts to 1KG European samples, the user can put the summary statistic file for 1KG as the first file in --meta-batch and use "--top 1" option.</arg>

--verbose
If the option is switched on, the detailed fst resutls of each selected SNP will be saved into "*.fst.gz." for each pair of cohorts.

--no-weight
If this option is switched on, always assume the sample sizes were equal for each pair of cohorts.

Examples

java -jar gear.jar sfst --meta-batch metalist.txt --qt-size qt-sample-size.txt --key SNP EAF EAFSE A1 A2 CHR BP P --out test
java -jar gear.jar sfst --meta-batch metalist.txt --qt-size cc-sample-size.txt --key SNP EAF EAFSE A1 A2 CHR BP P --me 50000 --out test

java -jar gear.jar sfst --meta-batch metalist.txt --qt-size qt-sample-size.txt --key SNP EAF EAFSE A1 A2 CHR BP P --no-weight --out test
java -jar gear.jar sfst --meta-batch metalist.txt --qt-size cc-sample-size.txt --key SNP EAF EAFSE A1 A2 CHR BP P --me 50000 --no-weight --out test

Given the estimated Fst, the geographic location of each cohort can be inferred using "fpc" subcommand.

1000 Genome reference samples can be found at 1000 Reference samples
Examples

java -jar gear.jar fpc --fst test.fst --out test
java -jar gear.jar fpc --fst test.fst --ref 9 3 1 --out test

test.fst is the fst matrix calculated from sfst. --ref specifies the three reference populations in test.fst. By default, the first three cohorts in test.fst will be set as the reference populations.

The output is test.fpc, which has two columns represents coordinates of the inferred geographic location for each cohort.