YOKOFAKUN: pipeline

Showing posts with label pipeline. Show all posts

05 December 2014

Divide-and-conquer in a #Makefile : recursivity and #parallelism.

This post is my notebook about implementing a divide-and-conquer strategy in GNU make.
Say you have a list of 'N' VCFs files. You want to create a list of:

common SNPs in vcf1 and vcf2
common SNPs in vcf3 and the previous list
common SNPs in vcf4 and the previous list
(...)
common SNPs in vcfN and the previous list

Yes, I know I can do this using:grep -v '^#' f.vcf|cut -f 1,2,4,5 | sort | uniq

Using a linear Makefile it could look like:

list2: vcf1 vcf2
    grep -v '^#' $^ |cut -f 1,2,4,5 | sort | uniq > $@
list3: vcf3 list2
    grep -v '^#' $^ |cut -f 1,2,4,5 | sort | uniq > $@
list4: vcf4 list3
    grep -v '^#' $^ |cut -f 1,2,4,5 | sort | uniq > $@
list5: vcf5 list4
    grep -v '^#' $^ |cut -f 1,2,4,5 | sort | uniq > $@
(...)

We can speed-up the workflow using the parallel option of make -j (number-of-parallel-jobs) and using a divide-and-conquer strategy. Here, the targets 'list1_2' and 'list3_4' can be processed independently in parallel.

list1_2: vcf1 vcf2
    grep -v '^#' $^ |cut -f 1,2,4,5 | sort | uniq > $@

list3_4: vcf3 vcf4
    grep -v '^#' $^ |cut -f 1,2,4,5 | sort | uniq > $@

list1_4: list1_2 list3_4
    grep -v '^#' $^ |cut -f 1,2,4,5 | sort | uniq > $@

By using the internal 'Make' functions $(eval), $(call), $(shell) we can define a recursive method "recursive". This method takes two arguments which are the 0-based indexes of a VCF in the list of VCFs. Here is the Makefile:
Running the makefile:

$ make 
gunzip -c Sample18.vcf.gz | grep -v '^#' | cut -f 1,2,4,5 | LC_ALL=C sort | uniq > target0_1
gunzip -c Sample13.vcf.gz | grep -v '^#' | cut -f 1,2,4,5 | LC_ALL=C sort | uniq > target1_2
LC_ALL=C comm -12 target0_1 target1_2 > target0_2
gunzip -c Sample1.vcf.gz | grep -v '^#' | cut -f 1,2,4,5 | LC_ALL=C sort | uniq > target2_3
gunzip -c Sample19.vcf.gz | grep -v '^#' | cut -f 1,2,4,5 | LC_ALL=C sort | uniq > target3_4
gunzip -c Sample12.vcf.gz | grep -v '^#' | cut -f 1,2,4,5 | LC_ALL=C sort | uniq > target4_5
LC_ALL=C comm -12 target3_4 target4_5 > target3_5
LC_ALL=C comm -12 target2_3 target3_5 > target2_5
LC_ALL=C comm -12 target0_2 target2_5 > target0_5
gunzip -c Sample17.vcf.gz | grep -v '^#' | cut -f 1,2,4,5 | LC_ALL=C sort | uniq > target5_6
gunzip -c Sample16.vcf.gz | grep -v '^#' | cut -f 1,2,4,5 | LC_ALL=C sort | uniq > target6_7
LC_ALL=C comm -12 target5_6 target6_7 > target5_7
gunzip -c Sample9.vcf.gz | grep -v '^#' | cut -f 1,2,4,5 | LC_ALL=C sort | uniq > target7_8
gunzip -c Sample15.vcf.gz | grep -v '^#' | cut -f 1,2,4,5 | LC_ALL=C sort | uniq > target8_9
gunzip -c Sample5.vcf.gz | grep -v '^#' | cut -f 1,2,4,5 | LC_ALL=C sort | uniq > target9_10
LC_ALL=C comm -12 target8_9 target9_10 > target8_10
LC_ALL=C comm -12 target7_8 target8_10 > target7_10
LC_ALL=C comm -12 target5_7 target7_10 > target5_10
LC_ALL=C comm -12 target0_5 target5_10 > target0_10
gunzip -c Sample14.vcf.gz | grep -v '^#' | cut -f 1,2,4,5 | LC_ALL=C sort | uniq > target10_11
gunzip -c Sample3.vcf.gz | grep -v '^#' | cut -f 1,2,4,5 | LC_ALL=C sort | uniq > target11_12
LC_ALL=C comm -12 target10_11 target11_12 > target10_12
gunzip -c Sample11.vcf.gz | grep -v '^#' | cut -f 1,2,4,5 | LC_ALL=C sort | uniq > target12_13
gunzip -c Sample2.vcf.gz | grep -v '^#' | cut -f 1,2,4,5 | LC_ALL=C sort | uniq > target13_14
gunzip -c Sample6.vcf.gz | grep -v '^#' | cut -f 1,2,4,5 | LC_ALL=C sort | uniq > target14_15
LC_ALL=C comm -12 target13_14 target14_15 > target13_15
LC_ALL=C comm -12 target12_13 target13_15 > target12_15
LC_ALL=C comm -12 target10_12 target12_15 > target10_15
gunzip -c Sample20.vcf.gz | grep -v '^#' | cut -f 1,2,4,5 | LC_ALL=C sort | uniq > target15_16
gunzip -c Sample10.vcf.gz | grep -v '^#' | cut -f 1,2,4,5 | LC_ALL=C sort | uniq > target16_17
LC_ALL=C comm -12 target15_16 target16_17 > target15_17
gunzip -c Sample4.vcf.gz | grep -v '^#' | cut -f 1,2,4,5 | LC_ALL=C sort | uniq > target17_18
gunzip -c Sample8.vcf.gz | grep -v '^#' | cut -f 1,2,4,5 | LC_ALL=C sort | uniq > target18_19
gunzip -c Sample7.vcf.gz | grep -v '^#' | cut -f 1,2,4,5 | LC_ALL=C sort | uniq > target19_20
LC_ALL=C comm -12 target18_19 target19_20 > target18_20
LC_ALL=C comm -12 target17_18 target18_20 > target17_20
LC_ALL=C comm -12 target15_17 target17_20 > target15_20
LC_ALL=C comm -12 target10_15 target15_20 > target10_20
LC_ALL=C comm -12 target0_10 target10_20 > target0_20

and here is the generated workflow (drawn with make2graph ).
:

That's it
Pierre.

30 January 2014

Parallelizing #RStats using #make

In the current post, I'll show how to use R as the main SHELL of GNU-Make instead of using a classical linux shell like 'bash'. Why would you do this ?

awesomeness
Make-based workflow management
Make-based execution with --jobs. GNU make knows how to execute several recipes at once. Normally, make will execute only one recipe at a time, waiting for it to finish before executing the next. However, the '-j' or '--jobs' option tells make to execute many recipes simultaneously.

The following recipe has been tested with GNU-Make 4.0 and I'm not sure it would world with '<=3.81'.

The only problem is that R doesn't accept a multiline-argument on the command line (see http://stackoverflow.com/questions/21442674) so I created a wrapper 'mockR' that save the argument '-e "code"' into a file and pipe it into R:

(Edit1: A comment from madscientist : Re your script; you can save yourself some wear-and-tear on your disk and avoid the need for temp files and cleanup by just piping the input directly: echo "$R" | R --vanilla --no-readline --quiet . Just a thought. ")

(Edit2: the exit value of 'R' should also be returned by 'mockR'.)

This file is set as executable:

$ chmod u+x ./mockR

In the makefile, we tell 'make' to use 'mockR' instead of '/usr/bin/sh':

SHELL = ./mockR

The R code will be passed to 'mockR' using the argument '-e "code"'

.SHELLFLAGS= -e

We also set 'ONESHELL': "If .ONESHELL is mentioned as a target, then when a target is built all lines of the recipe will be given to a single invocation of the shell rather than each line being invoked separately"

.ONESHELL:

Example 1

We download the table 'knownGene' from the UCSC and we plot a pdf file 'countExons=f(txStart)'. Please, note that the targets are created using some R statements, NOT bash statements:

Now Invoke make

Example 2

Using a the eval and the call function we can make the previous 'Makefile' applicable for all the chromosomes:

Now Invoke make USING TRHEE PARALLEL JOBS

You can now watch the final pdf files:

That's it,
Pierre

30 October 2013

GNU Make: saving the versions of the tools using 'order-only-prerequisites' : my notebook

Rule 3 of "Ten Simple Rules for Reproducible Computational Research". is
:

Archive the Exact Versions of All External Programs Used

.
I work with Makefile-based workflows: how can I save the version of each software used when I invoke 'make', whatever the target is ? A naive solution is to add a dependency to each target. For example, the following makefile takes a simple SAM file, convert it to BAM, sort and index. For each target, I've added a dependency named "dump_params" that append the version of samtools to a file "config.txt".

But that solution doesn't work because make re-builds all targets even if the top target already exists.

$ make

date << config.txt && \
 echo -n "Samtools " << config.txt && \
 samtools  2<&1 | grep Version << config.txt
samtools view -Sb samtools-0.1.18/examples/toy.sam < unsorted.bam
[samopen] SAM header is present: 2 sequences.
samtools sort unsorted.bam sorted
samtools index sorted.bam


$ make

date << config.txt && \
 echo -n "Samtools " << config.txt && \
 samtools  2<&1 | grep Version << config.txt
samtools view -Sb samtools-0.1.18/examples/toy.sam < unsorted.bam
[samopen] SAM header is present: 2 sequences.
samtools sort unsorted.bam sorted
samtools index sorted.bam

The solution I got via Stackoverflow is to use a order-only-prerequisites: "Order-only prerequisites can be specified by placing a pipe symbol (|) in the prerequisites list: any prerequisites to the left of the pipe symbol are normal; any prerequisites to the right are order-only... (...) Note that if you declare the same file to be both a normal and an order-only prerequisite, the normal prerequisite takes precedence (...)". The makefile with the 'order-only-prerequisites' is now:

And that works ! the final target is generated only once, but the file 'config.txt' is always generated.

$ make
date << config.txt && \
 echo -n "Samtools " << config.txt && \
 samtools  2<&1 | grep Version << config.txt
samtools view -Sb samtools-0.1.18/examples/toy.sam < unsorted.bam
[samopen] SAM header is present: 2 sequences.
samtools sort unsorted.bam sorted
samtools index sorted.bam

$ make
date << config.txt && \
 echo -n "Samtools " << config.txt && \
 samtools  2<&1 | grep Version << config.txt

$ make
date << config.txt && \
 echo -n "Samtools " << config.txt && \
 samtools  2<&1 | grep Version << config.txt

That's it,
Pierre

Update :another solution

Citing MadScientist's answer on stackoverflow : Another option is to use immediately expanded shell functions, like:

__dummy := $(shell echo "Makefile was run." >> config.txt)

Since it's immediately expanded the shell script will be invoked once, as the makefile is read in. There's no need to define a dump_params target or include it as a prerequisite. This is more old-school, but has the advantage that it will run for every invocation of make, without having to go through and ensure every target has the proper order-only prerequisite defined.

31 August 2012

The 500th post: Generating a pipeline of analysis (Makefile) for NGS with xslt.

This is my 500th post ! Happy birthday my blog ! (I know, I know, I have been unfaithful since the beginning of the year).

In my previous post, I've shown how a the .INTERMEDIATE keyword can be used in a Makefile to reduce the number of temporary files generated in a classical NGS pipeline. 'Make' also has an option '-j' to specify the number of threads: some independent tasks (such as aligning two distinct samples) can be processed in parallel .

At this point, I should cite this tweet from Sean Davis:

@yokofakun You might also take a look at distmake if you are on a cluster.Ruffus (python) is an option more bioinfo oriented than make
— Sean Davis (@seandavis12) August 30, 2012

... and this even-more-recent tweet from Justin H. Johnson:

Snakemake - A scalable bioinformatics workflow engine. 1.usa.gov/NqPI9l #Bioinformatics
— Justin H. Johnson (@BioInfo) August 31, 2012

Here, I will show how XSLT can be used to generate a whole Makefile to process the output of Illumina/Casava after the FASTQs files have been generated using configureBclToFastq.pl using --fastq-cluster-count.

A Xml descriptor generated by Illumina Casava looks like this:

<DemultiplexConfig>
  <Software/>
  <FlowcellInfo ID="C0KTCACXX" Operator="JohnDoe" Recipe="" Desc="">
    <Lane Number="1">
      <Sample ProjectId="WGS" Control="N" Index="CTTGTA" SampleId="Father" Desc="1.5pM" Ref="" />
      <Sample ProjectId="WGS" Control="N" Index="GCCAAT" SampleId="Mother" Desc="1.5pM" Ref="" />
      <Sample ProjectId="WGS" Control="N" Index="TGACCA" SampleId="Daughter" Desc="1.5pM" Ref="" />
      <Sample ProjectId="Undetermined_indices" Control="N" Index="Undetermined" SampleId="lane1" Desc="Clusters with unmatched barcodes for lane 1" Ref="unknown" />
    </Lane>
    <Lane Number="2">
      <Sample ProjectId="WGS" Control="N" Index="CTTGTA" SampleId="Father" Desc="1.5pM" Ref="" />
      <Sample ProjectId="WGS" Control="N" Index="GCCAAT" SampleId="Mother" Desc="1.5pM" Ref="" />
      <Sample ProjectId="WGS" Control="N" Index="TGACCA" SampleId="Daughter" Desc="1.5pM" Ref="" />
      <Sample ProjectId="Undetermined_indices" Control="N" Index="Undetermined" SampleId="lane2" Desc="Clusters with unmatched barcodes for lane 2" Ref="unknown" />
    </Lane>
  </FlowcellInfo>
</DemultiplexConfig>

This document contains the required informations to find the fastqs and to generate a Makefile. I wrote the following XSLT stylesheet:

https://github.com/lindenb/xslt-sandbox/(..)/demultiplexcfg2make.xsl

This stylesheet transforms the DemultiplexConfig.xml file to the following Makefile:

#
#files must have been generated using CASAVA/configureBclToFastq.pl with --fastq-cluster-count 0
#path to bwa
override BWA=/path/to/bwa-0.6.1/bwa
BWASAMPEFLAG= -a 600
BWAALNFLAG= -t 4
#path to samtools
SAMTOOLS=/path/to/samtools
#path to BEDTOOLS
BEDTOOLS=/path/to/bedtools/bin
#output directory
OUTDIR=2012831.OUTPUT.idp0
INPUTDIR=.
REF=/path/to/human_g1k_v37.fasta
JAVA=java

PICARDLIB=/path/to/picard-tools
TABIX=/path/to/tabix
GATK=/path/to/GenomeAnalysisTK.jar
GATKFLAGS=
VCFDBSNP=/path/to/ncbi/snp/00-All.vcf.gz

#rule how to create a bai from a bam:
%.bam.bai: %.bam
 #indexing BAM $<
 $(SAMTOOLS) index $<

#creates the default output directory
$(OUTDIR):
 mkdir -p $@

#indexes the reference for bwa
$(REF).bwt $(REF).ann $(REF).amb $(REF).pac $(REF).sa : $(REF)
 $(BWA) index -a bwtsw $<

#indexes the reference for samtools
$(REF).fai: $(REF)
 $(SAMTOOLS) faidx $<

#creates sequence dict for picard
$(REF).dict: $(REF)
 $(JAVA) -jar $(PICARDLIB)/CreateSequenceDictionary.jar REFERENCE=$(REF) OUTPUT=$<



#intermediate see http://plindenbaum.blogspot.fr/2012/08/next-generation-sequencing-gnu-make-and.html
.INTERMEDIATE :  \
 $(OUTDIR)/Project_WGS/Sample_Father/Father.bam.bai \
 $(OUTDIR)/Project_WGS/Sample_Father/Father.bam \
 $(OUTDIR)/Project_WGS/Sample_Father/Father_CTTGTA_L001.bam \
 $(OUTDIR)/Project_WGS/Sample_Father/Father_CTTGTA_L002.bam \
 $(OUTDIR)/Project_WGS/Sample_Mother/Mother.bam.bai \
 $(OUTDIR)/Project_WGS/Sample_Mother/Mother.bam \
 $(OUTDIR)/Project_WGS/Sample_Mother/Mother_GCCAAT_L001.bam \
 $(OUTDIR)/Project_WGS/Sample_Mother/Mother_GCCAAT_L002.bam \
 $(OUTDIR)/Project_WGS/Sample_Daughter/Daughter.bam.bai \
 $(OUTDIR)/Project_WGS/Sample_Daughter/Daughter.bam \
 $(OUTDIR)/Project_WGS/Sample_Daughter/Daughter_TGACCA_L001.bam \
 $(OUTDIR)/Project_WGS/Sample_Daughter/Daughter_TGACCA_L002.bam


#main target : generate the VCFs
all : vcfs


vcfs:  \
 $(OUTDIR)/Project_WGS/Sample_Father/Father.vcf.gz \
 $(OUTDIR)/Project_WGS/Sample_Mother/Mother.vcf.gz \
 $(OUTDIR)/Project_WGS/Sample_Daughter/Daughter.vcf.gz


$(OUTDIR)/Project_WGS/Sample_Father/Father.realigned.bam : $(OUTDIR)/Project_WGS/Sample_Father/Father.bam $(OUTDIR)/Project_WGS/Sample_Father/Father.bam.bai
 $(JAVA) -jar $(GATK) $(GATKFLAGS) \
  -T RealignerTargetCreator \
  -R $(REF) \
  -I $< \
  -o $<.intervals \
  --known $(VCFDBSNP)
 $(JAVA) -jar $(GATK) $(GATKFLAGS) \
  -T IndelRealigner \
  -R $(REF) \
  -I $< \
  -o $@ \
  -targetIntervals $<.intervals \
  --knownAlleles $(VCFDBSNP)
 rm -f $<.intervals


#creates a BAM for 'Father' by merging the other lanes
$(OUTDIR)/Project_WGS/Sample_Father/Father.bam: $(OUTDIR)  \
  $(OUTDIR)/Project_WGS/Sample_Father/Father_CTTGTA_L001.bam \
  $(OUTDIR)/Project_WGS/Sample_Father/Father_CTTGTA_L002.bam
 #merge BAMS
 $(JAVA) -jar $(PICARDLIB)/MergeSamFiles.jar SORT_ORDER=coordinate OUTPUT=$@ \
  INPUT=$(OUTDIR)/Project_WGS/Sample_Father/Father_CTTGTA_L001.bam \
  INPUT=$(OUTDIR)/Project_WGS/Sample_Father/Father_CTTGTA_L002.bam
 #fix mate information
 $(JAVA) -jar $(PICARDLIB)/FixMateInformation.jar \
  INPUT=$@
 #validate
 $(JAVA) -jar $(PICARDLIB)/ValidateSamFile.jar \
  VALIDATE_INDEX=true \
  I=$@


#Creates a BAM for 'Project_WGS/Sample_Father/Father_CTTGTA_L001'

$(OUTDIR)/Project_WGS/Sample_Father/Father_CTTGTA_L001.bam: $(REF).bwt \
 $(INPUTDIR)/Project_WGS/Sample_Father/Father_CTTGTA_L001_R1_001.fastq.gz \
 $(INPUTDIR)/Project_WGS/Sample_Father/Father_CTTGTA_L001_R2_001.fastq.gz
 #create directory
 mkdir -p $(OUTDIR)/Project_WGS/Sample_Father
 #align fastq1 with BWA
 $(BWA) aln $(BWAALNFLAG) $(REF) -f $(OUTDIR)/idp22528_1.sai $(INPUTDIR)/Project_WGS/Sample_Father/Father_CTTGTA_L001_R1_001.fastq.gz
 #align fastq2 with BWA
 $(BWA) aln $(BWAALNFLAG) $(REF) -f $(OUTDIR)/idp22528_2.sai $(INPUTDIR)/Project_WGS/Sample_Father/Father_CTTGTA_L001_R2_001.fastq.gz
 #sampe those files
 $(BWA) sampe $(BWASAMPEFLAG)  -r '@RG ID:idp22528 CN:MyLaboratory LB:Father PG:bwa SM:Father PL:ILLUMINA' $(REF) \
  -f $(OUTDIR)/idp22528.sam \
  $(OUTDIR)/idp22528_1.sai \
  $(OUTDIR)/idp22528_2.sai \
  $(INPUTDIR)/Project_WGS/Sample_Father/Father_CTTGTA_L001_R1_001.fastq.gz \
  $(INPUTDIR)/Project_WGS/Sample_Father/Father_CTTGTA_L001_R2_001.fastq.gz
 #remove sai
 rm -f $(OUTDIR)/idp22528_1.sai $(OUTDIR)/idp22528_2.sai
 #convert sam to bam
 $(SAMTOOLS) view -b -T $(REF) -o $(OUTDIR)/idp22528_unsorted.bam $(OUTDIR)/idp22528.sam
 #remove the sam
 rm -f $(OUTDIR)/idp22528.sam
 #sort this bam
 $(SAMTOOLS) sort  $(OUTDIR)/idp22528_unsorted.bam $(OUTDIR)/idp22528_sorted
 #remove the unsorted bam
 rm -f $(OUTDIR)/idp22528_unsorted.bam
 #mark duplicates
 $(JAVA) -jar $(PICARDLIB)/MarkDuplicates.jar VALIDATION_STRINGENCY=LENIENT ASSUME_SORTED=true TMP_DIR=$(OUTDIR) \
  INPUT=$(OUTDIR)/idp22528_sorted.bam \
  OUTPUT=$(OUTDIR)/idp22528_dup.bam \
  METRICS_FILE=$(OUTDIR)/Project_WGS/Sample_Father/Father.metrics
 #remove before duplicate
 rm -f $(OUTDIR)/idp22528_sorted.bam
 #move to final bam
 mv $(OUTDIR)/idp22528_dup.bam $@



$(OUTDIR)/Project_WGS/Sample_Father/Father_CTTGTA_L002.bam: $(REF).bwt \
 $(INPUTDIR)/Project_WGS/Sample_Father/Father_CTTGTA_L002_R1_001.fastq.gz \
 $(INPUTDIR)/Project_WGS/Sample_Father/Father_CTTGTA_L002_R2_001.fastq.gz
 #create directory
 mkdir -p $(OUTDIR)/Project_WGS/Sample_Father
 #align fastq1 with BWA
 $(BWA) aln $(BWAALNFLAG) $(REF) -f $(OUTDIR)/idp31344_1.sai $(INPUTDIR)/Project_WGS/Sample_Father/Father_CTTGTA_L002_R1_001.fastq.gz
 #align fastq2 with BWA
 $(BWA) aln $(BWAALNFLAG) $(REF) -f $(OUTDIR)/idp31344_2.sai $(INPUTDIR)/Project_WGS/Sample_Father/Father_CTTGTA_L002_R2_001.fastq.gz
 #sampe those files
 $(BWA) sampe $(BWASAMPEFLAG)  -r '@RG ID:idp31344 CN:MyLaboratory LB:Father PG:bwa SM:Father PL:ILLUMINA' $(REF) \
  -f $(OUTDIR)/idp31344.sam \
  $(OUTDIR)/idp31344_1.sai \
  $(OUTDIR)/idp31344_2.sai \
  $(INPUTDIR)/Project_WGS/Sample_Father/Father_CTTGTA_L002_R1_001.fastq.gz \
  $(INPUTDIR)/Project_WGS/Sample_Father/Father_CTTGTA_L002_R2_001.fastq.gz
 #remove sai
 rm -f $(OUTDIR)/idp31344_1.sai $(OUTDIR)/idp31344_2.sai
 #convert sam to bam
 $(SAMTOOLS) view -b -T $(REF) -o $(OUTDIR)/idp31344_unsorted.bam $(OUTDIR)/idp31344.sam
 #remove the sam
 rm -f $(OUTDIR)/idp31344.sam
 #sort this bam
 $(SAMTOOLS) sort  $(OUTDIR)/idp31344_unsorted.bam $(OUTDIR)/idp31344_sorted
 #remove the unsorted bam
 rm -f $(OUTDIR)/idp31344_unsorted.bam
 #mark duplicates
 $(JAVA) -jar $(PICARDLIB)/MarkDuplicates.jar VALIDATION_STRINGENCY=LENIENT ASSUME_SORTED=true TMP_DIR=$(OUTDIR) \
  INPUT=$(OUTDIR)/idp31344_sorted.bam \
  OUTPUT=$(OUTDIR)/idp31344_dup.bam \
  METRICS_FILE=$(OUTDIR)/Project_WGS/Sample_Father/Father.metrics
 #remove before duplicate
 rm -f $(OUTDIR)/idp31344_sorted.bam
 #move to final bam
 mv $(OUTDIR)/idp31344_dup.bam $@




#creates a VCF for 'Father'
$(OUTDIR)/Project_WGS/Sample_Father/Father.vcf.gz : $(OUTDIR)/Project_WGS/Sample_Father/Father.realigned.bam
 $(JAVA) -jar $(GATK) \
  -T UnifiedGenotyper \
  -glm BOTH \
  -baq OFF \
  -R $(REF) \
  -I $< \
  --dbsnp $(VCFDBSNP) \
  -o $(OUTDIR)/Project_WGS/Sample_Father/Father.vcf
 $(TABIX)/bgzip $(OUTDIR)/Project_WGS/Sample_Father/Father.vcf
 $(TABIX)/tabix -f -p vcf $(OUTDIR)/Project_WGS/Sample_Father/Father.vcf.gz
$(OUTDIR)/Project_WGS/Sample_Mother/Mother.realigned.bam : $(OUTDIR)/Project_WGS/Sample_Mother/Mother.bam $(OUTDIR)/Project_WGS/Sample_Mother/Mother.bam.bai
 $(JAVA) -jar $(GATK) $(GATKFLAGS) \
  -T RealignerTargetCreator \
  -R $(REF) \
  -I $< \
  -o $<.intervals \
  --known $(VCFDBSNP)
 $(JAVA) -jar $(GATK) $(GATKFLAGS) \
  -T IndelRealigner \
  -R $(REF) \
  -I $< \
  -o $@ \
  -targetIntervals $<.intervals \
  --knownAlleles $(VCFDBSNP)
 rm -f $<.intervals


#creates a BAM for 'Mother' by merging the other lanes
$(OUTDIR)/Project_WGS/Sample_Mother/Mother.bam: $(OUTDIR)  \
  $(OUTDIR)/Project_WGS/Sample_Mother/Mother_GCCAAT_L001.bam \
  $(OUTDIR)/Project_WGS/Sample_Mother/Mother_GCCAAT_L002.bam
 #merge BAMS
 $(JAVA) -jar $(PICARDLIB)/MergeSamFiles.jar SORT_ORDER=coordinate OUTPUT=$@ \
  INPUT=$(OUTDIR)/Project_WGS/Sample_Mother/Mother_GCCAAT_L001.bam \
  INPUT=$(OUTDIR)/Project_WGS/Sample_Mother/Mother_GCCAAT_L002.bam
 #fix mate information
 $(JAVA) -jar $(PICARDLIB)/FixMateInformation.jar \
  INPUT=$@
 #validate
 $(JAVA) -jar $(PICARDLIB)/ValidateSamFile.jar \
  VALIDATE_INDEX=true \
  I=$@


#Creates a BAM for 'Project_WGS/Sample_Mother/Mother_GCCAAT_L001'

$(OUTDIR)/Project_WGS/Sample_Mother/Mother_GCCAAT_L001.bam: $(REF).bwt \
 $(INPUTDIR)/Project_WGS/Sample_Mother/Mother_GCCAAT_L001_R1_001.fastq.gz \
 $(INPUTDIR)/Project_WGS/Sample_Mother/Mother_GCCAAT_L001_R2_001.fastq.gz
 #create directory
 mkdir -p $(OUTDIR)/Project_WGS/Sample_Mother
 #align fastq1 with BWA
 $(BWA) aln $(BWAALNFLAG) $(REF) -f $(OUTDIR)/idp15352_1.sai $(INPUTDIR)/Project_WGS/Sample_Mother/Mother_GCCAAT_L001_R1_001.fastq.gz
 #align fastq2 with BWA
 $(BWA) aln $(BWAALNFLAG) $(REF) -f $(OUTDIR)/idp15352_2.sai $(INPUTDIR)/Project_WGS/Sample_Mother/Mother_GCCAAT_L001_R2_001.fastq.gz
 #sampe those files
 $(BWA) sampe $(BWASAMPEFLAG)  -r '@RG ID:idp15352 CN:MyLaboratory LB:Mother PG:bwa SM:Mother PL:ILLUMINA' $(REF) \
  -f $(OUTDIR)/idp15352.sam \
  $(OUTDIR)/idp15352_1.sai \
  $(OUTDIR)/idp15352_2.sai \
  $(INPUTDIR)/Project_WGS/Sample_Mother/Mother_GCCAAT_L001_R1_001.fastq.gz \
  $(INPUTDIR)/Project_WGS/Sample_Mother/Mother_GCCAAT_L001_R2_001.fastq.gz
 #remove sai
 rm -f $(OUTDIR)/idp15352_1.sai $(OUTDIR)/idp15352_2.sai
 #convert sam to bam
 $(SAMTOOLS) view -b -T $(REF) -o $(OUTDIR)/idp15352_unsorted.bam $(OUTDIR)/idp15352.sam
 #remove the sam
 rm -f $(OUTDIR)/idp15352.sam
 #sort this bam
 $(SAMTOOLS) sort  $(OUTDIR)/idp15352_unsorted.bam $(OUTDIR)/idp15352_sorted
 #remove the unsorted bam
 rm -f $(OUTDIR)/idp15352_unsorted.bam
 #mark duplicates
 $(JAVA) -jar $(PICARDLIB)/MarkDuplicates.jar VALIDATION_STRINGENCY=LENIENT ASSUME_SORTED=true TMP_DIR=$(OUTDIR) \
  INPUT=$(OUTDIR)/idp15352_sorted.bam \
  OUTPUT=$(OUTDIR)/idp15352_dup.bam \
  METRICS_FILE=$(OUTDIR)/Project_WGS/Sample_Mother/Mother.metrics
 #remove before duplicate
 rm -f $(OUTDIR)/idp15352_sorted.bam
 #move to final bam
 mv $(OUTDIR)/idp15352_dup.bam $@



$(OUTDIR)/Project_WGS/Sample_Mother/Mother_GCCAAT_L002.bam: $(REF).bwt \
 $(INPUTDIR)/Project_WGS/Sample_Mother/Mother_GCCAAT_L002_R1_001.fastq.gz \
 $(INPUTDIR)/Project_WGS/Sample_Mother/Mother_GCCAAT_L002_R2_001.fastq.gz
 #create directory
 mkdir -p $(OUTDIR)/Project_WGS/Sample_Mother
 #align fastq1 with BWA
 $(BWA) aln $(BWAALNFLAG) $(REF) -f $(OUTDIR)/idm22432_1.sai $(INPUTDIR)/Project_WGS/Sample_Mother/Mother_GCCAAT_L002_R1_001.fastq.gz
 #align fastq2 with BWA
 $(BWA) aln $(BWAALNFLAG) $(REF) -f $(OUTDIR)/idm22432_2.sai $(INPUTDIR)/Project_WGS/Sample_Mother/Mother_GCCAAT_L002_R2_001.fastq.gz
 #sampe those files
 $(BWA) sampe $(BWASAMPEFLAG)  -r '@RG ID:idm22432 CN:MyLaboratory LB:Mother PG:bwa SM:Mother PL:ILLUMINA' $(REF) \
  -f $(OUTDIR)/idm22432.sam \
  $(OUTDIR)/idm22432_1.sai \
  $(OUTDIR)/idm22432_2.sai \
  $(INPUTDIR)/Project_WGS/Sample_Mother/Mother_GCCAAT_L002_R1_001.fastq.gz \
  $(INPUTDIR)/Project_WGS/Sample_Mother/Mother_GCCAAT_L002_R2_001.fastq.gz
 #remove sai
 rm -f $(OUTDIR)/idm22432_1.sai $(OUTDIR)/idm22432_2.sai
 #convert sam to bam
 $(SAMTOOLS) view -b -T $(REF) -o $(OUTDIR)/idm22432_unsorted.bam $(OUTDIR)/idm22432.sam
 #remove the sam
 rm -f $(OUTDIR)/idm22432.sam
 #sort this bam
 $(SAMTOOLS) sort  $(OUTDIR)/idm22432_unsorted.bam $(OUTDIR)/idm22432_sorted
 #remove the unsorted bam
 rm -f $(OUTDIR)/idm22432_unsorted.bam
 #mark duplicates
 $(JAVA) -jar $(PICARDLIB)/MarkDuplicates.jar VALIDATION_STRINGENCY=LENIENT ASSUME_SORTED=true TMP_DIR=$(OUTDIR) \
  INPUT=$(OUTDIR)/idm22432_sorted.bam \
  OUTPUT=$(OUTDIR)/idm22432_dup.bam \
  METRICS_FILE=$(OUTDIR)/Project_WGS/Sample_Mother/Mother.metrics
 #remove before duplicate
 rm -f $(OUTDIR)/idm22432_sorted.bam
 #move to final bam
 mv $(OUTDIR)/idm22432_dup.bam $@




#creates a VCF for 'Mother'
$(OUTDIR)/Project_WGS/Sample_Mother/Mother.vcf.gz : $(OUTDIR)/Project_WGS/Sample_Mother/Mother.realigned.bam
 $(JAVA) -jar $(GATK) \
  -T UnifiedGenotyper \
  -glm BOTH \
  -baq OFF \
  -R $(REF) \
  -I $< \
  --dbsnp $(VCFDBSNP) \
  -o $(OUTDIR)/Project_WGS/Sample_Mother/Mother.vcf
 $(TABIX)/bgzip $(OUTDIR)/Project_WGS/Sample_Mother/Mother.vcf
 $(TABIX)/tabix -f -p vcf $(OUTDIR)/Project_WGS/Sample_Mother/Mother.vcf.gz
$(OUTDIR)/Project_WGS/Sample_Daughter/Daughter.realigned.bam : $(OUTDIR)/Project_WGS/Sample_Daughter/Daughter.bam $(OUTDIR)/Project_WGS/Sample_Daughter/Daughter.bam.bai
 $(JAVA) -jar $(GATK) $(GATKFLAGS) \
  -T RealignerTargetCreator \
  -R $(REF) \
  -I $< \
  -o $<.intervals \
  --known $(VCFDBSNP)
 $(JAVA) -jar $(GATK) $(GATKFLAGS) \
  -T IndelRealigner \
  -R $(REF) \
  -I $< \
  -o $@ \
  -targetIntervals $<.intervals \
  --knownAlleles $(VCFDBSNP)
 rm -f $<.intervals


#creates a BAM for 'Daughter' by merging the other lanes
$(OUTDIR)/Project_WGS/Sample_Daughter/Daughter.bam: $(OUTDIR)  \
  $(OUTDIR)/Project_WGS/Sample_Daughter/Daughter_TGACCA_L001.bam \
  $(OUTDIR)/Project_WGS/Sample_Daughter/Daughter_TGACCA_L002.bam
 #merge BAMS
 $(JAVA) -jar $(PICARDLIB)/MergeSamFiles.jar SORT_ORDER=coordinate OUTPUT=$@ \
  INPUT=$(OUTDIR)/Project_WGS/Sample_Daughter/Daughter_TGACCA_L001.bam \
  INPUT=$(OUTDIR)/Project_WGS/Sample_Daughter/Daughter_TGACCA_L002.bam
 #fix mate information
 $(JAVA) -jar $(PICARDLIB)/FixMateInformation.jar \
  INPUT=$@
 #validate
 $(JAVA) -jar $(PICARDLIB)/ValidateSamFile.jar \
  VALIDATE_INDEX=true \
  I=$@


#Creates a BAM for 'Project_WGS/Sample_Daughter/Daughter_TGACCA_L001'

$(OUTDIR)/Project_WGS/Sample_Daughter/Daughter_TGACCA_L001.bam: $(REF).bwt \
 $(INPUTDIR)/Project_WGS/Sample_Daughter/Daughter_TGACCA_L001_R1_001.fastq.gz \
 $(INPUTDIR)/Project_WGS/Sample_Daughter/Daughter_TGACCA_L001_R2_001.fastq.gz
 #create directory
 mkdir -p $(OUTDIR)/Project_WGS/Sample_Daughter
 #align fastq1 with BWA
 $(BWA) aln $(BWAALNFLAG) $(REF) -f $(OUTDIR)/idp5176_1.sai $(INPUTDIR)/Project_WGS/Sample_Daughter/Daughter_TGACCA_L001_R1_001.fastq.gz
 #align fastq2 with BWA
 $(BWA) aln $(BWAALNFLAG) $(REF) -f $(OUTDIR)/idp5176_2.sai $(INPUTDIR)/Project_WGS/Sample_Daughter/Daughter_TGACCA_L001_R2_001.fastq.gz
 #sampe those files
 $(BWA) sampe $(BWASAMPEFLAG)  -r '@RG ID:idp5176 CN:MyLaboratory LB:Daughter PG:bwa SM:Daughter PL:ILLUMINA' $(REF) \
  -f $(OUTDIR)/idp5176.sam \
  $(OUTDIR)/idp5176_1.sai \
  $(OUTDIR)/idp5176_2.sai \
  $(INPUTDIR)/Project_WGS/Sample_Daughter/Daughter_TGACCA_L001_R1_001.fastq.gz \
  $(INPUTDIR)/Project_WGS/Sample_Daughter/Daughter_TGACCA_L001_R2_001.fastq.gz
 #remove sai
 rm -f $(OUTDIR)/idp5176_1.sai $(OUTDIR)/idp5176_2.sai
 #convert sam to bam
 $(SAMTOOLS) view -b -T $(REF) -o $(OUTDIR)/idp5176_unsorted.bam $(OUTDIR)/idp5176.sam
 #remove the sam
 rm -f $(OUTDIR)/idp5176.sam
 #sort this bam
 $(SAMTOOLS) sort  $(OUTDIR)/idp5176_unsorted.bam $(OUTDIR)/idp5176_sorted
 #remove the unsorted bam
 rm -f $(OUTDIR)/idp5176_unsorted.bam
 #mark duplicates
 $(JAVA) -jar $(PICARDLIB)/MarkDuplicates.jar VALIDATION_STRINGENCY=LENIENT ASSUME_SORTED=true TMP_DIR=$(OUTDIR) \
  INPUT=$(OUTDIR)/idp5176_sorted.bam \
  OUTPUT=$(OUTDIR)/idp5176_dup.bam \
  METRICS_FILE=$(OUTDIR)/Project_WGS/Sample_Daughter/Daughter.metrics
 #remove before duplicate
 rm -f $(OUTDIR)/idp5176_sorted.bam
 #move to final bam
 mv $(OUTDIR)/idp5176_dup.bam $@



$(OUTDIR)/Project_WGS/Sample_Daughter/Daughter_TGACCA_L002.bam: $(REF).bwt \
 $(INPUTDIR)/Project_WGS/Sample_Daughter/Daughter_TGACCA_L002_R1_001.fastq.gz \
 $(INPUTDIR)/Project_WGS/Sample_Daughter/Daughter_TGACCA_L002_R2_001.fastq.gz
 #create directory
 mkdir -p $(OUTDIR)/Project_WGS/Sample_Daughter
 #align fastq1 with BWA
 $(BWA) aln $(BWAALNFLAG) $(REF) -f $(OUTDIR)/idp211744_1.sai $(INPUTDIR)/Project_WGS/Sample_Daughter/Daughter_TGACCA_L002_R1_001.fastq.gz
 #align fastq2 with BWA
 $(BWA) aln $(BWAALNFLAG) $(REF) -f $(OUTDIR)/idp211744_2.sai $(INPUTDIR)/Project_WGS/Sample_Daughter/Daughter_TGACCA_L002_R2_001.fastq.gz
 #sampe those files
 $(BWA) sampe $(BWASAMPEFLAG)  -r '@RG ID:idp211744 CN:MyLaboratory LB:Daughter PG:bwa SM:Daughter PL:ILLUMINA' $(REF) \
  -f $(OUTDIR)/idp211744.sam \
  $(OUTDIR)/idp211744_1.sai \
  $(OUTDIR)/idp211744_2.sai \
  $(INPUTDIR)/Project_WGS/Sample_Daughter/Daughter_TGACCA_L002_R1_001.fastq.gz \
  $(INPUTDIR)/Project_WGS/Sample_Daughter/Daughter_TGACCA_L002_R2_001.fastq.gz
 #remove sai
 rm -f $(OUTDIR)/idp211744_1.sai $(OUTDIR)/idp211744_2.sai
 #convert sam to bam
 $(SAMTOOLS) view -b -T $(REF) -o $(OUTDIR)/idp211744_unsorted.bam $(OUTDIR)/idp211744.sam
 #remove the sam
 rm -f $(OUTDIR)/idp211744.sam
 #sort this bam
 $(SAMTOOLS) sort  $(OUTDIR)/idp211744_unsorted.bam $(OUTDIR)/idp211744_sorted
 #remove the unsorted bam
 rm -f $(OUTDIR)/idp211744_unsorted.bam
 #mark duplicates
 $(JAVA) -jar $(PICARDLIB)/MarkDuplicates.jar VALIDATION_STRINGENCY=LENIENT ASSUME_SORTED=true TMP_DIR=$(OUTDIR) \
  INPUT=$(OUTDIR)/idp211744_sorted.bam \
  OUTPUT=$(OUTDIR)/idp211744_dup.bam \
  METRICS_FILE=$(OUTDIR)/Project_WGS/Sample_Daughter/Daughter.metrics
 #remove before duplicate
 rm -f $(OUTDIR)/idp211744_sorted.bam
 #move to final bam
 mv $(OUTDIR)/idp211744_dup.bam $@




#creates a VCF for 'Daughter'
$(OUTDIR)/Project_WGS/Sample_Daughter/Daughter.vcf.gz : $(OUTDIR)/Project_WGS/Sample_Daughter/Daughter.realigned.bam
 $(JAVA) -jar $(GATK) \
  -T UnifiedGenotyper \
  -glm BOTH \
  -baq OFF \
  -R $(REF) \
  -I $< \
  --dbsnp $(VCFDBSNP) \
  -o $(OUTDIR)/Project_WGS/Sample_Daughter/Daughter.vcf
 $(TABIX)/bgzip $(OUTDIR)/Project_WGS/Sample_Daughter/Daughter.vcf
 $(TABIX)/tabix -f -p vcf $(OUTDIR)/Project_WGS/Sample_Daughter/Daughter.vcf.gz

(Of course, this workflow would satisfy my needs, may be not yours)

Et voila !

That's it !

Pierre

YOKOFAKUN