Showing posts with label pipeline. Show all posts
Showing posts with label pipeline. Show all posts

05 December 2014

Divide-and-conquer in a #Makefile : recursivity and #parallelism.

This post is my notebook about implementing a divide-and-conquer strategy in GNU make.
Say you have a list of 'N' VCFs files. You want to create a list of:

  • common SNPs in vcf1 and vcf2
  • common SNPs in vcf3 and the previous list
  • common SNPs in vcf4 and the previous list
  • (...)
  • common SNPs in vcfN and the previous list
Yes, I know I can do this using:grep -v '^#' f.vcf|cut -f 1,2,4,5 | sort | uniq

Using a linear Makefile it could look like:

list2: vcf1 vcf2
    grep -v '^#' $^ |cut -f 1,2,4,5 | sort | uniq > $@
list3: vcf3 list2
    grep -v '^#' $^ |cut -f 1,2,4,5 | sort | uniq > $@
list4: vcf4 list3
    grep -v '^#' $^ |cut -f 1,2,4,5 | sort | uniq > $@
list5: vcf5 list4
    grep -v '^#' $^ |cut -f 1,2,4,5 | sort | uniq > $@
(...)

We can speed-up the workflow using the parallel option of make -j (number-of-parallel-jobs) and using a divide-and-conquer strategy. Here, the targets 'list1_2' and 'list3_4' can be processed independently in parallel.

list1_2: vcf1 vcf2
    grep -v '^#' $^ |cut -f 1,2,4,5 | sort | uniq > $@

list3_4: vcf3 vcf4
    grep -v '^#' $^ |cut -f 1,2,4,5 | sort | uniq > $@

list1_4: list1_2 list3_4
    grep -v '^#' $^ |cut -f 1,2,4,5 | sort | uniq > $@

By using the internal 'Make' functions $(eval), $(call), $(shell) we can define a recursive method "recursive". This method takes two arguments which are the 0-based indexes of a VCF in the list of VCFs. Here is the Makefile:
Running the makefile:
$ make 
gunzip -c Sample18.vcf.gz | grep -v '^#' | cut -f 1,2,4,5 | LC_ALL=C sort | uniq > target0_1
gunzip -c Sample13.vcf.gz | grep -v '^#' | cut -f 1,2,4,5 | LC_ALL=C sort | uniq > target1_2
LC_ALL=C comm -12 target0_1 target1_2 > target0_2
gunzip -c Sample1.vcf.gz | grep -v '^#' | cut -f 1,2,4,5 | LC_ALL=C sort | uniq > target2_3
gunzip -c Sample19.vcf.gz | grep -v '^#' | cut -f 1,2,4,5 | LC_ALL=C sort | uniq > target3_4
gunzip -c Sample12.vcf.gz | grep -v '^#' | cut -f 1,2,4,5 | LC_ALL=C sort | uniq > target4_5
LC_ALL=C comm -12 target3_4 target4_5 > target3_5
LC_ALL=C comm -12 target2_3 target3_5 > target2_5
LC_ALL=C comm -12 target0_2 target2_5 > target0_5
gunzip -c Sample17.vcf.gz | grep -v '^#' | cut -f 1,2,4,5 | LC_ALL=C sort | uniq > target5_6
gunzip -c Sample16.vcf.gz | grep -v '^#' | cut -f 1,2,4,5 | LC_ALL=C sort | uniq > target6_7
LC_ALL=C comm -12 target5_6 target6_7 > target5_7
gunzip -c Sample9.vcf.gz | grep -v '^#' | cut -f 1,2,4,5 | LC_ALL=C sort | uniq > target7_8
gunzip -c Sample15.vcf.gz | grep -v '^#' | cut -f 1,2,4,5 | LC_ALL=C sort | uniq > target8_9
gunzip -c Sample5.vcf.gz | grep -v '^#' | cut -f 1,2,4,5 | LC_ALL=C sort | uniq > target9_10
LC_ALL=C comm -12 target8_9 target9_10 > target8_10
LC_ALL=C comm -12 target7_8 target8_10 > target7_10
LC_ALL=C comm -12 target5_7 target7_10 > target5_10
LC_ALL=C comm -12 target0_5 target5_10 > target0_10
gunzip -c Sample14.vcf.gz | grep -v '^#' | cut -f 1,2,4,5 | LC_ALL=C sort | uniq > target10_11
gunzip -c Sample3.vcf.gz | grep -v '^#' | cut -f 1,2,4,5 | LC_ALL=C sort | uniq > target11_12
LC_ALL=C comm -12 target10_11 target11_12 > target10_12
gunzip -c Sample11.vcf.gz | grep -v '^#' | cut -f 1,2,4,5 | LC_ALL=C sort | uniq > target12_13
gunzip -c Sample2.vcf.gz | grep -v '^#' | cut -f 1,2,4,5 | LC_ALL=C sort | uniq > target13_14
gunzip -c Sample6.vcf.gz | grep -v '^#' | cut -f 1,2,4,5 | LC_ALL=C sort | uniq > target14_15
LC_ALL=C comm -12 target13_14 target14_15 > target13_15
LC_ALL=C comm -12 target12_13 target13_15 > target12_15
LC_ALL=C comm -12 target10_12 target12_15 > target10_15
gunzip -c Sample20.vcf.gz | grep -v '^#' | cut -f 1,2,4,5 | LC_ALL=C sort | uniq > target15_16
gunzip -c Sample10.vcf.gz | grep -v '^#' | cut -f 1,2,4,5 | LC_ALL=C sort | uniq > target16_17
LC_ALL=C comm -12 target15_16 target16_17 > target15_17
gunzip -c Sample4.vcf.gz | grep -v '^#' | cut -f 1,2,4,5 | LC_ALL=C sort | uniq > target17_18
gunzip -c Sample8.vcf.gz | grep -v '^#' | cut -f 1,2,4,5 | LC_ALL=C sort | uniq > target18_19
gunzip -c Sample7.vcf.gz | grep -v '^#' | cut -f 1,2,4,5 | LC_ALL=C sort | uniq > target19_20
LC_ALL=C comm -12 target18_19 target19_20 > target18_20
LC_ALL=C comm -12 target17_18 target18_20 > target17_20
LC_ALL=C comm -12 target15_17 target17_20 > target15_20
LC_ALL=C comm -12 target10_15 target15_20 > target10_20
LC_ALL=C comm -12 target0_10 target10_20 > target0_20
and here is the generated workflow (drawn with make2graph ).
:
That's it
Pierre.

30 January 2014

Parallelizing #RStats using #make

In the current post, I'll show how to use R as the main SHELL of GNU-Make instead of using a classical linux shell like 'bash'. Why would you do this ?

  • awesomeness
  • Make-based workflow management
  • Make-based execution with --jobs. GNU make knows how to execute several recipes at once. Normally, make will execute only one recipe at a time, waiting for it to finish before executing the next. However, the '-j' or '--jobs' option tells make to execute many recipes simultaneously.
The following recipe has been tested with GNU-Make 4.0 and I'm not sure it would world with '<=3.81'.

The only problem is that R doesn't accept a multiline-argument on the command line (see http://stackoverflow.com/questions/21442674) so I created a wrapper 'mockR' that save the argument '-e "code"' into a file and pipe it into R:

(Edit1: A comment from madscientist : Re your script; you can save yourself some wear-and-tear on your disk and avoid the need for temp files and cleanup by just piping the input directly: echo "$R" | R --vanilla --no-readline --quiet . Just a thought. ")

(Edit2: the exit value of 'R' should also be returned by 'mockR'.)

This file is set as executable:
$ chmod u+x ./mockR
In the makefile, we tell 'make' to use 'mockR' instead of '/usr/bin/sh':
SHELL = ./mockR
The R code will be passed to 'mockR' using the argument '-e "code"'
.SHELLFLAGS= -e
We also set 'ONESHELL': "If .ONESHELL is mentioned as a target, then when a target is built all lines of the recipe will be given to a single invocation of the shell rather than each line being invoked separately"
.ONESHELL:

Example 1

We download the table 'knownGene' from the UCSC and we plot a pdf file 'countExons=f(txStart)'. Please, note that the targets are created using some R statements, NOT bash statements:

Now Invoke make


Example 2

Using a the eval and the call function we can make the previous 'Makefile' applicable for all the chromosomes:

Now Invoke make USING TRHEE PARALLEL JOBS





You can now watch the final pdf files:




That's it,
Pierre

30 October 2013

GNU Make: saving the versions of the tools using 'order-only-prerequisites' : my notebook

Rule 3 of "Ten Simple Rules for Reproducible Computational Research". is
:

Archive the Exact Versions of All External Programs Used
.
I work with Makefile-based workflows: how can I save the version of each software used when I invoke 'make', whatever the target is ? A naive solution is to add a dependency to each target. For example, the following makefile takes a simple SAM file, convert it to BAM, sort and index. For each target, I've added a dependency named "dump_params" that append the version of samtools to a file "config.txt".

But that solution doesn't work because make re-builds all targets even if the top target already exists.
$ make

date << config.txt && \
 echo -n "Samtools " << config.txt && \
 samtools  2<&1 | grep Version << config.txt
samtools view -Sb samtools-0.1.18/examples/toy.sam < unsorted.bam
[samopen] SAM header is present: 2 sequences.
samtools sort unsorted.bam sorted
samtools index sorted.bam


$ make

date << config.txt && \
 echo -n "Samtools " << config.txt && \
 samtools  2<&1 | grep Version << config.txt
samtools view -Sb samtools-0.1.18/examples/toy.sam < unsorted.bam
[samopen] SAM header is present: 2 sequences.
samtools sort unsorted.bam sorted
samtools index sorted.bam

The solution I got via Stackoverflow is to use a order-only-prerequisites: "Order-only prerequisites can be specified by placing a pipe symbol (|) in the prerequisites list: any prerequisites to the left of the pipe symbol are normal; any prerequisites to the right are order-only... (...) Note that if you declare the same file to be both a normal and an order-only prerequisite, the normal prerequisite takes precedence (...)". The makefile with the 'order-only-prerequisites' is now:




And that works ! the final target is generated only once, but the file 'config.txt' is always generated.
$ make
date << config.txt && \
 echo -n "Samtools " << config.txt && \
 samtools  2<&1 | grep Version << config.txt
samtools view -Sb samtools-0.1.18/examples/toy.sam < unsorted.bam
[samopen] SAM header is present: 2 sequences.
samtools sort unsorted.bam sorted
samtools index sorted.bam

$ make
date << config.txt && \
 echo -n "Samtools " << config.txt && \
 samtools  2<&1 | grep Version << config.txt

$ make
date << config.txt && \
 echo -n "Samtools " << config.txt && \
 samtools  2<&1 | grep Version << config.txt
That's it,
Pierre

Update :another solution

Citing MadScientist's answer on stackoverflow : Another option is to use immediately expanded shell functions, like:
__dummy := $(shell echo "Makefile was run." >> config.txt)
Since it's immediately expanded the shell script will be invoked once, as the makefile is read in. There's no need to define a dump_params target or include it as a prerequisite. This is more old-school, but has the advantage that it will run for every invocation of make, without having to go through and ensure every target has the proper order-only prerequisite defined.




31 August 2012

The 500th post: Generating a pipeline of analysis (Makefile) for NGS with xslt.

This is my 500th post ! Happy birthday my blog ! (I know, I know, I have been unfaithful since the beginning of the year).

In my previous post, I've shown how a the .INTERMEDIATE keyword can be used in a Makefile to reduce the number of temporary files generated in a classical NGS pipeline. 'Make' also has an option '-j' to specify the number of threads: some independent tasks (such as aligning two distinct samples) can be processed in parallel .

At this point, I should cite this tweet from Sean Davis:



... and this even-more-recent tweet from Justin H. Johnson:




Here, I will show how XSLT can be used to generate a whole Makefile to process the output of Illumina/Casava after the FASTQs files have been generated using configureBclToFastq.pl using --fastq-cluster-count.

A Xml descriptor generated by Illumina Casava looks like this:
<DemultiplexConfig>
  <Software/>
  <FlowcellInfo ID="C0KTCACXX" Operator="JohnDoe" Recipe="" Desc="">
    <Lane Number="1">
      <Sample ProjectId="WGS" Control="N" Index="CTTGTA" SampleId="Father" Desc="1.5pM" Ref="" />
      <Sample ProjectId="WGS" Control="N" Index="GCCAAT" SampleId="Mother" Desc="1.5pM" Ref="" />
      <Sample ProjectId="WGS" Control="N" Index="TGACCA" SampleId="Daughter" Desc="1.5pM" Ref="" />
      <Sample ProjectId="Undetermined_indices" Control="N" Index="Undetermined" SampleId="lane1" Desc="Clusters with unmatched barcodes for lane 1" Ref="unknown" />
    </Lane>
    <Lane Number="2">
      <Sample ProjectId="WGS" Control="N" Index="CTTGTA" SampleId="Father" Desc="1.5pM" Ref="" />
      <Sample ProjectId="WGS" Control="N" Index="GCCAAT" SampleId="Mother" Desc="1.5pM" Ref="" />
      <Sample ProjectId="WGS" Control="N" Index="TGACCA" SampleId="Daughter" Desc="1.5pM" Ref="" />
      <Sample ProjectId="Undetermined_indices" Control="N" Index="Undetermined" SampleId="lane2" Desc="Clusters with unmatched barcodes for lane 2" Ref="unknown" />
    </Lane>
  </FlowcellInfo>
</DemultiplexConfig>
This document contains the required informations to find the fastqs and to generate a Makefile. I wrote the following XSLT stylesheet:


This stylesheet transforms the DemultiplexConfig.xml file to the following Makefile:

#
#files must have been generated using CASAVA/configureBclToFastq.pl with --fastq-cluster-count 0
#path to bwa
override BWA=/path/to/bwa-0.6.1/bwa
BWASAMPEFLAG= -a 600
BWAALNFLAG= -t 4
#path to samtools
SAMTOOLS=/path/to/samtools
#path to BEDTOOLS
BEDTOOLS=/path/to/bedtools/bin
#output directory
OUTDIR=2012831.OUTPUT.idp0
INPUTDIR=.
REF=/path/to/human_g1k_v37.fasta
JAVA=java

PICARDLIB=/path/to/picard-tools
TABIX=/path/to/tabix
GATK=/path/to/GenomeAnalysisTK.jar
GATKFLAGS=
VCFDBSNP=/path/to/ncbi/snp/00-All.vcf.gz

#rule how to create a bai from a bam:
%.bam.bai: %.bam
 #indexing BAM $<
 $(SAMTOOLS) index $<

#creates the default output directory
$(OUTDIR):
 mkdir -p $@

#indexes the reference for bwa
$(REF).bwt $(REF).ann $(REF).amb $(REF).pac $(REF).sa : $(REF)
 $(BWA) index -a bwtsw $<

#indexes the reference for samtools
$(REF).fai: $(REF)
 $(SAMTOOLS) faidx $<

#creates sequence dict for picard
$(REF).dict: $(REF)
 $(JAVA) -jar $(PICARDLIB)/CreateSequenceDictionary.jar REFERENCE=$(REF) OUTPUT=$<



#intermediate see http://plindenbaum.blogspot.fr/2012/08/next-generation-sequencing-gnu-make-and.html
.INTERMEDIATE :  \
 $(OUTDIR)/Project_WGS/Sample_Father/Father.bam.bai \
 $(OUTDIR)/Project_WGS/Sample_Father/Father.bam \
 $(OUTDIR)/Project_WGS/Sample_Father/Father_CTTGTA_L001.bam \
 $(OUTDIR)/Project_WGS/Sample_Father/Father_CTTGTA_L002.bam \
 $(OUTDIR)/Project_WGS/Sample_Mother/Mother.bam.bai \
 $(OUTDIR)/Project_WGS/Sample_Mother/Mother.bam \
 $(OUTDIR)/Project_WGS/Sample_Mother/Mother_GCCAAT_L001.bam \
 $(OUTDIR)/Project_WGS/Sample_Mother/Mother_GCCAAT_L002.bam \
 $(OUTDIR)/Project_WGS/Sample_Daughter/Daughter.bam.bai \
 $(OUTDIR)/Project_WGS/Sample_Daughter/Daughter.bam \
 $(OUTDIR)/Project_WGS/Sample_Daughter/Daughter_TGACCA_L001.bam \
 $(OUTDIR)/Project_WGS/Sample_Daughter/Daughter_TGACCA_L002.bam


#main target : generate the VCFs
all : vcfs


vcfs:  \
 $(OUTDIR)/Project_WGS/Sample_Father/Father.vcf.gz \
 $(OUTDIR)/Project_WGS/Sample_Mother/Mother.vcf.gz \
 $(OUTDIR)/Project_WGS/Sample_Daughter/Daughter.vcf.gz


$(OUTDIR)/Project_WGS/Sample_Father/Father.realigned.bam : $(OUTDIR)/Project_WGS/Sample_Father/Father.bam $(OUTDIR)/Project_WGS/Sample_Father/Father.bam.bai
 $(JAVA) -jar $(GATK) $(GATKFLAGS) \
  -T RealignerTargetCreator \
  -R $(REF) \
  -I $< \
  -o $<.intervals \
  --known $(VCFDBSNP)
 $(JAVA) -jar $(GATK) $(GATKFLAGS) \
  -T IndelRealigner \
  -R $(REF) \
  -I $< \
  -o $@ \
  -targetIntervals $<.intervals \
  --knownAlleles $(VCFDBSNP)
 rm -f $<.intervals


#creates a BAM for 'Father' by merging the other lanes
$(OUTDIR)/Project_WGS/Sample_Father/Father.bam: $(OUTDIR)  \
  $(OUTDIR)/Project_WGS/Sample_Father/Father_CTTGTA_L001.bam \
  $(OUTDIR)/Project_WGS/Sample_Father/Father_CTTGTA_L002.bam
 #merge BAMS
 $(JAVA) -jar $(PICARDLIB)/MergeSamFiles.jar SORT_ORDER=coordinate OUTPUT=$@ \
  INPUT=$(OUTDIR)/Project_WGS/Sample_Father/Father_CTTGTA_L001.bam \
  INPUT=$(OUTDIR)/Project_WGS/Sample_Father/Father_CTTGTA_L002.bam
 #fix mate information
 $(JAVA) -jar $(PICARDLIB)/FixMateInformation.jar \
  INPUT=$@
 #validate
 $(JAVA) -jar $(PICARDLIB)/ValidateSamFile.jar \
  VALIDATE_INDEX=true \
  I=$@


#Creates a BAM for 'Project_WGS/Sample_Father/Father_CTTGTA_L001'

$(OUTDIR)/Project_WGS/Sample_Father/Father_CTTGTA_L001.bam: $(REF).bwt \
 $(INPUTDIR)/Project_WGS/Sample_Father/Father_CTTGTA_L001_R1_001.fastq.gz \
 $(INPUTDIR)/Project_WGS/Sample_Father/Father_CTTGTA_L001_R2_001.fastq.gz
 #create directory
 mkdir -p $(OUTDIR)/Project_WGS/Sample_Father
 #align fastq1 with BWA
 $(BWA) aln $(BWAALNFLAG) $(REF) -f $(OUTDIR)/idp22528_1.sai $(INPUTDIR)/Project_WGS/Sample_Father/Father_CTTGTA_L001_R1_001.fastq.gz
 #align fastq2 with BWA
 $(BWA) aln $(BWAALNFLAG) $(REF) -f $(OUTDIR)/idp22528_2.sai $(INPUTDIR)/Project_WGS/Sample_Father/Father_CTTGTA_L001_R2_001.fastq.gz
 #sampe those files
 $(BWA) sampe $(BWASAMPEFLAG)  -r '@RG ID:idp22528 CN:MyLaboratory LB:Father PG:bwa SM:Father PL:ILLUMINA' $(REF) \
  -f $(OUTDIR)/idp22528.sam \
  $(OUTDIR)/idp22528_1.sai \
  $(OUTDIR)/idp22528_2.sai \
  $(INPUTDIR)/Project_WGS/Sample_Father/Father_CTTGTA_L001_R1_001.fastq.gz \
  $(INPUTDIR)/Project_WGS/Sample_Father/Father_CTTGTA_L001_R2_001.fastq.gz
 #remove sai
 rm -f $(OUTDIR)/idp22528_1.sai $(OUTDIR)/idp22528_2.sai
 #convert sam to bam
 $(SAMTOOLS) view -b -T $(REF) -o $(OUTDIR)/idp22528_unsorted.bam $(OUTDIR)/idp22528.sam
 #remove the sam
 rm -f $(OUTDIR)/idp22528.sam
 #sort this bam
 $(SAMTOOLS) sort  $(OUTDIR)/idp22528_unsorted.bam $(OUTDIR)/idp22528_sorted
 #remove the unsorted bam
 rm -f $(OUTDIR)/idp22528_unsorted.bam
 #mark duplicates
 $(JAVA) -jar $(PICARDLIB)/MarkDuplicates.jar VALIDATION_STRINGENCY=LENIENT ASSUME_SORTED=true TMP_DIR=$(OUTDIR) \
  INPUT=$(OUTDIR)/idp22528_sorted.bam \
  OUTPUT=$(OUTDIR)/idp22528_dup.bam \
  METRICS_FILE=$(OUTDIR)/Project_WGS/Sample_Father/Father.metrics
 #remove before duplicate
 rm -f $(OUTDIR)/idp22528_sorted.bam
 #move to final bam
 mv $(OUTDIR)/idp22528_dup.bam $@



$(OUTDIR)/Project_WGS/Sample_Father/Father_CTTGTA_L002.bam: $(REF).bwt \
 $(INPUTDIR)/Project_WGS/Sample_Father/Father_CTTGTA_L002_R1_001.fastq.gz \
 $(INPUTDIR)/Project_WGS/Sample_Father/Father_CTTGTA_L002_R2_001.fastq.gz
 #create directory
 mkdir -p $(OUTDIR)/Project_WGS/Sample_Father
 #align fastq1 with BWA
 $(BWA) aln $(BWAALNFLAG) $(REF) -f $(OUTDIR)/idp31344_1.sai $(INPUTDIR)/Project_WGS/Sample_Father/Father_CTTGTA_L002_R1_001.fastq.gz
 #align fastq2 with BWA
 $(BWA) aln $(BWAALNFLAG) $(REF) -f $(OUTDIR)/idp31344_2.sai $(INPUTDIR)/Project_WGS/Sample_Father/Father_CTTGTA_L002_R2_001.fastq.gz
 #sampe those files
 $(BWA) sampe $(BWASAMPEFLAG)  -r '@RG ID:idp31344 CN:MyLaboratory LB:Father PG:bwa SM:Father PL:ILLUMINA' $(REF) \
  -f $(OUTDIR)/idp31344.sam \
  $(OUTDIR)/idp31344_1.sai \
  $(OUTDIR)/idp31344_2.sai \
  $(INPUTDIR)/Project_WGS/Sample_Father/Father_CTTGTA_L002_R1_001.fastq.gz \
  $(INPUTDIR)/Project_WGS/Sample_Father/Father_CTTGTA_L002_R2_001.fastq.gz
 #remove sai
 rm -f $(OUTDIR)/idp31344_1.sai $(OUTDIR)/idp31344_2.sai
 #convert sam to bam
 $(SAMTOOLS) view -b -T $(REF) -o $(OUTDIR)/idp31344_unsorted.bam $(OUTDIR)/idp31344.sam
 #remove the sam
 rm -f $(OUTDIR)/idp31344.sam
 #sort this bam
 $(SAMTOOLS) sort  $(OUTDIR)/idp31344_unsorted.bam $(OUTDIR)/idp31344_sorted
 #remove the unsorted bam
 rm -f $(OUTDIR)/idp31344_unsorted.bam
 #mark duplicates
 $(JAVA) -jar $(PICARDLIB)/MarkDuplicates.jar VALIDATION_STRINGENCY=LENIENT ASSUME_SORTED=true TMP_DIR=$(OUTDIR) \
  INPUT=$(OUTDIR)/idp31344_sorted.bam \
  OUTPUT=$(OUTDIR)/idp31344_dup.bam \
  METRICS_FILE=$(OUTDIR)/Project_WGS/Sample_Father/Father.metrics
 #remove before duplicate
 rm -f $(OUTDIR)/idp31344_sorted.bam
 #move to final bam
 mv $(OUTDIR)/idp31344_dup.bam $@




#creates a VCF for 'Father'
$(OUTDIR)/Project_WGS/Sample_Father/Father.vcf.gz : $(OUTDIR)/Project_WGS/Sample_Father/Father.realigned.bam
 $(JAVA) -jar $(GATK) \
  -T UnifiedGenotyper \
  -glm BOTH \
  -baq OFF \
  -R $(REF) \
  -I $< \
  --dbsnp $(VCFDBSNP) \
  -o $(OUTDIR)/Project_WGS/Sample_Father/Father.vcf
 $(TABIX)/bgzip $(OUTDIR)/Project_WGS/Sample_Father/Father.vcf
 $(TABIX)/tabix -f -p vcf $(OUTDIR)/Project_WGS/Sample_Father/Father.vcf.gz
$(OUTDIR)/Project_WGS/Sample_Mother/Mother.realigned.bam : $(OUTDIR)/Project_WGS/Sample_Mother/Mother.bam $(OUTDIR)/Project_WGS/Sample_Mother/Mother.bam.bai
 $(JAVA) -jar $(GATK) $(GATKFLAGS) \
  -T RealignerTargetCreator \
  -R $(REF) \
  -I $< \
  -o $<.intervals \
  --known $(VCFDBSNP)
 $(JAVA) -jar $(GATK) $(GATKFLAGS) \
  -T IndelRealigner \
  -R $(REF) \
  -I $< \
  -o $@ \
  -targetIntervals $<.intervals \
  --knownAlleles $(VCFDBSNP)
 rm -f $<.intervals


#creates a BAM for 'Mother' by merging the other lanes
$(OUTDIR)/Project_WGS/Sample_Mother/Mother.bam: $(OUTDIR)  \
  $(OUTDIR)/Project_WGS/Sample_Mother/Mother_GCCAAT_L001.bam \
  $(OUTDIR)/Project_WGS/Sample_Mother/Mother_GCCAAT_L002.bam
 #merge BAMS
 $(JAVA) -jar $(PICARDLIB)/MergeSamFiles.jar SORT_ORDER=coordinate OUTPUT=$@ \
  INPUT=$(OUTDIR)/Project_WGS/Sample_Mother/Mother_GCCAAT_L001.bam \
  INPUT=$(OUTDIR)/Project_WGS/Sample_Mother/Mother_GCCAAT_L002.bam
 #fix mate information
 $(JAVA) -jar $(PICARDLIB)/FixMateInformation.jar \
  INPUT=$@
 #validate
 $(JAVA) -jar $(PICARDLIB)/ValidateSamFile.jar \
  VALIDATE_INDEX=true \
  I=$@


#Creates a BAM for 'Project_WGS/Sample_Mother/Mother_GCCAAT_L001'

$(OUTDIR)/Project_WGS/Sample_Mother/Mother_GCCAAT_L001.bam: $(REF).bwt \
 $(INPUTDIR)/Project_WGS/Sample_Mother/Mother_GCCAAT_L001_R1_001.fastq.gz \
 $(INPUTDIR)/Project_WGS/Sample_Mother/Mother_GCCAAT_L001_R2_001.fastq.gz
 #create directory
 mkdir -p $(OUTDIR)/Project_WGS/Sample_Mother
 #align fastq1 with BWA
 $(BWA) aln $(BWAALNFLAG) $(REF) -f $(OUTDIR)/idp15352_1.sai $(INPUTDIR)/Project_WGS/Sample_Mother/Mother_GCCAAT_L001_R1_001.fastq.gz
 #align fastq2 with BWA
 $(BWA) aln $(BWAALNFLAG) $(REF) -f $(OUTDIR)/idp15352_2.sai $(INPUTDIR)/Project_WGS/Sample_Mother/Mother_GCCAAT_L001_R2_001.fastq.gz
 #sampe those files
 $(BWA) sampe $(BWASAMPEFLAG)  -r '@RG ID:idp15352 CN:MyLaboratory LB:Mother PG:bwa SM:Mother PL:ILLUMINA' $(REF) \
  -f $(OUTDIR)/idp15352.sam \
  $(OUTDIR)/idp15352_1.sai \
  $(OUTDIR)/idp15352_2.sai \
  $(INPUTDIR)/Project_WGS/Sample_Mother/Mother_GCCAAT_L001_R1_001.fastq.gz \
  $(INPUTDIR)/Project_WGS/Sample_Mother/Mother_GCCAAT_L001_R2_001.fastq.gz
 #remove sai
 rm -f $(OUTDIR)/idp15352_1.sai $(OUTDIR)/idp15352_2.sai
 #convert sam to bam
 $(SAMTOOLS) view -b -T $(REF) -o $(OUTDIR)/idp15352_unsorted.bam $(OUTDIR)/idp15352.sam
 #remove the sam
 rm -f $(OUTDIR)/idp15352.sam
 #sort this bam
 $(SAMTOOLS) sort  $(OUTDIR)/idp15352_unsorted.bam $(OUTDIR)/idp15352_sorted
 #remove the unsorted bam
 rm -f $(OUTDIR)/idp15352_unsorted.bam
 #mark duplicates
 $(JAVA) -jar $(PICARDLIB)/MarkDuplicates.jar VALIDATION_STRINGENCY=LENIENT ASSUME_SORTED=true TMP_DIR=$(OUTDIR) \
  INPUT=$(OUTDIR)/idp15352_sorted.bam \
  OUTPUT=$(OUTDIR)/idp15352_dup.bam \
  METRICS_FILE=$(OUTDIR)/Project_WGS/Sample_Mother/Mother.metrics
 #remove before duplicate
 rm -f $(OUTDIR)/idp15352_sorted.bam
 #move to final bam
 mv $(OUTDIR)/idp15352_dup.bam $@



$(OUTDIR)/Project_WGS/Sample_Mother/Mother_GCCAAT_L002.bam: $(REF).bwt \
 $(INPUTDIR)/Project_WGS/Sample_Mother/Mother_GCCAAT_L002_R1_001.fastq.gz \
 $(INPUTDIR)/Project_WGS/Sample_Mother/Mother_GCCAAT_L002_R2_001.fastq.gz
 #create directory
 mkdir -p $(OUTDIR)/Project_WGS/Sample_Mother
 #align fastq1 with BWA
 $(BWA) aln $(BWAALNFLAG) $(REF) -f $(OUTDIR)/idm22432_1.sai $(INPUTDIR)/Project_WGS/Sample_Mother/Mother_GCCAAT_L002_R1_001.fastq.gz
 #align fastq2 with BWA
 $(BWA) aln $(BWAALNFLAG) $(REF) -f $(OUTDIR)/idm22432_2.sai $(INPUTDIR)/Project_WGS/Sample_Mother/Mother_GCCAAT_L002_R2_001.fastq.gz
 #sampe those files
 $(BWA) sampe $(BWASAMPEFLAG)  -r '@RG ID:idm22432 CN:MyLaboratory LB:Mother PG:bwa SM:Mother PL:ILLUMINA' $(REF) \
  -f $(OUTDIR)/idm22432.sam \
  $(OUTDIR)/idm22432_1.sai \
  $(OUTDIR)/idm22432_2.sai \
  $(INPUTDIR)/Project_WGS/Sample_Mother/Mother_GCCAAT_L002_R1_001.fastq.gz \
  $(INPUTDIR)/Project_WGS/Sample_Mother/Mother_GCCAAT_L002_R2_001.fastq.gz
 #remove sai
 rm -f $(OUTDIR)/idm22432_1.sai $(OUTDIR)/idm22432_2.sai
 #convert sam to bam
 $(SAMTOOLS) view -b -T $(REF) -o $(OUTDIR)/idm22432_unsorted.bam $(OUTDIR)/idm22432.sam
 #remove the sam
 rm -f $(OUTDIR)/idm22432.sam
 #sort this bam
 $(SAMTOOLS) sort  $(OUTDIR)/idm22432_unsorted.bam $(OUTDIR)/idm22432_sorted
 #remove the unsorted bam
 rm -f $(OUTDIR)/idm22432_unsorted.bam
 #mark duplicates
 $(JAVA) -jar $(PICARDLIB)/MarkDuplicates.jar VALIDATION_STRINGENCY=LENIENT ASSUME_SORTED=true TMP_DIR=$(OUTDIR) \
  INPUT=$(OUTDIR)/idm22432_sorted.bam \
  OUTPUT=$(OUTDIR)/idm22432_dup.bam \
  METRICS_FILE=$(OUTDIR)/Project_WGS/Sample_Mother/Mother.metrics
 #remove before duplicate
 rm -f $(OUTDIR)/idm22432_sorted.bam
 #move to final bam
 mv $(OUTDIR)/idm22432_dup.bam $@




#creates a VCF for 'Mother'
$(OUTDIR)/Project_WGS/Sample_Mother/Mother.vcf.gz : $(OUTDIR)/Project_WGS/Sample_Mother/Mother.realigned.bam
 $(JAVA) -jar $(GATK) \
  -T UnifiedGenotyper \
  -glm BOTH \
  -baq OFF \
  -R $(REF) \
  -I $< \
  --dbsnp $(VCFDBSNP) \
  -o $(OUTDIR)/Project_WGS/Sample_Mother/Mother.vcf
 $(TABIX)/bgzip $(OUTDIR)/Project_WGS/Sample_Mother/Mother.vcf
 $(TABIX)/tabix -f -p vcf $(OUTDIR)/Project_WGS/Sample_Mother/Mother.vcf.gz
$(OUTDIR)/Project_WGS/Sample_Daughter/Daughter.realigned.bam : $(OUTDIR)/Project_WGS/Sample_Daughter/Daughter.bam $(OUTDIR)/Project_WGS/Sample_Daughter/Daughter.bam.bai
 $(JAVA) -jar $(GATK) $(GATKFLAGS) \
  -T RealignerTargetCreator \
  -R $(REF) \
  -I $< \
  -o $<.intervals \
  --known $(VCFDBSNP)
 $(JAVA) -jar $(GATK) $(GATKFLAGS) \
  -T IndelRealigner \
  -R $(REF) \
  -I $< \
  -o $@ \
  -targetIntervals $<.intervals \
  --knownAlleles $(VCFDBSNP)
 rm -f $<.intervals


#creates a BAM for 'Daughter' by merging the other lanes
$(OUTDIR)/Project_WGS/Sample_Daughter/Daughter.bam: $(OUTDIR)  \
  $(OUTDIR)/Project_WGS/Sample_Daughter/Daughter_TGACCA_L001.bam \
  $(OUTDIR)/Project_WGS/Sample_Daughter/Daughter_TGACCA_L002.bam
 #merge BAMS
 $(JAVA) -jar $(PICARDLIB)/MergeSamFiles.jar SORT_ORDER=coordinate OUTPUT=$@ \
  INPUT=$(OUTDIR)/Project_WGS/Sample_Daughter/Daughter_TGACCA_L001.bam \
  INPUT=$(OUTDIR)/Project_WGS/Sample_Daughter/Daughter_TGACCA_L002.bam
 #fix mate information
 $(JAVA) -jar $(PICARDLIB)/FixMateInformation.jar \
  INPUT=$@
 #validate
 $(JAVA) -jar $(PICARDLIB)/ValidateSamFile.jar \
  VALIDATE_INDEX=true \
  I=$@


#Creates a BAM for 'Project_WGS/Sample_Daughter/Daughter_TGACCA_L001'

$(OUTDIR)/Project_WGS/Sample_Daughter/Daughter_TGACCA_L001.bam: $(REF).bwt \
 $(INPUTDIR)/Project_WGS/Sample_Daughter/Daughter_TGACCA_L001_R1_001.fastq.gz \
 $(INPUTDIR)/Project_WGS/Sample_Daughter/Daughter_TGACCA_L001_R2_001.fastq.gz
 #create directory
 mkdir -p $(OUTDIR)/Project_WGS/Sample_Daughter
 #align fastq1 with BWA
 $(BWA) aln $(BWAALNFLAG) $(REF) -f $(OUTDIR)/idp5176_1.sai $(INPUTDIR)/Project_WGS/Sample_Daughter/Daughter_TGACCA_L001_R1_001.fastq.gz
 #align fastq2 with BWA
 $(BWA) aln $(BWAALNFLAG) $(REF) -f $(OUTDIR)/idp5176_2.sai $(INPUTDIR)/Project_WGS/Sample_Daughter/Daughter_TGACCA_L001_R2_001.fastq.gz
 #sampe those files
 $(BWA) sampe $(BWASAMPEFLAG)  -r '@RG ID:idp5176 CN:MyLaboratory LB:Daughter PG:bwa SM:Daughter PL:ILLUMINA' $(REF) \
  -f $(OUTDIR)/idp5176.sam \
  $(OUTDIR)/idp5176_1.sai \
  $(OUTDIR)/idp5176_2.sai \
  $(INPUTDIR)/Project_WGS/Sample_Daughter/Daughter_TGACCA_L001_R1_001.fastq.gz \
  $(INPUTDIR)/Project_WGS/Sample_Daughter/Daughter_TGACCA_L001_R2_001.fastq.gz
 #remove sai
 rm -f $(OUTDIR)/idp5176_1.sai $(OUTDIR)/idp5176_2.sai
 #convert sam to bam
 $(SAMTOOLS) view -b -T $(REF) -o $(OUTDIR)/idp5176_unsorted.bam $(OUTDIR)/idp5176.sam
 #remove the sam
 rm -f $(OUTDIR)/idp5176.sam
 #sort this bam
 $(SAMTOOLS) sort  $(OUTDIR)/idp5176_unsorted.bam $(OUTDIR)/idp5176_sorted
 #remove the unsorted bam
 rm -f $(OUTDIR)/idp5176_unsorted.bam
 #mark duplicates
 $(JAVA) -jar $(PICARDLIB)/MarkDuplicates.jar VALIDATION_STRINGENCY=LENIENT ASSUME_SORTED=true TMP_DIR=$(OUTDIR) \
  INPUT=$(OUTDIR)/idp5176_sorted.bam \
  OUTPUT=$(OUTDIR)/idp5176_dup.bam \
  METRICS_FILE=$(OUTDIR)/Project_WGS/Sample_Daughter/Daughter.metrics
 #remove before duplicate
 rm -f $(OUTDIR)/idp5176_sorted.bam
 #move to final bam
 mv $(OUTDIR)/idp5176_dup.bam $@



$(OUTDIR)/Project_WGS/Sample_Daughter/Daughter_TGACCA_L002.bam: $(REF).bwt \
 $(INPUTDIR)/Project_WGS/Sample_Daughter/Daughter_TGACCA_L002_R1_001.fastq.gz \
 $(INPUTDIR)/Project_WGS/Sample_Daughter/Daughter_TGACCA_L002_R2_001.fastq.gz
 #create directory
 mkdir -p $(OUTDIR)/Project_WGS/Sample_Daughter
 #align fastq1 with BWA
 $(BWA) aln $(BWAALNFLAG) $(REF) -f $(OUTDIR)/idp211744_1.sai $(INPUTDIR)/Project_WGS/Sample_Daughter/Daughter_TGACCA_L002_R1_001.fastq.gz
 #align fastq2 with BWA
 $(BWA) aln $(BWAALNFLAG) $(REF) -f $(OUTDIR)/idp211744_2.sai $(INPUTDIR)/Project_WGS/Sample_Daughter/Daughter_TGACCA_L002_R2_001.fastq.gz
 #sampe those files
 $(BWA) sampe $(BWASAMPEFLAG)  -r '@RG ID:idp211744 CN:MyLaboratory LB:Daughter PG:bwa SM:Daughter PL:ILLUMINA' $(REF) \
  -f $(OUTDIR)/idp211744.sam \
  $(OUTDIR)/idp211744_1.sai \
  $(OUTDIR)/idp211744_2.sai \
  $(INPUTDIR)/Project_WGS/Sample_Daughter/Daughter_TGACCA_L002_R1_001.fastq.gz \
  $(INPUTDIR)/Project_WGS/Sample_Daughter/Daughter_TGACCA_L002_R2_001.fastq.gz
 #remove sai
 rm -f $(OUTDIR)/idp211744_1.sai $(OUTDIR)/idp211744_2.sai
 #convert sam to bam
 $(SAMTOOLS) view -b -T $(REF) -o $(OUTDIR)/idp211744_unsorted.bam $(OUTDIR)/idp211744.sam
 #remove the sam
 rm -f $(OUTDIR)/idp211744.sam
 #sort this bam
 $(SAMTOOLS) sort  $(OUTDIR)/idp211744_unsorted.bam $(OUTDIR)/idp211744_sorted
 #remove the unsorted bam
 rm -f $(OUTDIR)/idp211744_unsorted.bam
 #mark duplicates
 $(JAVA) -jar $(PICARDLIB)/MarkDuplicates.jar VALIDATION_STRINGENCY=LENIENT ASSUME_SORTED=true TMP_DIR=$(OUTDIR) \
  INPUT=$(OUTDIR)/idp211744_sorted.bam \
  OUTPUT=$(OUTDIR)/idp211744_dup.bam \
  METRICS_FILE=$(OUTDIR)/Project_WGS/Sample_Daughter/Daughter.metrics
 #remove before duplicate
 rm -f $(OUTDIR)/idp211744_sorted.bam
 #move to final bam
 mv $(OUTDIR)/idp211744_dup.bam $@




#creates a VCF for 'Daughter'
$(OUTDIR)/Project_WGS/Sample_Daughter/Daughter.vcf.gz : $(OUTDIR)/Project_WGS/Sample_Daughter/Daughter.realigned.bam
 $(JAVA) -jar $(GATK) \
  -T UnifiedGenotyper \
  -glm BOTH \
  -baq OFF \
  -R $(REF) \
  -I $< \
  --dbsnp $(VCFDBSNP) \
  -o $(OUTDIR)/Project_WGS/Sample_Daughter/Daughter.vcf
 $(TABIX)/bgzip $(OUTDIR)/Project_WGS/Sample_Daughter/Daughter.vcf
 $(TABIX)/tabix -f -p vcf $(OUTDIR)/Project_WGS/Sample_Daughter/Daughter.vcf.gz
(Of course, this workflow would satisfy my needs, may be not yours)

Et voila !


That's it !


Pierre