SlideShare a Scribd company logo
Eli Lilly / September 14-15, 2011

20-Line Lifesavers:"
Coding simple solutions in the GATK
Kiran V Garimella (kiran.garimella@gmail.com), Mark A DePristo
G E N O M E S E Q U E N C I N G A N D A N A LY S I S , B R O A D I N S T I T U T E 


Research Informatics Group
E L I L I L LY A N D C O M PA N Y
Genome Analysis Toolkit (GATK)!
ˈjē-ˌnōm(,)ə-ˈna-lə-səs(,)ˈtül(,)kit&


Noun&

1.  A suite of tools for working with medical resequencing projects
    (e.g. 1,000 Genomes, The Cancer Genome Atlas)&

2.  A structured software library that makes writing efficient
    analysis tools using next-generation sequencing data easy!
Genome Analysis Toolkit (GATK)!
ˈjē-ˌnōm(,)ə-ˈna-lə-səs(.)ˈtül(,)kit&


Noun&      Most users think of the toolkit merely as a set
           of tools that implement our ideas…!
1.  A suite of tools for working with medical resequencing projects
    (e.g. 1,000 Genomes, The Cancer Genome Atlas)&

2.  A structured software library that makes writing efficient
    analysis tools using next-generation sequencing data easy!
Genome Analysis Toolkit (GATK)!
ˈjē-ˌnōm(,)ə-ˈna-lə-səs(.)ˈtül(,)kit&


Noun&

1. … suite the GATKʼs real with medical resequencing projects
   A but of tools for working power is in how easy it
   (e.g. 1,000 Genomes, The Cancer Genome Atlas)&
  makes it to instantiate your ideas.!
2.  A structured software library that makes writing efficient
    analysis tools using next-generation sequencing data easy!

                             This is what we will discuss today.!
Some tasks are made difficult by the wrong tools

                                            Convert to sam format, read the
                                           header, parse the read group info into
                                           a hash table keyed on the ID, loop
                                        over the reads, look up the read group id in the hash, find
                                        the platform unit tag,
These BAMS have numeric, non-               prepend it to the read name,
                                         convert back to BAM, reindex BAM.
unique read ids that collide when you
merge them!                                   Lines of Code: 500.


How long will It take to fix?
                                                            All day!




                                                      With all apologies to Randall Munroe and XKCD&
That same task, written in the GATK (20 lines of code)
package org.broadinstitute.sting.gatk.walkers.examples;	

import    net.sf.samtools.SAMFileWriter;	
import    net.sf.samtools.SAMRecord;	
import    org.broadinstitute.sting.commandline.Output;	
import    org.broadinstitute.sting.gatk.contexts.ReferenceContext;	
import    org.broadinstitute.sting.gatk.refdata.ReadMetaDataTracker;	
import    org.broadinstitute.sting.gatk.walkers.ReadWalker;	

public class FixReadNames extends ReadWalker<Integer, Integer> {	
    @Output	
    SAMFileWriter out;	

     @Override	
     public Integer map(ReferenceContext ref, SAMRecord read, ReadMetaDataTracker metaDataTracker) {	
         read.setReadName(read.getReadGroup().getPlatformUnit() + "." + read.getReadName());	
         out.addAlignment(read);	

          return null;	
     }	

     @Override	
     public Integer reduceInit() { return null; }	

     @Override	
     public Integer reduce(Integer value, Integer sum) { return null; }	
}
That same task, written in the GATK"
     (code that’s not filled in for you by the IDE – 5 lines)
package org.broadinstitute.sting.gatk.walkers.examples;	

import    net.sf.samtools.SAMFileWriter;	
import    net.sf.samtools.SAMRecord;	
import    org.broadinstitute.sting.commandline.Output;	
import    org.broadinstitute.sting.gatk.contexts.ReferenceContext;	
import    org.broadinstitute.sting.gatk.refdata.ReadMetaDataTracker;	
import    org.broadinstitute.sting.gatk.walkers.ReadWalker;	

public class FixReadNames extends ReadWalker<Integer, Integer> {	
    @Output	
    SAMFileWriter out;	

     @Override	
     public Integer map(ReferenceContext ref, SAMRecord read, ReadMetaDataTracker metaDataTracker) {	
         read.setReadName(read.getReadGroup().getPlatformUnit() + "." + read.getReadName());	
         out.addAlignment(read);	

          return null;	
     }	

     @Override	
     public Integer reduceInit() { return null; }	

     @Override	
     public Integer reduce(Integer value, Integer sum) { return null; }	
}	


     Most of the code is boilerplate, and the IDE can fill it in for you. The amount
              of code you have to manually write is actually very small.!
Those tasks are simple when using the right tools…


                                        Write a GATK READwalker that modifies
                                        the read name and writes it out again.

                                        Spend rest of time looking at lolCATs.

These BAMS have numeric, non-           Lines of Code: 5.
unique read ids that collide when you
merge them!

How long will It take to fix?
                                                        Um, All day...




                                                   With all apologies to Randall Munroe and XKCD&
…though whether you’ll tell people that is up to you.




             Hehe, I can haz cheezburger INDEED.




                                          With all apologies to Randall Munroe and XKCD&
We ’ re g o i n g t o w r i t e g e n u i n e l y u s e f u l , d e a d l i n e
d e f e a t i n g , l i f e s a v i n g t o o l s i n < 20 lines of code
Now we’ll go through a bunch of programs and learn
       to write new GATK tools by example
•  Weʼll setup the environment and look at five tutorial programs:&
  –  HelloRead: A simple walker that prints read information from a BAM&
  –  FixReadNames: Modify read names and emit results to a new BAM file&
  –  HelloVariant: A simple walker that prints variant information from a VCF&
  –  ComputeCoverageFromVCF: Computes a coverage histogram from a VCF&
  –  FindExclusiveVariants: Create a new VCF of variants exclusive to a sample&


•  Finished and commented versions are in the codebase at:&
  –  java/src/org/broadinstitute/sting/gatk/walkers/tutorial/&


•  How these tutorials work:&
  –  The 3! icon enumerates the various steps in each tutorial.&
  –  The code that you should write at each step is in the IntelliJ window.&
  –  Text in boxes like this& give additional information on each step, emphasize
     some information, and may clarify the command or code that you should write. &
Setting up for GATK development
See our wiki resources



•  http://www.broadinstitute.org/gsa/wiki/index.php/
   Configuring_IntelliJ&

•  http://www.broadinstitute.org/gsa/wiki/index.php/
   Queue_with_IntelliJ_IDEA&
Mechanics of a GATK “walker”"
(a program that “walks” along a dataset in a prescribed way)
ReadWalker: “walks” over reads and allows a
 computation to be performed on each one

 ReadWalker: process one read at a time!

                                                           reference!
                      TTTAAATCTGTGGGGTTAATCGGCGGGCTAA!
               (1)!
               (2)&
  computation!
               (3)&
     order!
               (4)&                                           reads!
               (5)&



  Example use cases:&
  1.  Setting an extra metadata tag in a read&
  2.  Searching for mouse contaminant reads and excluding them&
  3.  Find or realign indels&
 Some example GATK programs: CycleQualityWalker,
 TableRecalibrationWalker, CountReadsWalker, IndelRealigner, etc.!
ReadWalker: “walks” over reads and allows a
 computation to be performed on each one

 ReadWalker: process one read at a time!

                                                           reference!
                      TTTAAATCTGTGGGGTTAATCGGCGGGCTAA!
               (1)&
               (2)!
  computation!
               (3)&
     order!
               (4)&                                           reads!
               (5)&



  Example use cases:&
  1.  Setting an extra metadata tag in a read&
  2.  Searching for mouse contaminant reads and excluding them&
  3.  Find or realign indels&
 Some example GATK programs: CycleQualityWalker,
 TableRecalibrationWalker, CountReadsWalker, IndelRealigner, etc.!
ReadWalker: “walks” over reads and allows a
 computation to be performed on each one

 ReadWalker: process one read at a time!

                                                           reference!
                      TTTAAATCTGTGGGGTTAATCGGCGGGCTAA!
               (1)&
               (2)&
  computation!
               (3)!
     order!
               (4)&                                           reads!
               (5)&



  Example use cases:&
  1.  Setting an extra metadata tag in a read&
  2.  Searching for mouse contaminant reads and excluding them&
  3.  Find or realign indels&
 Some example GATK programs: CycleQualityWalker,
 TableRecalibrationWalker, CountReadsWalker, IndelRealigner, etc.!
LocusWalker: “walks” over genomic positions and
allows a computation to be performed at each one

   LocusWalker: process a single-base genomic position at a time!

    computation order!    (1)(2)(3)(4)(5) …&                   reference!
                         TTTAAATCTGTGGGGTTAATCGGCGGGCTAA!



                                                                   reads!




     Example use cases:&
     1.  Variant calling&
     2.  Depth of coverage calculations&
     3.  Compute properties of regions (GC content, read error rates)&
    Note: reads are required for locus walkers. RefWalkers are a similar type
    of walker that examine each genomic locus, but do not require reads.!
LocusWalker: “walks” over genomic positions and
allows a computation to be performed at each one

   LocusWalker: process a single-base genomic position at a time!

    computation order!    (1)(2)(3)(4)(5) …&                   reference!
                         TTTAAATCTGTGGGGTTAATCGGCGGGCTAA!



                                                                   reads!




     Example use cases:&
     1.  Variant calling&
     2.  Depth of coverage calculations&
     3.  Compute properties of regions (GC content, read error rates)&
    Note: reads are required for locus walkers. RefWalkers are a similar type
    of walker that examine each genomic locus, but do not require reads.!
LocusWalker: “walks” over genomic positions and
allows a computation to be performed at each one

   LocusWalker: process a single-base genomic position at a time!

    computation order!    (1)(2)(3)(4)(5) …&                   reference!
                         TTTAAATCTGTGGGGTTAATCGGCGGGCTAA!



                                                                   reads!




     Example use cases:&
     1.  Variant calling&
     2.  Depth of coverage calculations&
     3.  Compute properties of regions (GC content, read error rates)&
    Note: reads are required for locus walkers. RefWalkers are a similar type
    of walker that examine each genomic locus, but do not require reads.!
RodWalker: “walks” over positions in a file and allows
    a computation to be performed at each one

   RodWalker: process a genomic position from a file (e.g. VCF) at a time!

      computation order!      (1)!   (2)&        (3)&4)&
                                                    (       reference!
                           TTTAAATCTGTGGGGTTAATCGGCGGGCTAA!
            SampleA!          *!                 *& &
                                                  *
            SampleB!                 *&
            SampleC!                 *&             *&        variants!



       Example use cases:&
       1.  Variant calling&
       2.  Depth of coverage calculations&
       3.  Compute properties of regions (GC content, read error rates)&
       Some example GATK programs: VariantEval, PhaseByTransmission,
       VariantAnnotator, VariantRecalibrator, SelectVariants, etc.!
RodWalker: “walks” over positions in a file and allows
    a computation to be performed at each one

   RodWalker: process a genomic position from a file (e.g. VCF) at a time!

      computation order!      (1)&    (2)!         (3)&4)&
                                                      (       reference!
                           TTTAAATCTGTGGGGTTAATCGGCGGGCTAA!
            SampleA!          *&                   *& &
                                                    *
            SampleB!                  *!
            SampleC!                  *!              *&       variants!



       Example use cases:&
       1.  Variant filtering&
       2.  Computing metrics on variants&
       3.  Refining variant calls by enforcing additional constraints&
       Some example GATK programs: VariantEval, PhaseByTransmission,
       VariantAnnotator, VariantRecalibrator, SelectVariants, etc.!
RodWalker: “walks” over positions in a file and allows
    a computation to be performed at each one

   RodWalker: process a genomic position from a file (e.g. VCF) at a time!

      computation order!      (1)&    (2)!         (3)!4)&
                                                      (       reference!
                           TTTAAATCTGTGGGGTTAATCGGCGGGCTAA!
            SampleA!          *&                   *! &
                                                    *
            SampleB!                  *&
            SampleC!                  *&              *&       variants!



       Example use cases:&
       1.  Variant filtering&
       2.  Computing metrics on variants&
       3.  Refining variant calls by enforcing additional constraints&
       Some example GATK programs: VariantEval, PhaseByTransmission,
       VariantAnnotator, VariantRecalibrator, SelectVariants, etc.!
Writing your first GATK walkers
Example 1: Hello, Read!




1! Right-click on “walkers”, select New->Package&
Example 1: Hello, Read!




                                  2!
   Type “examples” as the package name.&

    3!
    Click “OK”.&
Example 1: Hello, Read!




   Right-click on “examples” and select New->Java class.
4! Enter the name “HelloRead”.&

   A file declaring the class and proper package name is
   created for you.&
Example 1: Hello, Read!




                                     5!
          Add the following text to the class declaration:&

          extends ReadWalker<Integer, Integer> {	

          This will tell the GATK that you are creating a
          program that iterates over all of the reads in a
          BAM file, one at a time.&

          The “import” statement at the top will be
          added by the IDE.&
Example 1: Hello, Read!




             6! IntelliJ can detect what methods you
                need to implement in order to get your
                program working.&

                Make sure your cursor is on the class
                declaration and type “Alt-Enter” to get
                the contextual action menu.&

                Select “Implement Methods”.&
Example 1: Hello, Read!



                 7! Select all of the methods
                        (usually, theyʼll already be
                        selected, so you wonʼt need
                        to do anything).&




      8! Click “OK”.&
Example 1: Hello, Read!




The three methods, map(), reduceInit(), and reduce()
are now implemented with placeholder code.&
Example 1: Hello, Read!




         9! Declare a PrintStream and mark it
            with the @Output annotation. This
            tells the GATK that weʼre going to
            channel our output through this
            object.&

            Donʼt worry about instantiating it –
            the GATK will do that automatically.&
Example 1: Hello, Read!

   11! When youʼre done, hit the disk icon (or type Ctrl-S) to save your work.&



In your map() method, add a line of code that prints “Hello” and the name of the read:&

out.println(“Hello, ” + read.getReadName());	

Or, just type read. and then hit Ctrl-Space. IntelliJ will show you a window of all the
methods you can call, and you can just select it from the list.&

                10!
Example 1: Hello, Read!



      12! Back in the terminal window, change
          to your gatk-lilly directory and type:&

          ant dist	

          This will compile the GATK-Lilly
          codebase, including your new walker!&
Example 1: Hello, Read!




   Itʼll take about a minute to compile.&
Example 1: Hello, Read!




13! Run your code by entering the following command:&

    java -jar dist/GenomeAnalysisTK.jar 	
      -T HelloRead 	
      -R /lrlhps/apps/gatk-lilly/resources/ucsc.hg19.fasta 	
      -I /lrlhps/apps/gatk-lilly/testdata/114T.chr21.analysis_ready.bam 	
    | less	

    Every walker must be provided with a reference fasta file.
Example 1: Hello, Read!




         Your code is now running and saying
         “Hello” to every read in the file!
Example 1: Hello, Read!

    Letʼs add some information to the output. Add the line:&

    out.println(“Hello, ” + read.getReadName() +	
                    “at ” + read.getReferenceName() + 	
                      “:” + read.getAlignmentStart()	
    );	


    This will print out the read name, the contig name, and
    the starting position for the readʼs alignment.	


                             14!
Example 1: Hello, Read!



                                                                       15!
                                                                       1!
Compile and run with a single command:&
 Compile and run with a single command:&
ant dist && java -jar dist/GenomeAnalysisTK.jar 	
  ant HelloRead 	 -jar dist/GenomeAnalysisTK.jar 	
  -T dist && java
    -T HelloRead 	
  -R /lrlhps/apps/gatk-lilly/resources/ucsc.hg19.fasta 	
    -R /lrlhps/apps/gatk-lilly/resources/ucsc.hg19.fasta 	
  -I /lrlhps/apps/gatk-lilly/testdata/114T.chr21.analysis_ready.bam 	
| less /lrlhps/apps/gatk-lilly/testdata/114T.chr21.analysis_ready.bam 	
    -I –S	
  | less –S	
(The && instructs the shell to proceed only if the previous command
was successful. If the compilation fails, only if the previous be run.)	
  (The && instructs the shell to proceed HelloRead will not command
  was successful. If the compilation fails, HelloRead will not be run.)
Example 1: Hello, Read!




     The updated command is running and showing us the
     alignment position in addition to the read name!
Example 1: Hello, Read!


                                               16!
You can run on just a specific region by supplying the -L argument,
and redirect the output to a separate file with the -o argument:&

java   -jar dist/GenomeAnalysisTK.jar 	
  -T   HelloRead 	
  -R   /lrlhps/apps/gatk-lilly/resources/ucsc.hg19.fasta 	
  -I   /lrlhps/apps/gatk-lilly/testdata/114T.chr21.analysis_ready.bam 	
  -L   chr21:9411000-9411200 	
  -o   test.txt	

No additional code is required on your part to enable this.
Example 1: Hello, Read!




The resultant file, with reads from chr21:9,411,000-9,411,200 only.
Example 2: Fix read names




                Letʼs use what weʼve learned to
                write a program that can change
                read names like discussed earlier
                in this tutorial.
Example 2: Fix read names




Now letʼs create a
new example
program called
“FixReadNames”.	
  1!
Example 2: Fix read names




Make FixReadNames a ReadWalker.	
                                   2!
Example 2: Fix read names




         3! This time, weʼll emit a BAM file
            by directing the output to a
            SAMFileWriter object instead
            of a PrintStream.
Example 2: Fix read names




                                     4!
                       Change the read name,
                       tacking on the platform
                       unit information.
Example 2: Fix read names




                       5!
Add the alignment to
the output stream.
Example 2: Fix read names


                                                        6!
Compile and run your code:&

ant dist && java -jar dist/GenomeAnalysisTK.jar 	
  -T FixReadNames 	
  -R /lrlhps/apps/gatk-lilly/resources/ucsc.hg19.fasta 	
  -I /lrlhps/apps/gatk-lilly/testdata/114T.chr21.analysis_ready.bam 	
  -L chr21:9411000-9411200 	
  -o test.bam
Example 2: Fix read names




          Run the following command to see your results:&

          samtools view test.bam | less -S	

              7!
Example 2: Fix read names




All of the read names now have the
platform unit prepended to them!
Example 3: Hello, Variant!




                  This will be a larger example,
                  introducing variant processing,
                  map-reduce calculations, and
                  the onTraversalDone() method.
                  All code required is listed here.
Example 3: Hello, Variant!




Weʼve created a
new program called
“HelloVariant”.	
                1!
Example 3: Hello, Variant!

               This program extends
               RodWalker<Integer, Integer>	
          2!
Example 3: Hello, Variant!




         3! Declares a PrintStream.
Example 3: Hello, Variant!




                                      4!
          In the map() function, weʼll loop over lines in a
          VCF file and print metadata from each record.
Example 3: Hello, Variant!




         Return 1.&
      5! This will get passed to reduce() later.
Example 3: Hello, Variant!




              This gets called before the first
              reduce() call. By returning 0,
           6! we initialize the record counter.
Example 3: Hello, Variant!




                      All of the return values from map
                      () get passed to reduce(), one
                      at a time. Here, we add value to
                      sum, effectively counting all the
                      calls to map().	
                 7!
Example 3: Hello, Variant!




                        8!
         The onTraversalDone() method runs after the
         computation is complete. Here, we print the total
         number of map() calls made.
Example 3: Hello, Variant!



                                                          9!
Compile and run the HelloVariant walker, but this time, rather than specifying a BAM
file with the -I argument, weʼll attach a VCF file:&
ant dist && java –jar dist/GenomeAnalysisTK.jar 	
  –T HelloVariant 	
  -R /lrlhps/apps/gatk-lilly/resources/ucsc.hg19.fasta 	
  -B:variant,VCF /lrlhps/apps/gatk-lilly/testdata/test.chr21.analysis_ready.vcf
Example 3: Hello, Variant!




 The program prints out the reference allele,
 alternate allele, and locus for each VCF record, and
 finally prints out the number of records processed!
Example 4: Compute depth of coverage from a VCF file




                               Letʼs continue exploring variant
                               processing by taking a closer look
                               at the VariantContext object,
                               the programmatic representation
                               of a VCF record.&

                               This program will compute a depth
                               of coverage histogram using VCF
                               metadata rather than a BAM file.
Example 4: Compute depth of coverage from a VCF file




             Create a new program called&
        1!
               ComputeCoverageFromVCF	

             of type&

               RodWalker<Integer, Integer>	

             with the usual&

               @Output	
               PrintStream out	

             declaration.&
Example 4: Compute depth of coverage from a VCF file




             2!
 Add a command line argument with the following code:&

   @Argument(fullName=“sample”, shortName=“sn”, doc=“Sample to process”, required=false)	
   public string SAMPLE;	


 This adds the command-line argument --sample (aka -sn) and stores the inputted
 value in the String variable SAMPLE.&

 Weʼll use this to allow the user to specify whether they want to get coverage for a
 specific sample or all of the samples (by specifying no sample at all).&
Example 4: Compute depth of coverage from a VCF file




          3!
    Declare a hashtable to store the coverage counts.&
      private TreeMap<Integer, Integer> histogram = new TreeMap<Integer, Integer>();	


    A TreeMap is a special kind of hashtable that returns its keys in sorted order.&
Example 4: Compute depth of coverage from a VCF file




                                                                4!
        Loop over the variants. For each one, weʼll print the
        coverage observed. We also make sure that we get the
        coverage for the sample requested (if the user specified a
        sample name to the --sample argument), or for all
        samples (if the user specified no sample name at all).&

        For every coverage level we observe, we increment the
        appropriate entry in the histogram object.&
Example 4: Compute depth of coverage from a VCF file




                In the onTraversalDone() method, weʼll loop over every
                coverage level in the histogram and output the depth and
                the number of times we observed that depth.&
                            5!
Example 4: Compute depth of coverage from a VCF file




     6! Compile and run:&
        ant dist && java -jar dist/GenomeAnalysisTK.jar 	
         -T ComputeCoverageFromVCF 	
         -R /lrlhps/apps/gatk-lilly/resources/ucsc.hg19.fasta 	
         -B:variant,VCF /lrlhps/apps/gatk-lilly/testdata/test.chr21.analysis_ready.vcf 	
         -o histogram.txt
Example 4: Compute depth of coverage from a VCF file



         Two columns of information are
         printed. First column is the coverage
         level, second is the number of times
         that coverage level was observed!
Example 5: Find variants unique to a single sample




                             For our last example, weʼll write a simple
                             program that can take an input VCF and
                             write a new VCF containing only variants
                             that are exclusive to one sample.&

                             Weʼll also introduce the initialize()
                             method, which can be used to prepare the
                             environment for the computation.&
Example 5: Find variants unique to a single sample




    1!
     Create a new RodWalker called FindExclusiveVariants that has a
     command-line argument called “sample” (aka “sn”) of type String.&

     Add an output stream, but rather than be of type PrintStream,
     make it of type VCFWriter. Weʼll use this to output a new VCF file
     based on the input VCF.&
Example 5: Find variants unique to a single sample




                                                                   2!
The initialize() method is called first, before any of the map() or
reduce() calls are made. It is useful for preparing the
environment, writing headers, setting up variables, etc.&

Here, weʼll write a VCF header to the output stream. While weʼre
free to add/remove header lines and samples, weʼll just copy the
input fileʼs header to the output file.&
Example 5: Find variants unique to a single sample




                                         3!
                       Loop over each record in the VCF, and each
                       Genotype object contained within the
                       VariantContext object. Check the
                       genotypes of each sample and, if only our
                       sample of interest is variant, output the
                       record to the new VCF file.&
Example 5: Find variants unique to a single sample




4! Compile and run:&
   ant dist && java -jar dist/GenomeAnalysisTK.jar 	
     -T FindExclusiveVariants 	
     -R /lrlhps/apps/gatk-lilly/resources/ucsc.hg19.fasta 	
     -B:variant,VCF /lrlhps/apps/gatk-lilly/testdata/test.chr21.analysis_ready.vcf 	
     -sn 113N 	
     -o 113.exclusive.vcf
Example 5: Find variants unique to a single sample




                   5!
                    After the program completes, look at the output.
Example 5: Find variants unique to a single sample




                                                                6!
You can scroll left and right with the arrow key, but letʼs clean up the output to
make it easier to read. Supply this command instead:&
grep –v ‘##’ 113.exclusive.vcf | cut –f1-7,10- | head -10 | column –t | less -S
Example 5: Find variants unique to a single sample




                           Observe how the third sample is
                           variant and the other three samples
                           are not. Our program is selecting only
                           the variants that are exclusive to 113N!
Conclusions
•  From the five example programs, we have learned how to:&
  –  configure IntelliJ for GATK development&
  –  create a new ReadWalker or RodWalker	
  –  declare output streams (PrintStream, SAMFileWriter, VCFWriter)&
  –  access and modify metadata in reads&
  –  access variants, samples, and metadata from a VCF file&
  –  declare command-line arguments&
  –  prepare for computations with the initialize() method&
  –  finish computations with the onTraversalDone() method&
  –  compile and run new GATK programs&


•  This tutorial is more than enough to get started with writing new
   and useful GATK programs&
  –  Our FixReadNames, ComputeCoverageFromVCF, and FindExclusiveVariants
     walkers are fully realized programs, ready to be used for real work.&
  –  You now have enough information to write your own somatic variant finder.&
Additional resources
•  For more information on developing in the GATK and Java, see&
  –  http://www.broadinstitute.org/gsa/wiki/index.php/GATK_Development&
  –  http://download.oracle.com/javase/tutorial/java/index.html&

•  Explore the GATK Git repository at&
  –  https://github.com/broadgsa&
  –  https://github.com/signup/free (to add your own code, sign up for free account)&

•  To learn Git, the codebaseʼs version control system, see&
  –  http://gitref.org/&
  –  http://git-scm.com/course/svn.html (for those already familiar with SVN)&

•  Read our papers on the GATK framework and tools&
  –  http://genome.cshlp.org/content/20/9/1297.long&
  –  http://www.nature.com/ng/journal/v43/n5/abs/ng.806.html&

•  Fore more guidance, feel free to look at other programs in the GATK&
  –  Every program is a tutorial!&

More Related Content

PPT
Much ado about randomness. What is really a random number?
PDF
아파트 정보를 이용한 ELK stack 활용 - 오근문
PDF
[2 d1] elasticsearch 성능 최적화
PDF
Elastic Search Training#1 (brief tutorial)-ESCC#1
PDF
Postgresql search demystified
PPTX
ElasticSearch - DevNexus Atlanta - 2014
PPTX
엘라스틱서치 적합성 이해하기 20160630
PDF
Pg py-and-squid-pypgday
Much ado about randomness. What is really a random number?
아파트 정보를 이용한 ELK stack 활용 - 오근문
[2 d1] elasticsearch 성능 최적화
Elastic Search Training#1 (brief tutorial)-ESCC#1
Postgresql search demystified
ElasticSearch - DevNexus Atlanta - 2014
엘라스틱서치 적합성 이해하기 20160630
Pg py-and-squid-pypgday

Viewers also liked (20)

PPT
Creating a SNP calling pipeline
PPTX
Variant (SNPs/Indels) calling in DNA sequences, Part 1
PDF
Ensembl Plants: Visualising, mining and analysing crop genomics data
PDF
Workshop socialnetworking hra
PPTX
Year To Date Comparison
PPT
Microweb
PDF
Chuong 4 thach thuc tham hut thuong mai
PDF
如何开展社会化媒体营销?品牌拟人化
PPTX
IBM MQ v8 enhancements
PDF
Chuong 1 tu bat on vi mo den con duong tai co cau
PPT
Statisitics 4 5
PPT
Opening Microtravel
PPT
Investor Relations 2.0 Jak to zacząć w Polsce?
KEY
Amazon Ec2
PPT
Domino must gather information
PPTX
Workshop social networking 09
PDF
Chuong 5 bien dong lao dong va viec lam
PPSX
Product Platform
PPT
Janssen immune system_&_microbiome_022213 (1)
Creating a SNP calling pipeline
Variant (SNPs/Indels) calling in DNA sequences, Part 1
Ensembl Plants: Visualising, mining and analysing crop genomics data
Workshop socialnetworking hra
Year To Date Comparison
Microweb
Chuong 4 thach thuc tham hut thuong mai
如何开展社会化媒体营销?品牌拟人化
IBM MQ v8 enhancements
Chuong 1 tu bat on vi mo den con duong tai co cau
Statisitics 4 5
Opening Microtravel
Investor Relations 2.0 Jak to zacząć w Polsce?
Amazon Ec2
Domino must gather information
Workshop social networking 09
Chuong 5 bien dong lao dong va viec lam
Product Platform
Janssen immune system_&_microbiome_022213 (1)
Ad

Similar to 20-Line Lifesavers: Coding simple solutions in the GATK (20)

PDF
Hanna bosc2010
PPTX
Workshop NGS data analysis - 1
PDF
Discovery and annotation of variants by exome analysis using NGS
PPTX
2012 talk to CSE department at U. Arizona
PDF
A Genome Sequence Analysis System Built with Hypertable
PPTX
Workshop NGS data analysis - 2
PDF
20110524zurichngs 2nd pub
PDF
SeqinR - biological data handling
PDF
Assembling genomes using ABySS
PDF
2011-06-08 Taverna workflow system
PPTX
2014 toronto-torbug
DOCX
Paper - Muhammad Gulraj
PDF
The Component Retrieval Language
PDF
Procter Vamsas Bosc2009
PDF
Next-Generation Informatics
PPTX
Assignment-2 -upload.pptx
PPTX
2015 illinois-talk
PDF
Sequencing, Alignment and Assembly
PDF
B Chapman - Toolkit for variation comparison and analysis
PDF
Large Scale Resequencing: Approaches and Challenges
Hanna bosc2010
Workshop NGS data analysis - 1
Discovery and annotation of variants by exome analysis using NGS
2012 talk to CSE department at U. Arizona
A Genome Sequence Analysis System Built with Hypertable
Workshop NGS data analysis - 2
20110524zurichngs 2nd pub
SeqinR - biological data handling
Assembling genomes using ABySS
2011-06-08 Taverna workflow system
2014 toronto-torbug
Paper - Muhammad Gulraj
The Component Retrieval Language
Procter Vamsas Bosc2009
Next-Generation Informatics
Assignment-2 -upload.pptx
2015 illinois-talk
Sequencing, Alignment and Assembly
B Chapman - Toolkit for variation comparison and analysis
Large Scale Resequencing: Approaches and Challenges
Ad

More from Dan Bolser (8)

PDF
Ramona Tăme - Email Encryption and Digital SIgning
ODP
Nice 2012, BioWikis and DASWiki
PPTX
Ensembl plants hsf_d_bolser_2012
PDF
NETTAB 2012 flyer
PPT
Semantic MediaWiki Workshop
PPT
Wikis at work
PPT
BioWikis BSB10
ODP
Wikipedia and the Global Brain
Ramona Tăme - Email Encryption and Digital SIgning
Nice 2012, BioWikis and DASWiki
Ensembl plants hsf_d_bolser_2012
NETTAB 2012 flyer
Semantic MediaWiki Workshop
Wikis at work
BioWikis BSB10
Wikipedia and the Global Brain

Recently uploaded (20)

PPTX
human mycosis Human fungal infections are called human mycosis..pptx
PDF
Chinmaya Tiranga quiz Grand Finale.pdf
PPTX
Cell Structure & Organelles in detailed.
PDF
O5-L3 Freight Transport Ops (International) V1.pdf
PDF
ANTIBIOTICS.pptx.pdf………………… xxxxxxxxxxxxx
PDF
Saundersa Comprehensive Review for the NCLEX-RN Examination.pdf
PPTX
PPT- ENG7_QUARTER1_LESSON1_WEEK1. IMAGERY -DESCRIPTIONS pptx.pptx
PDF
Module 4: Burden of Disease Tutorial Slides S2 2025
PDF
A GUIDE TO GENETICS FOR UNDERGRADUATE MEDICAL STUDENTS
PDF
O7-L3 Supply Chain Operations - ICLT Program
PDF
Black Hat USA 2025 - Micro ICS Summit - ICS/OT Threat Landscape
PDF
Complications of Minimal Access Surgery at WLH
PPTX
202450812 BayCHI UCSC-SV 20250812 v17.pptx
PPTX
master seminar digital applications in india
PPTX
Pharmacology of Heart Failure /Pharmacotherapy of CHF
PPTX
school management -TNTEU- B.Ed., Semester II Unit 1.pptx
PDF
The Lost Whites of Pakistan by Jahanzaib Mughal.pdf
PDF
102 student loan defaulters named and shamed – Is someone you know on the list?
PDF
RMMM.pdf make it easy to upload and study
PPTX
Microbial diseases, their pathogenesis and prophylaxis
human mycosis Human fungal infections are called human mycosis..pptx
Chinmaya Tiranga quiz Grand Finale.pdf
Cell Structure & Organelles in detailed.
O5-L3 Freight Transport Ops (International) V1.pdf
ANTIBIOTICS.pptx.pdf………………… xxxxxxxxxxxxx
Saundersa Comprehensive Review for the NCLEX-RN Examination.pdf
PPT- ENG7_QUARTER1_LESSON1_WEEK1. IMAGERY -DESCRIPTIONS pptx.pptx
Module 4: Burden of Disease Tutorial Slides S2 2025
A GUIDE TO GENETICS FOR UNDERGRADUATE MEDICAL STUDENTS
O7-L3 Supply Chain Operations - ICLT Program
Black Hat USA 2025 - Micro ICS Summit - ICS/OT Threat Landscape
Complications of Minimal Access Surgery at WLH
202450812 BayCHI UCSC-SV 20250812 v17.pptx
master seminar digital applications in india
Pharmacology of Heart Failure /Pharmacotherapy of CHF
school management -TNTEU- B.Ed., Semester II Unit 1.pptx
The Lost Whites of Pakistan by Jahanzaib Mughal.pdf
102 student loan defaulters named and shamed – Is someone you know on the list?
RMMM.pdf make it easy to upload and study
Microbial diseases, their pathogenesis and prophylaxis

20-Line Lifesavers: Coding simple solutions in the GATK

  • 1. Eli Lilly / September 14-15, 2011 20-Line Lifesavers:" Coding simple solutions in the GATK Kiran V Garimella (kiran.garimella@gmail.com), Mark A DePristo G E N O M E S E Q U E N C I N G A N D A N A LY S I S , B R O A D I N S T I T U T E Research Informatics Group E L I L I L LY A N D C O M PA N Y
  • 2. Genome Analysis Toolkit (GATK)! ˈjē-ˌnōm(,)ə-ˈna-lə-səs(,)ˈtül(,)kit& Noun& 1.  A suite of tools for working with medical resequencing projects (e.g. 1,000 Genomes, The Cancer Genome Atlas)& 2.  A structured software library that makes writing efficient analysis tools using next-generation sequencing data easy!
  • 3. Genome Analysis Toolkit (GATK)! ˈjē-ˌnōm(,)ə-ˈna-lə-səs(.)ˈtül(,)kit& Noun& Most users think of the toolkit merely as a set of tools that implement our ideas…! 1.  A suite of tools for working with medical resequencing projects (e.g. 1,000 Genomes, The Cancer Genome Atlas)& 2.  A structured software library that makes writing efficient analysis tools using next-generation sequencing data easy!
  • 4. Genome Analysis Toolkit (GATK)! ˈjē-ˌnōm(,)ə-ˈna-lə-səs(.)ˈtül(,)kit& Noun& 1. … suite the GATKʼs real with medical resequencing projects A but of tools for working power is in how easy it (e.g. 1,000 Genomes, The Cancer Genome Atlas)& makes it to instantiate your ideas.! 2.  A structured software library that makes writing efficient analysis tools using next-generation sequencing data easy! This is what we will discuss today.!
  • 5. Some tasks are made difficult by the wrong tools Convert to sam format, read the header, parse the read group info into a hash table keyed on the ID, loop over the reads, look up the read group id in the hash, find the platform unit tag, These BAMS have numeric, non- prepend it to the read name, convert back to BAM, reindex BAM. unique read ids that collide when you merge them! Lines of Code: 500. How long will It take to fix? All day! With all apologies to Randall Munroe and XKCD&
  • 6. That same task, written in the GATK (20 lines of code) package org.broadinstitute.sting.gatk.walkers.examples; import net.sf.samtools.SAMFileWriter; import net.sf.samtools.SAMRecord; import org.broadinstitute.sting.commandline.Output; import org.broadinstitute.sting.gatk.contexts.ReferenceContext; import org.broadinstitute.sting.gatk.refdata.ReadMetaDataTracker; import org.broadinstitute.sting.gatk.walkers.ReadWalker; public class FixReadNames extends ReadWalker<Integer, Integer> { @Output SAMFileWriter out; @Override public Integer map(ReferenceContext ref, SAMRecord read, ReadMetaDataTracker metaDataTracker) { read.setReadName(read.getReadGroup().getPlatformUnit() + "." + read.getReadName()); out.addAlignment(read); return null; } @Override public Integer reduceInit() { return null; } @Override public Integer reduce(Integer value, Integer sum) { return null; } }
  • 7. That same task, written in the GATK" (code that’s not filled in for you by the IDE – 5 lines) package org.broadinstitute.sting.gatk.walkers.examples; import net.sf.samtools.SAMFileWriter; import net.sf.samtools.SAMRecord; import org.broadinstitute.sting.commandline.Output; import org.broadinstitute.sting.gatk.contexts.ReferenceContext; import org.broadinstitute.sting.gatk.refdata.ReadMetaDataTracker; import org.broadinstitute.sting.gatk.walkers.ReadWalker; public class FixReadNames extends ReadWalker<Integer, Integer> { @Output SAMFileWriter out; @Override public Integer map(ReferenceContext ref, SAMRecord read, ReadMetaDataTracker metaDataTracker) { read.setReadName(read.getReadGroup().getPlatformUnit() + "." + read.getReadName()); out.addAlignment(read); return null; } @Override public Integer reduceInit() { return null; } @Override public Integer reduce(Integer value, Integer sum) { return null; } } Most of the code is boilerplate, and the IDE can fill it in for you. The amount of code you have to manually write is actually very small.!
  • 8. Those tasks are simple when using the right tools… Write a GATK READwalker that modifies the read name and writes it out again. Spend rest of time looking at lolCATs. These BAMS have numeric, non- Lines of Code: 5. unique read ids that collide when you merge them! How long will It take to fix? Um, All day... With all apologies to Randall Munroe and XKCD&
  • 9. …though whether you’ll tell people that is up to you. Hehe, I can haz cheezburger INDEED. With all apologies to Randall Munroe and XKCD&
  • 10. We ’ re g o i n g t o w r i t e g e n u i n e l y u s e f u l , d e a d l i n e d e f e a t i n g , l i f e s a v i n g t o o l s i n < 20 lines of code
  • 11. Now we’ll go through a bunch of programs and learn to write new GATK tools by example •  Weʼll setup the environment and look at five tutorial programs:& –  HelloRead: A simple walker that prints read information from a BAM& –  FixReadNames: Modify read names and emit results to a new BAM file& –  HelloVariant: A simple walker that prints variant information from a VCF& –  ComputeCoverageFromVCF: Computes a coverage histogram from a VCF& –  FindExclusiveVariants: Create a new VCF of variants exclusive to a sample& •  Finished and commented versions are in the codebase at:& –  java/src/org/broadinstitute/sting/gatk/walkers/tutorial/& •  How these tutorials work:& –  The 3! icon enumerates the various steps in each tutorial.& –  The code that you should write at each step is in the IntelliJ window.& –  Text in boxes like this& give additional information on each step, emphasize some information, and may clarify the command or code that you should write. &
  • 12. Setting up for GATK development
  • 13. See our wiki resources •  http://www.broadinstitute.org/gsa/wiki/index.php/ Configuring_IntelliJ& •  http://www.broadinstitute.org/gsa/wiki/index.php/ Queue_with_IntelliJ_IDEA&
  • 14. Mechanics of a GATK “walker”" (a program that “walks” along a dataset in a prescribed way)
  • 15. ReadWalker: “walks” over reads and allows a computation to be performed on each one ReadWalker: process one read at a time! reference! TTTAAATCTGTGGGGTTAATCGGCGGGCTAA! (1)! (2)& computation! (3)& order! (4)& reads! (5)& Example use cases:& 1.  Setting an extra metadata tag in a read& 2.  Searching for mouse contaminant reads and excluding them& 3.  Find or realign indels& Some example GATK programs: CycleQualityWalker, TableRecalibrationWalker, CountReadsWalker, IndelRealigner, etc.!
  • 16. ReadWalker: “walks” over reads and allows a computation to be performed on each one ReadWalker: process one read at a time! reference! TTTAAATCTGTGGGGTTAATCGGCGGGCTAA! (1)& (2)! computation! (3)& order! (4)& reads! (5)& Example use cases:& 1.  Setting an extra metadata tag in a read& 2.  Searching for mouse contaminant reads and excluding them& 3.  Find or realign indels& Some example GATK programs: CycleQualityWalker, TableRecalibrationWalker, CountReadsWalker, IndelRealigner, etc.!
  • 17. ReadWalker: “walks” over reads and allows a computation to be performed on each one ReadWalker: process one read at a time! reference! TTTAAATCTGTGGGGTTAATCGGCGGGCTAA! (1)& (2)& computation! (3)! order! (4)& reads! (5)& Example use cases:& 1.  Setting an extra metadata tag in a read& 2.  Searching for mouse contaminant reads and excluding them& 3.  Find or realign indels& Some example GATK programs: CycleQualityWalker, TableRecalibrationWalker, CountReadsWalker, IndelRealigner, etc.!
  • 18. LocusWalker: “walks” over genomic positions and allows a computation to be performed at each one LocusWalker: process a single-base genomic position at a time! computation order! (1)(2)(3)(4)(5) …& reference! TTTAAATCTGTGGGGTTAATCGGCGGGCTAA! reads! Example use cases:& 1.  Variant calling& 2.  Depth of coverage calculations& 3.  Compute properties of regions (GC content, read error rates)& Note: reads are required for locus walkers. RefWalkers are a similar type of walker that examine each genomic locus, but do not require reads.!
  • 19. LocusWalker: “walks” over genomic positions and allows a computation to be performed at each one LocusWalker: process a single-base genomic position at a time! computation order! (1)(2)(3)(4)(5) …& reference! TTTAAATCTGTGGGGTTAATCGGCGGGCTAA! reads! Example use cases:& 1.  Variant calling& 2.  Depth of coverage calculations& 3.  Compute properties of regions (GC content, read error rates)& Note: reads are required for locus walkers. RefWalkers are a similar type of walker that examine each genomic locus, but do not require reads.!
  • 20. LocusWalker: “walks” over genomic positions and allows a computation to be performed at each one LocusWalker: process a single-base genomic position at a time! computation order! (1)(2)(3)(4)(5) …& reference! TTTAAATCTGTGGGGTTAATCGGCGGGCTAA! reads! Example use cases:& 1.  Variant calling& 2.  Depth of coverage calculations& 3.  Compute properties of regions (GC content, read error rates)& Note: reads are required for locus walkers. RefWalkers are a similar type of walker that examine each genomic locus, but do not require reads.!
  • 21. RodWalker: “walks” over positions in a file and allows a computation to be performed at each one RodWalker: process a genomic position from a file (e.g. VCF) at a time! computation order! (1)! (2)& (3)&4)& ( reference! TTTAAATCTGTGGGGTTAATCGGCGGGCTAA! SampleA! *! *& & * SampleB! *& SampleC! *& *& variants! Example use cases:& 1.  Variant calling& 2.  Depth of coverage calculations& 3.  Compute properties of regions (GC content, read error rates)& Some example GATK programs: VariantEval, PhaseByTransmission, VariantAnnotator, VariantRecalibrator, SelectVariants, etc.!
  • 22. RodWalker: “walks” over positions in a file and allows a computation to be performed at each one RodWalker: process a genomic position from a file (e.g. VCF) at a time! computation order! (1)& (2)! (3)&4)& ( reference! TTTAAATCTGTGGGGTTAATCGGCGGGCTAA! SampleA! *& *& & * SampleB! *! SampleC! *! *& variants! Example use cases:& 1.  Variant filtering& 2.  Computing metrics on variants& 3.  Refining variant calls by enforcing additional constraints& Some example GATK programs: VariantEval, PhaseByTransmission, VariantAnnotator, VariantRecalibrator, SelectVariants, etc.!
  • 23. RodWalker: “walks” over positions in a file and allows a computation to be performed at each one RodWalker: process a genomic position from a file (e.g. VCF) at a time! computation order! (1)& (2)! (3)!4)& ( reference! TTTAAATCTGTGGGGTTAATCGGCGGGCTAA! SampleA! *& *! & * SampleB! *& SampleC! *& *& variants! Example use cases:& 1.  Variant filtering& 2.  Computing metrics on variants& 3.  Refining variant calls by enforcing additional constraints& Some example GATK programs: VariantEval, PhaseByTransmission, VariantAnnotator, VariantRecalibrator, SelectVariants, etc.!
  • 24. Writing your first GATK walkers
  • 25. Example 1: Hello, Read! 1! Right-click on “walkers”, select New->Package&
  • 26. Example 1: Hello, Read! 2! Type “examples” as the package name.& 3! Click “OK”.&
  • 27. Example 1: Hello, Read! Right-click on “examples” and select New->Java class. 4! Enter the name “HelloRead”.& A file declaring the class and proper package name is created for you.&
  • 28. Example 1: Hello, Read! 5! Add the following text to the class declaration:& extends ReadWalker<Integer, Integer> { This will tell the GATK that you are creating a program that iterates over all of the reads in a BAM file, one at a time.& The “import” statement at the top will be added by the IDE.&
  • 29. Example 1: Hello, Read! 6! IntelliJ can detect what methods you need to implement in order to get your program working.& Make sure your cursor is on the class declaration and type “Alt-Enter” to get the contextual action menu.& Select “Implement Methods”.&
  • 30. Example 1: Hello, Read! 7! Select all of the methods (usually, theyʼll already be selected, so you wonʼt need to do anything).& 8! Click “OK”.&
  • 31. Example 1: Hello, Read! The three methods, map(), reduceInit(), and reduce() are now implemented with placeholder code.&
  • 32. Example 1: Hello, Read! 9! Declare a PrintStream and mark it with the @Output annotation. This tells the GATK that weʼre going to channel our output through this object.& Donʼt worry about instantiating it – the GATK will do that automatically.&
  • 33. Example 1: Hello, Read! 11! When youʼre done, hit the disk icon (or type Ctrl-S) to save your work.& In your map() method, add a line of code that prints “Hello” and the name of the read:& out.println(“Hello, ” + read.getReadName()); Or, just type read. and then hit Ctrl-Space. IntelliJ will show you a window of all the methods you can call, and you can just select it from the list.& 10!
  • 34. Example 1: Hello, Read! 12! Back in the terminal window, change to your gatk-lilly directory and type:& ant dist This will compile the GATK-Lilly codebase, including your new walker!&
  • 35. Example 1: Hello, Read! Itʼll take about a minute to compile.&
  • 36. Example 1: Hello, Read! 13! Run your code by entering the following command:& java -jar dist/GenomeAnalysisTK.jar -T HelloRead -R /lrlhps/apps/gatk-lilly/resources/ucsc.hg19.fasta -I /lrlhps/apps/gatk-lilly/testdata/114T.chr21.analysis_ready.bam | less Every walker must be provided with a reference fasta file.
  • 37. Example 1: Hello, Read! Your code is now running and saying “Hello” to every read in the file!
  • 38. Example 1: Hello, Read! Letʼs add some information to the output. Add the line:& out.println(“Hello, ” + read.getReadName() + “at ” + read.getReferenceName() + “:” + read.getAlignmentStart() ); This will print out the read name, the contig name, and the starting position for the readʼs alignment. 14!
  • 39. Example 1: Hello, Read! 15! 1! Compile and run with a single command:& Compile and run with a single command:& ant dist && java -jar dist/GenomeAnalysisTK.jar ant HelloRead -jar dist/GenomeAnalysisTK.jar -T dist && java -T HelloRead -R /lrlhps/apps/gatk-lilly/resources/ucsc.hg19.fasta -R /lrlhps/apps/gatk-lilly/resources/ucsc.hg19.fasta -I /lrlhps/apps/gatk-lilly/testdata/114T.chr21.analysis_ready.bam | less /lrlhps/apps/gatk-lilly/testdata/114T.chr21.analysis_ready.bam -I –S | less –S (The && instructs the shell to proceed only if the previous command was successful. If the compilation fails, only if the previous be run.) (The && instructs the shell to proceed HelloRead will not command was successful. If the compilation fails, HelloRead will not be run.)
  • 40. Example 1: Hello, Read! The updated command is running and showing us the alignment position in addition to the read name!
  • 41. Example 1: Hello, Read! 16! You can run on just a specific region by supplying the -L argument, and redirect the output to a separate file with the -o argument:& java -jar dist/GenomeAnalysisTK.jar -T HelloRead -R /lrlhps/apps/gatk-lilly/resources/ucsc.hg19.fasta -I /lrlhps/apps/gatk-lilly/testdata/114T.chr21.analysis_ready.bam -L chr21:9411000-9411200 -o test.txt No additional code is required on your part to enable this.
  • 42. Example 1: Hello, Read! The resultant file, with reads from chr21:9,411,000-9,411,200 only.
  • 43. Example 2: Fix read names Letʼs use what weʼve learned to write a program that can change read names like discussed earlier in this tutorial.
  • 44. Example 2: Fix read names Now letʼs create a new example program called “FixReadNames”. 1!
  • 45. Example 2: Fix read names Make FixReadNames a ReadWalker. 2!
  • 46. Example 2: Fix read names 3! This time, weʼll emit a BAM file by directing the output to a SAMFileWriter object instead of a PrintStream.
  • 47. Example 2: Fix read names 4! Change the read name, tacking on the platform unit information.
  • 48. Example 2: Fix read names 5! Add the alignment to the output stream.
  • 49. Example 2: Fix read names 6! Compile and run your code:& ant dist && java -jar dist/GenomeAnalysisTK.jar -T FixReadNames -R /lrlhps/apps/gatk-lilly/resources/ucsc.hg19.fasta -I /lrlhps/apps/gatk-lilly/testdata/114T.chr21.analysis_ready.bam -L chr21:9411000-9411200 -o test.bam
  • 50. Example 2: Fix read names Run the following command to see your results:& samtools view test.bam | less -S 7!
  • 51. Example 2: Fix read names All of the read names now have the platform unit prepended to them!
  • 52. Example 3: Hello, Variant! This will be a larger example, introducing variant processing, map-reduce calculations, and the onTraversalDone() method. All code required is listed here.
  • 53. Example 3: Hello, Variant! Weʼve created a new program called “HelloVariant”. 1!
  • 54. Example 3: Hello, Variant! This program extends RodWalker<Integer, Integer> 2!
  • 55. Example 3: Hello, Variant! 3! Declares a PrintStream.
  • 56. Example 3: Hello, Variant! 4! In the map() function, weʼll loop over lines in a VCF file and print metadata from each record.
  • 57. Example 3: Hello, Variant! Return 1.& 5! This will get passed to reduce() later.
  • 58. Example 3: Hello, Variant! This gets called before the first reduce() call. By returning 0, 6! we initialize the record counter.
  • 59. Example 3: Hello, Variant! All of the return values from map () get passed to reduce(), one at a time. Here, we add value to sum, effectively counting all the calls to map(). 7!
  • 60. Example 3: Hello, Variant! 8! The onTraversalDone() method runs after the computation is complete. Here, we print the total number of map() calls made.
  • 61. Example 3: Hello, Variant! 9! Compile and run the HelloVariant walker, but this time, rather than specifying a BAM file with the -I argument, weʼll attach a VCF file:& ant dist && java –jar dist/GenomeAnalysisTK.jar –T HelloVariant -R /lrlhps/apps/gatk-lilly/resources/ucsc.hg19.fasta -B:variant,VCF /lrlhps/apps/gatk-lilly/testdata/test.chr21.analysis_ready.vcf
  • 62. Example 3: Hello, Variant! The program prints out the reference allele, alternate allele, and locus for each VCF record, and finally prints out the number of records processed!
  • 63. Example 4: Compute depth of coverage from a VCF file Letʼs continue exploring variant processing by taking a closer look at the VariantContext object, the programmatic representation of a VCF record.& This program will compute a depth of coverage histogram using VCF metadata rather than a BAM file.
  • 64. Example 4: Compute depth of coverage from a VCF file Create a new program called& 1! ComputeCoverageFromVCF of type& RodWalker<Integer, Integer> with the usual& @Output PrintStream out declaration.&
  • 65. Example 4: Compute depth of coverage from a VCF file 2! Add a command line argument with the following code:& @Argument(fullName=“sample”, shortName=“sn”, doc=“Sample to process”, required=false) public string SAMPLE; This adds the command-line argument --sample (aka -sn) and stores the inputted value in the String variable SAMPLE.& Weʼll use this to allow the user to specify whether they want to get coverage for a specific sample or all of the samples (by specifying no sample at all).&
  • 66. Example 4: Compute depth of coverage from a VCF file 3! Declare a hashtable to store the coverage counts.& private TreeMap<Integer, Integer> histogram = new TreeMap<Integer, Integer>(); A TreeMap is a special kind of hashtable that returns its keys in sorted order.&
  • 67. Example 4: Compute depth of coverage from a VCF file 4! Loop over the variants. For each one, weʼll print the coverage observed. We also make sure that we get the coverage for the sample requested (if the user specified a sample name to the --sample argument), or for all samples (if the user specified no sample name at all).& For every coverage level we observe, we increment the appropriate entry in the histogram object.&
  • 68. Example 4: Compute depth of coverage from a VCF file In the onTraversalDone() method, weʼll loop over every coverage level in the histogram and output the depth and the number of times we observed that depth.& 5!
  • 69. Example 4: Compute depth of coverage from a VCF file 6! Compile and run:& ant dist && java -jar dist/GenomeAnalysisTK.jar -T ComputeCoverageFromVCF -R /lrlhps/apps/gatk-lilly/resources/ucsc.hg19.fasta -B:variant,VCF /lrlhps/apps/gatk-lilly/testdata/test.chr21.analysis_ready.vcf -o histogram.txt
  • 70. Example 4: Compute depth of coverage from a VCF file Two columns of information are printed. First column is the coverage level, second is the number of times that coverage level was observed!
  • 71. Example 5: Find variants unique to a single sample For our last example, weʼll write a simple program that can take an input VCF and write a new VCF containing only variants that are exclusive to one sample.& Weʼll also introduce the initialize() method, which can be used to prepare the environment for the computation.&
  • 72. Example 5: Find variants unique to a single sample 1! Create a new RodWalker called FindExclusiveVariants that has a command-line argument called “sample” (aka “sn”) of type String.& Add an output stream, but rather than be of type PrintStream, make it of type VCFWriter. Weʼll use this to output a new VCF file based on the input VCF.&
  • 73. Example 5: Find variants unique to a single sample 2! The initialize() method is called first, before any of the map() or reduce() calls are made. It is useful for preparing the environment, writing headers, setting up variables, etc.& Here, weʼll write a VCF header to the output stream. While weʼre free to add/remove header lines and samples, weʼll just copy the input fileʼs header to the output file.&
  • 74. Example 5: Find variants unique to a single sample 3! Loop over each record in the VCF, and each Genotype object contained within the VariantContext object. Check the genotypes of each sample and, if only our sample of interest is variant, output the record to the new VCF file.&
  • 75. Example 5: Find variants unique to a single sample 4! Compile and run:& ant dist && java -jar dist/GenomeAnalysisTK.jar -T FindExclusiveVariants -R /lrlhps/apps/gatk-lilly/resources/ucsc.hg19.fasta -B:variant,VCF /lrlhps/apps/gatk-lilly/testdata/test.chr21.analysis_ready.vcf -sn 113N -o 113.exclusive.vcf
  • 76. Example 5: Find variants unique to a single sample 5! After the program completes, look at the output.
  • 77. Example 5: Find variants unique to a single sample 6! You can scroll left and right with the arrow key, but letʼs clean up the output to make it easier to read. Supply this command instead:& grep –v ‘##’ 113.exclusive.vcf | cut –f1-7,10- | head -10 | column –t | less -S
  • 78. Example 5: Find variants unique to a single sample Observe how the third sample is variant and the other three samples are not. Our program is selecting only the variants that are exclusive to 113N!
  • 79. Conclusions •  From the five example programs, we have learned how to:& –  configure IntelliJ for GATK development& –  create a new ReadWalker or RodWalker –  declare output streams (PrintStream, SAMFileWriter, VCFWriter)& –  access and modify metadata in reads& –  access variants, samples, and metadata from a VCF file& –  declare command-line arguments& –  prepare for computations with the initialize() method& –  finish computations with the onTraversalDone() method& –  compile and run new GATK programs& •  This tutorial is more than enough to get started with writing new and useful GATK programs& –  Our FixReadNames, ComputeCoverageFromVCF, and FindExclusiveVariants walkers are fully realized programs, ready to be used for real work.& –  You now have enough information to write your own somatic variant finder.&
  • 80. Additional resources •  For more information on developing in the GATK and Java, see& –  http://www.broadinstitute.org/gsa/wiki/index.php/GATK_Development& –  http://download.oracle.com/javase/tutorial/java/index.html& •  Explore the GATK Git repository at& –  https://github.com/broadgsa& –  https://github.com/signup/free (to add your own code, sign up for free account)& •  To learn Git, the codebaseʼs version control system, see& –  http://gitref.org/& –  http://git-scm.com/course/svn.html (for those already familiar with SVN)& •  Read our papers on the GATK framework and tools& –  http://genome.cshlp.org/content/20/9/1297.long& –  http://www.nature.com/ng/journal/v43/n5/abs/ng.806.html& •  Fore more guidance, feel free to look at other programs in the GATK& –  Every program is a tutorial!&