logo

Informatics for High-Throughput Sequencing Data 2016

Module 7 lab


CBW HT-seq Module 7 - Galaxy lab

This lab was created by Florence Cavalli

Table of contents

Introduction

  1. Load the data
  2. Create single interval
  3. Check the quality
  4. Convert the FASTQ quality format
  5. Trim the read and remove adapter sequence with Trimmomoatic
  6. Align the reads with BWA-mem
  7. Sort the sam/bam
  8. Convert bam to sam
  9. Indel realignment
  10. FixMates
  11. Mark duplicates
  12. Base recalibration
  13. Extract Metrics
  14. Call SNPs
  15. Filter the variants
  16. Annotate the variants
    The “History options” menu
    Workflow

Introduction

In this practical we will use Galaxy to perform the analysis you ran in Module 2 and Module 5. We are therefore using the dataset from patient NA12878 (reads extracted from region chr1:17704860-18004860)

We will use the galaxy website: https://usegalaxy.org/
Start by loging on the website with your username

The paramaters that need to be set or changed are specified in the text below (and in the images), keep the rest of the parameters with their default values

** indicates the tool to use for each step. You need to select it from the left column

tool

1) Load the data

** Get Data/Upload File

“Paste/Fetch Data” box:

  • NA12878_CBW_chr1_R1.fastq.gz
    http://cbw##.dyndns.info/HTSeq_module2/raw_reads/NA12878/NA12878_CBW_chr1_R1.fastq.gz

  • NA12878_CBW_chr1_R2.fastq.gz
    http://cbw##.dyndns.info/HTSeq_module2/raw_reads/NA12878/NA12878_CBW_chr1_R2.fastq.gz

  • hg19_chr1.fa
    http://cbw##.dyndns.info/Module7/hg19_chr1.fa

  • dbSNP_135_chr1.vcf.gz
    http://cbw##.dyndns.info/HTSeq_module2/reference/dbSNP_135_chr1.vcf.gz

replace ## by your student id

data

When the data is loaded and/or a task has finished to run, the file created is green. Grey indicates that the task is queueing, yellow that the task is beeing processed and red that an error occurred.
For example after loading the data, you should see the following in the history column on the right

File check:

file1

NOTE: Numbers and paths of the files may vary with usage

NOTE: The icons on the right of each file allow you to “Poke an eye”, edit the attribute or delete the file

2) Create single interval

We create an interval corresponding to the region of interest (we extracted reads from this region for this example dataset) to be used later on by different tools

** Text manipulation/Create Single Interval

  • Chromosome: chr1
  • Start position: 17704860
  • End position: 18004860
  • Name: chr1 interval

createInterval

File check:

file2bis

All the following steps are detailed in the Module 2 practical

3) Check the quality

BVAtools is not available on the galaxy website
You can use ** NGS: QC and manipulation/FastQC Read Quality reports instead
Since the tool in not avaible we will not run it but this is a very important step of the analysis, don’t skip it!
You can use FastQC at several points in your analysis on fastq, bam or sam files

4) Convert the FASTQ quality format

** NGS: QC and manipulation/FASTAQ groomer convert between various FASTQ quality formats

  • File to groom : NA12878_CBW_chr1_R1.fastq

groomer

Run FASTAQ groomer on the NA12878_CBW_chr1_R2.fastq file as well

File check:

file3

5) Trim the read and remove adapter sequence with Trimmomoatic

** NGS: QC and manipulation/Trimmomatic

  • Input FASTQ file: FASTQ groomer results for xxx_R1.fastq and xxx_R2.fastq files
  • Perform initial ILLUMINACLIP step :Yes
    – Adapter sequence to use : TruSeq3 (paired-end, for MiSeq and HiSeq)
    – Maximum mismatch count which will still allow a full match to be performed : 2
    – How accurate the match between the two ‘adapter ligated’ reads must be for PE palindrome read alignment: 30
    – How accurate the match between any adapter etc. sequence must be against a read: 15
  • Select Trimmomatic operation to perform : insert Trimmomatic option
    – TRAILING: 20
    – MINLEN: 32

trim1 trim2

This creates 4 files, for paired and unpaired reads for both R1 and R2

Use the paired results for the alignment

File check:

file4

6) Align the reads with BWA-MEM

** NGS: Mapping/Map with BWA-MEM

  • Will you select a reference genome from your history or use a built-in index? : Use a genome from history or build index
  • Use the following dataset as the reference sequence: hg19_chr1.fa
  • Select first set of reads: Trimmomactic on FASTQ groomer (R1 paired)
  • Select second set of reads: Trimmomactic on FASTQ groomer (R2 paired)
  • Set the read groups information : Set read group SAM/BAM specification
    with the following info
    – Read group identifier (ID): NA12878
    – Read group sample name (SM): NA12878
    – Platform/technology used to produce the reads (PL): ILLUMINA
    – Library name (LB):NA12878
    – Sequencing center that produced the read (CN): Broad Institute
    – Platform unit (PU): runNA12878_1
  • Select Analysis node: 3. Full list of option
  • Set input/output options : Set
  • Mark shorter split hits of a chimeric alignment in the FLAG field as ‘secondary alignment’ instead of ‘supplementary alignment’: Yes -M; For Picard<1.96 compatibility

align1 align2

align3

align4

*** This step takes some time to run

File check:

file5

7) Sort the sam/bam

** NGS: Picard/SortSam sort SAM/BAM dataset

  • Select SAM/BAM dataset or dataset collection: map with BWA-MEM file

  • Sort order: coordinate

  • Select validation stringency: Silent

sort

File check:

file6

8) Convert bam to sam file (optional)

We will convert the bam file to a sam file to be able to look at it
You can run this at different point of your analysis

** NGS: SAMtools/BAM-to-SAM

  • BAM File to Convert: sorted Bam file

convertBam

File check:

file7

9) Indel realignment

** NGS GATK Tools/RealignerTargetCreator

  • Choose the source for the reference list: History
  • Bam file: sorted Bam file
  • Using reference file: hg19_chr1.fa
  • Basic or advance GATK option: Advanced
    – Operate on Genomic intervals: Genomic intervals : created interval on chr1

realignertarget

realignertarget

** NGS GATK Tools/IndelRealigner

  • Choose the source for the reference list: History
  • Bam file: sorted Bam file
  • Using reference file: hg19_chr1.fa
  • Restrict realignment to provided intervals: Realigner Target create results

indel

File check:

file8

10) FixMates

**NGS: Picard/FixMateInformation

  • Select SAM/BAM dataset or dataset collection: Indel Realigner result bam
  • Select validation stringency: Silent

fixemate

File check:

file9

11) Mark duplicates

**NGS: Picard/MarkDuplicates

  • Select SAM/BAM dataset or dataset collection: FixMateInformation result
  • Select validation stringency: Silent

markdup

markdup2

File check:

file10

Have a look at the “Markduplicate metrics”

12) Base Recalibration

** NGS GATK Tools/BaseRecalibrator is not available!

So we can use “Count covariates” and “Table recalibration”. These two steps are the equivalent of BaseRecalibrator which is present in a newer version of GATK

** NGS: GATK Tools/Count Covariates on BAM files

  • Choose the source for the reference list: History
  • Bam file: marked duplicates file
  • Using reference genome: hg19_chr1.fa
  • Covariates to be used in the recalibration: Select all, then unselect “Tile covariate”

countCov

** NGS: GATK Tools/Table Recalibration on BAM files

  • Covariates table recalibration file: Count Covariate
  • Choose the source for the reference list: History
  • Bam File: Marked duplicates file
  • Using reference genome:hg19_chr1.fa

table

File check:

file11

13) Extract Metrics

** NGS GATK Tools/Depth of Coverage on BAM files

  • Choose the source for the reference list: History
  • Bam file: Table recalibration result bam file
  • Using reference genome: hg19_chr1.fa
  • Partition type for depth of coverage: select sample and readgroup
  • Summary coverage threshold
    – for summary file outputs, report the % of bases covered to >= this number: 10, 25, 50 and 100 (insert 4 thresholds)
  • Basic or Advanced GATK options: Advanced
    – Operate on Genomic intervals: Genomic intervals : created interval on chr1
  • Basic or Advanced Analysis options: Advanced
    – “Omit the output of the depth of coverage at each base”: Yes

cov1

cov2

cov3

cov5

cov4

file12a

** NGS: Picard/Collect Alignment Summary Metrics writes a file containing summary alignment metrics

  • SAM/BAM dataset: Table recalibration result bam file
  • Choose the source for the reference list: History
  • Using reference genome: hg19_chr1.fa
  • The level(s) at which to accumulate metrics set to: Read group, Sample
  • Select validation stringency: Silent

sumA1

sumA2

File check:

file12b

View “Collect Alignment Summary metrics”

** NGS: Picard/CollectInsertSizeMetrics plots distribution of insert sizes

  • SAM/BAM dataset: Table recalibration result bam file
  • Choose the source for the reference list: History
  • Using reference genome: hg19_chr1.fa
  • The level(s) at which to accumulate metrics set to: Read group
  • Select validation stringency: Silent

sumI1

sumI2

File check:

file12c

View “collectInsertSize Metrics” and pdf file

Variant calling and annotation from Module 5

All the following steps are detailed in the Module 5 practical

To continue you can use the aligned, sorted, marked duplicates and quality recalibrated files that you just created or download the one you used in Module 5 from the server (http://cbw##.dyndns.info/module5/NA12878.bwa.sort.rmdup.realign.bam, ## being your student id).

14) Call SNPs

** NGS GATK Tools/Unified Genotyper SNP and indel caller

  • Choose the source for the reference list: History
  • BAM file: Table recalibration result bam file
  • Using reference genome: hg19_chr1.fa
  • The minimum phred-scaled confidence threshold at which variants not at ‘trigger’ track sites should be emitted (and filtered if less than the calling threshold): 10
  • Basic or Advanced GATK options: Advanced
    – Operate on Genomic intervals: Genomic intervals : created interval on chr1

uni1

uni2

File check:

file13

Have a look at the vcf file

15) Filter the variants

Typically variant callers will only perform a minimal amount of filtering when presenting variant calls. In the case of GATK, we are actively removing any variant with score less than 10. Any variant with a score less than 30 is labeled with the filter “LowQual”.

To perform more rigorous filtering, another program must be used. In our case, we will use the VariantFiltration tool in GATK.

NOTE: The best practice when using GATK is to use the VariantRecalibrator. In our data set, we had too few variants to accurately use the variant recalibrator and therefore we used the VariantFiltration tool instead.

** NGS GATK Tools/Variant Filtration on VCF files

  • Choose the source for the reference list: History
  • Variant file to annotate: Unified genotyper results file
  • Using reference genome: hg19_chr1.fa
  • Variant filter

set the 3 following filters

– filter Expression: QD < 2.0 Filter, name: QDFilter
– filter Expression: FS > 200.0 Filter, name: FSFilter
– filter Expression: MQ < 40.0 Filter, name: MQFilter

varFil1

varFil2

File check:

file14

You can look at the output vcf file that contains “filter” annotation

16) Annotate the variants

with snpEff

** NGS: Variant Analysis/SnpEff Variant effect and annotation

  • Sequence changes: Variant Filtration result
  • Filter output
    –select “Do not show INTERGENIC changes”

SnpEff1

SnpEff2

File check:

file15b

You can download the SnpEff stat html page and vcf file

with GATK VariantAnnotator

** NGS GATK Tools/VariantAnnotator

  • Choose the source for the reference list: History
  • Variant file to annotate: SnpEff vcf file
  • Using reference file: hg19_chr1.fa
  • Provide a dbSNP reference-ordered data file: set dbSNP
    – ROD file: dbSNP_135_chr1.vcf.gz
  • Basic or Advanced GATK options: Advanced
    – Operate on Genomic intervals: Genomic intervals : created interval on chr1

Anno1

Anno2

Anno4

Anno3

File check:

file15a

Look at or download your filtered and annotated variant vcf files

The “History options” menu

You can

  • See saved histories
  • Extract workflow
  • Save your workflow
  • Share your workflow

history

Workflow

Use the workflow tab

  • Edit your worklow
  • Rename input and output files
  • Run your workflow on other samples

workflow

For information (we will not do it as part of the practical), you can run a tool and/or a given workflow on several samples at once. You will need to “Build List of Dataset Pairs” then run the tools on “Dataset collection”

s1

  • Select your input samples

s2

  • Create your pairs
  • Run a tool on your dataset collection

s3

Next, you could…

  • Create a new history, upload your input samples and run you workflow on them
  • Continue using the tools on Galaxy
  • Continue exploring/analysing/visualizing your data and results!

There are lots of tutorial, mainling list, videos for help (such as https://vimeo.com/galaxyproject)
We hope that you enjoyed using Galaxy!

View on GitHub