Read Alignment
Key Learning Outcomes¶
After completing this practical the trainee should be able to:
-
Perform a simple NGS data alignment task, with Bowtie2, against one interested reference data
-
Interpret and manipulate the mapping output using SAMtools
-
Visualise the alignment via a standard genome browser, e.g. IGV browser
Resources You’ll be Using¶
Tools Used¶
Bowtie2:
http://bowtie-bio.sourceforge.net/bowtie2/index.shtml
Samtools:
http://broadinstitute.github.io/picard
BEDTools:
http://code.google.com/p/bedtools/
UCSC tools:
http://hgdownload.cse.ucsc.edu/admin/exe/
IGV genome browser:
http://www.broadinstitute.org/igv/
Useful Links¶
SAM Specification:
http://samtools.sourceforge.net/SAM1.pdf
Explain SAM Flags:
https://broadinstitute.github.io/picard/explain-flags.html
Sources of Data¶
http://www.ebi.ac.uk/arrayexpress/experiments/E-GEOD-11431
Author Information¶
Primary Author(s): Myrto Kostadima kostadim@ebi.ac.uk
Contributor(s): Xi (Sean) Li sean.li@csiro.au
Introduction¶
The goal of this hands-on session is to perform an unspliced alignment for a small subset of raw reads. We will align raw sequencing data to the mouse genome using Bowtie2 and then we will manipulate the SAM output in order to visualize the alignment on the IGV browser.
Prepare the Environment¶
We will use one data set in this practical, which can be found in the
ChIP-seq
directory on your desktop.
Open the Terminal.
First, go to the right folder, where the data are stored.
1 |
|
The .fastq
file that we will align is called Oct4.fastq
. This file
is based on Oct4 ChIP-seq data published by Chen et al. (2008). For
the sake of time, we will align these reads to a single mouse
chromosome.
Alignment¶
You already know that there are a number of competing tools for short read alignment, each with its own set of strengths, weaknesses, and caveats. Here we will try Bowtie2, a widely used ultrafast, memory efficient short read aligner.
Bowtie2 has a number of parameters in order to perform the alignment. To view them all type
1 |
|
Bowtie2 uses indexed genome for the alignment in order to keep its
memory footprint small. Because of time constraints we will build the
index only for one chromosome of the mouse genome. For this we need the
chromosome sequence in FASTA format. This is stored in a file named
mm10
, under the subdirectory bowtie_index
.
The indexed chromosome is generated using the command:
STOP
DO NOT run this command. This has already been run for you.
** bowtie2-build bowtie_index/mm10.fa bowtie_index/mm10 **
This command will output 6 files that constitute the index. These files
that have the prefix mm10
are stored in the bowtie_index
subdirectory. To view if they files have been successfully created type:
1 |
|
Now that the genome is indexed we can move on to the actual alignment.
The first argument for bowtie2
is the basename of the index for the
genome to be searched; in our case this is mm10
. We also want to make
sure that the output is in SAM format using the -S
parameter. The last
argument is the name of the FASTQ file.
Align the Oct4 reads using Bowtie2:
1 |
|
The above command outputs the alignment in SAM format and stores them in
the file Oct4.sam
.
In general before you run Bowtie2, you have to know what quality encoding your FASTQ files are in. The available FASTQ encodings for bowtie are:
–phred33-quals : Input qualities are Phred+33 (default).
–phred64-quals
: Input qualities are Phred+64 (same as –solexa1.3-quals
).
–solexa-quals : Input qualities are from GA Pipeline ver. < 1.3.
–solexa1.3-quals : Input qualities are from GA Pipeline ver. >= 1.3.
–integer-quals : Qualities are given as space-separated integers (not ASCII).
The FASTQ files we are working with are Sanger encoded (Phred+33), which is the default for Bowtie2.
Bowtie2 will take 2-3 minutes to align the file. This is fast compared to other aligners which sacrifice some speed to obtain higher sensitivity.
Look at the top 10 lines of the SAM file using head (record lines are wrapped). Then try the second command, note use arrow navigation and to exit type ’q’.
1 2 |
|
Question
Can you distinguish between the header of the SAM format and the actual alignments?
Answer
Answer
The header line starts with the letter ‘@’, i.e.:
@HD VN:1.0 SO:unsorted
@SQ SN:chr1 LN:195471971
@PG ID:Bowtie2 PN:bowtie2 VN:2.2.4 CL:“/tools/bowtie2/bowtie2-default/bowtie2-align-s –wrapper basic-0 -x bowtie_index/mm10 -q Oct4.fastq”
While, the actual alignments start with read id, i.e.:
SRR002012.45 0 etc
SRR002012.48 16 chr1 etc
Question
What kind of information does the header provide?
Answer
Answer
-
@HD: Header line; VN: Format version; SO: the sort order of alignments.
-
@SQ: Reference sequence information; SN: reference sequence name; LN: reference sequence length.
-
@PG: Read group information; ID: Read group identifier; VN: Program version; CL: the command line that produces the alignment.
Question
To which chromosome are the reads mapped?
Answer
Answer
Chromosome 1.
Manipulate SAM output¶
SAM files are rather big and when dealing with a high volume of NGS data, storage space can become an issue. As we have already seen, we can convert SAM to BAM files (their binary equivalent that are not human readable) that occupy much less space.
Convert SAM to BAM using samtools view
and store the output in the
file Oct4.bam
. You have to instruct samtools view
that the input is
in SAM format (-S
), the output should be in BAM format (-b
) and that
you want the output to be stored in the file specified by the -o
option:
1 |
|
Compute summary stats for the Flag values associated with the alignments using:
1 |
|
Visualize alignments in IGV¶
IGV is a stand-alone genome browser. Please check their website (http://www.broadinstitute.org/igv/) for all the formats that IGV can display. For our visualization purposes we will use the BAM and bigWig formats.
When uploading a BAM file into the genome browser, the browser will look
for the index of the BAM file in the same folder where the BAM files is.
The index file should have the same name as the BAM file and the suffix
.bai
. Finally, to create the index of a BAM file you need to make sure
that the file is sorted according to chromosomal coordinates.
Sort alignments according to chromosomal position and store the result
in the file with the prefix Oct4.sorted
:
1 |
|
Index the sorted file.
1 |
|
The indexing will create a file called Oct4.sorted.bam.bai
. Note that
you don’t have to specify the name of the index file when running
samtools index
, it simply appends a .bai
suffix to the input BAM
file.
Another way to visualize the alignments is to convert the BAM file into a bigWig file. The bigWig format is for display of dense, continuous data and the data will be displayed as a graph. The resulting bigWig files are in an indexed binary format.
The BAM to bigWig conversion takes place in two steps. Firstly, we
convert the BAM file into a bedgraph, called Oct4.bedgraph
, using the
tool genomeCoverageBed
from BEDTools. Then we convert the bedgraph
into a bigWig binary file called Oct4.bw
, using bedGraphToBigWig
from the UCSC tools:
1 2 |
|
Both of the commands above take as input a file called
mouse.mm10.genome
that is stored under the subdirectory
bowtie_index
. These genome files are tab-delimited and describe the
size of the chromosomes for the organism of interest. When using the
UCSC Genome Browser, Ensembl, or Galaxy, you typically indicate which
species/genome build you are working with. The way you do this for
BEDTools is to create a “genome” file, which simply lists the names of
the chromosomes (or scaffolds, etc.) and their size (in basepairs).
BEDTools includes pre-defined genome files for human and mouse in the
genomes
subdirectory included in the BEDTools distribution.
Now we will load the data into the IGV browser for visualization. In
order to launch IGV double click on the IGV 2.3
icon on your Desktop.
Ignore any warnings and when it opens you have to load the genome of
interest.
On the top left of your screen choose from the drop down menu
Mouse (mm10)
. If it doesn’t appear in list, click More ..
, type
mm10
in the Filter section, choose the mouse genome and press OK. Then
in order to load the desire files go to:
1 |
|
On the pop up window navigate to Desktop -> chipseq folder and select
the file Oct4.sorted.bam
.
Repeat these steps in order to load Oct4.bw
as well.
Select chr1
from the drop down menu on the top left. Right click on
the name of Oct4.bw
and choose Maximum under the Windowing Function.
Right click again and select Autoscale.
In order to see the aligned reads of the BAM file, you need to zoom in
to a specific region. For example, look for gene Lemd1
in the search
box.
Question
What is the main difference between the visualization of BAM and bigWig files?
Answer
Answer
The actual alignment of reads that stack to a particular region can be displayed using the information stored in a BAM format. The bigWig format is for display of dense, continuous data that will be displayed in the Genome Browser as a graph.
Using the +
button on the top right, zoom in to see more of the
details of the alignments.
Question
What do you think the different colors mean?
Answer
Answer
The different color represents four nucleotides, e.g. blue is Cytidine (C), red is Thymidine (T).
Practice Makes Perfect!¶
In the chipseq folder you will find the file gfp.fastq
. Follow the
above described analysis, from the bowtie2 alignment step, for this
dataset as well. You will need these files for the ChIP-Seq module.