Copy Number Variation
Key Learning Outcomes¶
After completing this practical the trainee should be able to:
-
Understand and perform a simple copy number variation analysis on NGS data
-
Become familiar with Sequenza
-
Understand the CNV inference process as an interplay between depth of sequencing, cellularity and B-allele frequency
-
Visualize CNV events by manual inspection
Resources You’ll be Using¶
Tools Used¶
Sequenza:
http://www.cbs.dtu.dk/biotools/sequenza/
IGV:
http://www.broadinstitute.org/igv/
Sources of Data¶
Raw data download:
http://sra.dnanexus.com/studies/ERP001071
Data publication:
http://www.ncbi.nlm.nih.gov/pubmed/22194472
Author Information¶
Primary Author(s):
Velimir Gayevskiy, Garvan Institute v.gayevskiy@garvan.org.au
Sonika Tyagi, AGRF sonika.tyagi@agrf.org.au
Contributor(s):
Introduction¶
The goal of this hands-on session is to perform a copy number variation analysis (CNV) on a normal/tumour pair of alignment files (BAMs) produced by the mapping of Illumina short read sequencing data.
To ensure reasonable analysis times, we will perform the analysis on a
heavily subsetted pair of BAM files. These files contain just the first
60Mb of chromosome 5 but contain several examples of inferred copy
number events to enable interpretation and visualisation of the copy
number variation that is present in entire cancer genomes. Sequenza
is
the tool we will use to perform this analysis. It consists of two
Python
pre-processing steps followed by a third step in R
to infer the
depth ratio, cellularity, ploidy and to plot the results for
interpretation.
In the second part of the tutorial we will also be using IGV
to
visualise and manually inspect the copy number variation we inferred in
the first part for validation purposes. This section will also include a
discussion on the importance of good quality data by highlighting the
inadequacies of the workshop dataset and the implications this has on
analysis results.
Prepare the Environment¶
We will use a dataset derived from whole genome sequencing of a 33-yr-old lung adenocarcinoma patient, who is a never-smoker and has no familial cancer history.
The data files are contained in the subdirectory called data
and are
the following:
normal.chr5.60Mb.bam
andnormal.chr5.60Mb.bam.bai
tumour.chr5.60Mb.bam
andtumour.chr5.60Mb.bam.bai
These files are based on subsetting the whole genomes derived from blood
and liver metastases
to the first 60Mb of chromosome 5. This will allow
our analyses to run in a sufficient time during the workshop, but it’s
worth being aware that we are analysing just 1.9% of the genome which
will highlight the length of time and resources required to perform
cancer genomics on full genomes!
Open the Terminal and go to the cnv
working directory:
cd /home/trainee/cnv/
All commands entered into the terminal for this tutorial should be from
within the cnv
directory.
Check that the data
directory contains the above-mentioned files by
typing:
ls -l data
Sequenza CNV Analysis¶
Sequenza is run in three steps. The first pre-processing step is run on
the final normal and tumour mapped data (BAM files) in order to walk the
genome in a pileup format (automatically generated by samtools
). This
first step finds high quality sites in the genomes and extracts their
depth and genotype in the normal genome and calculates the variant
alleles and allele frequencies in the tumour genome. The second step is
to perform a binning on these sites to save space and analysis time in
the third step. Finally, the third step is run in R
to normalise the depth
ratio between the normal/tumour genomes, infer cellularity and ploidy
and graphically output results for interpretation.
Step 1: Pre-processing – Walking the Genome¶
pypy software/sequenza/sequenza-utils.py bam2seqz \
-n data/normal.chr5.60Mb.bam \
-t data/tumour.chr5.60Mb.bam \
--fasta assets/human_g1k_v37.fasta \
-gc assets/human_g1k_v37.gc50Base.txt.gz \
-C 5:1-60000000 | gzip > stage1.seqz.gz
Hint
Press tab after typing a few characters of a directory of filename to auto-complete the rest. This makes entering long file names very quick.
Explanation of parameters:
-n: the normal BAM
-t: the tumour BAM
--fasta: the reference genome used for mapping (b37 here)
-gc: GC content as windows through the genome (pre-generated and downloadable from the Sequenza website)
-C: specifies the genomic location to process
There will not be any indication that it is running once you launch the
command, to make sure it is running open a new Terminal tab with
Shift + Control + T
(or from the menu with File then Open Tab) and type
the command top
. You should see the top line being the command ’pypy’ with
a % CPU usage of 98/99%. Press q
to quit out of this process view and go
back to the tab running Sequenza
. If everything is running correctly, it
will take approximately 40 minutes to run. Go have a coffee!
Once the command is done you will be returned to the terminal prompt.
Make sure the output file is the correct size by typing ls -lh
from the
Terminal window that you ran Sequenza
from, there should be a file
called stage1.seqz.gz
of the size ~395M.
You can look at the first few lines of the output in the file
stage1.seqz.gz
with:
zcat stage1.seqz.gz | head -n 20
This output has one line for each position in the BAMs and includes information on the position, depths, allele frequencies, zygosity, GC in the location.
Step 2: Perform Binning¶
The binning step takes the rows of genomic positions and compresses them down to 1 row for every 200 rows previously. This massively reduces the file size and processing time in the third step.
pypy software/sequenza/sequenza-utils.py seqz-binning \
-w 200 \
-s stage1.seqz.gz | gzip > stage2.seqz.gz
Explanation of parameters:
-w: the window size (typically 50 for exomes, 200 for genomes)
-s: the large seqz file generated in the first step
This step should take approximately 4 minutes to complete.
Step 3: Running Sequenza Algorithms and Plotting Results¶
We will now perform the CNV analysis and output the results using the R
part of Sequenza.
Open the R
terminal:
R
You should now see the R
prompt identified with “>”.
Run the Sequenza R
commands:
library("sequenza")
setwd("/home/trainee/cnv")
data.file <- "stage2.seqz.gz"
seqzdata <- sequenza.extract(data.file)
CP.example <- sequenza.fit(seqzdata)
sequenza.results(sequenza.extract = seqzdata, cp.table = CP.example, sample.id = "CanGenWorkshop", out.dir="sequenza_results")
Quit R
:
q()
Then enter n
at the “Save workspace image” prompt.
If every command ran successfully, you will now have a sequenza_results
folder containing 13 files (type ls -l sequenza_results/
).
Sequenza Analysis Results and Visualisation¶
One of the first and most important estimates that Sequenza provides is
the tumour cellularity (the estimated percentage of tumour cells in the
tumour genome). This estimate is based on the B allele frequency and
depth ratio through the genome and is an important metric to know for
interpretation of Sequenza results and for other analyses. Lets look at
the cellularity estimate for our analysis by opening
CanGenWorkshop_model_fit.pdf
with the command:
evince sequenza_results/CanGenWorkshop_model_fit.pdf
The cellularity estimate is at the top along with the average ploidy
estimate and the standard deviation of the B allele frequency. We can
see that the cellularity has been estimated at 24% which is fairly low
and we will see why this is bad in the next section on CNV
visualisation. The ploidy value of 2.1 indicates this piece of the
genome is not hugely amplified or deleted and the BAF standard deviation
indicates there are no significant long losses of heterozygosity.
Close the PDF window to resume the Terminal prompt.
Let’s now look at the CNV inferences through our genomic block. Open the genome copy number visualisation file with:
evince sequenza_results/CanGenWorkshop_genome_view.pdf
This file contains three “pages” of copy number events through the
entire genomic block.
- The first page shows copy numbers of the A (red) and B (blue) alleles,
- The second page shows overall copy number changes, and
- The third page shows the B allele frequency and depth ratio through genomic block.
Looking at the overall copy number changes, we see that our block is at a copy number of 2 with a small duplication to copy number 4 about ⅓ of the way through the block and another just after halfway through the block. There is also a reduction in copy number to 1 copy about ⅘ of the way through the block. The gap that you see just before this reduction in copy number is the chromosomal centromere - an area that is notoriously difficult to sequence so always ends up in a gap with short read data.
You can see how this is a very easy to read output and lets you
immediately see the frequency and severity of copy number events through
your genome. Let’s compare the small genomic block we ran with the same
output from the entire genome which has been pre-computed for you. This
is located in the pre_generated/results_whole_genome
folder and
contains the same 13 output files as for the small genomic block. As
before, let’s look at the cellularity estimate with:
evince pre_generated/results_whole_genome/CanGenWorkshop_model_fit.pdf
It now looks like it’s even worse at just 16%! A change is to be
expected as we were only analysing 1.9% of the genome. Let’s now look at
the whole genome copy number profile with:
evince pre_generated/results_whole_genome/CanGenWorkshop_genome_view.pdf
You can see that there are a number of copy number events across the
genome and our genomic block (the first 60Mb of chromosome 5) is
inferred as mostly copy number 4 followed by a reduction to copy number
2, rather than 2 to 1 as we saw in the output we generated. The reason
for this is that Sequenza
uses the genome-wide depth ratio and BAF in
order to estimate copy number, if you ask it to analyse a small block
mostly at copy number 4 with a small reduction to copy number 2, the
most likely scenario in lieu of more data is that this is a copy number
2 block with a reduction to 1. It’s important to carefully examine the
cellularity, ploidy and BAF estimates of your sample along with the
plots of model fit (CanGenWorkshop_model_fit.pdf
) and
cellularity/ploidy contours (CanGenWorkshop_CP_contours.pdf
) in order
to decide if you believe Sequenza’s inference of the copy numbers. Have
a look at these for yourself if you want to get a better idea of how
Sequenza
makes its inferences and conclusions.
CNV Visualisation/Confirmation in IGV¶
Let’s see if we can visualise one of the CNV events where copy number
increased significantly. We’ll focus on the copy number 4 event seen at
about ⅓ of the way through the CanGenWorkshop_genome_view.pdf
output
we generated. First, we need to find the coordinates that have been
predicted for this event. Have a look at the CanGenWorkshop_segments.txt
file in the results folder to view all predicted CNV events with:
less sequenza_results/CanGenWorkshop_segments.txt
There is only one at a copy number of 4 (CNt column) and it starts at
21,051,700 to 21,522,065 which is 470kb and corresponds to the small block
we see in the genome view PDF.
Quit out of viewing the segments file by pressing q
.
We will now open IGV
and see if we can observe the predicted increase in
copy number within these genomic coordinates.
/home/trainee/snv/Applications/igv/igv.sh
IGV will take 30 seconds or so to open so just be patient.
For a duplication of this size, we will not be able to easily observe it
just by looking at the raw read alignments. In order to see it we will
generate two tiled data files (TDFs) within IGV which contain the
average read depth for a given window size through the genome. This
means that we can aggregate the average read depth over relatively large
chunks of the genome and compare these values between the normal and
tumour genomes.
To begin, select the genome Human (b37)
, go to Tools
then Run igvtools...
in the IGV
menubar. Specify the normal bam file (under cnv
then data
) as the
input file and change the window size to 100,000 (one hundred thousand).
Then press the Run
button and IGV will make the TDF file. This takes
about 5 minutes. Repeat this for the tumour genome.
After you have both TDF files, go to File
and Load from file...
in
the menubar and select the BAM and TDF files to open. Once you have
opened them, they will appear as tracks along with the BAM tracks we
loaded initially. Navigate to the genomic coordinates of our event
(5:21,051,700-21,522,065) by typing it in the coordinate box at the top.
Mouse over the two blue tracks to get the average depth values for the
100,000 bp windows. What you should see is that the liverMets sample has
3-6X more coverage than the Blood sample for the four windows that cover
this region.
This may seem a bit underwhelming, after all, wasn’t the increase of the region to a copy number of 4, i.e. we expect a doubling of reads in the tumour? To explain why we are only seeing such a small coverage increase, we need to turn to our good friend mathematics!
Imagine we have two 30X genomes for the normal and tumour samples and the tumour is at 100% purity. If there is a copy number increase to 4 in the tumour from 2 in the normal, the duplicated segment should indeed have twice as many reads as the same segment in the normal genome. Now, lets imagine the tumour genome was only at a purity of 50% (i.e. it contains 50% normal cells and 50% tumour cells). Now, half of the duplicated “tumour genome” segment will be at a copy number of 2 and half will be at 4. What does this mean when we sequence them as a mixture? The resulting average copy number of the block will be . Now what if we only have 16% tumour cells in our “tumour genome”? This will be . You can see how sequencing a low cellularity tumour at a low depth makes it much harder to infer copy number variations!
Returning to our genomes at hand, when we previously looked at the cellularity estimate of this tumour we saw it was 20% from the small block we ran or 16% from the whole genome. Thus, the read depth increase of just 3-6X (about 10-20% more reads) in this segment is not surprising. A low cellularity tumour greatly reduces our power to infer copy number events as relatively small changes in depth can occur by chance in the genome and these can be mis-identified as copy number changes. As well as this, it reduces our power for other analyses since we must also remember that a tumour can itself contain multiple clones which have to share just 16% of reads.
It is possible to sequence through a low-cellularity sample when, for example, there is no way to take another sample (as is the case of most biopsies). “Sequencing through” means to simply sequence the tumour at a much higher coverage, usually 90-120X. This will mean that there will be a net increase in reads supplying evidence for copy number events and variants and in aggregate these will still retain power to infer these events when using tools that look at the whole genome like Sequenza does.
References¶
- F. Favero, T. Joshi, A. M. Marquard, N. J. Birkbak, M. Krzystanek, Q. Li, Z. Szallasi, and A. C. Eklund. “Sequenza: allele-specific copy number and mutation profiles from tumor sequencing data”. Annals of Oncology, 2015, vol. 26, issue 1, 64-70.