Copy Number Variation

Key Learning Outcomes¶

After completing this practical the trainee should be able to:

Understand and perform a simple copy number variation analysis on NGS data
Become familiar with Sequenza
Understand the CNV inference process as an interplay between depth of sequencing, cellularity and B-allele frequency
Visualize CNV events by manual inspection

Resources You’ll be Using¶

Tools Used¶

Sequenza:
http://www.cbs.dtu.dk/biotools/sequenza/

IGV:
http://www.broadinstitute.org/igv/

Sources of Data¶

Raw data download:
http://sra.dnanexus.com/studies/ERP001071

Data publication:
http://www.ncbi.nlm.nih.gov/pubmed/22194472

Author Information¶

Primary Author(s):
Velimir Gayevskiy, Garvan Institute v.gayevskiy@garvan.org.au
Sonika Tyagi, AGRF sonika.tyagi@agrf.org.au

Contributor(s):

Introduction¶

The goal of this hands-on session is to perform a copy number variation analysis (CNV) on a normal/tumour pair of alignment files (BAMs) produced by the mapping of Illumina short read sequencing data.

To ensure reasonable analysis times, we will perform the analysis on a heavily subsetted pair of BAM files. These files contain just the first 60Mb of chromosome 5 but contain several examples of inferred copy number events to enable interpretation and visualisation of the copy number variation that is present in entire cancer genomes. Sequenza is the tool we will use to perform this analysis. It consists of two Python pre-processing steps followed by a third step in R to infer the depth ratio, cellularity, ploidy and to plot the results for interpretation.

In the second part of the tutorial we will also be using IGV to visualise and manually inspect the copy number variation we inferred in the first part for validation purposes. This section will also include a discussion on the importance of good quality data by highlighting the inadequacies of the workshop dataset and the implications this has on analysis results.

Prepare the Environment¶

We will use a dataset derived from whole genome sequencing of a 33-yr-old lung adenocarcinoma patient, who is a never-smoker and has no familial cancer history.

The data files are contained in the subdirectory called data and are the following:

normal.chr5.60Mb.bam and normal.chr5.60Mb.bam.bai

tumour.chr5.60Mb.bam and tumour.chr5.60Mb.bam.bai

These files are based on subsetting the whole genomes derived from blood and liver metastases to the first 60Mb of chromosome 5. This will allow our analyses to run in a sufficient time during the workshop, but it’s worth being aware that we are analysing just 1.9% of the genome which will highlight the length of time and resources required to perform cancer genomics on full genomes!

Open the Terminal and go to the cnv working directory:

cd /home/trainee/cnv/

All commands entered into the terminal for this tutorial should be from within the cnv directory.

Check that the data directory contains the above-mentioned files by typing:

ls -l data

Sequenza CNV Analysis¶

Sequenza is run in three steps. The first pre-processing step is run on the final normal and tumour mapped data (BAM files) in order to walk the genome in a pileup format (automatically generated by samtools). This first step finds high quality sites in the genomes and extracts their depth and genotype in the normal genome and calculates the variant alleles and allele frequencies in the tumour genome. The second step is to perform a binning on these sites to save space and analysis time in the third step. Finally, the third step is run in R to normalise the depth ratio between the normal/tumour genomes, infer cellularity and ploidy and graphically output results for interpretation.

Step 1: Pre-processing – Walking the Genome¶

pypy software/sequenza/sequenza-utils.py bam2seqz \
-n data/normal.chr5.60Mb.bam \
-t data/tumour.chr5.60Mb.bam \
--fasta assets/human_g1k_v37.fasta \
-gc assets/human_g1k_v37.gc50Base.txt.gz \
-C 5:1-60000000 | gzip > stage1.seqz.gz

Hint

Press tab after typing a few characters of a directory of filename to auto-complete the rest. This makes entering long file names very quick.

Explanation of parameters:

-n: the normal BAM
-t: the tumour BAM
--fasta: the reference genome used for mapping (b37 here)
-gc: GC content as windows through the genome (pre-generated and downloadable from the Sequenza website)
-C: specifies the genomic location to process

There will not be any indication that it is running once you launch the command, to make sure it is running open a new Terminal tab with Shift + Control + T (or from the menu with File then Open Tab) and type the command top. You should see the top line being the command ’pypy’ with a % CPU usage of 98/99%. Press q to quit out of this process view and go back to the tab running Sequenza. If everything is running correctly, it will take approximately 40 minutes to run. Go have a coffee!

Once the command is done you will be returned to the terminal prompt. Make sure the output file is the correct size by typing ls -lh from the Terminal window that you ran Sequenza from, there should be a file called stage1.seqz.gz of the size ~395M.

You can look at the first few lines of the output in the file stage1.seqz.gz with:

zcat stage1.seqz.gz | head -n 20

This output has one line for each position in the BAMs and includes information on the position, depths, allele frequencies, zygosity, GC in the location.

Step 2: Perform Binning¶

The binning step takes the rows of genomic positions and compresses them down to 1 row for every 200 rows previously. This massively reduces the file size and processing time in the third step.

pypy software/sequenza/sequenza-utils.py seqz-binning \
-w 200 \
-s stage1.seqz.gz | gzip > stage2.seqz.gz

Explanation of parameters:

-w: the window size (typically 50 for exomes, 200 for genomes)
-s: the large seqz file generated in the first step

This step should take approximately 4 minutes to complete.

Step 3: Running Sequenza Algorithms and Plotting Results¶

We will now perform the CNV analysis and output the results using the R part of Sequenza.

Open the R terminal:

You should now see the R prompt identified with “>”.

Run the Sequenza R commands:

library("sequenza")
setwd("/home/trainee/cnv")
data.file <- "stage2.seqz.gz"
seqzdata <- sequenza.extract(data.file)
CP.example <- sequenza.fit(seqzdata)
sequenza.results(sequenza.extract = seqzdata, cp.table = CP.example, sample.id = "CanGenWorkshop", out.dir="sequenza_results")

Quit R:

q()

Then enter n at the “Save workspace image” prompt.

If every command ran successfully, you will now have a sequenza_results folder containing 13 files (type ls -l sequenza_results/).

Sequenza Analysis Results and Visualisation¶

One of the first and most important estimates that Sequenza provides is the tumour cellularity (the estimated percentage of tumour cells in the tumour genome). This estimate is based on the B allele frequency and depth ratio through the genome and is an important metric to know for interpretation of Sequenza results and for other analyses. Lets look at the cellularity estimate for our analysis by opening CanGenWorkshop_model_fit.pdf with the command:

evince sequenza_results/CanGenWorkshop_model_fit.pdf

The cellularity estimate is at the top along with the average ploidy estimate and the standard deviation of the B allele frequency. We can see that the cellularity has been estimated at 24% which is fairly low and we will see why this is bad in the next section on CNV visualisation. The ploidy value of 2.1 indicates this piece of the genome is not hugely amplified or deleted and the BAF standard deviation indicates there are no significant long losses of heterozygosity.

Close the PDF window to resume the Terminal prompt.

Let’s now look at the CNV inferences through our genomic block. Open the genome copy number visualisation file with:

evince sequenza_results/CanGenWorkshop_genome_view.pdf

This file contains three “pages” of copy number events through the entire genomic block.

The first page shows copy numbers of the A (red) and B (blue) alleles,
The second page shows overall copy number changes, and
The third page shows the B allele frequency and depth ratio through genomic block.

Looking at the overall copy number changes, we see that our block is at a copy number of 2 with a small duplication to copy number 4 about ⅓ of the way through the block and another just after halfway through the block. There is also a reduction in copy number to 1 copy about ⅘ of the way through the block. The gap that you see just before this reduction in copy number is the chromosomal centromere - an area that is notoriously difficult to sequence so always ends up in a gap with short read data.

You can see how this is a very easy to read output and lets you immediately see the frequency and severity of copy number events through your genome. Let’s compare the small genomic block we ran with the same output from the entire genome which has been pre-computed for you. This is located in the pre_generated/results_whole_genome folder and contains the same 13 output files as for the small genomic block. As before, let’s look at the cellularity estimate with:

evince pre_generated/results_whole_genome/CanGenWorkshop_model_fit.pdf

It now looks like it’s even worse at just 16%! A change is to be expected as we were only analysing 1.9% of the genome. Let’s now look at the whole genome copy number profile with:

evince pre_generated/results_whole_genome/CanGenWorkshop_genome_view.pdf

You can see that there are a number of copy number events across the genome and our genomic block (the first 60Mb of chromosome 5) is inferred as mostly copy number 4 followed by a reduction to copy number 2, rather than 2 to 1 as we saw in the output we generated. The reason for this is that Sequenza uses the genome-wide depth ratio and BAF in order to estimate copy number, if you ask it to analyse a small block mostly at copy number 4 with a small reduction to copy number 2, the most likely scenario in lieu of more data is that this is a copy number 2 block with a reduction to 1. It’s important to carefully examine the cellularity, ploidy and BAF estimates of your sample along with the plots of model fit (CanGenWorkshop_model_fit.pdf) and cellularity/ploidy contours (CanGenWorkshop_CP_contours.pdf) in order to decide if you believe Sequenza’s inference of the copy numbers. Have a look at these for yourself if you want to get a better idea of how Sequenza makes its inferences and conclusions.

CNV Visualisation/Confirmation in IGV¶

Let’s see if we can visualise one of the CNV events where copy number increased significantly. We’ll focus on the copy number 4 event seen at about ⅓ of the way through the CanGenWorkshop_genome_view.pdf output we generated. First, we need to find the coordinates that have been predicted for this event. Have a look at the CanGenWorkshop_segments.txt file in the results folder to view all predicted CNV events with:

less sequenza_results/CanGenWorkshop_segments.txt

There is only one at a copy number of 4 (CNt column) and it starts at 21,051,700 to 21,522,065 which is 470kb and corresponds to the small block we see in the genome view PDF.

Quit out of viewing the segments file by pressing q.

We will now open IGV and see if we can observe the predicted increase in copy number within these genomic coordinates.

/home/trainee/snv/Applications/igv/igv.sh

IGV will take 30 seconds or so to open so just be patient.

For a duplication of this size, we will not be able to easily observe it just by looking at the raw read alignments. In order to see it we will generate two tiled data files (TDFs) within IGV which contain the average read depth for a given window size through the genome. This means that we can aggregate the average read depth over relatively large chunks of the genome and compare these values between the normal and tumour genomes.

To begin, select the genome Human (b37), go to Tools then Run igvtools... in the IGV menubar. Specify the normal bam file (under cnv then data) as the input file and change the window size to 100,000 (one hundred thousand). Then press the Run button and IGV will make the TDF file. This takes about 5 minutes. Repeat this for the tumour genome.

After you have both TDF files, go to File and Load from file... in the menubar and select the BAM and TDF files to open. Once you have opened them, they will appear as tracks along with the BAM tracks we loaded initially. Navigate to the genomic coordinates of our event (5:21,051,700-21,522,065) by typing it in the coordinate box at the top. Mouse over the two blue tracks to get the average depth values for the 100,000 bp windows. What you should see is that the liverMets sample has 3-6X more coverage than the Blood sample for the four windows that cover this region.

This may seem a bit underwhelming, after all, wasn’t the increase of the region to a copy number of 4, i.e. we expect a doubling of reads in the tumour? To explain why we are only seeing such a small coverage increase, we need to turn to our good friend mathematics!

Imagine we have two 30X genomes for the normal and tumour samples and the tumour is at 100% purity. If there is a copy number increase to 4 in the tumour from 2 in the normal, the duplicated segment should indeed have twice as many reads as the same segment in the normal genome. Now, lets imagine the tumour genome was only at a purity of 50% (i.e. it contains 50% normal cells and 50% tumour cells). Now, half of the duplicated “tumour genome” segment will be at a copy number of 2 and half will be at 4. What does this mean when we sequence them as a mixture? The resulting average copy number of the block will be $(0.5*2)+(0.5*4) = 3$ . Now what if we only have 16% tumour cells in our “tumour genome”? This will be $(0.84*2)+(0.16*4) = 2.32$ . You can see how sequencing a low cellularity tumour at a low depth makes it much harder to infer copy number variations!

Returning to our genomes at hand, when we previously looked at the cellularity estimate of this tumour we saw it was 20% from the small block we ran or 16% from the whole genome. Thus, the read depth increase of just 3-6X (about 10-20% more reads) in this segment is not surprising. A low cellularity tumour greatly reduces our power to infer copy number events as relatively small changes in depth can occur by chance in the genome and these can be mis-identified as copy number changes. As well as this, it reduces our power for other analyses since we must also remember that a tumour can itself contain multiple clones which have to share just 16% of reads.

It is possible to sequence through a low-cellularity sample when, for example, there is no way to take another sample (as is the case of most biopsies). “Sequencing through” means to simply sequence the tumour at a much higher coverage, usually 90-120X. This will mean that there will be a net increase in reads supplying evidence for copy number events and variants and in aggregate these will still retain power to infer these events when using tools that look at the whole genome like Sequenza does.

References¶

F. Favero, T. Joshi, A. M. Marquard, N. J. Birkbak, M. Krzystanek, Q. Li, Z. Szallasi, and A. C. Eklund. “Sequenza: allele-specific copy number and mutation profiles from tumor sequencing data”. Annals of Oncology, 2015, vol. 26, issue 1, 64-70.