Pacbio reads: assembly with command line tools¶

This tutorial demonstrates how to use long PacBio sequence reads to assemble a bacterial genome, including correcting the assembly with short Illumina reads.

Resources¶

Tools (and versions) used in this tutorial include:

canu 1.5 (requires java 1.8)
infoseq and sizeseq (part of EMBOSS) 6.6.0.0
circlator 1.5.1
bwa 0.7.15
samtools 1.3.1
makeblastdb and blastn (part of blast) 2.4.0+
pilon 1.20
spades 3.10.1

Learning objectives¶

At the end of this tutorial, be able to:

Assemble and circularise a bacterial genome from PacBio sequence data.
Recover small plasmids missed by long read sequencing, using Illumina data
Explore the effect of polishing assembled sequences with a different data set.

Overview¶

Simplified version of workflow:

workflow

Question

How do long- and short-read assembly methods differ?

Answer

Short reads are usually assembled using De Bruijn graphs. With long reads, there is a move back towards simpler overlap-layout-consensus methods.

Question

Where can we find out the what the approximate genome size should be for the species being assembled?

Answer

Go to https://www.ncbi.nlm.nih.gov/genome/ - enter species name - click on Genome Assembly and Annotation report - sort table by clicking on the column header Size (Mb) - look at range of sizes in this column.

Question

Where could you view the output filename.gfa and what would it show?

Answer

This is the assembly graph. You can view it using the tool “Bandage”, https://rrwick.github.io/Bandage/, to see how the contigs are connected (including ambiguities).

Get data¶

cd /home/trainee/long_reads/workshop_files
ls -l

The files we need are:

pacbio.fastq.gz - the PacBio reads
illumina_R1.fastq.gz - the Illumina forward reads
illumina_R2.fastq.gz - the Illumina reverse reads

Pre-computed data¶

Some of these tools will take too long to run in this workshop. For these tools, we have pre-computed the output files. In this workshop, we will still enter in the commands and set the tool running, but will sometimes then stop the run and move on to pre-computed output files.

In your directory, along with the PacBio and Illumina files, you may also see folders of pre-computed data.

Sample information¶

The sample used in this tutorial is a gram-positive bacteria called Staphylococcus aureus (sample number 25747). This particular sample is from a strain that is resistant to the antibiotic methicillin (a type of penicillin). It is also called MRSA: methicillin-resistant Staphylococcus aureus. It was isolated from (human) blood and caused bacteraemia, an infection of the bloodstream.

Assemble ¶

We will use the assembly software called Canu, https://github.com/marbl/canu.
Type in:

canu -p canu -d canu_outdir_NGS genomeSize=2.8m -pacbio-raw pacbio.fastq.gz

the first canu tells the program to run
-p canu names prefix for output files (“canu”)
-d canu_outdir_NGS names output directory
genomeSize only has to be approximate.
- e.g. Staphylococcus aureus, 2.8m
- e.g. Streptococcus pyogenes, 1.8m
Canu will correct, trim and assemble the reads.
Various output will be displayed on the screen.
As we don’t have time for Canu to complete, stop the run by typing Ctrl-C. We will look at pre-computed data in the folder canu_outdir.

Canu output¶

Move into canu_outdir and ls to see the output files.

cd canu_outdir
ls -l

The canu.contigs.fasta are the assembled sequences.
The canu.unassembled.fasta are the reads that could not be assembled.
The canu.correctedReads.fasta.gz are the corrected Pacbio reads that were used in the assembly.
The canu.contigs.gfa is the graph of the assembly.
Display summary information about the contigs: (infoseq is a tool from EMBOSS)

infoseq canu.contigs.fasta

Question

How long is the assembled contig ?

Answer

tig00000001 2851805 This looks like a chromosome of approximately 2.8 million bases.

This matches what we would expect for this sample. For other data, Canu may not be able to join all the reads into one contig, so there may be several contigs in the output. Also, the sample may contain some plasmids and these may be found full or partially by Canu as additional contigs.

Try it later

Change Canu parameters if required: If the assembly is poor with many contigs, re-run Canu with extra sensitivity parameters; e.g.

canu -p prefix -d outdir corMhapSensitivity=high corMinCoverage=0 genomeSize=2.8m -pacbio-raw pacbio.fastq.gz

Trim and circularise¶

Run Circlator¶

Circlator (https://github.com/sanger-pathogens/circlator) identifies and trims overhangs (on chromosomes and plasmids) and orients the start position at an appropriate gene (e.g. dnaA). It takes in the assembled contigs from Canu, as well as the corrected reads prepared by Canu.

Overhangs are shown in blue:

circlator Adapted from Figure 1. Hunt et al. Genome Biology 2015

Run Circlator:

```text
cd /home/trainee/long_reads/workshop_files
circlator all --threads 4 --verbose canu_outdir/canu.contigs.fasta canu_outdir/canu.correctedReads.fasta.gz circlator_outdir_NGS    
```

`--threads` is the number of cores <!-- change this to an appropriate number-->    
`--verbose` prints progress information to the screen    
`canu_outdir/canu.contigs.fasta` is the file path to the input Canu assembly    
`canu_outdir/canu.correctedReads.fasta.gz` is the file path to the corrected Pacbio reads - note, fastA not fastQ    
`circlator_outdir_NGS` is the name of the output directory.

Stop the run by typing Ctrl-C. We will look at pre-computed data in the folder circlator_outdir.

Circlator output¶

Move into the circlator_outdir directory and ls to list files.

cd /home/trainee/long_reads/workshop_files/circlator_outdir
ls -ltr

Questions¶

Question

Were all the contigs circularised? Why/why not?

Hint

less 04.merge.circularise.log

Type “q” to exit.

Answer

Yes, the contig was circularised (last column). In this example, the contig could be circularized because it contained the entire sequence, with overhangs that were trimmed.

Question

Where were the contigs oriented (which gene)?

Hint

less 06.fixstart.log

Type “q” to exit.

Answer

Look in the “gene_name” column.
The contig has been oriented at tr|A0A090N2A8|A0A090N2A8_STAAU, which is another name for dnaA. This is typically used as the start of bacterial chromosome sequences.

Question

What are the trimmed contig sizes?

Hint

infoseq 06.fixstart.fasta

Answer

tig00000001 2823331 (28564 bases trimmed) -This trimmed part is the overlap.

Question

Circlator can set the start of the sequence at a particular gene. Which gene does it use? Is this appropriate for all contigs?

Answer

Circlator uses dnaA for the chromosomal contig. For other contigs, it uses a centrally-located gene. However, ideally, plasmids would be oriented on a gene such as a rep gene. It is possible to provide a file to Circlator to do this.

Re-name the contigs file:

The trimmed contigs are in the file called 06.fixstart.fasta.
Re-name it contig1.fasta:

cp 06.fixstart.fasta contig1.fasta

Open this file in a text editor (e.g. nano: nano contig1.fasta) and change the header to “>chromosome”.

Move the file back into the main folder (mv contig1.fasta ../).

Tips

If all the contigs have not circularised with Circlator, an option is to change the --b2r_length_cutoff setting to approximately 2X the average read depth.

Find smaller plasmids¶

Pacbio reads are long, and may have been longer than small plasmids. We will look for any small plasmids using the Illumina reads.

This section involves several steps:

Use the Canu+Circlator output of a trimmed assembly contig.
Map all the Illumina reads against this PacBio-assembled contig.
Extract any reads that didn’t map and assemble them together: this could be a plasmid, or part of a plasmid.
Look for overhang: if found, trim.

Align Illumina reads to the PacBio contig¶

Index the contigs file:

bwa index contig1.fasta

Align Illumina reads using using bwa mem:
```
bwa mem -t 4 contig1.fasta illumina_R1.fastq.gz illumina_R2.fastq.gz | samtools sort > aln_NGS.bam     
```
- bwa mem is the alignment tool
- -t 4 is the number of cores
- contig1.fasta is the input assembly file
- illumina_R1.fastq.gz illumina_R2.fastq.gz are the Illumina reads
- | samtools sort pipes the output to samtools to sort
- > aln_NGS.bam sends the alignment to the file aln_NGS.bam

Stop the run by typing Ctrl-C. We will use the pre-computed file called aln.bam.

Extract unmapped Illumina reads¶

Index the alignment file:

samtools index aln.bam

Extract the fastq files from the bam alignment - those reads that were unmapped to the Pacbio alignment - and save them in various “unmapped” files:

samtools fastq -f 4 -1 unmapped.R1.fastq -2 unmapped.R2.fastq -s unmapped.RS.fastq aln.bam

fastq is a command that coverts a .bam file into fastq format
-f 4 : only output unmapped reads
-1 : put R1 reads into a file called unmapped.R1.fastq
-2 : put R2 reads into a file called unmapped.R2.fastq
-s : put singleton reads into a file called unmapped.RS.fastq
aln.bam : input alignment file

We now have three files of the unampped reads: unmapped.R1.fastq, unmapped.R2.fastq, unmapped.RS.fastq.

Assemble the unmapped reads¶

Assemble with Spades (http://cab.spbu.ru/software/spades/):
```
spades.py -1 unmapped.R1.fastq -2 unmapped.R2.fastq -s unmapped.RS.fastq --careful --cov-cutoff auto -o spades_assembly_NGS
```
- -1 is input file forward
- -2 is input file reverse
- -s is unpaired
- --careful minimizes mismatches and short indels
- --cov-cutoff auto computes the coverage threshold (rather than the default setting, “off”)
- -o is the output directory

Stop the run by typing Ctrl-C. We will use the pre-computed file in the folder spades_assembly.

Move into the output directory (spades_assembly) and look at the contigs:

infoseq contigs.fasta

78 contigs were assembled, with the max length of 2250 (the first contig).
All other nodes are < 650kb so we will disregard as they are unlikely to be plasmids.
We will extract the first sequence (NODE_1):

samtools faidx contigs.fasta

samtools faidx contigs.fasta NODE_1_length_2550_cov_496.613 > contig2.fasta

This is now saved as contig2.fasta
Open in nano and change header to “>plasmid”.

Trim the plasmid¶

To trim any overhang on this plasmid, we will blast the start of contig2 against itself.

Take the start of the contig:

head -n 10 contig2.fasta > contig2.fa.head

We want to see if it matches the end (overhang).
Format the assembly file for blast:

makeblastdb -in contig2.fasta -dbtype nucl

Blast the start of the assembly (.head file) against all of the assembly:

blastn -query contig2.fa.head -db contig2.fasta -evalue 1e-3 -dust no -out contig2.bls

Look at contig2.bls to see hits:

less contig2.bls

The first hit is at start, as expected.
The second hit is at 2474 all the way to the end - 2550.
This is the overhang.
Trim to position 2473.
Type ‘q’ to exit.
Index the plasmid.fa file:

samtools faidx contig2.fasta

Trim

samtools faidx contig2.fasta plasmid:1-2473 > plasmid.fa.trimmed

plasmid is the name of the contig, and we want the sequence from 1-2473.
Open this file in nano (nano plasmid.fa.trimmed) and change the header to “>plasmid”, save.
(Use the side scroll bar to see the top of the file.)
We now have a trimmed plasmid.
Move file back into main folder:

cp plasmid.fa.trimmed ../

Move into the main folder.

Plasmid contig orientation¶

The bacterial chromosome was oriented at the gene dnaA. Plasmids are often oriented at the replication gene, but this is highly variable and there is no established convention. Here we will orient the plasmid at a gene found by Prodigal, in Circlator:

circlator fixstart plasmid.fa.trimmed plasmid_fixstart

fixstart is an option in Circlator just to orient a sequence.
plasmid.fa.trimmed is our small plasmid.
plasmid_fixstart is the prefix for the output files.

View the output:

less plasmid_fixstart.log

The plasmid has been oriented at a gene predicted by Prodigal, and the break-point is at position 1200.
Change the file name:

cp plasmid_fixstart.fasta contig2.fasta

Collect contigs¶

cat contig1.fasta contig2.fasta > genome.fasta

See the contigs and sizes:

infoseq genome.fasta

chromosome: 2823331
plasmid: 2473

Questions¶

Question

Why is this section so complicated?

Answer

Finding small plasmids is difficult for many reasons! This paper has a nice summary: On the (im)possibility to reconstruct plasmids from whole genome short-read sequencing data. doi: https://doi.org/10.1101/086744

Question

Why can PacBio sequencing miss small plasmids?

Answer

Library prep size selection

Question

We extract unmapped Illumina reads and assemble these to find small plasmids. What could they be missing?

Answer

Repeats that have mapped to the PacBio assembly.

Question

How do you find a plasmid in a Bandage graph?

Answer

It is probably circular, matches the size of a known plasmid, and has a rep gene.

Question

Are there easier ways to find plasmids?

Answer

Possibly. One option is the program called Unicycler which may automate many of these steps. https://github.com/rrwick/Unicycler

Correct¶

We will correct the Pacbio assembly with Illumina reads, using the tool Pilon (https://github.com/broadinstitute/pilon/wiki).

Make an alignment file¶

Align the Illumina reads (R1 and R2) to the draft PacBio assembly, e.g. genome.fasta:

bwa index genome.fasta    
bwa mem -t 4 genome.fasta illumina_R1.fastq.gz illumina_R2.fastq.gz | samtools sort > aln_illumina_pacbio_NGS.bam

-t is the number of cores

Stop the run by typing Ctrl-C. We will use the pre-computed file called aln_illumina_pacbio.bam. - We will use the pre-computed file called aln_illumina_pacbio.bam.

Index the files:

samtools index aln_illumina_pacbio.bam
samtools faidx genome.fasta

Now we have an alignment file to use in Pilon: aln_illumina_pacbio.bam

Run Pilon¶

Pilon is a software tool which can be used to:

Automatically improve draft assemblies
Find variation among strains, including large event detection
```
pilon --genome genome.fasta --frags aln_illumina_pacbio.bam --output pilon1_NGS --fix all --mindepth 0.5 --changes --verbose --threads 4
```
- --genome is the name of the input assembly to be corrected
- --frags is the alignment of the reads against the assembly
- --output is the name of the output prefix
- --fix is an option for types of corrections
- --mindepth gives a minimum read depth to use
- --changes produces an output file of the changes made
- --verbose prints information to the screen during the run
- --threads: the number of cores

Stop the run by typing Ctrl-C. We will use the pre-computed files called with the prefixes pilon1. - We will use the pre-computed files called with the prefixes pilon1._

Look at the changes file:

less pilon1.changes

Example:

pilon

Look at the details of the fasta file:

infoseq pilon1.fasta

chromosome - 2823340 (net +9 bases)
plasmid - 2473 (no change)

Option:

If there are many changes, run Pilon again, using the pilon1.fasta file as the input assembly, and the Illumina reads to correct.

Genome output¶

Change the file name:

cp pilon1.fasta assembly.fasta

We now have the corrected genome assembly of Staphylococcus aureus in .fasta format, containing a chromosome and a small plasmid.

Questions¶

Question

Why don’t we correct earlier in the assembly process?

Answer

We need to circularise the contigs and trim overhangs first.

Question

Why can we use some reads (Illumina) to correct other reads (PacBio) ?

Answer

Illumina reads have higher accuracy

Question

Could we just use PacBio reads to assemble the genome?

Answer

Yes, if accuracy adequate.

Short-read assembly: a comparison¶

So far, we have assembled the long PacBio reads into one contig (the chromosome) and found an additional plasmid in the Illumina short reads.

If we only had Illumina reads, we could also assemble these using the tool Spades.

You can try this here, or try it later on your own data.

Get data¶

We will use the same Illumina data as we used above:

illumina_R1.fastq.gz: the Illumina forward reads
illumina_R2.fastq.gz: the Illumina reverse reads

This is from Sample 25747.

Assemble¶

Run Spades:

spades.py -1 illumina_R1.fastq.gz -2 illumina_R2.fastq.gz --careful --cov-cutoff auto -o spades_assembly_all_illumina

-1 is input file of forward reads
-2 is input file of reverse reads
--careful minimizes mismatches and short indels
--cov-cutoff auto computes the coverage threshold (rather than the default setting, “off”)
-o is the output directory

Results¶

Move into the output directory and look at the contigs:

infoseq contigs.fasta

Questions¶

How many contigs were found by Spades?

many

How does this compare to the number of contigs found by assembling the long read data with Canu?

many more.

Does it matter that an assembly is in many contigs?

Yes
broken genes => missing/incorrect annotations
less information about structure: e.g. number of plasmids
No
Many or all genes may still be annotated
Gene location is useful (e.g. chromosome, plasmid1) but not always essential (e.g. presence/absence of particular resistance genes)

How can we get more information about the assembly from Spades?

Look at the assembly graph assembly_graph.fastg, e.g. in the program Bandage. This shows how contigs are related, albeit with ambiguity in some places.

Comparative genomics¶

We will compare the genomes assembled by:

pacbio pre-pilon
pacbio post pilon
illumina only

Do the annotations differ¶

Run prokka on the three - how do they differ

Align the short-read and long-read assemblies¶

Mauve: align the illumina contigs to the polished pacbio contigs.

Further analyses¶

Annotate genomes, e.g. with Prokka, https://github.com/tseemann/prokka
Comparative genomics, e.g. with Roary, https://sanger-pathogens.github.io/Roary/

Links¶

Canu manual and gitub repository
Circlator article and github repository
Pilon article and github repository
Notes on finishing and evaluating assemblies.