Scaffolding
Key Learning Outcomes¶
After completing this module the trainee should be able to:
-
Run an assembly using long mate pair (LMP) reads.
-
Compare results of LMP assemblies with those from the short reads.
-
Run scaffolding and assess the assembly statistics.
Resources You’ll be Using¶
Tools Used¶
Kmer Analysis Tool kit:
https://github.com/TGAC/KAT
Nextclip:
https://github.com/richardmleggett/nextclip
Abyss:
http://www.bcgsc.ca/platform/bioinfo/software/abyss
Soap Denovo:
http://soap.genomics.org.cn/soapdenovo.html
SOAPec:
http://soap.genomics.org.cn/about.html
Author Information¶
Primary Author(s):
Bernardo Clavijo, TGAC Bernardo.Clavijo@tgac.ac.uk
Contributor(s):
Sonika Tyagi, AGRF sonika.tyagi@agrf.org.au
Zhiliang Chan zhiliang@unsw.edu.au
Scaffolding with long mate pair libraries: Chalara scaffolding using LMP¶
Scaffold is made of two or more contigs joined together using the read pair information. In the absence of a high-quality reference genome, new genome assemblies are often evaluated on the basis of the number of scaffolds and contigs required to represent the genome. Other criteria such as the alignment of reads back to assemblies, N50, and the length of contigs and scaffolds relative to the size of the genome can also be used to assess the quality of assemblies. In this exercise we will be using a long mate pair data to perform assembly and scaffolding and we will focus on how using LMP reads can affect the assemblies. Most part of this tutorial has been precomputed and made available to you in interest of time. We will spend more time exploring and discussing the results.
Pair-end assembly using Chalara dataset¶
Before doing a scaffolding, we need to build an assembly using the pair-end reads first.
Let’s go to the correct location where the files are:
1 |
|
STOP
You DO NOT need to run this command. This has already been done for you.
1 |
|
Let’s look at the stats by doing:
1 |
|
Table 1: Statistics of Chalara assembly by ABySS using k=27.
n | n:500 | L50 | min | N80 | N50 | N20 | E-size | max | sum | name |
---|---|---|---|---|---|---|---|---|---|---|
394789 | 11681 | 2094 | 500 | 2880 | 6254 | 12303 | 7936 | 44414 | 44.07e6 | cha1-unitigs.fa |
394199 | 11673 | 2097 | 500 | 2887 | 6255 | 12303 | 7937 | 44414 | 44.11e6 | cha1-contigs.fa |
394161 | 11647 | 2095 | 500 | 2898 | 6269 | 12303 | 7944 | 44414 | 44.12e6 | cha1-scaffolds.fa |
1 |
|
STOP
This is also a pre-computed example for you:
1 |
|
Again, we need to look at the statistics of our new assembly using k=61:
1 |
|
It should be looking like this:
Table 2: Statistics of Chalara assembly by ABySS using k=61.
n | n:500 | L50 | min | N80 | N50 | N20 | E-size | max | sum | name |
---|---|---|---|---|---|---|---|---|---|---|
130547 | 12596 | 1770 | 500 | 3352 | 8379 | 16380 | 10518 | 54300 | 50.75e6 | cha2-unitigs.fa |
130210 | 12575 | 1771 | 500 | 3363 | 8382 | 16380 | 10525 | 54300 | 50.78e6 | cha2-contigs.fa |
130182 | 12555 | 1769 | 500 | 3377 | 8394 | 16380 | 10534 | 54300 | 50.78e6 | cha2-scaffolds.fa |
Question
The assembly contiguity is worse than Fusarium, why?
Answer
- Genome characteristics.
- Data
- Coverage
- Error distributions
- Read sizes
- Fragment sizes
- Kmer spectra
Chalara scaffolding using long mate-pair reads (LMP)¶
Let’s put some LMP in there.
1 |
|
STOP
This is also a pre-computed example for you:
1 |
|
less chalmp1-stats.tab
Table 3: Statistics of Chalara assembly by ABySS using k=27.
n | n:500 | L50 | min | N80 | N50 | N20 | E-size | max | sum | name |
---|---|---|---|---|---|---|---|---|---|---|
130548 | 12596 | 1770 | 500 | 3352 | 8379 | 16380 | 10518 | 54300 | 50.75e6 | chalmp1-unitigs.fa |
130211 | 12575 | 1771 | 500 | 3363 | 8382 | 16380 | 10525 | 54300 | 50.78e6 | chalmp1-contigs.fa |
130148 | 12545 | 1769 | 500 | 3380 | 8400 | 16380 | 10535 | 54300 | 50.78e6 | chalmp1-scaffolds.fa |
So… what happened? Let’s check the data by:
- Kmer spectra
- Fragment sizes
Any hints on the protocol? And a not-so-obvious property:
1 |
|
Do you remember LMP required pre-processing? Let’s try with processed LMP.
Prior task (already made) preprocess the LMP with nextclip
.
1 |
|
STOP
This is also a pre-computed example for you:
1 |
|
less chalmpproc1-stats.tab
Table 4: Statistics of Chalara assembly by ABySS using k=61 with LMP.
n | n:500 | L50 | min | N80 | N50 | N20 | E-size | max | sum | name |
---|---|---|---|---|---|---|---|---|---|---|
130548 | 12596 | 1770 | 500 | 3352 | 8379 | 16380 | 10518 | 54300 | 50.75e6 | chalmpproc1-unitigs.fa |
130211 | 12575 | 1771 | 500 | 3363 | 8382 | 16380 | 10525 | 54300 | 50.78e6 | chalmpproc1-contigs.fa |
122061 | 6679 | 167 | 500 | 18609 | 87510 | 187171 | 106178 | 397967 | 51.15e6 | chalmpproc1-scaffolds.fa |
That’s much better!
Chalara: beyond first pass¶
Do you remember these?
Genome characteristics. Data:
- Coverage
- Error distributions
- Read sizes
- Fragment sizes
- Kmer spectra
Let’s think a bit and try to improve the assembly.
Question
If you look at the pre-processed LMP, do you notice anything peculiar?
Testing the inclusion of heavily pre-processed LMP coverage into the DBG
1 |
|
STOP
This is also a pre-computed example for you:
1 |
|
less chalmp2-stats.tab
Let’s look at the stats.
Table 5: Statistics of improved Chalara assembly by ABySS using k=61 with LMP.
n | n:500 | L50 | min | N80 | N50 | N20 | E-size | max | sum | name |
---|---|---|---|---|---|---|---|---|---|---|
128306 | 10150 | 1182 | 500 | 4954 | 12585 | 24777 | 15693 | 68684 | 51.03e6 | chaproclmp4-unitigs.fa |
118939 | 5772 | 362 | 500 | 15717 | 41870 | 82033 | 52819 | 252066 | 52.24e6 | chaproclmp4-contigs.fa |
116139 | 4014 | 141 | 500 | 41145 | 109273 | 224986 | 133450 | 460979 | 52.56e6 | chaproclmp4-scaffolds.fa |
References¶
-
Robert Ekblom* and Jochen B. W. Wolf. “A field guide to whole-genome sequencing, assembly and annotation”. Evolutionary Applications Special Issue: Evolutionary Conservation Volume 7, Issue 9, pages 1026 – 1042, November 2014
-
De novo genome assembly: what every biologist should know. Nature Methods 9, 333 – 337 (2012) doi:10.1038/nmeth.1935