Introduction to Command Line

Key Learning Outcomes¶

After completing this practical the trainee should be able to:

Familiarise yourself with the command line environment on a Linux operating system.
Run some basic linux system and file operation commands
Navigation of biological data files structure and manipulation

Resources¶

Tools¶

Basic Linux system commands on an Ubuntu OS.
Basic file operation commands

Links¶

Author Information¶

Primary Author(s):
Matt Field matt.field@anu.edu.au

Shell Exercise¶

Let’s try out your new shell skills on some real data.

The file 1000gp.vcf is a small sample (1%) of a very large text file containing human genetics data. Specifically, it describes genetic variation in three African individuals sequenced as part of the 1000 Genomes Project. The ’vcf’ extension lets us know that it’s in a specific text format, namely ’Variant Call Format’. The file starts with a bunch of comment lines (they start with ’#’ or ’##’), and then a large number of data lines. This VCF file lists the differences between the three African individuals and a standard ’individual’ called the reference (actually based upon a few different people). Each line in the file corresponds to a difference. The line tells us the position of the difference (chromosome and position), the genetic sequence in the reference, and the corresponding sequence in each of the three Africans. Before we start processing the file, let’s get a high-level view of the file that we’re about to work with.

Open the Terminal and go to the directory where the data are stored:

cd /home/trainee/cli
ls
pwd
ls -lh 1000gp.vcf
wc -l 1000gp.vcf

Question

What is the file size (in kilo-bytes), and how many lines are in the file?.

Hint

man ls, man wc

Answer

3.6M

45034 lines

Because this file is so large, you’re going to almost always want to pipe (‘|’) the result of any command to less (a simple text viewer, type q to exit) or head (to print the first 10 lines) so that you don’t accidentally print 45,000 lines to the screen.

Let’s start by printing the first 5 lines to see what it looks like.

head -5 1000gp.vcf

That isn’t very interesting; it’s just a bunch of the comments at the beginning of the file (they all start with #)!

Print the first 20 lines to see more of the file.

head -20 1000gp.vcf

Okay, so now we can see the basic structure of the file. A few comment lines that start with ’#’ or ’##’ and then a bunch of lines of data that contain all the data and are pretty hard to understand. Each line of data contains the same number of fields, and all fields are separated with TABs. These fields are:

the chromosome (which volume the difference is in)
the position (which character in the volume the difference starts at)
the ID of the difference
the sequence in the reference human(s)

The rest of the columns tell us, in a rather complex way, a bunch of additional information about that position, including: the predicted sequence for each of the three Africans and how confident the scientists are that these sequences are correct.

To start analyzing the actual data, we have to remove the header.

Question

How can we print the first 10 non-header lines (those that don’t start with a ’#’)?

Hint

man grep (remember to use pipes ‘|’)

Answer

grep -v "^#" 1000gp.vcf | head

This is an advanced section.

Question

How many lines of data are in the file (rather than counting the number of header lines and subtracting, try just counting the number of data lines)?

Answer

grep -v "^#" 1000gp.vcf | wc -l

(should print 45024)

Where these differences are located can be important. If all the differences between two encyclopedias were in just the first volume, that would be interesting. The first field of each data line is the name of the chromosome that the difference occurs on (which volume we’re on).

Question

Print the first 10 chromosomes, one per line.

Hint

man cut (remember to remove header lines first)

Answer

grep -v "^#" 1000gp.vcf | cut -f 1 | head

As you should have observed, the first 10 lines are on numbered chromosomes. Every normal cell in your body has 23 pairs of chromosomes, 22 pairs of ‘autosomal’ chromosomes (these are numbered 1-22) and a pair of sex chromosomes (two Xs if you’re female, an X and a Y if you’re male).

Let’s look at which chromosomes these variations are on.

Question

Print a list of the chromosomes that are in the file (each chromosome name should only be printed once, so you should only print 23 lines).

Hint

Remove all duplicates from your previous answer (man sort)

Answer

grep -v "^#" 1000gp.vcf | cut -f 1 | sort -u

Rather than using sort to print unique results, a common pipeline is to first sort and then pipe to another UNIX command, uniq. The uniq command takes sorted input and prints only unique lines, but it provides more flexibility than just using sort by itself. Keep in mind, if the input isn’t sorted, uniq won’t work properly.

Question

Using sort and uniq, print the number of times each chromosome occurs in the file.

Hint

man uniq

Answer

grep -v "^#" 1000gp.vcf | cut -f 1 | sort | uniq -c

Question

Add to your previous solution to list the chromosomes from most frequently observed to least frequently observed.

Hint

Make sure you’re sorting in descending order. By default, sort sorts in ascending order.

Answer

grep -v "^#" 1000gp.vcf | cut -f 1 | sort | uniq -c | sort -n -r

This is great, but biologists might also like to see the chromosomes ordered by their number (not dictionary order), since different chromosomes have different attributes and this ordering allows them to find a specific chromosome more easily.

Question

Sort the previous output by chromosome number

Hint

A lot of the power of sort comes from the fact that you can specify which fields to sort on, and the order in which to sort them. In this case you only need to sort on one field.

Answer

grep -v "^#" 1000gp.vcf | cut -f 1 | sort | uniq -c | sort -k 2n