How to run a BLAST search¶
Step 1: Download data¶
First, we need some data! For this tutorial, we're using mouse and zebrafish RefSeq protein data sets from NCBI. We'll put them in a temporary file system that's large enough to run data analysis in:
sudo chmod a+rwxt /mnt cd /mnt
chmod command stands for change mode and it is used to define or change permissions or modes on files and limit access to only those who are allowed access. It's the same as using your mouse to right-click a file or folder and selecting the permission tabs and defining who can access the resource.
Now, we'll use
curl to download the files from an online OSF repository to the AWS remote instance. These are copies of the files from the NCBI FTP site.
You can Copy/Paste multiple commands at a time, and they will execute in order.
curl -o mouse.1.protein.faa.gz -L https://osf.io/v6j9x/download curl -o zebrafish.1.protein.faa.gz -L https://osf.io/68mgf/download
curl is a command line tool that allows you to transfer data from or to a remote server. The lowercase
-o allows you to specifiy the name of the saved file, while the uppercase
-O saves the file with its original filename. The
-L is the location of the file.
There should now be two protein files in the current directory:
lscommand to list the files in the directory. The
-lflag adds file permissions, ownership, size, and creation time information. The
-hflag converts the file sizes into human-readable units and the
-tflag sorts the files starting with the most recently created file.
Your output should look similar, though the download dates will be the current date and time when you downloaded the files:
total 30M -rw-rw-r-- 1 ubuntu ubuntu 14M Oct 2 20:30 zebrafish.1.protein.faa.gz -rw-rw-r-- 1 ubuntu ubuntu 12M Oct 2 20:28 mouse.1.protein.faa.gz
These files are FASTA protein files (that's what the ".faa"
means) that are compressed with
gzip (that's what the ".gz" means).
Let's look at the first few sequences in the file:
These are protein sequences in FASTA format. FASTA format is something many of you have probably seen in one form or another -- it's ubiquitous!
It's a text file, containing sequence records. Each record
starts with a
>, and then contains one or more lines of sequence text. The information associated with each
sequence is in the header line: accession number, gene, and species.
Step 2: Subset data for demo¶
Let's take those first two sequences and save them to a file. We'll
do this using output redirection with
>, which says "take
all the output from the preceding command and put it into this new file."
Select the first 11 lines of the file and save as a new file:
head -n 11 mouse.1.protein.faa > mm-first.faa
So now, for example, you can do
cat mm-first.faa to see the contents of
that file (or
ubuntu@ip-172-31-17-217:~$ cat mm-first.faa >YP_220550.1 NADH dehydrogenase subunit 1 (mitochondrion) [Mus musculus domesticus] MFFINILTLLVPILIAMAFLTLVERKILGYMQLRKGPNIVGPYGILQPFADAMKLFMKEPMRPLTTSMSLFIIAPTLSLT LALSLWVPLPMPHPLINLNLGILFILATSSLSVYSILWSGWASNSKYSLFGALRAVAQTISYEVTMAIILLSVLLMNGSY SLQTLITTQEHMWLLLPAWPMAMMWFISTLAETNRAPFDLTEGESELVSGFNVEYAAGPFALFFMAEYTNIILMNALTTI IFLGPLYYINLPELYSTNFMMEALLLSSTFLWIRASYPRFRYDQLMHLLWKNFLPLTLALCMWHISLPIFTAGVPPYM >YP_220551.1 NADH dehydrogenase subunit 2 (mitochondrion) [Mus musculus domesticus] MNPITLAIIYFTIFLGPVITMSSTNLMLMWVGLEFSLLAIIPMLINKKNPRSTEAATKYFVTQATASMIILLAIVLNYKQ LGTWMFQQQTNGLILNMTLMALSMKLGLAPFHFWLPEVTQGIPLHMGLILLTWQKIAPLSILIQIYPLLNSTIILMLAIT SIFMGAWGGLNQTQMRKIMAYSSIAHMGWMLAILPYNPSLTLLNLMIYIILTAPMFMALMLNNSMTINSISLLWNKTPAM LTMISLMLLSLGGLPPLTGFLPKWIIITELMKNNCLIMATLMAMMALLNLFFYTRLIYSTSLTMFPTNNNSKMMTHQTKT KPNLMFSTLAIMSTMTLPLAPQLIT
if you try
less mm-first.faa, exit back to the terminal by pressing the Q key in your keyboard.
Step 3: Create blast database¶
Now let's BLAST these two mouse sequences against the entire zebrafish
protein data set. First, we need to tell BLAST that the zebrafish
sequences are (a) a database, and (b) a protein database. That's done by calling
makeblastdb -in zebrafish.1.protein.faa -dbtype prot
Building a new DB, current time: 10/02/2020 20:38:17 New DB name: /home/ubuntu/zebrafish.1.protein.faa New DB title: zebrafish.1.protein.faa Sequence type: Protein Keep MBits: T Maximum file size: 1000000000B Adding sequences from FASTA; added 53088 sequences in 1.69959 seconds.
Step 4: Run a BLAST search¶
We'll use the
blastp command to conduct a protein sequence search comparing the mouse sequences (
-query) with the zebrafish database (
blastp -query mm-first.faa -db zebrafish.1.protein.faa
This should run pretty quickly, but you're going to get a lot of output!!
To save it to a file instead of watching it go past on the screen,
ask BLAST to save the output to a file that we'll name
blastp -query mm-first.faa -db zebrafish.1.protein.faa -out mm-first.x.zebrafish.txt
Now you can page through this file by typing:
Use the Up and Down keys to move up/down, and Q to exit the paging mode.
Let's run some more sequences (this search will take a little longer to run).
head -n command, we'll select the first 498 lines of the mouse protein FASTA file.
Since these FASTA files are divided over five lines per sequence, this subset will include 96 sequences.
head -n 498 mouse.1.protein.faa > mm-second.faa blastp -query mm-second.faa -db zebrafish.1.protein.faa -out mm-second.x.zebrafish.txt
Check the output file:
We can verify the number of records by searching for all of the lines that include the
grep '>' mm-second.faa | wc -l
grep is a powerful tool for finding patterns and words on Unix and Linux Systems.
wc -l tells
wc to count the number of lines that have the matching pattern
>. The command should tell you there are 96 lines in this file.
Why did it take longer to BLAST
mm-second.faa has 96 sequences in comparison to
mm-first.faa which only had 2 sequences.
Last, but not least, let's generate a more machine-readable version of that last file. There are several BLAST output formats. The blast output format 6 is the most commonly used.
blastp -query mm-second.faa -db zebrafish.1.protein.faa -out mm-second.x.zebrafish.tsv -outfmt 6
The first few lines look like this:
YP_220550.1 NP_059331.1 69.010 313 97 0 4 316 10 322 1.24e-150 426 YP_220551.1 NP_059332.1 44.509 346 188 3 1 344 1 344 8.62e-92 279 YP_220551.1 NP_059341.1 24.540 163 112 3 112 263 231 393 5.15e-06 49.7
For this particular example, the blast search results show that the sequences we compared are quite different. The top match from the mouse data was only 69% similar to the zebrafish reference database. Of course, this makes sense since mouse and zebrafish species are very divergent!
Remember to download any files you want to save and terminate your instance when you have completed your tasks with AWS. See the AWS tutorial for help.
BLAST searches can be conducted between DNA and protein sequences, including translating DNA to protein or protein to DNA. For this tutorial, we used the
blastp command to compare protein sequences. Other commands include
blastn for comparing DNA sequences and
tblastn for comparing translated sequences. This flexibility and ease of use has made BLAST a very popular tool for quickly comparing sequences. Command line BLAST searches can also be scaled up to compare sequences for multiple samples to large databases. For more information about BLAST, check out the NCBI webpage.