The Basic Local Alignment Search Tool (BLAST) finds regions of similarity between sequences. The program compares nucleotide or protein sequences and calculates the statistical significance of matches. BLAST can be used to infer functional and evolutionary relationships between sequences as well as help identify members of gene families.
There are several types of BLAST searches. NCBI's WebBLAST offers four main search types:
There are also standalone and API BLAST options as well as pre-populated specialized searches available on the BLAST homepage linked above.
Object: Starting with a sequence, identify the protein or gene and the source.
Example: From the following sequence (available at http://tinyurl.com/blastp-sequence, or copy the sequence below), identify the most probable protein and organism:
MSKRKAPQET LNGGITDMLT ELANFEKNVS QAIHKYNAYR KAASVIAKYP HKIKSGAEAK
KLPGVGTKIA EKIDEFLATG KLRKLEKIRQ DDTSSSINFL TRVSGIGPSA ARKFVDEGIK
TLEDLRKNED KLNHHQRIGL KYFGDFEKRI PREEMLQMQD IVLNEVKKVD SEYIATVCGS
FRRGAESSGD MDVLLTHPSF TSESTKQPKL LHQVVEQLQK VHFITDTLSK GETKFMGVCQ
LPSKNDEKEY PHRRIDIRLI PKDQYYCGVL YFTGSDIFNK NMRAHALEKG FTINEYTIRP
LGVTGVAGEP LPVDSEKDIF DYIQWKYREP KDRSE
Querying a sequence
Protein and gene sequence comparisons are done with BLAST (Basic Local Alignment Search Tool).
To access BLAST, go to Resources > Sequence Analysis > BLAST:
This is a protein sequence, and so Protein BLAST should be selected from the BLAST menu:
Enter the query sequence in the search box, provide a job title, choose a database to query, and click BLAST:
Viewing your results
Under the Alignments tab next to Alignment view select Pairwise with dots for identities.
View the Descriptions tab to see a list of significant alignments:
Clicking on a protein name displays the pairwise sequence alignment and links to additional information about the protein and its associated gene (if available).
For the pairwise with dots for identities display, any differing amino acid in the subject sequence will be displayed in red:
Saving your results
To save your search queries and settings, click on the Save Search link, then log in to My NCBI using the Sign in or Register link at the upper right. Once you do this, your search strategies should appear in the Saved Search Strategies tab.
Object: Starting with two or more sequences, compare them and find the differences.
Example: In the NCBI database Nucleotide, enter the following search:
human[orgn] AND mitochondrion[ti]
This will search for nucleic acid sequences from humans with the word "mitochondrion" in the title. Mitochondrial DNA is often used in evolutionary comparisons because it is inherited only through the maternal lineage and changes very slowly.
Limit the results to NCBI Reference Sequences by selecting the RefSeq limit under Source databases in the left-hand Filter menu. These are high-quality sequences that have been curated and annotated by NCBI staff.
There are three Reference Sequences for the mitochondrial genome in humans: one for modern humans (Homo sapiens), one for Neanderthals (Homo sapiens neanderthalensis), and one for Denisovans (Homo sp. Altai).
In the right-hand discovery menu under Analyze these sequences click Run BLAST.
This will open BLASTn, Nucleotide BLAST, and automatically add the accession numbers of these Reference Sequences into the Query Sequence box.
To compare sequences, check the box next to Align two or more sequences under the Query Sequence box. To BLAST the modern human mitochondrial genome sequence (NC_012920.1) against the subject sequences of Neanderthal (NC_011137.1) and Denisova (NC_013993.1), move the latter two accession numbers from the Query Sequence box into the Subject Sequence box using copy and paste.
Click BLAST, leaving the other settings at their default options.
You should see two results, in which the query sequence (modern human) is compared to one of the subject sequences, Neanderthal or Denisovan. Note that the query sequence is 99% similar to the Neanderthal sequence, and 98% similar to the Denisovan sequence.
To see how the sequences differ and what the biological significance might be:
Click on the name of the first result (Homo sapiens neanderthalis). You should see a base-by-base comparison of the two sequences in two lines. The top line is the query sequence (modern human). In the second line, representing the subject sequence (ancient human), bases where the subject sequence is identical to the query sequence are replaced by dots, and bases where the subject sequence differs from the query sequence appear in red.
Scroll down to the first coding sequence (CDS). The CDS regions are displayed in four lines: the first line shows the amino acid translation for the query sequence (modern human) on the second line. The third line is the subject sequence (ancient human), and the one below shows the amino acid translation for the subject sequence.
Note that there are two additional amino acids, M (methionine) and P (proline), at the beginning of the protein sequence in modern humans compared to Neanderthal. This is due to the substitution of T (thymine) at position 3308 in the modern human sequence for C (cytosine) in the analogous position in the Neanderthal sequence.
Note as well that the substitution of A (adenine) at position 3334 in the modern human sequence for G (guanine) in the Neanderthal sequence results in an amino acid difference in the protein sequences. In the modern human protein sequence an I (isoleucine) replaces a V (valine) present in the Neanderthal protein sequence.
To investigate the biological significance of this change, go to the Amino Acid Explorer. In the left-hand menu, use the Compare tool to see what effects a change from V to I might have. Look at both the text and graphics comparisons. Does this seem to be a conservative mutation (that is, one that results in little or no change in protein structure or function) or a non-conservative mutation (that is, one that results in a significant change in protein structure or function)?
Now scroll down to the Denisovan result and look at positions 3308 and 3334 in the query sequence. Are there any differences in the Denisovan sequence at these positions?