Creating Custom Database using Standalone NCBI BLAST+

Basic Local Alignment Search Tool (BLAST) is a collection of programs developed using heuristic algorithm in C++ for comparing DNA, RNA, and protein sequences. The standalone command-line interface (CLI) of BLAST is named as BLAST+. The latest version of NCBI BLAST+ can be downloaded from the FTP server of NCBI (ftp://ftp.ncbi.nih.gov/blast/executables/blast+/LATEST). This is a simple tutorial for creating a custom database, accessing the database, and performing a sequence search using BLAST+.

1. Creating a Custom Database

A nucleotide (nucl) or protein (prot) database can be created using -dbtype parameter in makeblastdb program. makeblastdb is a command-line utility from NCBI's BLAST+ suite used to create searchable databases from sequence files (like FASTA) for faster sequence similarity searches and generating indexed files that allow tools like BLAST to quickly find matches to query sequences, requiring options like -in for input, -dbtype to set the database type (e.g., nucl or prot), -title for human-readable title (optional, quotes for spaces), and -out for the database name. It creates necessary index files (like .pin, .psq) for efficient searching. Moreover, it generates multiple files with the specified database name and different extensions (e.g., .nhr, .nin, .nsq for nucleotide). For larger databases, you might need sufficient virtual memory; otherwise, it could fail.

We can create two types of database using command-line below,

Non-indexed Database: ./makeblastdb -in DBX.fasta -out DBX -dbtype prot

Building a new DB, current time: 12/04/2020 10:10:06
New DB name:   C:\NCBI\blast-2.6.0+\bin\DBX
New DB title:  DBX.fasta
Sequence type: Protein
Keep MBits: T
Maximum file size: 1000000000B
Adding sequences from FASTA; added 20 sequences in 0.0041614 seconds.

Indexed Database: ./makeblastdb -in DB.fasta -out DB -dbtype prot -parse_seqids

Building a new DB, current time: 12/03/2020 17:29:38
New DB name:   C:\NCBI\blast-2.11.0+\bin\DB
New DB title:  DB.fasta
Sequence type: Protein
Keep MBits: T
Maximum file size: 1000000000B
Adding sequences from FASTA; added 20 sequences in 0.0277056 seconds.

The sequence files DB.fasta and DBX.fasta were given at end of this page.

2. List of Records in the Database

List of entries in the database can be viewed using -entry all parameter in blastdbcmd program. The command-line to get the list of sequence identifiers assigned to the non-indexed and indexed database are below,

Non-indexed Database: ./blastdbcmd -db DBX -entry all -outfmt "OID: %o GI: %g ACC: %a IDENTIFIER: %i"

OID: 0 GI: N/A ACC: BL_ORD_ID:0 IDENTIFIER: gnl|BL_ORD_ID|0
OID: 1 GI: N/A ACC: BL_ORD_ID:1 IDENTIFIER: gnl|BL_ORD_ID|1
OID: 2 GI: N/A ACC: BL_ORD_ID:2 IDENTIFIER: gnl|BL_ORD_ID|2
OID: 3 GI: N/A ACC: BL_ORD_ID:3 IDENTIFIER: gnl|BL_ORD_ID|3
OID: 4 GI: N/A ACC: BL_ORD_ID:4 IDENTIFIER: gnl|BL_ORD_ID|4
OID: 5 GI: N/A ACC: BL_ORD_ID:5 IDENTIFIER: gnl|BL_ORD_ID|5
OID: 6 GI: N/A ACC: BL_ORD_ID:6 IDENTIFIER: gnl|BL_ORD_ID|6
OID: 7 GI: N/A ACC: BL_ORD_ID:7 IDENTIFIER: gnl|BL_ORD_ID|7
OID: 8 GI: N/A ACC: BL_ORD_ID:8 IDENTIFIER: gnl|BL_ORD_ID|8
OID: 9 GI: N/A ACC: BL_ORD_ID:9 IDENTIFIER: gnl|BL_ORD_ID|9
OID: 10 GI: N/A ACC: BL_ORD_ID:10 IDENTIFIER: gnl|BL_ORD_ID|10
OID: 11 GI: N/A ACC: BL_ORD_ID:11 IDENTIFIER: gnl|BL_ORD_ID|11
OID: 12 GI: N/A ACC: BL_ORD_ID:12 IDENTIFIER: gnl|BL_ORD_ID|12
OID: 13 GI: N/A ACC: BL_ORD_ID:13 IDENTIFIER: gnl|BL_ORD_ID|13
OID: 14 GI: N/A ACC: BL_ORD_ID:14 IDENTIFIER: gnl|BL_ORD_ID|14
OID: 15 GI: N/A ACC: BL_ORD_ID:15 IDENTIFIER: gnl|BL_ORD_ID|15
OID: 16 GI: N/A ACC: BL_ORD_ID:16 IDENTIFIER: gnl|BL_ORD_ID|16
OID: 17 GI: N/A ACC: BL_ORD_ID:17 IDENTIFIER: gnl|BL_ORD_ID|17
OID: 18 GI: N/A ACC: BL_ORD_ID:18 IDENTIFIER: gnl|BL_ORD_ID|18
OID: 19 GI: N/A ACC: BL_ORD_ID:19 IDENTIFIER: gnl|BL_ORD_ID|19

Indexed Database: ./blastdbcmd -db DB -entry all -outfmt "OID: %o GI: %g ACC: %a IDENTIFIER: %i"

OID: 0 GI: N/A ACC: Sequence1 IDENTIFIER: lcl|Sequence1
OID: 1 GI: N/A ACC: Sequence2 IDENTIFIER: lcl|Sequence2
OID: 2 GI: N/A ACC: Sequence3 IDENTIFIER: lcl|Sequence3
OID: 3 GI: N/A ACC: Sequence4 IDENTIFIER: lcl|Sequence4
OID: 4 GI: N/A ACC: Sequence5 IDENTIFIER: lcl|Sequence5
OID: 5 GI: N/A ACC: Sequence6 IDENTIFIER: lcl|Sequence6
OID: 6 GI: N/A ACC: Sequence7 IDENTIFIER: lcl|Sequence7
OID: 7 GI: N/A ACC: Sequence8 IDENTIFIER: lcl|Sequence8
OID: 8 GI: N/A ACC: Sequence9 IDENTIFIER: lcl|Sequence9
OID: 9 GI: N/A ACC: Sequence10 IDENTIFIER: lcl|Sequence10
OID: 10 GI: N/A ACC: Sequence11 IDENTIFIER: lcl|Sequence11
OID: 11 GI: N/A ACC: Sequence12 IDENTIFIER: lcl|Sequence12
OID: 12 GI: N/A ACC: Sequence13 IDENTIFIER: lcl|Sequence13
OID: 13 GI: N/A ACC: Sequence14 IDENTIFIER: lcl|Sequence14
OID: 14 GI: N/A ACC: Sequence15 IDENTIFIER: lcl|Sequence15
OID: 15 GI: N/A ACC: Sequence16 IDENTIFIER: lcl|Sequence16
OID: 16 GI: N/A ACC: Sequence17 IDENTIFIER: lcl|Sequence17
OID: 17 GI: N/A ACC: Sequence18 IDENTIFIER: lcl|Sequence18
OID: 18 GI: N/A ACC: Sequence19 IDENTIFIER: lcl|Sequence19
OID: 19 GI: N/A ACC: Sequence20 IDENTIFIER: lcl|Sequence20

The identifier BLAST Ordinal Identifiers (BL_ORD_ID) and General (GNL) represents non-indexed database and Local (LCL) represents indexed database.

3. Searching Sequence from the Database

Sequence in the database can be accessed through entry ID using -entry parameter in blastdbcmd program. The command-line to access entry from the non-indexed and indexed database are below,

Non-indexed Database: ./blastdbcmd -db DBX -entry 'gnl|BL_ORD_ID|1'

>AAA40590.1 insulin [Octodon degus]
MAPWMHLLTVLALLALWGPNSVQAYSSQHLCGSNLVEALYMTCGRSGFYRPHDRRELEDLQVEQAELGLEAGGLQPSALE
MILQKRGIVDQCCNNICTFNQLQNYCNVP

The latest BLAST+ does not permit access to first entry (index number ‘0’) in the non-indexed database; since the starting index number is ‘1’. Moreover, it does not recognize entries of a non-indexed database. I have used BLAST+ version 2.6.0 to construct a non-indexed database.

./blastdbcmd -db DBX -entry 'gnl|BL_ORD_ID|0'
Error: [blastdbcmd] CObject_id::GetId(): Invalid choice selection: NCBI-General::Object-id.str

Indexed Database: ./blastdbcmd -db DB -entry Sequence1

Sequence1 NP_001191615.1 insulin precursor [Aplysia californica]
MSKFLLQSHSANACLLTLLLTLASNLDISLANFEHSCNGYMRPHPRGLCGEDLHVIISNLCSSLGGNRRFLAKYMVKRDT
ENVNDKLRGILLNKKEAFSYLTKREASGSITCECCFNQCRIFELAQYCRLPDHFFSRISRTGRSNSGHAQLEDNFS

4. Retrieving Sequence from the Database

Sequence from the database can be retrieved through entry ID using -entry parameter in blastdbcmd program. The command-line to retrieve the sequence to file from the non-indexed and indexed database are below,

Non-indexed Database: ./blastdbcmd -db DBX -entry 'gnl|BL_ORD_ID|1' -out Sequence2.fasta

Indexed Database: ./blastdbcmd -db DB -entry Sequence1 -out Sequence1.fasta

5. Performing Pairwise Alignment

Pairwise sequence alignment can be performed be passing query as a input file (in fasta file format) through parameters, or raw sequence (not supported in BLAST+ old versions) through command-line. The command-line to perform pairwise alignment are below,

./blastp -db DB -query sequence.fasta, OR

echo ALLALLALGAPTPARAFANQHLCGSHLVEALYLVCGERGFFYTPKARREVEDTQVGGVE | ./blastp -db DB

The sequence alignment output is below,

BLASTP 2.11.0+


Reference: Stephen F. Altschul, Thomas L. Madden, Alejandro A.
Schaffer, Jinghui Zhang, Zheng Zhang, Webb Miller, and David J.
Lipman (1997), "Gapped BLAST and PSI-BLAST: a new generation of
protein database search programs", Nucleic Acids Res. 25:3389-3402.


Reference for composition-based statistics: Alejandro A. Schaffer,
L. Aravind, Thomas L. Madden, Sergei Shavirin, John L. Spouge, Yuri
I. Wolf, Eugene V. Koonin, and Stephen F. Altschul (2001),
"Improving the accuracy of PSI-BLAST protein database searches with
composition-based statistics and other refinements", Nucleic Acids
Res. 29:2994-3005.



Database: DB.fasta
           20 sequences; 2,178 total letters



Query=
Length=59
                                                                      Score     E
Sequences producing significant alignments:                          (Bits)  Value

Sequence3 KAB1251309.1 Insulin [Camelus dromedarius]                  120     3e-41
Sequence16 AAA59172.1 insulin [Homo sapiens]                          91.3    6e-30
Sequence10 AAA19033.1 insulin [Oryctolagus cuniculus]                 89.7    3e-29
Sequence4 NP_001035835.1 insulin, isoform 2 precursor [Homo sapiens]  85.1    2e-26
Sequence8 AAB60625.1 insulin [Ovis aries]                             82.0    2e-26
Sequence20 ELK28555.1 Insulin [Myotis davidii]                        77.0    1e-23
Sequence7 pir||INHY insulin - hamster                                 65.5    3e-20
Sequence13 pir||INEL insulin - elephant                               64.7    6e-20
Sequence15 pir||INTK insulin - turkey (tentative sequence)            63.5    1e-19
Sequence12 pir||INOS insulin - ostrich                                63.5    1e-19
Sequence11 pir||INMKSQ insulin - common squirrel monkey               60.1    3e-18
Sequence2 AAA40590.1 insulin [Octodon degus]                          60.8    6e-18
Sequence6 pir||INCD insulin - cod (Gadus sp.)                         55.8    2e-16
Sequence5 NP_571131.1 insulin preproprotein [Danio rerio]             53.9    3e-15
Sequence9 XP_014388588.1 PREDICTED: insulin [Myotis brandtii]         48.9    1e-12
Sequence19 BAS32722.1 insulin, partial [Varanus exanthematicus]       38.5    2e-09
Sequence17 QBX89050.1 insulin, partial [Nephrops norvegicus]          19.6    0.042


>Sequence3 KAB1251309.1 Insulin [Camelus dromedarius]
Length=110

 Score = 120 bits (300),  Expect = 3e-41, Method: Compositional matrix adjust.
 Identities = 59/59 (100%), Positives = 59/59 (100%), Gaps = 0/59 (0%)

Query  1   ALLALLALGAPTPARAFANQHLCGSHLVEALYLVCGERGFFYTPKARREVEDTQVGGVE  59
           ALLALLALGAPTPARAFANQHLCGSHLVEALYLVCGERGFFYTPKARREVEDTQVGGVE
Sbjct  9   ALLALLALGAPTPARAFANQHLCGSHLVEALYLVCGERGFFYTPKARREVEDTQVGGVE  67


>Sequence16 AAA59172.1 insulin [Homo sapiens]
Length=110

 Score = 91.3 bits (225),  Expect = 6e-30, Method: Compositional matrix adjust.
 Identities = 42/50 (84%), Positives = 42/50 (84%), Gaps = 0/50 (0%)

Query  10  APTPARAFANQHLCGSHLVEALYLVCGERGFFYTPKARREVEDTQVGGVE  59
            P PA AF NQHLCGSHLVEALYLVCGERGFFYTPK RRE ED QVG VE
Sbjct  18  GPDPAAAFVNQHLCGSHLVEALYLVCGERGFFYTPKTRREAEDLQVGQVE  67


>Sequence10 AAA19033.1 insulin [Oryctolagus cuniculus]
Length=110

 Score = 89.7 bits (221),  Expect = 3e-29, Method: Compositional matrix adjust.
 Identities = 40/47 (85%), Positives = 43/47 (91%), Gaps = 0/47 (0%)

Query  13  PARAFANQHLCGSHLVEALYLVCGERGFFYTPKARREVEDTQVGGVE  59
           PA+AF NQHLCGSHLVEALYLVCGERGFFYTPK+RREVE+ QVG  E
Sbjct  21  PAQAFVNQHLCGSHLVEALYLVCGERGFFYTPKSRREVEELQVGQAE  67


>Sequence4 NP_001035835.1 insulin, isoform 2 precursor [Homo sapiens]
Length=200

 Score = 85.1 bits (209),  Expect = 2e-26, Method: Compositional matrix adjust.
 Identities = 38/44 (86%), Positives = 38/44 (86%), Gaps = 0/44 (0%)

Query  11  PTPARAFANQHLCGSHLVEALYLVCGERGFFYTPKARREVEDTQ  54
           P PA AF NQHLCGSHLVEALYLVCGERGFFYTPK RRE ED Q
Sbjct  19  PDPAAAFVNQHLCGSHLVEALYLVCGERGFFYTPKTRREAEDLQ  62


>Sequence8 AAB60625.1 insulin [Ovis aries]
Length=105

 Score = 82.0 bits (201),  Expect = 2e-26, Method: Compositional matrix adjust.
 Identities = 38/43 (88%), Positives = 39/43 (91%), Gaps = 0/43 (0%)

Query  16  AFANQHLCGSHLVEALYLVCGERGFFYTPKARREVEDTQVGGV  58
           AF NQHLCGSHLVEALYLVCGERGFFYTPKARREVE  QVG +
Sbjct  24  AFVNQHLCGSHLVEALYLVCGERGFFYTPKARREVEGPQVGAL  66


>Sequence20 ELK28555.1 Insulin [Myotis davidii]
Length=168

 Score = 77.0 bits (188),  Expect = 1e-23, Method: Compositional matrix adjust.
 Identities = 34/40 (85%), Positives = 36/40 (90%), Gaps = 0/40 (0%)

Query  15  RAFANQHLCGSHLVEALYLVCGERGFFYTPKARREVEDTQ  54
           +AF NQHLCGSHLVEALYLVCGERGFFYTPK RRE+ D Q
Sbjct  23  QAFVNQHLCGSHLVEALYLVCGERGFFYTPKDRRELPDPQ  62


 Score = 44.3 bits (103),  Expect = 5e-11, Method: Compositional matrix adjust.
 Identities = 24/52 (46%), Positives = 32/52 (62%), Gaps = 2/52 (4%)

Query  8    LGAPTPARAFANQHLCGSHLVEALYLVCGERGFFYTPKARREVEDTQVGGVE  59
            L +  P+++  +QHLCG  LV AL + CG+RG FY P A  E +D Q   VE
Sbjct  80   LASVDPSQS-QDQHLCGDELVNALTITCGDRG-FYNPMAPLEQDDLQEEEVE  129


>Sequence7 pir||INHY insulin - hamster
Length=51

 Score = 65.5 bits (158),  Expect = 3e-20, Method: Compositional matrix adjust.
 Identities = 28/30 (93%), Positives = 29/30 (97%), Gaps = 0/30 (0%)

Query  17  FANQHLCGSHLVEALYLVCGERGFFYTPKA  46
           F NQHLCGSHLVEALYLVCGERGFFYTPK+
Sbjct  1   FVNQHLCGSHLVEALYLVCGERGFFYTPKS  30


>Sequence13 pir||INEL insulin - elephant
Length=51

 Score = 64.7 bits (156),  Expect = 6e-20, Method: Compositional matrix adjust.
 Identities = 28/30 (93%), Positives = 28/30 (93%), Gaps = 0/30 (0%)

Query  17  FANQHLCGSHLVEALYLVCGERGFFYTPKA  46
           F NQHLCGSHLVEALYLVCGERGFFYTPK
Sbjct  1   FVNQHLCGSHLVEALYLVCGERGFFYTPKT  30


>Sequence15 pir||INTK insulin - turkey (tentative sequence)
Length=51

 Score = 63.5 bits (153),  Expect = 1e-19, Method: Compositional matrix adjust.
 Identities = 28/29 (97%), Positives = 29/29 (100%), Gaps = 0/29 (0%)

Query  18  ANQHLCGSHLVEALYLVCGERGFFYTPKA  46
           ANQHLCGSHLVEALYLVCGERGFFY+PKA
Sbjct  2   ANQHLCGSHLVEALYLVCGERGFFYSPKA  30


>Sequence12 pir||INOS insulin - ostrich
Length=51

 Score = 63.5 bits (153),  Expect = 1e-19, Method: Compositional matrix adjust.
 Identities = 28/29 (97%), Positives = 29/29 (100%), Gaps = 0/29 (0%)

Query  18  ANQHLCGSHLVEALYLVCGERGFFYTPKA  46
           ANQHLCGSHLVEALYLVCGERGFFY+PKA
Sbjct  2   ANQHLCGSHLVEALYLVCGERGFFYSPKA  30


>Sequence11 pir||INMKSQ insulin - common squirrel monkey
Length=51

 Score = 60.1 bits (144),  Expect = 3e-18, Method: Compositional matrix adjust.
 Identities = 26/30 (87%), Positives = 26/30 (87%), Gaps = 0/30 (0%)

Query  17  FANQHLCGSHLVEALYLVCGERGFFYTPKA  46
           F NQHLCG HLVEALYLVCGERGFFY PK
Sbjct  1   FVNQHLCGPHLVEALYLVCGERGFFYAPKT  30


>Sequence2 AAA40590.1 insulin [Octodon degus]
Length=109

 Score = 60.8 bits (146),  Expect = 6e-18, Method: Compositional matrix adjust.
 Identities = 28/49 (57%), Positives = 35/49 (71%), Gaps = 1/49 (2%)

Query  11  PTPARAFANQHLCGSHLVEALYLVCGERGFFYTPKARREVEDTQVGGVE  59
           P   +A+++QHLCGS+LVEALY+ CG  G FY P  RRE+ED QV   E
Sbjct  19  PNSVQAYSSQHLCGSNLVEALYMTCGRSG-FYRPHDRRELEDLQVEQAE  66


>Sequence6 pir||INCD insulin - cod (Gadus sp.)
Length=51

 Score = 55.8 bits (133),  Expect = 2e-16, Method: Compositional matrix adjust.
 Identities = 23/27 (85%), Positives = 25/27 (93%), Gaps = 0/27 (0%)

Query  20  QHLCGSHLVEALYLVCGERGFFYTPKA  46
           QHLCGSHLV+ALYLVCG+RGFFY PK
Sbjct  5   QHLCGSHLVDALYLVCGDRGFFYNPKG  31


>Sequence5 NP_571131.1 insulin preproprotein [Danio rerio]
Length=108

 Score = 53.9 bits (128),  Expect = 3e-15, Method: Compositional matrix adjust.
 Identities = 25/32 (78%), Positives = 27/32 (84%), Gaps = 2/32 (6%)

Query  20  QHLCGSHLVEALYLVCGERGFFYTPKARREVE  51
           QHLCGSHLV+ALYLVCG  GFFY PK  R+VE
Sbjct  27  QHLCGSHLVDALYLVCGPTGFFYNPK--RDVE  56


>Sequence9 XP_014388588.1 PREDICTED: insulin [Myotis brandtii]
Length=183

 Score = 48.9 bits (115),  Expect = 1e-12, Method: Compositional matrix adjust.
 Identities = 26/51 (51%), Positives = 34/51 (67%), Gaps = 1/51 (2%)

Query  9   GAPTPARAFANQHLCGSHLVEALYLVCGERGFFYTPKARREVEDTQVGGVE  59
            APTPA+AF  +HLC   L E L ++CG++G F  PKA RE+ D Q G V+
Sbjct  17  WAPTPAQAFYFEHLCDEDLAEMLTIICGDQG-FRNPKATRELPDPQEGEVD  66


 Score = 46.6 bits (109),  Expect = 7e-12, Method: Compositional matrix adjust.
 Identities = 22/40 (55%), Positives = 28/40 (70%), Gaps = 1/40 (3%)

Query  20   QHLCGSHLVEALYLVCGERGFFYTPKARREVEDTQVGGVE  59
            Q LCG  LV+ L +VCG+RG FY+P A RE+ D Q G V+
Sbjct  106  QRLCGEDLVDTLTMVCGDRG-FYSPTALRELPDPQEGEVD  144


>Sequence19 BAS32722.1 insulin, partial [Varanus exanthematicus]
Length=88

 Score = 38.5 bits (88),  Expect = 2e-09, Method: Compositional matrix adjust.
 Identities = 22/49 (45%), Positives = 29/49 (59%), Gaps = 2/49 (4%)

Query  3   LALLALGAPTPARAFA--NQHLCGSHLVEALYLVCGERGFFYTPKARRE  49
           L LLA+ APT   A +  ++HLCGS LVEAL   CG+ G +   K   +
Sbjct  1   LVLLAVLAPTAIYATSENDEHLCGSALVEALVSACGKEGIYSFTKRNEQ  49


>Sequence17 QBX89050.1 insulin, partial [Nephrops norvegicus]
Length=178

 Score = 19.6 bits (39),  Expect = 0.042, Method: Compositional matrix adjust.
 Identities = 9/25 (36%), Positives = 12/25 (48%), Gaps = 2/25 (8%)

Query  20  QHLCGSHLVEALYLVCGERGFFYTP  44
           + LCG  L   L  VC  +G +  P
Sbjct  25  RRLCGWRLANKLNRVC--KGVYNNP  47


 Score = 13.1 bits (22),  Expect = 9.6, Method: Compositional matrix adjust.
 Identities = 6/19 (32%), Positives = 7/19 (37%), Gaps = 0/19 (0%)

Query  32   YLVCGERGFFYTPKARREV  50
            YL   +R    TP    E
Sbjct  90   YLTFSQRASEDTPSEENEA  108



Lambda      K        H        a         alpha
   0.324    0.139    0.423    0.792     4.96

Gapped
Lambda      K        H        a         alpha    sigma
   0.267   0.0410    0.140     1.90     42.6     43.6

Effective search space used: 57052


  Database: DB.fasta
    Posted date:  Dec 3, 2020  5:29 PM
  Number of letters in database: 2,178
  Number of sequences in database:  20



Matrix: BLOSUM62
Gap Penalties: Existence: 11, Extension: 1
Neighboring words threshold: 11
Window for multiple hits: 40

6. Storing Pairwise Alignment Result

The output of pairwise alignment can be stored to the local disk using command-line below,

./blastp -db DB -query sequence.fasta -outfmt 0 -out output.html -html

The list of sequence alignment output formats (-outfmt) are:

0 = Pairwise,
1 = Query-anchored showing identities,
2 = Query-anchored no identities,
3 = Flat query-anchored showing identities,
4 = Flat query-anchored no identities,
5 = BLAST XML,
6 = Tabular,
7 = Tabular with comment lines,
8 = Seqalign (Text ASN.1),
9 = Seqalign (Binary ASN.1),
10 = Comma-separated values,
11 = BLAST archive (ASN.1),
12 = Seqalign (JSON),
13 = Multiple-file BLAST JSON,
14 = Multiple-file BLAST XML2,
15 = Single-file BLAST JSON,
16 = Single-file BLAST XML2,
17 = Sequence Alignment/Map (SAM), and
18 = Organism Report

Sequence Files used for Database Creation

The FASTA file formatted multiple sequence file (DB.fasta) is given below:

>Sequence1 NP_001191615.1 insulin precursor [Aplysia californica]
MSKFLLQSHSANACLLTLLLTLASNLDISLANFEHSCNGYMRPHPRGLCGEDLHVIISNLCSSLGGNRRF
LAKYMVKRDTENVNDKLRGILLNKKEAFSYLTKREASGSITCECCFNQCRIFELAQYCRLPDHFFSRISR
TGRSNSGHAQLEDNFS
>Sequence2 AAA40590.1 insulin [Octodon degus]
MAPWMHLLTVLALLALWGPNSVQAYSSQHLCGSNLVEALYMTCGRSGFYRPHDRRELEDLQVEQAELGLE
AGGLQPSALEMILQKRGIVDQCCNNICTFNQLQNYCNVP
>Sequence3 KAB1251309.1 Insulin [Camelus dromedarius]
MALWTRLLALLALLALGAPTPARAFANQHLCGSHLVEALYLVCGERGFFYTPKARREVEDTQVGGVELGG
GPGAGGLQPLGPEGRPQKRGIVEQCCASVCSLYQLENYCN
>Sequence4 NP_001035835.1 insulin, isoform 2 precursor [Homo sapiens]
MALWMRLLPLLALLALWGPDPAAAFVNQHLCGSHLVEALYLVCGERGFFYTPKTRREAEDLQASALSLSS
STSTWPEGLDATARAPPALVVTANIGQAGGSSSRQFRQRALGTSDSPVLFIHCPGAAGTAQGLEYRGRRV
TTELVWEEVDSSPQPQGSESLPAQPPAQPAPQPEPQQAREPSPEVSCCGLWPRRPQRSQN
>Sequence5 NP_571131.1 insulin preproprotein [Danio rerio]
MAVWLQAGALLVLLVVSSVSTNPGTPQHLCGSHLVDALYLVCGPTGFFYNPKRDVEPLLGFLPPKSAQET
EVADFAFKDHAELIRKRGIVEQCCHKPCSIFELQNYCN
>Sequence6 pir||INCD insulin - cod (Gadus sp.)
MAPPQHLCGSHLVDALYLVCGDRGFFYNPKGIVDQCCHRPCDIFDLQNYCN
>Sequence7 pir||INHY insulin - hamster
FVNQHLCGSHLVEALYLVCGERGFFYTPKSGIVDQCCTSICSLYQLENYCN
>Sequence8 AAB60625.1 insulin [Ovis aries]
MALWTRLVPLLALLALWAPAPAHAFVNQHLCGSHLVEALYLVCGERGFFYTPKARREVEGPQVGALELAG
GPGAGGLEGPPQKRGIVEQCCAGVCSLYQLENYCN
>Sequence9 XP_014388588.1 PREDICTED: insulin [Myotis brandtii]
MALWTRLLPLLALLALWAPTPAQAFYFEHLCDEDLAEMLTIICGDQGFRNPKATRELPDPQEGEVDMGAG
GQKALTLEQLLQNSDIPARLLALWAPAPAPAQSGEQRLCGEDLVDTLTMVCGDRGFYSPTALRELPDPQE
GEVDMGAGGQKALTLEQLLQNSDIVDMCCNNFCSFYQLEYYCN
>Sequence10 AAA19033.1 insulin [Oryctolagus cuniculus]
MASLAALLPLLALLVLCRLDPAQAFVNQHLCGSHLVEALYLVCGERGFFYTPKSRREVEELQVGQAELGG
GPGAGGLQPSALELALQKRGIVEQCCTSICSLYQLENYCN
>Sequence11 pir||INMKSQ insulin - common squirrel monkey
FVNQHLCGPHLVEALYLVCGERGFFYAPKTGVVDQCCTSICSLYQLQNYCN
>Sequence12 pir||INOS insulin - ostrich
AANQHLCGSHLVEALYLVCGERGFFYSPKAGIVEQCCHNTCSLYQLENYCN
>Sequence13 pir||INEL insulin - elephant
FVNQHLCGSHLVEALYLVCGERGFFYTPKTGIVEQCCTGVCSLYQLENYCN
>Sequence14 AAF80383.1 insulin precursor [Aplysia californica]
MSKFLLQSHSANACLLTLLLTLASNLDISLANFEHSCNGYMRPHPRGLCGEDLHVIISNLCSSLGGNRRF
LAKYMVKRDTENVNDKLRGILLNKKEAFSYLTKREASGSITCECCFNQCRIFELAQYCRLPDHFFSRISR
TGRSNSGHAQLEDNFS
>Sequence15 pir||INTK insulin - turkey (tentative sequence)
AANQHLCGSHLVEALYLVCGERGFFYSPKAGIVEQCCHNTCSLYQLENYCN
>Sequence16 AAA59172.1 insulin [Homo sapiens]
MALWMRLLPLLALLALWGPDPAAAFVNQHLCGSHLVEALYLVCGERGFFYTPKTRREAEDLQVGQVELGG
GPGAGSLQPLALEGSLQKRGIVEQCCTSICSLYQLENYCN
>Sequence17 QBX89050.1 insulin, partial [Nephrops norvegicus]
VVVVVVGSSRASRRTYPTSEEEPRRRLCGWRLANKLNRVCKGVYNNPGSTGNYLFYRSRRDGESEPGLPP
EEYLDLLPDPEEERGLRHHYLTFSQRASEDTPSEENEAPGSFFGSLSPQDSPHQSAVQEDEASSVQFPFL
TEEEASQMVRVRPRSKRGLSAECCRKVCTVSELVGYCY
>Sequence18 ACQ91106.1 insulin, partial [Haliotis corrugata]
DLHVIISNLCSSLGGNRRFLAKYMVKRDTENVNDKLRGILLNKKEAFSYLTKREASGSITCECCFNQCRI
FELAQYCRLPDHFFSRISRTG
>Sequence19 BAS32722.1 insulin, partial [Varanus exanthematicus]
LVLLAVLAPTAIYATSENDEHLCGSALVEALVSACGKEGIYSFTKRNEQSLGHGLLDNEVPFHLGKRGIV
EDCCENICPWSVLQSYCR
>Sequence20 ELK28555.1 Insulin [Myotis davidii]
MALWTRLLPLLALLALWAPAPAQAFVNQHLCGSHLVEALYLVCGERGFFYTPKDRRELPDPQGESSPLTP
RSHPKGTGYLASVDPSQSQDQHLCGDELVNALTITCGDRGFYNPMAPLEQDDLQEEEVEMDEGGLQALTL
EGLLQKRGIVEECCTNVCSLYQLERYCN

The FASTA file formatted multiple sequence file (DBX.fasta) is given below:

>NP_001191615.1 insulin precursor [Aplysia californica]
MSKFLLQSHSANACLLTLLLTLASNLDISLANFEHSCNGYMRPHPRGLCGEDLHVIISNLCSSLGGNRRF
LAKYMVKRDTENVNDKLRGILLNKKEAFSYLTKREASGSITCECCFNQCRIFELAQYCRLPDHFFSRISR
TGRSNSGHAQLEDNFS
>AAA40590.1 insulin [Octodon degus]
MAPWMHLLTVLALLALWGPNSVQAYSSQHLCGSNLVEALYMTCGRSGFYRPHDRRELEDLQVEQAELGLE
AGGLQPSALEMILQKRGIVDQCCNNICTFNQLQNYCNVP
>KAB1251309.1 Insulin [Camelus dromedarius]
MALWTRLLALLALLALGAPTPARAFANQHLCGSHLVEALYLVCGERGFFYTPKARREVEDTQVGGVELGG
GPGAGGLQPLGPEGRPQKRGIVEQCCASVCSLYQLENYCN
>NP_001035835.1 insulin, isoform 2 precursor [Homo sapiens]
MALWMRLLPLLALLALWGPDPAAAFVNQHLCGSHLVEALYLVCGERGFFYTPKTRREAEDLQASALSLSS
STSTWPEGLDATARAPPALVVTANIGQAGGSSSRQFRQRALGTSDSPVLFIHCPGAAGTAQGLEYRGRRV
TTELVWEEVDSSPQPQGSESLPAQPPAQPAPQPEPQQAREPSPEVSCCGLWPRRPQRSQN
>NP_571131.1 insulin preproprotein [Danio rerio]
MAVWLQAGALLVLLVVSSVSTNPGTPQHLCGSHLVDALYLVCGPTGFFYNPKRDVEPLLGFLPPKSAQET
EVADFAFKDHAELIRKRGIVEQCCHKPCSIFELQNYCN
>pir||INCD insulin - cod (Gadus sp.)
MAPPQHLCGSHLVDALYLVCGDRGFFYNPKGIVDQCCHRPCDIFDLQNYCN
>pir||INHY insulin - hamster
FVNQHLCGSHLVEALYLVCGERGFFYTPKSGIVDQCCTSICSLYQLENYCN
>AAB60625.1 insulin [Ovis aries]
MALWTRLVPLLALLALWAPAPAHAFVNQHLCGSHLVEALYLVCGERGFFYTPKARREVEGPQVGALELAG
GPGAGGLEGPPQKRGIVEQCCAGVCSLYQLENYCN
>XP_014388588.1 PREDICTED: insulin [Myotis brandtii]
MALWTRLLPLLALLALWAPTPAQAFYFEHLCDEDLAEMLTIICGDQGFRNPKATRELPDPQEGEVDMGAG
GQKALTLEQLLQNSDIPARLLALWAPAPAPAQSGEQRLCGEDLVDTLTMVCGDRGFYSPTALRELPDPQE
GEVDMGAGGQKALTLEQLLQNSDIVDMCCNNFCSFYQLEYYCN
>AAA19033.1 insulin [Oryctolagus cuniculus]
MASLAALLPLLALLVLCRLDPAQAFVNQHLCGSHLVEALYLVCGERGFFYTPKSRREVEELQVGQAELGG
GPGAGGLQPSALELALQKRGIVEQCCTSICSLYQLENYCN
>pir||INMKSQ insulin - common squirrel monkey
FVNQHLCGPHLVEALYLVCGERGFFYAPKTGVVDQCCTSICSLYQLQNYCN
>pir||INOS insulin - ostrich
AANQHLCGSHLVEALYLVCGERGFFYSPKAGIVEQCCHNTCSLYQLENYCN
>pir||INEL insulin - elephant
FVNQHLCGSHLVEALYLVCGERGFFYTPKTGIVEQCCTGVCSLYQLENYCN
>AAF80383.1 insulin precursor [Aplysia californica]
MSKFLLQSHSANACLLTLLLTLASNLDISLANFEHSCNGYMRPHPRGLCGEDLHVIISNLCSSLGGNRRF
LAKYMVKRDTENVNDKLRGILLNKKEAFSYLTKREASGSITCECCFNQCRIFELAQYCRLPDHFFSRISR
TGRSNSGHAQLEDNFS
>pir||INTK insulin - turkey (tentative sequence)
AANQHLCGSHLVEALYLVCGERGFFYSPKAGIVEQCCHNTCSLYQLENYCN
>AAA59172.1 insulin [Homo sapiens]
MALWMRLLPLLALLALWGPDPAAAFVNQHLCGSHLVEALYLVCGERGFFYTPKTRREAEDLQVGQVELGG
GPGAGSLQPLALEGSLQKRGIVEQCCTSICSLYQLENYCN
>QBX89050.1 insulin, partial [Nephrops norvegicus]
VVVVVVGSSRASRRTYPTSEEEPRRRLCGWRLANKLNRVCKGVYNNPGSTGNYLFYRSRRDGESEPGLPP
EEYLDLLPDPEEERGLRHHYLTFSQRASEDTPSEENEAPGSFFGSLSPQDSPHQSAVQEDEASSVQFPFL
TEEEASQMVRVRPRSKRGLSAECCRKVCTVSELVGYCY
>ACQ91106.1 insulin, partial [Haliotis corrugata]
DLHVIISNLCSSLGGNRRFLAKYMVKRDTENVNDKLRGILLNKKEAFSYLTKREASGSITCECCFNQCRI
FELAQYCRLPDHFFSRISRTG
>BAS32722.1 insulin, partial [Varanus exanthematicus]
LVLLAVLAPTAIYATSENDEHLCGSALVEALVSACGKEGIYSFTKRNEQSLGHGLLDNEVPFHLGKRGIV
EDCCENICPWSVLQSYCR
>ELK28555.1 Insulin [Myotis davidii]
MALWTRLLPLLALLALWAPAPAQAFVNQHLCGSHLVEALYLVCGERGFFYTPKDRRELPDPQGESSPLTP
RSHPKGTGYLASVDPSQSQDQHLCGDELVNALTITCGDRGFYNPMAPLEQDDLQEEEVEMDEGGLQALTL
EGLLQKRGIVEECCTNVCSLYQLERYCN

Search This Blog

BioGem Blog