Indexing databases for GPMAW
Decide which database to download taking into consideration that different databases provide different levels of annotation, completeness and some may have multiple copies of nearly identical sequences that sorting through them may not be worth the hazzle.
Download using either your favorite browser or a dedicated FTP utiltity.
Create a main directory for all databases and a sub-directory for each database so you end up with a single database in a directory. This makes the databases much easier to maintain. When GPMAW is installed (default in C:\gpmaw\) a directory is created for databases (C:\gpmaw\database\), but you are free to locate them elsewhere. I generally use a separate partition or harddisk. You can also place them on a shared server, but if you use the database for Digest mass search, you will experience a performance hit.
Most databases will be downloaded as compressed files, usually compressed by the gunzip utility (having the .gz extension). They have to be decompressed before use. Several Windows zip utilities can decompress gunzipped files, otherwise you can download the free gzip program from here.
Once the database has been decompressed (often resulting in a 3x file increase) you can continue with indexing.
Open DBindex. On the start page, you select the ‘Convert’ button. As most databases have been stored on a Unix based server, the file will be in VMS format and has to be converted to the DOS file system before use.
Select the “VMS to DOS” button; in the File open dialog you navigate to the database (you may have to change the ‘file type’ to see the file) and upon selection DBindex will convert the file (technically: insert #13#10 characters at line ends instead of just #13).
If you have downloaded one of the large non-redundant databases (e.g. NCBInr), it is possible that the name lines may be longer than 250 characters, thus rendering them unreadable for GPMAW. This can be corrected by converting the database using the “Non-standard to standard FastA” routine. In the box below, you can determine the new maximum length of the name line.
If you have downloaded a database in Swiss-Prot format (TrEMBL, IPI, Swiss-Prot) you have to generate a FastA version of the file by running the “Database to FastA” routine.
You now have a FastA formatted database in DOS format. You can use this filefor:
Indexing: On the “Indexing” page of DBindex you can create text indices for the database. These indices are used when searching for sequences using the GPMAW command “File | Open FastA sequence”.
Digest mass search: This function also demands the creation of indices that are either created using GPMAW “Setup | Make digest database” or generated on-the-fly when first using the database for digest mass searching.
BLAST: Indices have to be generated - carried out in GPMAW using the ‘System setup’ on the ‘BLAST’ page you select the ‘Format’ command.
The downloaded database will usually have the extension .DAT or .SEQ.
A FastA database generated from a Swiss-Prot formatted file will have the extension .SEQ. A .IDX file containing indices into the main (non-FastA) database. In addition a .FC1 file of the same name will be generated. This is a text file that can be read by Notepad.
When text indexing a file, the following files will be generated:
.FAC a text file containing statistics of the database and indexing.
.ACC indices for accessioin numbers. Binary, do not read.
.NDX index file for the database. Binary.
.TRG target index file. Binary.
max_rec.txt a text file containing the largest record found in the database.
BLAST indexing will generate three index files (.psq, .pin and .phr) in addition to a formatdb.log text file containing statistics. The files will have the same name as the database.
Digest mass search indexing will generate three files for each enzyme used, typically with a name combined of the enzyme and mass file (e.g. trypacet). The extensions are .NAM, .INF, and .DA2.