Protein databases and GPMAW
GPMAW uses databases in a variety of purposes. In most cases you will need to download the database from the Internet, but the scientific community has made most of the databases freely available, and the only drawback is the enormous size of some of the databases.
Format. The databases are (of course) available in different formats. The most common format is the FastA format which is found in a couple of variants. For all of the variants, a database record is defined by the name line which starts with a ‘>’ sign, usually followed by one or more accession numbers, the protein name and the species. On the following lines comes the sequence in one-letter code, usually formatted with 60 characters pr. line.
Another popular format is the Swiss-Prot (or EMBL) format, where each sequence record contains much additional information. For a detailed description see here.
Finally, many records are obtained in GenBank (Entrez/NCBI) format. This format is similar in information content to the Swiss-Prot format, is easier to read for humans, but more difficult to parse for computer programs. More information here.
GPMAW can read individual records in most formats, but in order to read a database, it has to be indexed by the utility program DBindex (freely available from Lighthouse data, download it here).
DBindex can handle databases in FastA and Swiss-Prot format. However, the Swiss-Prot format has to be converted into FastA before indexing, but when individual records are retrieved by GPMAW, the program will retrieve the fully annotated sequence.
NOTE: DBIndex is replaced by a new download/formatting utility DBGet, which enables you to download, format and index the big databases or species specific parts of them. For more information check here.
How are databases used?
Retrieval of sequence records.
Digest mass searches.
BLAST homology searches.
General description of files generated and how they are stored.
Table of useful databases.
Some databases and how to handle them in GPMAW/DBindex:
Swiss-Prot - The best annotated database. Reference for most other databases. TrEMBL is handled in an identical manner and when added to Swiss-Prot makes a good complete non-redundant database.
Note: The Swiss-Prot database has been replaced (incorporated into) by the UniProt database.
IPI human, mouse, rat - “Complete” protein database for human, mouse and rat respectively. Although in Swiss-Prot format, it is only partially annotated.
Note: the IPI databases have been deprecated, the new download utility will replace them.
NCBInr - The “complete” set of protein sequences collected from most existing protein databases. However, this is not non-redundant