Central Provider of biological/medical databases
The internet is offering a lot of information. But when you want to find high quality information and software about a specific topic it is not easy to find out which is the most prominent resource you should use. In bioinformatics there are two major entry points which build up a virtual research community organized around databases, software, and data formats. In this section these two most prominent resources are introduced.
NCBI (National Center for Biotechnology Information)
Starting in 1988 the National Library of Medicine and the National Institutes of Healths in the United States decided to build up national resource for molecular biology information called NCBI (National Center for Biotechnology Information). This organization develops databases, computer programs for genomic data, guide research in computational biology, and runs a central server for published biomedical information. The duties and responsibilities of the NCBI are to build up a better understanding of molecular processes affecting human health and diseases.
EBI (European Bioinformatics Institute)
The European Bioinformatics Institute (EBI) is a non-profit academic organisation that is part of the European Molecular Biology Laboratory (EMBL). The EBI is a center for research in bioinformatics, also providing tools, database, and defining data formats for genomic data.
Databases/Software
Three public databases are introduced: GenBank (NCBI), DDBJ, and the EMBL database (EBI). These publicly available databases store nucleotide and proteine sequence data. In addition to these databases we will discuss other catagories of bioinformatics databases. We will take a look at the ENSEMBL database wich describes complete information/annotation of higher eukaryotic genomes. After that we will inspect a database for proteine sequence data.
- PubMed (bibliographic database)
- OMIM (human genes and genetic disorders)
- Taxonomy Browser (taxonomy database)
- GenBank (nucleotide/protein databases)
- Ensembl (selected eukaryotic genomes)
- NCBI:Entrez (life science search engine)
- EBI@SRS (sequnce retrival server)
- UniProt/SwissProt (protein databases)
Accession Numbers and Feature Tables
An essential feature of DNA and protein sequence records is that they are tagged with accession numbers. An accession number is a string about four to ten numbers and/or alphabetic characters that are associated with a molecular sequence record. Often you find this accession numbers at the beginning of a data entry (shortcut AC). Some other important data fields in this sequence records are the unique ID field and the DE line. The ID field tries to describe the data entry in a more or less structured way so that is human readable (often includes species shortcut, sequence length, etc.). Every data record characterises a special biological feature. In doing so every database uses defined vocabulary to describe the actual knowledge about the data record which describes the biological function, expression or interaction of the data entry etc. These features are used in different ways in every database. For example in the databases EMBL, GenBank, and DDBJ every database use their own feature description.
Data formats
In bioinformatics different file formats are used to exchange information between sequence databases. More or less every database enables the user to get sequence data in some formats. This data formats can be used as an imput for sequence analysis software. Some of the most prominent formats are described below.
Please direct questions and comments to Martin Haubrock.