Human Prostate Gene DataBase

  PGDB Documentation
 
  1. Definition of PGDB
  2. Motivations of PGDB
  3. Gene Inclusion Criteria
  4. Data Sources of PGDB
  5. Process of Construction
  6. Analysis of Expression
  7. Data Structure and Format
  8. Web Interface

Definition of PGDB
PGDB represents Prostate Gene DataBase and is a curated and integrated database of genes related to prostate and prostate diseases.

Motivations of PGDB
Biomedical literature is growing explosively. The MEDLINE database is a primary repository for such data. The prostate is a male sex gland and a common site of urological disorders. Prostatic diseases including prostate cancer, benign prostatic hypertrophy, infection and inflammation affect millions of men worldwide. Genetic factors in combination with other factors such as environmental, dietary play critical roles in both physiological and pathological processes of the prostate. A large number of genetic and molecular events have been documented in the literature and are represented by thousands of records in the MEDLINE database.  A fundamental limitation of MEDLINE and other similar resources is that the information they contain is not represented in structured format. Thus, both retrieval effectiveness and precision are poor. For example, a typical question scientists may ask when searching MEDLINE database is: “What genes have been found mutated in human prostate cancer?” To answer the question, they may search MEDLINE using query  "prostate cancer" AND mutation AND human, which returns 714 records as of July 24, 2002, among which less than half are relevant to the question and many of which are redundant. Another problem hindering efficient retrieval of gene-related information from literature databases is the non-standard terms used for gene names by scientists. For example, different alias names have been used in the literature for the CDKN2A gene commonly known as p16, including ARF, P16, CMM2, INK4, MTS1, TP16, CDK4I, CDKN2, INK4A, p14ARF, p16INK4. Use of any one to query MEDLINE database will result in missing of relevant records.

In consideration of the existing problems, PGDB was thus constructed to: 1) catalog gene-related facts of the prostate and prostatic diseases cumulated in the literature database during the past years and years to come; 2) store the information in structured format for fast and easy access; 3) annotate to deliver value-added information.    

Gene Inclusion Criteria
Two general categories of genes are currently included in PGDB. The first category is genes that have been documented in literature to be involved in the following molecular events in normal prostate or diseased prostate. These events include gene mutation, amplification, methylation, gross deletion, polymorphism, and over-expression. Another category is genes specifically expressed in prostate. Evidence for this category is from the SAGEmap database and the UniGene database hosted by National Center for Biotechnology Information (NCBI). For EST expression, a UniGene cluster must have at least 2 member ESTs, all of which were derived from prostate libraries; for SAGE expression, a gene to be defined as prostate specific must have a tag count of more than 1, all of which were derived from prostate libraries. Most of genes in this category are UniGene clusters of ESTs.

Data sources of PGDB
PGDB uses data from the following databases. 

·        MEDLINE citation database through PubMed at http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=PubMed

·        Unigene at http://www.ncbi.nlm.nih.gov/UniGene/

·        SAGEmap at http://www.ncbi.nlm.nih.gov/SAGE/

·        dbSNP at  http://www.ncbi.nlm.nih.gov/SNP/

·        LocusLink at http://www.ncbi.nlm.nih.gov/LocusLink/

·        Gene Ontology at http://www.geneontology.org

·        NCBI’s Gene Expression Omnibus (GEO), a gene expression and hybridization array data repository, at http://www.ncbi.nlm.nih.gov/geo/

 

Process of Construction
The construction PGDB is a multiple stage process.
  1. Data retrieval: MEDLINE citation abstracts are retrieved using Entrez query tool. A typical query consists of three key words: “prostate”, “human” and the word for the event. For example, the query for the event of gene mutation in prostate was (("prostate"[MeSH Terms] OR prostate[Text Word]) AND ("mutation"[MeSH Terms] OR mutation[Text Word])). MEDLINE records that are of review type or without abstracts were excluded. Genes involved in the event of over-expression were retrieved from OMIM database using “prostate” as the query word. Data from other databases were retrieved either through FTP or HTTP.
  2. Data extraction: MEDLINE abstracts from each query were carefully read by two scientists to identify true relationship between a gene and the prostate and to extract gene name, type of molecular and genetic events, type of prostatic diseases. A list of genes was thus generated for further annotations.
  3. Data annotations: Data annotations were performed automatically using programs written in Perl language. Pieces of information from other database were extracted and added to the extracted gene such as alias name, summary of gene function, gene ontology, SNPs.
  4. Expression analysis: To provide relative expression levels in all tissues for each gene, expression data were analyzed as stated below.
  5. File generation: PGDB is stored and maintained in a single denormalized flat file, from which front-end web pages are further generated automatically for display.

Analysis of Expression
For each gene collected in PGDB, levels of expression were analyzed utilizing both SAGE and EST data and pooled by tissue type.  For expression derived from EST, the number of ESTs for each gene in each library was first normalized to the number of ESTs per million, and then was pooled by tissue to obtain the average level of expression in tissues. When calculating expression from SAGE data, only reliable mapping data was used as defined by SAGEmap database. For each gene, the tag frequency in each library was also normalized to the number of tags per million. Special measures were taken to deal with the problem of multiple tag assignments. If one SAGE tag was mapped to n genes, the tag frequency for each gene in each library was divided by n. If one gene had more than one tag mapped to it, then the tag frequency for the gene was the sum of tag frequencies of all tags.

Interpretation of expression
For each gene in PGDB database, SAGE and/or EST expression data are given. An example is provided below. To view explanation for each item, please click the link.

EST (11 ESTsa, 6 librariesb)

Tissue  

Breadth c

CPM d

muscle Breadth of expression: 1 out of  82 libraries express this gene (1.22%) Average CPM:8.54
prostate Breadth of expression: 2 out of  305 libraries express this gene (0.66%) Average CPM:3.96
uncharacterized tissue Breadth of expression: 2 out of  1507 libraries express this gene (0.13%) Average CPM:20.33
uterus Breadth of expression: 1 out of  229 libraries express this gene (0.44%) Average CPM:276.32

SAGE (2858419 tagsa, 53 librariesb)

Tissue  

Breadth c

CPM d

ovary Breadth of expression: 3 out of  10 libraries express this gene (30.00%) Average CPM:45.85
pancreas Breadth of expression: 4 out of  8 libraries express this gene (50.00%) Average CPM:46.08
prostate Breadth of expression: 9 out of  13 libraries express this gene (69.23%) Average CPM:50.82
skin Breadth of expression: 1 out of  7 libraries express this gene (14.29%) Average CPM:112.98
stomach Breadth of expression: 2 out of  4 libraries express this gene (50.00%) Average CPM:71.46

 

a.       Total ESTs or SAGE tags: Total ESTs or SAGE tags representing this gene in all libraries from all tissues.

b.      Total Libraries: Total number of libraries expressing this gene.

c.       Breadth: Percentage of libraries expressing this gene out of total libraries in a tissue pool.

d.      Tag count per million (CPM): Number of tags from a library which is mapped to the gene is first normalized to a tag count per million, then is averaged among libraries expressing this gene.

 

Data Structure and Format
PGDB is distributed and maintained in a single flat file.  Fields of entry are explained below.

  • Name: Official gene name as assigned by HUGO Gene Nomenclature Committee (HGNC). If no official name is available, interim name from LocusLink is used.

  • Symbol (Optional):  Official gene symbol as assigned by HUGO Gene Nomenclature Committee (HGNC). If no official symbol is available, interim symbol, from LocusLink is used.

  • Aliases (Optional): Other names and symbols used for the gene, from  LocusLink

  • Gene Products: The name of product of this transcript

  • Category: The types of molecular or genetic event and disease the gene is involved or  type of expression derived by analysis of EST and SAGE expression data

  • UniGene (Optional): UniGene Id for the gene

  • Reference Sequences: mRNA or Protein sequence from RefSeq

  • OMIM and SNP (Optional): OMIM ID for the gene and link to NCBI SNPs for the locus.  

  • Locus (Optional): LocusLink ID, chromosome, cytoband for the gene including linking to UCSC and Ensembl genome database.

  • Summary (Optional): A summary description of the gene, its products, its significance, and mutant phenotypes, from LocusLink 

  • Gene Ontology (Optional): Gene Ontology for the gene, from LocusLink and Gene Ontology

  • Expression: Expression information derived from analysis of EST and SAGE data and is pooled by tissue type. Details are here.

  • Evidence (Optional): Supporting references listed by type of molecular events and diseases, sorted by year of publication, from PubMed. Other key publications related to this gene links to a list of publications related to the gene (from LocusLink)

 

Web Interface

  • Search: PGDB uses the free search engine, ht://dig from http://www.htdig.org. Searchable fields include gene name and symbol, aliases, UniGene ID, OMIM ID, and LocusLink ID.

  • Browse: PGDB content can be browsed by molecular event and by disease.