PGDB: Documentation

Human Prostate Gene DataBase

PGDB Documentation

PGDB Home

Search PGDB

Browse by Category

Documentation

Database Statistics

Resources

Definition of PGDB
Motivations of PGDB
Gene Inclusion Criteria
Data Sources of PGDB
Process of Construction
Analysis of Expression
Data Structure and Format
Web Interface

Definition of PGDB
PGDB represents Prostate Gene DataBase and is a curated and integrated database of genes related to prostate and prostate diseases.

Motivations of PGDB
Biomedical literature is growing explosively. The MEDLINE database is a primary repository for such data. The prostate is a male sex gland and a common site of urological disorders. Prostatic diseases including prostate cancer, benign prostatic hypertrophy, infection and inflammation affect millions of men worldwide. Genetic factors in combination with other factors such as environmental, dietary play critical roles in both physiological and pathological processes of the prostate. A large number of genetic and molecular events have been documented in the literature and are represented by thousands of records in the MEDLINE database. A fundamental limitation of MEDLINE and other similar resources is that the information they contain is not represented in structured format. Thus, both retrieval effectiveness and precision are poor. For example, a typical question scientists may ask when searching MEDLINE database is: “What genes have been found mutated in human prostate cancer?” To answer the question, they may search MEDLINE using query "prostate cancer" AND mutation AND human, which returns 714 records as of July 24, 2002, among which less than half are relevant to the question and many of which are redundant. Another problem hindering efficient retrieval of gene-related information from literature databases is the non-standard terms used for gene names by scientists. For example, different alias names have been used in the literature for the CDKN2A gene commonly known as p16, including ARF, P16, CMM2, INK4, MTS1, TP16, CDK4I, CDKN2, INK4A, p14ARF, p16INK4. Use of any one to query MEDLINE database will result in missing of relevant records.

In consideration of the existing problems, PGDB was thus constructed to: 1) catalog gene-related facts of the prostate and prostatic diseases cumulated in the literature database during the past years and years to come; 2) store the information in structured format for fast and easy access; 3) annotate to deliver value-added information.

Gene Inclusion Criteria
Two general categories of genes are currently included in PGDB. The first category is genes that have been documented in literature to be involved in the following molecular events in normal prostate or diseased prostate. These events include gene mutation, amplification, methylation, gross deletion, polymorphism, and over-expression. Another category is genes specifically expressed in prostate. Evidence for this category is from the SAGEmap database and the UniGene database hosted by National Center for Biotechnology Information (NCBI). For EST expression, a UniGene cluster must have at least 2 member ESTs, all of which were derived from prostate libraries; for SAGE expression, a gene to be defined as prostate specific must have a tag count of more than 1, all of which were derived from prostate libraries. Most of genes in this category are UniGene clusters of ESTs.

Data sources of PGDB
PGDB uses data from the following databases.

· MEDLINE citation database through PubMed at http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=PubMed

· Unigene at http://www.ncbi.nlm.nih.gov/UniGene/

· SAGEmap at http://www.ncbi.nlm.nih.gov/SAGE/

· dbSNP at http://www.ncbi.nlm.nih.gov/SNP/

· LocusLink at http://www.ncbi.nlm.nih.gov/LocusLink/

· Gene Ontology at http://www.geneontology.org

· NCBI’s Gene Expression Omnibus (GEO), a gene expression and hybridization array data repository, at http://www.ncbi.nlm.nih.gov/geo/

Process of Construction
The construction PGDB is a multiple stage process.

Data retrieval: MEDLINE citation abstracts are retrieved using Entrez query tool. A typical query consists of three key words: “prostate”, “human” and the word for the event. For example, the query for the event of gene mutation in prostate was (("prostate"[MeSH Terms] OR prostate[Text Word]) AND ("mutation"[MeSH Terms] OR mutation[Text Word])). MEDLINE records that are of review type or without abstracts were excluded. Genes involved in the event of over-expression were retrieved from OMIM database using “prostate” as the query word. Data from other databases were retrieved either through FTP or HTTP.
Data extraction: MEDLINE abstracts from each query were carefully read by two scientists to identify true relationship between a gene and the prostate and to extract gene name, type of molecular and genetic events, type of prostatic diseases. A list of genes was thus generated for further annotations.
Data annotations: Data annotations were performed automatically using programs written in Perl language. Pieces of information from other database were extracted and added to the extracted gene such as alias name, summary of gene function, gene ontology, SNPs.
Expression analysis: To provide relative expression levels in all tissues for each gene, expression data were analyzed as stated below.
File generation: PGDB is stored and maintained in a single denormalized flat file, from which front-end web pages are further generated automatically for display.

Analysis of Expression
For each gene collected in PGDB, levels of expression were analyzed utilizing both SAGE and EST data and pooled by tissue type. For expression derived from EST, the number of ESTs for each gene in each library was first normalized to the number of ESTs per million, and then was pooled by tissue to obtain the average level of expression in tissues. When calculating expression from SAGE data, only reliable mapping data was used as defined by SAGEmap database. For each gene, the tag frequency in each library was also normalized to the number of tags per million. Special measures were taken to deal with the problem of multiple tag assignments. If one SAGE tag was mapped to n genes, the tag frequency for each gene in each library was divided by n. If one gene had more than one tag mapped to it, then the tag frequency for the gene was the sum of tag frequencies of all tags.

Interpretation of expression
For each gene in PGDB database, SAGE and/or EST expression data are given. An example is provided below. To view explanation for each item, please click the link.

EST (11 ESTs^a, 6 libraries^b)
Tissue	Breadth ^c	CPM ^d

muscle
prostate
uncharacterized tissue
uterus

SAGE (2858419 tags^a, 53 libraries^b)

Tissue

Breadth ^c

CPM ^d

ovary

pancreas

prostate

skin

stomach

a. Total ESTs or SAGE tags: Total ESTs or SAGE tags representing this gene in all libraries from all tissues.

b. Total Libraries: Total number of libraries expressing this gene.

c. Breadth: Percentage of libraries expressing this gene out of total libraries in a tissue pool.

d. Tag count per million (CPM): Number of tags from a library which is mapped to the gene is first normalized to a tag count per million, then is averaged among libraries expressing this gene.

Data Structure and Format
PGDB is distributed and maintained in a single flat file. Fields of entry are explained below.

Name: Official gene name as assigned by HUGO Gene Nomenclature Committee (HGNC). If no official name is available, interim name from LocusLink is used.
Symbol (Optional): Official gene symbol as assigned by HUGO Gene Nomenclature Committee (HGNC). If no official symbol is available, interim symbol, from LocusLink is used.
Aliases (Optional): Other names and symbols used for the gene, from LocusLink
Gene Products: The name of product of this transcript
Category: The types of molecular or genetic event and disease the gene is involved or type of expression derived by analysis of EST and SAGE expression data
UniGene (Optional): UniGene Id for the gene
Reference Sequences: mRNA or Protein sequence from RefSeq
OMIM and SNP (Optional): OMIM ID for the gene and link to NCBI SNPs for the locus.
Locus (Optional): LocusLink ID, chromosome, cytoband for the gene including linking to UCSC and Ensembl genome database.
Summary (Optional): A summary description of the gene, its products, its significance, and mutant phenotypes, from LocusLink
Gene Ontology (Optional): Gene Ontology for the gene, from LocusLink and Gene Ontology
Expression: Expression information derived from analysis of EST and SAGE data and is pooled by tissue type. Details are here.
Evidence (Optional): Supporting references listed by type of molecular events and diseases, sorted by year of publication, from PubMed. Other key publications related to this gene links to a list of publications related to the gene (from LocusLink)

Web Interface

Search: PGDB uses the free search engine, ht://dig from http://www.htdig.org. Searchable fields include gene name and symbol, aliases, UniGene ID, OMIM ID, and LocusLink ID.
Browse: PGDB content can be browsed by molecular event and by disease.