- Definition of PGDB
- Motivations of
- Gene Inclusion
- Data Sources of PGDB
- Process of
- Data Structure and Format
- Web Interface
Definition of PGDB
PGDB represents Prostate
Gene DataBase and is a curated and integrated database of
genes related to prostate and prostate diseases.
Motivations of PGDB
literature is growing explosively. The MEDLINE database is a primary repository
for such data. The prostate is a male sex gland and a common site of urological
disorders. Prostatic diseases including prostate cancer, benign prostatic
hypertrophy, infection and inflammation affect millions of men worldwide.
Genetic factors in combination with other factors such as environmental,
dietary play critical roles in both physiological and pathological processes of
the prostate. A large number of genetic and molecular events have been
documented in the literature and are represented by thousands of records in the
MEDLINE database. A fundamental
limitation of MEDLINE and other similar resources is that the information they
contain is not represented in structured format. Thus, both retrieval
effectiveness and precision are poor. For example, a typical question
scientists may ask when searching MEDLINE database is: “What genes have been
found mutated in human prostate cancer?” To answer the question, they may
search MEDLINE using query
"prostate cancer" AND mutation AND human, which returns 714
records as of July 24, 2002, among which less than half are relevant to the
question and many of which are redundant. Another problem hindering efficient
retrieval of gene-related information from literature databases is the non-standard
terms used for gene names by scientists. For example, different alias names
have been used in the literature for the CDKN2A gene commonly known as p16,
including ARF, P16, CMM2, INK4, MTS1, TP16,
CDK4I, CDKN2, INK4A, p14ARF, p16INK4. Use of
any one to query MEDLINE database will result in missing of relevant records.
of the existing problems, PGDB was thus constructed to: 1) catalog gene-related
facts of the prostate and prostatic diseases cumulated in the literature
database during the past years and years to come; 2) store the information in
structured format for fast and easy access; 3) annotate to deliver value-added
Gene Inclusion Criteria
Two general categories of genes are
currently included in PGDB. The first category is genes that have been
documented in literature to be involved in the following molecular events in
normal prostate or diseased prostate. These events include gene mutation,
amplification, methylation, gross deletion, polymorphism, and over-expression.
Another category is genes specifically expressed in prostate. Evidence for this
category is from the SAGEmap
database and the UniGene
database hosted by National Center for
Biotechnology Information (NCBI). For EST expression, a UniGene cluster
must have at least 2 member ESTs, all of which were derived from prostate
libraries; for SAGE expression, a gene to be defined as prostate specific must
have a tag count of more than 1, all of which were derived from prostate libraries. Most
of genes in this category are UniGene clusters of ESTs.
Data sources of PGDB
PGDB uses data from the following
MEDLINE citation database through PubMed at http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=PubMed
Unigene at http://www.ncbi.nlm.nih.gov/UniGene/
SAGEmap at http://www.ncbi.nlm.nih.gov/SAGE/
dbSNP at http://www.ncbi.nlm.nih.gov/SNP/
LocusLink at http://www.ncbi.nlm.nih.gov/LocusLink/
Gene Ontology at http://www.geneontology.org
NCBI’s Gene Expression Omnibus (GEO), a gene expression
and hybridization array data repository, at http://www.ncbi.nlm.nih.gov/geo/
Process of Construction
PGDB is a multiple stage process.
- Data retrieval: MEDLINE citation abstracts are retrieved using Entrez query tool. A typical query consists of three key words: “prostate”, “human” and the word for the event. For example, the query for the event of gene mutation in prostate was (("prostate"[MeSH Terms] OR prostate[Text Word]) AND ("mutation"[MeSH Terms] OR mutation[Text Word])). MEDLINE records that are of review type or without abstracts were excluded. Genes involved in the event of over-expression were retrieved from OMIM database using “prostate” as the query word. Data from other databases were retrieved either through FTP or HTTP.
- Data extraction: MEDLINE abstracts from each query were carefully read by two scientists to identify true relationship between a gene and the prostate and to extract gene name, type of molecular and genetic events, type of prostatic diseases. A list of genes was thus generated for further annotations.
- Data annotations: Data annotations were
performed automatically using programs written in Perl language. Pieces of information from other database were extracted and added to the extracted gene such as alias name, summary of gene function, gene ontology, SNPs.
- Expression analysis: To provide relative expression levels in all tissues for each gene, expression data were analyzed as stated below.
- File generation: PGDB is stored and
maintained in a single denormalized flat file, from which front-end web
pages are further generated automatically for display.
For each gene
collected in PGDB, levels of expression were analyzed utilizing both SAGE and
EST data and pooled by tissue type.
For expression derived from EST, the number of ESTs for each gene in
each library was first normalized to the number of ESTs per million, and then
was pooled by tissue to obtain the average level of expression in tissues. When
calculating expression from SAGE data, only reliable mapping data was used as
defined by SAGEmap database. For each gene, the tag frequency in each library
was also normalized to the number of tags per million. Special measures were
taken to deal with the problem of multiple tag assignments. If one SAGE tag was
mapped to n genes, the tag frequency for each gene in each library was
divided by n. If one gene had more than one tag mapped to it, then the
tag frequency for the gene was the sum of tag frequencies of all tags.
For each gene in
PGDB database, SAGE and/or EST expression data are given. An
example is provided below. To view explanation for each item, please click the
ESTs or SAGE tags: Total ESTs or SAGE tags representing this gene in all
libraries from all tissues.
Libraries: Total number of libraries expressing this gene.
Percentage of libraries expressing this gene out of total libraries in a tissue pool.
count per million (CPM): Number of tags from a library which is mapped to
the gene is first normalized to a tag count per million, then is averaged among libraries expressing this gene.
Structure and Format
PGDB is distributed and maintained in a single flat file. Fields
of entry are explained below.
gene name as assigned by HUGO
Gene Nomenclature Committee (HGNC). If no official name is
available, interim name from LocusLink
Official gene symbol as assigned by HUGO
Gene Nomenclature Committee (HGNC). If no official symbol is
available, interim symbol, from LocusLink
Other names and symbols used for the gene, from LocusLink
The name of product of this transcript
types of molecular or genetic event and disease the gene is involved
or type of expression derived by analysis of EST and SAGE
UniGene Id for the
Sequences: mRNA or Protein sequence from RefSeq
OMIM and SNP (Optional):
for the gene and link to NCBI SNPs for the locus.
chromosome, cytoband for
the gene including linking to UCSC and Ensembl genome database.
A summary description of the gene, its products, its significance, and
mutant phenotypes, from LocusLink
(Optional): Gene Ontology for the gene, from LocusLink
and Gene Ontology
Expression information derived from analysis of EST and SAGE data and is
pooled by tissue type. Details are here.
Supporting references listed by type of molecular events and diseases,
sorted by year of publication, from PubMed.
Other key publications related to this
gene links to a list of publications related to the gene (from
uses the free search engine, ht://dig from http://www.htdig.org.
Searchable fields include gene name and symbol, aliases, UniGene ID,
OMIM ID, and LocusLink ID.
content can be browsed by molecular event and by disease.