geneSmash Documentation
geneSmash is a mash-up of various sources of information about human genes. The primary sources at the time of this writing are
- The gene_info file from the NCBI Entrez gene FTP site.
- The gene2unigene file from the NCBI Entrez gene FTP site.
- The refFlat.txt file from the UCSC Genome Browser.
- The hsa.gff file from miRBase.
- Human gene expression array annotation information is extracted from the Manufacturer's (Affymetrix, Agilent and Illumina) websites.
- Affymetrix annotation files are obtained from NetAffx Analysis Center
- Illumina probe annotation is acquired from this location
- Agilent annotation information is taken from Agilent earray portal
Currently, probe annotation information for various Human gene expression array platforms from the above specified manufacturers is available in geneSmash
Web Site
The geneSmash web site provides one set of search tools built upon this infrastructure. You can enter the official gene symbol to get back the genome location, along with links out to the source databases at Entrez Gene or at the UCSC Genome Browser. Alternatively, you can search for genes by alias, by gene expression probe in a microarray, by cytoband location, or by giving a range of base positions in the human genome. You can also write your own progams (see below) or add your own web applications on top of a local copy of the database.
Mirrors and Replication
CouchDB provides native support for database replication. You can use those facilities to make (and maintain) a local copy of the entire geneSmash database. Because replication copies items at the granularity of an individual document (which in this instance means the collection of information about one gene), it is much gentler on network resources than copying the entire source files from the NCBI or UCSC. This advantage becomes particularly pronounced during maintenance, since a second replication will only copy the documents that have changed since the last time you replicated.
If you replicate the database, we request that you maintain links to the geneSmash logo and to the University of Texas M.D. Anderson Cancer Center.
Programming Interface
Because geneSmash is implemented using CouchDB, all of the data is available through a RESTful interface. Calls are made to the server using standard HTTP, and responses are sent in JSON format.
Database Overview
The database "design document" (which serves as the equivalent of a database schema from a relational database) is available (in JSON format) via the call /genesmash/_design/basic
Information on individual genes
The primary key (in CouchDB terms, the _id
) in geneSmash
is the NCBI Entrez Gene
database identifier. For example, suppose we are interested in the tumor
suppressor gene p53, whose Entrez gene id is 7157. In order to
get all of the geneSmash information about p53, you would make
an HTTP call to the URL:
/genesmash/7157.
In order to get the data on a different gene whose Entrez Gene id
is known, just replace 7157
in the URL by the id of the
gene of interest.
Queries based on other identifiers
Of course, in many circumstances, you do not know the Entrez gene id but have some other way to refer to the gene. One common example occurs when you know the official HGNC symbol for a gene. We have designed CouchDB to allow queries based on some of these other identifiers.
Queries in CouchDB are implemented by defining views in the
design document. You can use the call above to get a copy of the
design document and see the complete list of views that have been
defined. The view by_symbol
allows you to query based on
the HGNC symbol. For genes without an HGNC symbol, you can query the
database using symbols provided by Entrez Gene. So, the HTTP request
/genesmash/_design/basic/_view/by_symbol?key="EGFR"&include_docs=true
will get all of the genesmash information on the gene EGFR.
Other views defined at present include
by_alias
by_cytoband
by_ensembl
by_location
by_mir
by_symbol
by_unigene
by_probe
gene_location
all
maxlength
minlength
Information on all genes
If you omit the key
parameter when you invoke a CouchDB
view, then the response contains information on all the documents that
are relevant to the view. For example, the HTTP request
/genesmash/_design/basic/_view/by_symbol
returns geneSmash information on all genes, sorted by the HGNC symbol.
Now, you might be hesitant to follow the previous link, but I encourage you to go ahead. For all permanent views, CouchDB pre-comoutes the responses. So even querying for all genes is very fast, since most of the time the server already knows the answer and just has to transmit the bytes over the network.
Using geneSmash in other programs
Because the interface to geneSmash (like the default interface for any CouchDB application) only uses HTTP and JSON, it can be integrated directly into all modern programming languages without imposing the overhead of a new specialized programming library. For instance, the following code examples shows you how to use geneSmash in the R statistical programming environment
- To get the genomic coordinates of a gene based on its
HGNC symbol:
library(rjson.krc) getGeneLocations <- function(sym) { host <- "HOSTNAME" giUrl <- paste(host, "genesmash/_design/basic/_view/by_symbol", sep='/') whatever <- paste(giUrl, "?key=\"", sym, "\"&include_docs=true", sep='') junk <- paste(readLines(whatever), collapse='') stuff <- fromJSON(junk) rows <- stuff[["rows"]][[1]]$doc$Maps data.frame(Build=unlist(lapply(rows, function(x) x$NCBI)), Chromosome=unlist(lapply(rows, function(x) x$Chromosome)), TranscriptionStart=unlist(lapply(rows, function(x) x$TranscriptionStart)), TranscriptionEnd=unlist(lapply(rows, function(x) x$TranscriptionEnd))) } getGeneLocations("TP53")
Some notes on the code:
- The first line loads an R package that converts between JSON
objects and R objects. The version of the
rjson
package currently at CRAN has some limitations that make it work poorly in the current context. We have patched the package and you can get a copy form our R repository at http://bioinformatics.mdanderson.org/OOMPA. Note that you will need to supply this repository name to theinstall.packages
function. New: Beginning with version 0.7, theRJSONIO
package that Duncan Temple Lang maintains at Omegahat can be used instead ofrjson.krc
. That implementation is recommended, especially since it is much faster. - The first two lines of the
getGeneLocation
function use the symbol argument to construct an appropriate URL. The call topaste(readLines(...))
actually makes the HTTP request to the geneSmash server. The next line converts the JSON response into an R object, and the final lines extract the relevant part of the response. - If you actually run the code, you may at first be surprised to get back an entire data frame instead of a single response. However, there are two reasons why the answer is not unique. First, we have loaded the mapping data for several different builds of the genome into geneSmash, and you get answers for every build. Second, many genes have alternative splice forms; each one has a slightly different transcription start and end (even within a single build of the genome). If you actually explore the "Maps" element, you will discover that it contains the start and end postions of all of the exons for every known alternative splice form in multiple builds of the genome.
- This version of the code does not perform error checking on the result, so it can probably not be used in production code. Failure can occur because the server is not available, or because the symbol passed as an argument is not a valid HGNC symbol, or because no mapping location is known; all three conditions should be checked.
- The first line loads an R package that converts between JSON
objects and R objects. The version of the
- To get the information of a gene based on the microarray probe identifier:
library(rjson.krc) getProbeInfo <- function(Manufacturer, ProbeID) { host <- "HOSTNAME" giUrl <- paste(host, "genesmash/_design/basic/_view/by_probe2", sep='/') link <- paste(giUrl, "?startkey=[\"", Manufacturer, "\",\"", ProbeID, "\"]", "&endkey=[\"", Manufacturer, "\",\"", ProbeID, "\",\"\\u9999\"]", "&include_docs=true", sep='') JSON_data <- paste(readLines(link), collapse='') gene_data <- fromJSON(JSON_data) geneInfo <- NA if(is.list(gene_data) & (length(gene_data$rows) > 0)) { rows <- gene_data[["rows"]][[1]]$doc geneID = unlist(rows['_id']) sym = unlist(rows['Symbol']) genbankID = ifelse(is.null(unlist(rows['GenBank'])), NA, unlist(rows['GenBank'])) unigeneID = ifelse(is.null(unlist(rows['UniGene'])), NA, unlist(rows['UniGene'])) desc = ifelse(is.null(unlist(rows['Description'])), NA, unlist(rows['Description'])) chr = ifelse(is.null(unlist(rows['Chromosome'])), NA, unlist(rows['Chromosome'])) geneInfo <- data.frame(row.names = ProbeID, EntrezGeneID= geneID, GenbankID = genbankID, UnigeneID = unigeneID, Symbol = sym, Description = desc, Chromosome = chr) } geneInfo } getProbeInfo("Affymetrix", "205241_at") getProbeInfo("Agilent", "A_23_P142045")
Note: Microarray probes associated with a NCBI Entrez Gene are only included in the geneSmash database.