geneSmash Documentation

geneSmash is a mash-up of various sources of information about human genes. The primary sources at the time of this writing are

The gene_info file from the NCBI Entrez gene FTP site.
The gene2unigene file from the NCBI Entrez gene FTP site.
The refFlat.txt file from the UCSC Genome Browser.
The hsa.gff file from miRBase.
Human gene expression array annotation information is extracted from the Manufacturer's (Affymetrix, Agilent and Illumina) websites.
- Affymetrix annotation files are obtained from NetAffx™ Analysis Center
- Illumina probe annotation is acquired from this location
- Agilent annotation information is taken from Agilent earray portal
Currently, probe annotation information for various Human gene expression array platforms from the above specified manufacturers is available in geneSmash

Other sources may be incorporated in the future. These sources of information have been combined into a simple CouchDB database. As a consequence, we can build tools that make it possible to find the genomic location of a gene from its symbol, or to map easily between other classes of gene identifiers.

Web Site

The geneSmash web site provides one set of search tools built upon this infrastructure. You can enter the official gene symbol to get back the genome location, along with links out to the source databases at Entrez Gene or at the UCSC Genome Browser. Alternatively, you can search for genes by alias, by gene expression probe in a microarray, by cytoband location, or by giving a range of base positions in the human genome. You can also write your own progams (see below) or add your own web applications on top of a local copy of the database.

Mirrors and Replication

CouchDB provides native support for database replication. You can use those facilities to make (and maintain) a local copy of the entire geneSmash database. Because replication copies items at the granularity of an individual document (which in this instance means the collection of information about one gene), it is much gentler on network resources than copying the entire source files from the NCBI or UCSC. This advantage becomes particularly pronounced during maintenance, since a second replication will only copy the documents that have changed since the last time you replicated.

If you replicate the database, we request that you maintain links to the geneSmash logo and to the University of Texas M.D. Anderson Cancer Center.

Programming Interface

Because geneSmash is implemented using CouchDB, all of the data is available through a RESTful interface. Calls are made to the server using standard HTTP, and responses are sent in JSON format.

Database Overview

The database "design document" (which serves as the equivalent of a database schema from a relational database) is available (in JSON format) via the call /genesmash/_design/basic

Information on individual genes

The primary key (in CouchDB terms, the _id) in geneSmash is the NCBI Entrez Gene database identifier. For example, suppose we are interested in the tumor suppressor gene p53, whose Entrez gene id is 7157. In order to get all of the geneSmash information about p53, you would make an HTTP call to the URL: /genesmash/7157.

In order to get the data on a different gene whose Entrez Gene id is known, just replace 7157 in the URL by the id of the gene of interest.

Queries based on other identifiers

Of course, in many circumstances, you do not know the Entrez gene id but have some other way to refer to the gene. One common example occurs when you know the official HGNC symbol for a gene. We have designed CouchDB to allow queries based on some of these other identifiers.

Queries in CouchDB are implemented by defining views in the design document. You can use the call above to get a copy of the design document and see the complete list of views that have been defined. The view by_symbol allows you to query based on the HGNC symbol. For genes without an HGNC symbol, you can query the database using symbols provided by Entrez Gene. So, the HTTP request /genesmash/_design/basic/_view/by_symbol?key="EGFR"&include_docs=true will get all of the genesmash information on the gene EGFR. Other views defined at present include

by_alias
by_cytoband
by_ensembl
by_location
by_mir
by_symbol
by_unigene
by_probe
gene_location
all
maxlength
minlength

Information on all genes

If you omit the key parameter when you invoke a CouchDB view, then the response contains information on all the documents that are relevant to the view. For example, the HTTP request /genesmash/_design/basic/_view/by_symbol returns geneSmash information on all genes, sorted by the HGNC symbol.

Now, you might be hesitant to follow the previous link, but I encourage you to go ahead. For all permanent views, CouchDB pre-comoutes the responses. So even querying for all genes is very fast, since most of the time the server already knows the answer and just has to transmit the bytes over the network.

Using geneSmash in other programs

Because the interface to geneSmash (like the default interface for any CouchDB application) only uses HTTP and JSON, it can be integrated directly into all modern programming languages without imposing the overhead of a new specialized programming library. For instance, the following code examples shows you how to use geneSmash in the R statistical programming environment

To get the genomic coordinates of a gene based on its HGNC symbol:
```
library(rjson.krc)
getGeneLocations <- function(sym) {
  host <- "HOSTNAME"
  giUrl <- paste(host, "genesmash/_design/basic/_view/by_symbol", sep='/')
  whatever <- paste(giUrl, "?key=\"", sym, "\"&include_docs=true", sep='')
  junk <- paste(readLines(whatever), collapse='')
  stuff <- fromJSON(junk)
  rows <- stuff[["rows"]][[1]]$doc$Maps
  data.frame(Build=unlist(lapply(rows, function(x) x$NCBI)),
             Chromosome=unlist(lapply(rows, function(x) x$Chromosome)),
             TranscriptionStart=unlist(lapply(rows, function(x) x$TranscriptionStart)),
             TranscriptionEnd=unlist(lapply(rows, function(x) x$TranscriptionEnd)))
}
getGeneLocations("TP53")
```
Some notes on the code:
- The first line loads an R package that converts between JSON objects and R objects. The version of the rjson package currently at CRAN has some limitations that make it work poorly in the current context. We have patched the package and you can get a copy form our R repository at http://bioinformatics.mdanderson.org/OOMPA. Note that you will need to supply this repository name to the install.packages function. New: Beginning with version 0.7, the RJSONIO package that Duncan Temple Lang maintains at Omegahat can be used instead of rjson.krc. That implementation is recommended, especially since it is much faster.
- The first two lines of the getGeneLocation function use the symbol argument to construct an appropriate URL. The call to paste(readLines(...)) actually makes the HTTP request to the geneSmash server. The next line converts the JSON response into an R object, and the final lines extract the relevant part of the response.
- If you actually run the code, you may at first be surprised to get back an entire data frame instead of a single response. However, there are two reasons why the answer is not unique. First, we have loaded the mapping data for several different builds of the genome into geneSmash, and you get answers for every build. Second, many genes have alternative splice forms; each one has a slightly different transcription start and end (even within a single build of the genome). If you actually explore the "Maps" element, you will discover that it contains the start and end postions of all of the exons for every known alternative splice form in multiple builds of the genome.
- This version of the code does not perform error checking on the result, so it can probably not be used in production code. Failure can occur because the server is not available, or because the symbol passed as an argument is not a valid HGNC symbol, or because no mapping location is known; all three conditions should be checked.

To get the information of a gene based on the microarray probe identifier:


library(rjson.krc)
getProbeInfo <- function(Manufacturer, ProbeID) {
  host <- "HOSTNAME"
  giUrl <- paste(host, "genesmash/_design/basic/_view/by_probe2", sep='/')
  link <- paste(giUrl, "?startkey=[\"", Manufacturer, "\",\"", ProbeID, "\"]",
                "&endkey=[\"", Manufacturer, "\",\"", ProbeID, "\",\"\\u9999\"]",
	        "&include_docs=true", sep='')
  JSON_data <- paste(readLines(link), collapse='')
  gene_data <- fromJSON(JSON_data)
  geneInfo <- NA
  if(is.list(gene_data) & (length(gene_data$rows) > 0)) {
	rows <- gene_data[["rows"]][[1]]$doc
	geneID = unlist(rows['_id'])
	sym = unlist(rows['Symbol'])
	genbankID = ifelse(is.null(unlist(rows['GenBank'])), NA, unlist(rows['GenBank']))
	unigeneID = ifelse(is.null(unlist(rows['UniGene'])), NA, unlist(rows['UniGene']))
	desc = ifelse(is.null(unlist(rows['Description'])), NA, unlist(rows['Description']))
	chr = ifelse(is.null(unlist(rows['Chromosome'])), NA, unlist(rows['Chromosome']))
	geneInfo <- data.frame(row.names = ProbeID, EntrezGeneID= geneID, 
		               GenbankID = genbankID, UnigeneID = unigeneID,
			       Symbol = sym, Description = desc, Chromosome = chr)
  }
  geneInfo
}
getProbeInfo("Affymetrix", "205241_at")
getProbeInfo("Agilent", "A_23_P142045")

Note: Microarray probes associated with a NCBI Entrez Gene are only included in the geneSmash database.