targetHub Documentation

targetHub is a database of miRNA-mRNA interactions. The interaction data is obtained various external data sources and in some cases computed in-house by algorithms implemented for miRNA target prediction

miRNA and 3'UTR data

miRBase, a standard repository for miRNA data (Version 18) is used as the reference for Human miRNA data in targetHub. 3'UTR sequence data of the Human genome is extracted from UCSC Genome Brower site. Human mRNA transcripts in the 3'UTR data are annotated with Entrez Gene ID to use targetHub data in conjunction with geneSmash. The transcripts that doesn't map to standard chromosomes are filtered out from the sequence data. The filtered Human 3'UTR sequences and the mature miRNA sequence data from miRBase are used to compute the target predictions by algorithms such as miRanda

Targets for mature miRNA originating from the same stem-loop

The mature miRNA originating from the same hairpin loop are on different strands. As their sequences are complimentary, the targets for these miRNA sequences are completely different. The algorithms used to find the miRNA targets mostly rely on the seed sequences of mature miRNA. The identification used for mature miRNA by these algorithms is not standard, as the nomeculature is still evolving.

If two mature miRNA sequences are excised from the same stem-loop miRNA, they can be named in different ways.

miRNA/miRNA* notation is used to represent the miRNA of the same hairpin loop based on the expression levels. miRNA* represent the one with low expression level (example: hsa-miR-103a and hsa-miR-103a*).
The current method is to represent the arm: 5p or 3p (example: hsa-miR-103a-1-5p). In Version 18 of miRBase, mature miRNA sequences from all Human precursors are now designated in -5p and -3p convention, rather than miR/miR*.

If the databases of the target prediction algorithms state the name of stem-loop(hairpin miRNA), then it implies that it is representing the highly expressing mature miRNA of the loop. A better annotation (like 5p/3p convention) of mature miRNA in the miRNA target databases makes the target prediction more sensible.

Stem-loop miRNA nomenclature

Guidelines for the miRNA nomeculature are provided by miRBase (Griffiths-Jones et al, 2006, Meyes et al, 2008 and Griffiths-Jones et al, 2008). Some of them are elaborated in this section as they might be of utility in the context of querying the targetHub

Paralogous miRNA whose mature sequences differ atmost 2 bases are given suffixes by letters (example: hsa-miR-103a, hsa-miR-103b)
Distinct stem-loop miRNA that give rise to identical (in every base) mature miRNA have numbered suffixes (example: hsa-miR-103a-1, hsa-miR-103a-2)
If the miRNA is present in similar locus of genome in different strands of the genome, The distinction was made with 'S' for sense and 'AS' for antisense (example: hsa-miR-103 and hsa-miR-103-AS) previously in some cases

miRNA Target Prediction

This section describes the miRNA-target interaction data sources/methods used to setup the data in targetHub.

miRTarBase

miRTarBase is a curated database of experimentally validated microRNA-target interactions. Human version of miRTarBase [Release 2.5, October, 2011] is downloaded from their website. Each record of the database is associated with the type of experimental evidence and relevant PubMed article supporting the interaction. As the nomenclature for miRNA is not standard, miRNA names in miRTarBase are represented with various conventions (described above). miRNA that match with current version of miRBase are retained as it is in targetHub; while the candidates that has no matching identifier are manually curated by mapping through previous identifiers of miRBase.

TargetScan

The miRNA-mRNA interaction data predicted by TargetScan algorithm(Lewis et al, 2005,Grimson et al, 2007 and Friedman et al, 2009) is obtained from their website [Version 6.1, March 2012]. TargetScan provides two metrics: Probablity of conserved targeting (Pct) and Total contextual score (TCS) to assess the importance mir-target interaction. Pct corresponds to a Bayesian estimate of the probablity that a miR site on the 3' UTR of an mRNA is conserved due to miR targeting. While TCS represents the strength of the sequential features that facilitate miR-target hybridization/cleaveage. TargetScan predicts miR targets for miR families instead of individual miRs. targetHub specify the miR-family for which the target is identified along with number of conserved and non-conserved probable sites of miRNA-mRNA interaction in the mRNA transcript

PicTar

PicTar also looks for identical seed sequence to predict miRNA-mRNA interaction, similar to TargetScan. The targets predictions by PicTar were computed in 2005, where as TargetScan has an update in 2012. As the data is relatively old, there is lot of descripencies between current naming convention in miRBase and PicTar. miRNA name is manually curated in case of discrepency and the miRNA identifiers of PicTar are retained. In many cases the predictions of PicTar and TargetScan overlap well, compared to other algorithms like miRanda. PicTar derives an overall score to assess the strength of the miR-target interaction. This is the maximum likelihood that a given 3'UTR sequence is targeted by a fixed set of microRNAs. The PicTar algorithm scores any 3' UTR that has at least one aligned conserved predicted binding site for a microRNA, but then incorporates all possible binding sites into the score, even if they appear to be non-conserved.

Two levels of conservation can be chosen for PicTar algorithm:

Conservation among four vertebrates: human, mouse, rat, and dog [termed as picTar4 in targetHub]
Conservation among five vertebrates: human, mouse, rat, dog, and chicken [termed as picTar5 in targetHub]

The PicTar miRNA-mRNA interaction data is obtained from UCSC genome browser website (only present for build hg17). To fit the track conventions of the UCSC browser (integers), all scores were scaled by the maximum score of all microRNA 3'-UTR scores observed.

miRanda

miRanda is first bioinformatic method to predict the target genes of microRNA. The algorithms adds empirical rules to inscrease the weight of certain signficant positions in the miRNA. Recently (Betel et al, 2010), a machine learning method (mirSVR) is integrated to miRanda that would predict the extent of downregulation of a specific mRNA by a given miRNA (mirSVR score). This method is supposedly capable of finding non-canonical and non-conserved miR target sites. The code of miRanda is obtained from their website and is used to compute the miRNA targets for the current version (18) of miRBase using strict mode and default cutoff score (140).

Implementation

Database

targetHub is built using CouchDB. CouchDB is an open source, non-relational, document oriented database system. CouchDB’s built in web administration console communicates with the database using HTTP requests. The RESTful JSON API provided by CouchDB allows to access the database from any environment that allows HTTP requests (see examples).The database "design document" (which serves as the equivalent of a database schema in a relational database) is available here in JSON format.

Interface

The gene-based search interface for the website is built using geneSmash, a gene-centric couchDB database. The geneSmash web service converts any of the input gene identifier types to Entrez Gene. The Entrez Gene identifiers are passed to targetHub to retreive the relevant miRNA-target interactions. The queries with miRNA identifiers would be passed directly to targetHub

Replication

CouchDB provides native support for database replication. You can use this facility to make (and maintain) a local copy of the targetHub database. If you want an identical interface, you should also replicate geneSmash along with targetHub. If you replicate the database, we request that you maintain links to targetHub, geneSmash and the University of Texas M.D. Anderson Cancer Center.

Web Site

Search

Search can be performed in targetHub for miRNA-target relationships using either miRBase miRNA (stem-loop or mature) identifier or various types of gene identifiers supported by geneSmash. Any one of the following list of identifiers can be used to search targetHub

miRBase Identifier [stem-loop miRNA]
mature miR Identifier
HUGO Gene Symbol
Gene Symbol Alias
Ensembl Identifier
Entrez Gene Identifier
NCBI Unigene Identifier
Human Gene Expression Array Identifiers of
- Affymetrix
- Illumina
- Agilent

Download

The search results can be downloaded as a tab-delimited (TSV) file. Each record in the results of a TSV file represent a miRNA-target interaction by a specific method (Eg: TargetScan). The first six columns in the file represent a generic miRNA-target interaction in targetHub. The last four columns are specific to each method used to derive the miRNA-target interaction. The following table describe the last four columns for each method in targetHub.

Method	Param1	Param2	Param3	Param4
TargetScan	Context Score	Aggregate Pct	Total Conserved Sites	Representative miRNA
mirTarBase	Experiment Type	Evidence Level	InteractionID	PubmedID
miRanda	Score	Energy	Transcript Location	Position
picTar (4 & 5)	Score	Position	miRNA	-

Illustration

The predicted targets by various methods described above are illustrated by venn diagram after search with any criteria in the web site. The predicted targets are summarized for a given stem-loop miRNA or a gene. A miRNA-gene interaction defined by two different mature miRNA (3p and 5p) of the same stem-loop miRNA are considered as different interactions for this count. Search with mature miRNA would not generate any illustrations, as multiple stem-loop miRNA are associated with each mature miRNA.

Using targetHub in other programs

The following code examples show, how to use targetHub to extract miRNA targeting a gene by specific methods in the PERL programming environment
```
use LWP::Simple;
use JSON;
sub getTargetingMirna {
  my ($EntrezGeneID, $method, $couchdbView, $dataLink, $tData, %tData, @targetData, @targetList);
  $EntrezGeneID = $_[0]; $method =  $_[1];
  $couchdbView = "HOSTNAME/tarhub/_design/basic/_view/by_geneIDmethod";
  $method =~ s/\+/%2B/g;
  $dataLink = $couchdbView. '?key=["'.$EntrezGeneID.'","'.$method.'"]';
  $tData = get $dataLink;
  $tData  = decode_json $tData;
  %tData = %$tData;
  @targetData = @{$tData{"rows"}};
  for($i = 0; $i < scalar(@targetData);$i++) {
	
	$targetList[@targetList] = $targetData[$i]->{"id"}."\t".$targetData[$i]->{"value"}."\n";
  }
	return @targetList;
} 
my @targetList = getTargetingMirna("7157", "miranda+targetscan");
print "@targetList\n";
```
Some notes on the code:
- The top two lines loads PERL packages LWP::Simple and JSON which are used to handle HTTP request and convert JSON objects to Perl data structure respectively.The The first two lines of the getTargetingMirna function declare and initiate variables. The next three lines use the Entrez Gene identifier and predicting method arguments to construct an appropriate URL to query targetHub. The next line actually makes the HTTP request to the targetHub server. The following three lines converts the JSON response into an PERL Array of Hash data structure. The final lines extract the relevant part of the response
- This version of the code does not perform error checking on the result, so it can probably not be used in production code. Failure can occur because the server is not available, or because the gene identifier passed as an argument is not a valid Entrez Gene identifier, or because no known mirna-target interation is available for the gene; all three conditions should be checked.
The following code example show, how to use targetHub to extract miRNA targeting a gene by evidence count in the R statistical programming environment
```
library(RJSONIO)
getTargetingMirna <- function(sym, evidence_count) {
  giUrl <- "HOSTNAME/genesmash/_design/basic/_view/by_symbol"
  glink <- paste(giUrl, "?key=\"", sym, "\"", sep='')
  gene_data <- fromJSON(paste(readLines(glink), collapse=''))
  Target_Info <- NA
  if(length(gene_data[["rows"]]) != 0) {
	  GeneID <- as.character(gene_data$rows[[1]]["id"])
	  #Extracting mirna-target interactions for the Gene from targetHub
	  diUrl <- 'HOSTNAME/tarhub/_design/basic/_view/by_geneIDcount'
	  link <- paste(diUrl, '?key=["', GeneID, '",', evidence_count, ']', sep='')
	  JSON_data <- paste(readLines(link), collapse='')
	  target_data <- fromJSON(JSON_data)
	  if(is.list(target_data) & (length(target_data$rows) > 0)) {
		  target_data <- target_data$rows
		  target_Info <- matrix(nrow = length(target_data), ncol = 2)
		  colnames(target_Info) <- c("miRNA-gene_interaction", "corresponding_mature_miR") 
		  for(i in 1:length(target_data)) {
			target_Info[i,1] = unlist(target_data[[i]]$id)
			target_Info[i,2] = unlist(target_data[[i]]$value)
	      }	
	   } 
  }
  target_Info
}
getTargetingMirna("TP53",2)
```
Some notes on the code:
- The first line loads an R package that converts between JSON objects and R objects.The first two lines of the funcyton getTargetingMirna use the symbol argument to construct an appropriate URL. The next line actually makes the HTTP request to the geneSmash server and converts the JSON response into an R object. This code snippet converts an alternate gene identifier to Entrez gene identifier. The Entrez gene identifier obtained from geneSmash is used to make a mirna-target data retreival using similar HTTP request to targetHub. The final lines extract the relevant part of the response
- This version of the code does not perform error checking on the result, so it can probably not be used in production code. Failure can occur because the server is not available, or because the symbol passed as an argument is not a valid HUGO symbol, or because no known miRNA-target interaction is available for the gene; all three conditions should be checked.