We developed CompareProspector to take advantage of comparative genomics information to aid sequence motif finding. CompareProspector is built upon BioProspector (Liu X et al, 2000), which is an extension of the original Gibbs sampler (Liu JS et al, 1995) with improved flexibility and performance.
CompareProspector takes as input a list of sequences from one species that is predicted to share common regulatory element(s). Such sequences can be obtained from high throughput genomics techniques such as gene expression profile clustering or chromatin immunoprecipitation followed by microarray (ChIP-chip). It also takes as input a list of percent identity values representing the cross-species conservation of each nucleotide.
In the Gibbs sampling iterations, CompareProspector biases the motif finding towards sequences conserved across species. First of all, the user can specify two WPID thresholds, Tch (high conservation threshold) and Tcl (low conservation threshold). In BioProspector, a site score Ax is calculated for every site x in the input sequence as the ratio of the probability of generating x from the motif model over the probability of generating x from the background distribution. A new site is sampled with probability proportional to Ax. In CompareProspector, during initial iterations of Gibbs sampling, only positions whose WPID values are above Tch are sampled. Subsequently, the WPID cutoff is gradually decreased from Tch to Tcl to allow sampling of less conserved positions. The new site score A'x is weighted by sequence conservation (A'x = Ax ´ WPIDx, WPIDx being the WPID of site x) to favor sampling of more conserved sequences. Sequences without orthologs are assigned Tcl as the WPIDx for all x, so they only participate in sampling in later iterations. Finally, in the original BioProspector, sites with a high enough score Ax are automatically added to the motif without sampling. CompareProspector restricts automatic additions to only sites whose WPIDs are above Tch. This step further down weighs the influence of divergent sites and sequences without orthologs. The output of CompareProspector includes a list of highest-scoring motifs as position-specific probability matrices, the individual sites used to construct each motif, and the locations of the sites on the input sequences.
Liu, J. S., Neuwald, A. F. and Lawrence, C. E. (1995). "Bayesian Models for Multiple Local Sequence Alignment and Gibbs Sampling Strategies." Journal of the American Statistical Association 90: 1156-1170.
Liu, X., Brutlag, D. L. and Liu, J. S. (2001). "Bioprospector: Discovering Conserved DNA Motifs in Upstream Regulatory Regions of Co-Expressed Genes." Pac Symp Biocomput: 127-138.