Overview**

We developed CompareProspector to take advantage of comparative genomics information to aid sequence motif finding. CompareProspector is built upon BioProspector (Liu X et al, 2000), which is an extension of the original Gibbs sampler (Liu JS et al, 1995) with improved flexibility and performance.

CompareProspector takes as input a list of sequences from one species that is predicted to share common regulatory element(s). Such sequences can be obtained from high throughput genomics techniques such as gene expression profile clustering or chromatin immunoprecipitation followed by microarray (ChIP-chip). It also takes as input a list of percent identity values representing the cross-species conservation of each nucleotide.

In the Gibbs sampling iterations,
CompareProspector biases the motif finding towards sequences conserved across
species. First of all, the user can specify two WPID thresholds, T_{ch}
(high conservation threshold) and T_{cl} (low conservation threshold).
In BioProspector, a site score *Ax* is calculated for every site *x*
in the input sequence as the ratio of the probability of generating *x*
from the motif model over the probability of generating *x* from the
background distribution. A new site is sampled with probability proportional to
*Ax*. In CompareProspector, during initial iterations of Gibbs sampling,
only positions whose WPID values are above T_{ch }are sampled.
Subsequently, the WPID cutoff is gradually decreased from T_{ch} to T_{cl}
to allow sampling of less conserved positions. The new site score *A'x* is
weighted by sequence conservation (*A'x *=* Ax *
´* WPIDx*, *WPIDx* being the
WPID of site *x*) to favor sampling of more conserved sequences. Sequences
without orthologs are assigned T_{cl} as the *WPIDx* for all *x*,
so they only participate in sampling
in later iterations. Finally,
in the original BioProspector, sites with a high enough score *Ax* are
automatically added to the motif without sampling. CompareProspector restricts
automatic additions to only sites whose WPIDs are above T_{ch}.
This step further down weighs the influence of divergent sites and sequences
without orthologs. The output of CompareProspector includes a list of
highest-scoring motifs as position-specific probability matrices, the individual
sites used to construct each motif, and the locations of the sites on the input
sequences.

