****************************************************** Overview: BioProspector Release 2. Feburary 2004 A DNA seqence motif finding program. Reference Liu X, Brutlag DL, Liu JS. BioProspector: discovering conserved DNA motifs in upstream regulatory regions of co-expressed genes. Pac Symp Biocomput. 2001;:127-38. BioProspector is free to academia. Please do not distribute the executable of the program to others without a license or consent of the owners. For details about obtaining this program, please refer to: http://motif.stanford.edu/distributions/ ****************************************************** Start Using BioProspector: Once you download and unzip BioPropsector.2004.zip, you will see several files: BioPropsector.xxx (xxx the platform of BioPropsector executables) genomebg.xxx (genomebg can calculate the genome base distribution) BioProspector.2004.README ConvertFASTA.pl First make the program run, say you are under a Mac OSX xterm, type: mv BioProspector.mac BioPropsector chmod 555 BioProspector If you need to run genomebg.xxx under Linux, do: mv genomebg.linux genombg chmod 555 genomebg Note: if you need to run it under windows with Cygwin, you need to do: mv BioProspector.cygwin BioProspector.exe Now you can run the program by typing: ./BioProspector As you see, typing the program without any parameters gives you a short list on how to set parameters. ****************************************************** Input Sequences: The following shows the details about parameter specification: Usage: ./BioProspector -i seqfile (options) The input sequence should be in a plain text file, not in word or html. Right now, the program only recogonize restricted FASTA format: >sequence1 name ATGGTGACGAC sequence1 as ONE LINE >sequence2 name GTAGCCTCATG sequence2 as ONE LINE If you have the following sequence format: >seq1 ATCTGAATGCA AGCTGCACACGTTTTT CAGATAAA >seq2 AGTCAGACTACA AGCTGCACACTTTT CAGATAAA >seq3 1 AGTCG GTCAC GCACG CACAC 20 21 CGATT CAAAT TGTGA CGACG 40 You can use ConvertFASTA.pl to convert your input into the format BioProspector takes. To run this program, type: perl convertFASTA.pl < input > input.new input.new will have the sequences converted into the right format. ****************************************************** Description of optional parameters -W For one-block motif or the first block of a two-block motif. Allowed -W range is [4, 50]. -o -f Precomputed background distribution (to speed up the program), which can be obtained by running the included program genomebg. Included are pre-computed yeast whole genome and yeast intergenic sequence distribution. To run genomebg, type: ./genomebg -i inputSequenceFile -o outputDistributionFile InputSequenceFile contains whole genome (or just intergenic) sequences in the same restricted FASTA format. The outputDistributionFile is the one you can use on BioProspector by specifying -f. -b If you want to use another sequence file which contains sequences that represent the background. Should be the same format as seqfile. -n BioProspector reinitializes the program for certain number of times to avoid local maxima, and only reports the top ones among the -n tries. Allowed -n range is (0, 200]. -r The best -r motifs among the -n try are reported at the end. Allowed -r range is (0, n]. -w Do not specify this for one-block motifs. Allowed -w range is [4, 50]. -p 1 [if two-block motif is palindrome (default 0)] Palindromes are two-block motifs, and the two blocks should set to be the same length. Otherwise, BioProspector uses the average of blk1 and blk2 width as the width for both motif blocks. If a motif of wid is a one block palindrome (e.g. ATTGGATCCAAT), then specify -W and -w to be each of wid/2 long with -G and -g to be 0. -G This should only be set if -W and -w are both set (i.e, only set this for two-blk motifs). Allowed -G range is [0, 50]. -g This should only be set if -W and -w are both set. Allowed -G range is [0, 50]. Also (G - g) should be smaller than 20. E.g. -G 50 -g 35 is OK, but -G 30 -g 5 is not. -d 1 [if only need to examine forward (default 2)] BioProspector automatically checks both forward and backward strands, so only specify this if you want the program the check only the forward strand. -a 1 [if every sequence contains the motif (default 0)] -h 1 [if want more degenerate sites (default fewer sites)] If the motif is very degenerate and you want to get more sites, then specify "-h 1 " which will do a final refinement which allows more generate sites to be put in the motif. Default is motif finding without refinement step. -e If you expect to see one site every 1000 bases, then you can specify -e 1000. Of course, most of the time we don't know how frequent the motif occurs (we don't even know what it is), in which case don't specify -e at all. Example 1 ./BioProspector -i inseq -b backseq -o out & ==> find motif of width 10 from inseq (restricted FASTA). Search motifs from both DNA strands. Use sequences in backseq as the background base distribution. Output the best 10 motifs to to out and run the program in background (&). You could start checking result (or progress of the program) with "more out" or "less out" while program is still running. Example 2 ./BioProspector -i inseq -W 15 -f yeast_all.bg -n 100 -r 20 -d 1 -a 1 -h 1 -o out ==> find motif of width 15 from sequence file inseq (restricted FASTA), use yeast genome as the background distribution and search forward strand. Try motif finding 100 times, and report the best 20. Every sequence should contain at least one site for each motif, and automatically refine the motif after convergence. Print the output in file out. ****************************************************** Output format: 1. BioProspector tries -n times (re-initialize) to look for motifs, and prints out the motif found at each try in the following format: Try #, motif score, consensus of blk1, reverse consensus of blk1, consensus of blk2, reverse consensus of blk2, total number of aligned sites. If you specified a motif refinement, then after each Try, a second line Ref # is also printed. 2. The highest -r motifs among -n tries. 1) Motif width (blk 1, blk 2); Gap range [minGap, maxGap]; Motif score; number of aligned sites. 2) Motif probability matrix (one line per motif column), Consensus, Reverse Consensus, Degenerate, Reverse degenerate, all in transfac format. 3) Sequence alignment in fASTA format: sequence name, length, site number in the sequence, starting alignment position of site (r 53 means position 53 (from left to right) in reverse direction, f 47 is forward direction position 47). Based on the sequence length, and site starting position relative to its beginning (5' end), you can easily calculate the site distance to its end (3'end, which might be transcription start site), and the sequence of the aligned site. You can cut out this alignment and paste it into: http://weblogo.berkeley.edu/logo.cgi to get the sequence logo of the motif. ****************************************************** Contact information: For questions, requests and bug report: please contact X. Shirley Liu at xsliu@jimmy.harvard.edu. ******************************************************