The sections below give detailed instructions on how to use the programs that make up Scansite.
Motif Scan of a Protein from a Public Database
|
|
This program will scan one protein for all of the motifs in Scansite (or a subset of them, at your choosing). To scan your protein of interest using its entry in public databases (Swiss-Prot, TrEMBL, Genpept, Ensembl), you will need its accession number or ID in that database.
In a web browser, go to the URL http://scansite.mit.edu. You will see the Scansite home page as shown in Figure 1.
Under the "Motif Scan" heading, click "Scan a Protein by Accession Number or ID". You will see the Motif Scan input page as shown in Figure 2.
Select which public database you will be accessing (Swiss-Prot, TrEMBL, Genpept, or Ensembl) from the drop-down box.
Choose which motifs in Scansite's database to scan. To scan for all motifs, click the checkbox labeled "Look for all motifs". To scan only for motifs you specify, click the checkbox labeled "Look only for motifs and groups selected below". Select one or more items in the "Individual Motifs" list, and/or one or more items in the "Motif Groups" list.
Choose the stringency level desired: high, medium, or low. This sets how high a sequence must score when compared to all subsequences that match the motif within the entire vertebrate collection of Swiss-Prot proteins. High stringency indicates that the motif identified in the query sequence is within the top 0.2% of all matching sequences contained in vertebrate Swiss-Prot proteins. Medium and low stringency scores correspond to the top 1% and 5% of sequence matches, respectively. Sites identified under high-stringency scoring are likely to be correct, though there is a possibility that real sites will fail to be identified (i.e. a non-zero false negative selection rate). In contrast, medium and low stringency scoring has a much lower rate of false-negative predictions, but tends to over-call motif sites, resulting in increasing numbers of false positive hits.
To show domains recognized in your sequence, check the box labeled "Show predicted domains in sequence". Otherwise, uncheck it.
Click "Submit Request".
|
Figure 3 shows an example of the output. Your protein is drawn as a thin rectangle. If any sites were found, they are marked above the rectangle with a short-hand name of the domain type (such as "Y_Kin", "SH2", or "PDZ"). If you requested the domains in your sequence to be shown, they will be marked as colored boxes with their names and residue ranges annotated below the rectangle. If phosphorylation and domain-binding sites already known from the literature are present, these will be marked below the domain names in a row labeled "Mapped sites" (none are present in this example). Further down, a plot of the surface accessibility indicates residues that are likely to be near the protein surface and thus able to interact with other proteins. At the bottom, a simple ruler indicates every hundredth position in the input protein sequence.
|
Below the protein image is a table listing the details of the sites found (see Figure 4). Similar motifs are grouped together (for example, all tyrosine-kinase domains). The table indicates the motif name and Gene Card (if one exists) for each site found. The next line lists each site found for that motif, with its score, the percentile that protein's score falls into compared with all vertebrate proteins in Swiss-Prot, the sequence surrounding that site, and the solvent accessibility at that position. Clicking on the Gene Card takes you to that entry on the Gene Card site (25). Clicking on the near-site sequence displays the full protein sequence with the site location highlighted. Clicking on the score displays a histogram showing where this score falls in the distribution of all vertebrate Swiss-Prot proteins that have been scored for this motif.
NotesMost of the program execution time for Motif Scan is spent retrieving the Pfam file for protein domains from the Pfam server in St. Louis. If you do not need domain information and have a lot of proteins to do, you can save time by turning off the "show domains" option.
If motifs you expect are not found in your protein at Scansite's default "High" threshold setting, try using the "Medium" or "Low" settings. The "Low" setting often overwhelms the graphical display unless you are scanning with a small number of selected motifs. However, even if the graphical display looks cluttered, the table of results is always easily readable (just longer).
Scansite scores are ranked on a 0 to ∞ scale, where 0 means a protein sequence perfectly matches the optimal binding pattern, and larger numbers indicate progressively poorer matches to the optimal consensus sequence. Lower scores in the output are thus better matches. (In the matrices and during early parts of program execution, higher scores are better, so you should still use higher numbers in matrices to indicate high affinities.)
Clicking on the 15-residue sequence displayed in the results table shows the position of the site within the full protein sequence. Also, the page generated gives you a chance to submit this 15-mer peptide to BLAST. This can let you check whether this site's sequence is conserved in organisms expected to be physiologically similar to this hit.
If the sites found by Motif Scan seem believable, you can use the motifs for those sites to search databases for other hits. In favorable cases, this can allow you to piece together parts of a pathway, if the interacting parts of different proteins can be connected.
|
If your protein is not in the public databases (or at least not in the ones Scansite uses), you can use this program to enter your sequence directly. Though it differs from the previous program in the input method, they are otherwise identical. To scan your protein by copying its sequence directly into Scansite, you will need a text file containing this sequence. It can be in any text format. If numbers, spaces, or other invalid characters are included, Scansite will remove them.
In a web browser, go to the URL http://scansite.mit.edu. You will see the Scansite home page as shown in Figure 1.
Under the "Motif Scan" heading, click "Scan a Protein by Input Sequence". You will see the Motif Scan input page shown in Figure 5.
Enter the protein name in the text box labeled "Protein Name".
Open the text file containing your protein sequence and copy it into the text box labeled "Sequence".
Choose which motifs in Scansite's database to scan. To scan for all motifs, click the checkbox labeled "Look for all motifs". To scan only for motifs you specify, click the checkbox labeled "Look only for motifs and groups selected below". Select one or more items in the "Individual Motifs" list, and/or one or more items in the "Motif Groups" list.
Choose the stringency level desired: high, medium, or low. This sets how high a sequence must score to be reported.
To show domains recognized in your sequence, check the box labeled "Show predicted domains in sequence". Otherwise, uncheck it.
Click "Submit Request".
The output will resemble that in Figures 3 and 4 as before. Some differences are that a description of the protein is no longer shown, and no mapped sites will be displayed even if some sites have been mapped for your protein, because this information cannot easily be inferred from the input sequence alone.
|
This program searches all proteins in a selected database for matches to a Scansite motif.
In a web browser, go to the URL http://scansite.mit.edu. You will see the Scansite home page as shown in Figure 1.
Under the "Database Search" heading, click "Search Using a Scansite Motif". You will see a page of options as shown in Figure 6.
In the list box labeled "Select Motif to Use", scroll through the list of available motifs and select the one you want to search with.
Select the database to search using this matrix (Swiss-Prot, TrEMBL, Genpept, or Ensembl).
If you are looking for proteins with specific characteristics, you can restrict the search to get a shorter list of more relevant results. Steps 6 through 12 guide you through these choices. However, if you do not want to specify any of these, you can skip ahead to Step 13.
In the "Organism class" drop-down list, select which class to search. The choices are Mammals, Vertebrates, Invertebrates, Plants, Fungi, Bacteria, or All. The default choice is Mammals. If you do not want to restrict your search by class, choose All. For a description of each class, click on the "Class descriptions" link.
To search only among proteins in a given species, write the genus and species in the "Single species" text box, such as "Homo sapiens", "Caenorhabditis elegans", etc. Abbreviations and wild cards can be used here. Click the "Examples" link for more details. To avoid any restrictions on species, leave this text box blank.
For targeting proteins in a range of molecular weights, fill in weights in daltons in the two boxes labeled "Molecular weight range", such as "20,000" to "25,000". To avoid molecular weight restrictions, leave these fields blank.
If you are seeking only proteins with a characteristic isoelectric point, fill in these values in the two boxes labeled "Isoelectric point range", such as "5.5" to "6.0". To avoid isoelectric point restrictions, leave these fields blank.
For proteins likely to be phosphorylated at one or more sites, click on the number of phosphorylations you want to target under "Phosphorylated sites" (0 to 3 phosphorylations are allowed). This will affect the calculated molecular weights and isoelectric points if you are using either of those restrictions above. If you do not need to account for phosphorylations, leave this option set at 0.
To look for proteins in a functional category, enter a search term in the "Keyword search" field, such as "oxidoreductase", "cytochrome", "kinase", "membrane", or perhaps "hypothetical" (to focus on novel genes). For other examples, click on the "Examples" link. Leaving this blank will skip the keyword search function.
To look for proteins that contain a consensus sequence or other sequence part, enter it in the "Sequence contains" text box. Wild cards are permitted here. For wildcard details, click the "Examples" link. Leave this field blank to avoid restricting your search by sequence parts.
Now that all the restriction options have been specified (or no restrictions at all, if you so chose), select how large of an output list you want. The choices are 50, 100, 200, 300, 400, 500, 1000, or 2000 proteins.
Click on "Submit" to start the search.
|
This program can take several minutes to run if you select the larger databases (TrEMBL, Genpept). When it finishes, you will see output like that shown in Figure 7. The proteins are sorted by score, but you can view them sorted by molecular weight or isoelectric point by clicking the "Sort by Molecular Weight" or "Sort by Isoelectric Point" links, which are near the top of the page. For each protein retrieved, its score, ID or accession number, description, site position, site sequence, molecular weight, and isoelectric point are shown. Clicking the score will show a histogram of how good this score is relative to all proteins that were scored in the selected database or database subset that you searched. Clicking the ID or accession number will take you to this protein's entry in the database that was searched. Lastly, clicking the small "Submit" button on the left of any entry will submit it to Scansite's Motif Scan program described earlier.
NotesThe search options in this and the other Scansite database programs are intended to address common problems with database searches. For many searches, you may only be interested in matches from humans or a model organism. Restricting your search to proteins from a single species is done by entering that species name in the "Single species" text box. Scansite uses a MySQL database, and the regular expressions syntax supported by MySQL allows certain helpful wildcards. For example, if you're tired of writing out "Caenorhabditis elegans", you can use "C.* elegans" instead. In a regular expression, the period (.) matches any single character, and the asterisk extends that match to multiple characters (or even zero characters). You could also do genus-wide searches, by entering "Rattus" for example. If you try doing that with "Mus", you will accidentally match "Thermus aquaticus" as well, but you can avoid that by entering "^Mus". The caret symbol (^) requires the text to match at the beginning of the species name. Another pitfall to avoid is specifying an invertebrate species like "Drosophila melanogaster" when your Organism Class setting is "Mammals". You will get no hits, because no entry in Genpept has a source organism that is both a mammal and a fruit fly. (At least, the Genpept curators hope no entry like that is present.)
The molecular weight, isoelectric point, and phosphorylation options are intended for use in conjunction with two-dimensional gel electrophoresis experiments. When you find a few spots appearing reproducibly on a 2D gel under a particular test condition and not under the control, you could use Scansite to find what proteins are expected to be in that region of the 2-D gel by putting in a molecular weight range and isoelectric point. You could simultaneously constrain the species to match the cell line you used in the experiment. If it is an experiment involving possible phosphorylation events, you can see how much a putative phosphorylation would move it on the gel.
The keyword search is primarily only useful for searching Swiss-Prot, because of its detailed annotations. It might be useful in Genpept if you are searching for novel proteins, in which case you could search for phrases like "hypothetical".
The "Sequence contains" text field is a quick way to restrict your search to proteins containing a consensus sequence. Unlike the protocol herein titled "Database Search Using a Consensus Sequence", in this case the desired consensus sequence does not need to be part of the motif being searched for. Regular expressions can be used here too. For example, the sequence "PXXP" is represented as "P..P" in regular expression syntax. More details about regular expression options are given in the "Examples" link to the right of the "Sequence contains" field.
|
Scansite's database searches can be made much more relevant to your own research by creating your own motifs. To use your own binding motif in a database search, you will need to define it in a text file and import the text file into Scansite. Detailed instructions are given on the FAQ page. As in the last program, this too searches all proteins in a database for matches to a motif.
In a web browser, go to the URL http://scansite.mit.edu. You will see the Scansite home page as shown in Figure 1.
Under the "Database Search" heading, click "Search Using an Input Motif". You will see the motif input page shown in Figure 8.
In the text box labeled "Motif Name", enter a name to identify your motif.
In the text box labeled "File of matrix values", type the location of your matrix file on your file system. Click the "Browse" button to select it from a directory listing. When finished, click the "SUBMIT" button. (See the FAQ for instructions on how to make the matrix file.) You will see the File Upload Verification page shown in Figure 9.
|
Your matrix will be displayed at this point. Verify that it is in the correct format (See "Materials" section). If everything looks correct, click on "Yes, I would like to continue with this matrix." If some editing is required, click on "Yes, but I would like to edit this matrix." If it looks wrong, click on "No, I will upload the file again", and return to Step 4. If you have chosen to continue, you will see a page similar to Fig. 6, with the name of your motif displayed at the top.
Select the database to search using this matrix (Swiss-Prot, TrEMBL, Genpept, or Ensembl).
If you are looking for proteins with specific characteristics, you can restrict the search to get a shorter list of more relevant results. Steps 8 through 14 guide you through these choices. However, if you do not want to specify any of these, you can skip ahead to Step 15.
In the "Organism class" drop-down list, select which class to search. The choices are Mammals, Vertebrates, Invertebrates, Plants, Fungi, Bacteria, or All. The default choice is Mammals. If you do not want to restrict your search by class, choose All. For a description of each class, click on the "Class descriptions" link.
To search only among proteins in a given species, write the genus and species in the "Single species" text box, such as "Homo sapiens", "Caenorhabditis elegans", etc. Abbreviations and wild cards can be used here. Click the "Examples" link for more details. To avoid any restrictions on species, leave this text box blank.
For targeting proteins in a range of molecular weights, fill in weights in daltons in the two boxes labeled "Molecular weight range", such as "20,000" to "25,000". To avoid molecular weight restrictions, leave these fields blank.
If you are seeking only proteins with a characteristic isoelectric point, fill in these values in the two boxes labeled "Isoelectric point range", such as "5.5" to "6.0". To avoid isoelectric point restrictions, leave these fields blank.
For proteins likely to be phosphorylated at one or more sites, click on the number of phosphorylations you want to target under "Phosphorylated sites" (0 to 3 phosphorylations are allowed). This will affect the calculated molecular weights and isoelectric points if you are using either of those restrictions above. If you do not need to account for phosphorylations, leave this option set at 0.
To look for proteins in a functional category, enter a search term in the "Keyword search" field, such as "oxidoreductase", "cytochrome", "kinase", "membrane", or perhaps "hypothetical" (to focus on novel genes). For other examples, click on the "Examples" link. Leaving this blank will skip the keyword search function.
To look for proteins that contain a consensus sequence or other sequence part, enter it in the "Sequence contains" text box. Wild cards are permitted here. For wildcard details, click the "Examples" link. Leave this field blank to avoid restricting your search by sequence parts.
Now that all the restriction options have been specified (or no restrictions at all, if you so chose), select how large of an output list you want. The choices are 50, 100, 200, 300, 400, 500, 1000, or 2000 proteins.
Click on "Submit" to start the search.
The output will look like that from the previous Database Search program in Fig. 7.
Notes
Most of the difficulties encountered with this program are with making the matrix properly. See the FAQ for complete details. Briefly, to avoid the most common problems, make sure your matrix meets these criteria:
Fixed residues should have the value 21, and non-fixed residues should avoid this value.
The center position, row 8, should have a fixed residue.
The matrix should contain 15 rows of values.
For positions that you have no affinity information for, or that you know have a negligible role in selection, use 1's for all the residues. If you have included the N and C terminus columns ("$" and "*"), use 0’s at this position.
Making an effective matrix can be a challenging task. You will often need to change some values to keep the resulting output reasonable. If your affinity values are from an experimental source, we recommend that your changes preserve the rank ordering of the raw values, so that your motif is as strongly grounded in experiment as possible.
|
If you have insufficient information to make a full matrix of binding affinities, but you have a tentative consensus sequence describing your motif, you can still put this into Scansite and search the database for matches. To use this option, just have the consensus sequence available so you can enter it.
An alternative to making your own full matrix is to specify a consensus sequence as a binding motif. Scansite will then construct a rough matrix matching the characteristics of your consensus sequence, and this can be used to search the databases. The results in this case will be less quantitative. However, many users have found this program useful for quickly finding proteins with certain sequence characteristics.
In a web browser, go to the URL http://scansite.mit.edu. You will see the Scansite home page as shown in Figure 1.
Under the "Database Search" heading, click "Search Using Quick Matrix Method for Making a Motif". You will see the input page shown in Figure 10.
In the text box labeled "Motif Name", enter a name to identify the motif that will be created.
On the page there are two rows of small text boxes, labeled as positions -7 to +7, with 0 being the required fixed position. The top row is labeled "Primary Preference", and the second row is labeled "Secondary Preference". Start by entering your fixed residue in the only box at position 0. You can use the slash (/) to enter two fixed residues, such as "S/T".
In the "Primary Preference" row (top one), enter the residues of your consensus sequence in their position relative to your fixed residue. You can use the slash (/) to enter two residues, such as "D/E". Wildcards can be used, which are "$" for hydrophobic residues (G, A, V, I, L, M), "@" for aromatics (F, Y, W), "!" for neutral hydrophilics (S, T, W, Q), "#" for positive hydrophilics (H, K, R), and "&" for negative hydrophilics (D, E). Scansite will give residues in this top row a score of 9.0. For positions with no residue preference, leave it blank or use "X".
In the "Secondary Preference" row (bottom one), you can enter alternative residues at some positions if desired. These will be given a lower score of 4.5, and thus allows you to specify a weaker affinity for some residue types. The same wildcards can be used as in the last step. When you are finished, click the "Submit" button at the bottom of the page.
You will see a page similar to Fig. 6, with the name and schematic description of your consensus sequence displayed at the top.
Select the database to search using this matrix (Swiss-Prot, TrEMBL, Genpept, or Ensembl).
If you are looking for proteins with specific characteristics, you can restrict the search to get a shorter list of more relevant results. Steps 10 through 16 guide you through these choices. However, if you do not want to specify any of these, you can skip ahead to Step 17.
In the "Organism class" drop-down list, select which class to search. The choices are Mammals, Vertebrates, Invertebrates, Plants, Fungi, Bacteria, or All. The default choice is Mammals. If you do not want to restrict your search by class, choose All. For a description of each class, click on the "Class descriptions" link.
To search only among proteins in a given species, write the genus and species in the "Single species" text box, such as "Homo sapiens", "Caenorhabditis elegans", etc. Abbreviations and wild cards can be used here. Click the "Examples" link for more details. To avoid any restrictions on species, leave this text box blank.
For targeting proteins in a range of molecular weights, fill in weights in daltons in the two boxes labeled "Molecular weight range", such as "20,000" to "25,000". To avoid molecular weight restrictions, leave these fields blank.
If you are seeking only proteins with a characteristic isoelectric point, fill in these values in the two boxes labeled "Isoelectric point range", such as "5.5" to "6.0". To avoid isoelectric point restrictions, leave these fields blank.
For proteins likely to be phosphorylated at one or more sites, click on the number of phosphorylations you want to target under "Phosphorylated sites" (0 to 3 phosphorylations are allowed). This will affect the calculated molecular weights and isoelectric points if you are using either of those restrictions above. If you do not need to account for phosphorylations, leave this option set at 0.
To look for proteins in a functional category, enter a search term in the "Keyword search" field, such as "oxidoreductase", "cytochrome", "kinase", "membrane", or perhaps "hypothetical" (to focus on novel genes). For other examples, click on the "Examples" link. Leaving this blank will skip the keyword search function.
To look for proteins that contain a consensus sequence or other sequence part, enter it in the "Sequence contains" text box. Wild cards are permitted here. For wildcard details, click the "Examples" link. Leave this field blank to avoid restricting your search by sequence parts.
Now that all the restriction options have been specified (or no restrictions at all, if you so chose), select how large of an output list you want. The choices are 50, 100, 200, 300, 400, 500, 1000, or 2000 proteins.
Click on "Submit" to start the search.
The output will
again look the Database Search results shown in Fig. 7.