Tutorials for Scansite Programs

The sections below give detailed instructions on how to use the programs that make up Scansite.

Motif Scan of a Protein from a Public Database
Motif Scan of an Input Protein Sequence
Database Search Using a Scansite Motif
Database Search Using an Input Motif
Database Search Using Quick Matrix Method

Fig. 1: Scansite home page (http://scansite.mit.edu).

Motif Scan of a Protein from a Public Database

Fig. 2: Motif Scan input form.

This program will scan one protein for all of the motifs in Scansite (or a subset of them, at your choosing). To scan your protein of interest using its entry in public databases (Swiss-Prot, TrEMBL, Genpept, Ensembl), you will need its accession number or ID in that database.

  1. In a web browser, go to the URL http://scansite.mit.edu. You will see the Scansite home page as shown in Figure 1.

  2. Under the "Motif Scan" heading, click "Scan a Protein by Accession Number or ID". You will see the Motif Scan input page as shown in Figure 2.

  3. Enter the accession number or ID in the text field labeled "Protein ID or Accession Number". For databases that assign both an accession number and an ID to each entry (Swiss-Prot, Genpept), you can enter either one here. If you don't know the accession number or ID, you can use the links on this page ("Search Swiss-Prot/TrEMBL for entry name", etc.) to find them.

  4. Select which public database you will be accessing (Swiss-Prot, TrEMBL, Genpept, or Ensembl) from the drop-down box.

  5. Choose which motifs in Scansite's database to scan. To scan for all motifs, click the checkbox labeled "Look for all motifs". To scan only for motifs you specify, click the checkbox labeled "Look only for motifs and groups selected below". Select one or more items in the "Individual Motifs" list, and/or one or more items in the "Motif Groups" list.

  6. Choose the stringency level desired: high, medium, or low. This sets how high a sequence must score when compared to all subsequences that match the motif within the entire vertebrate collection of Swiss-Prot proteins. High stringency indicates that the motif identified in the query sequence is within the top 0.2% of all matching sequences contained in vertebrate Swiss-Prot proteins. Medium and low stringency scores correspond to the top 1% and 5% of sequence matches, respectively. Sites identified under high-stringency scoring are likely to be correct, though there is a possibility that real sites will fail to be identified (i.e. a non-zero false negative selection rate). In contrast, medium and low stringency scoring has a much lower rate of false-negative predictions, but tends to over-call motif sites, resulting in increasing numbers of false positive hits.

  7. To show domains recognized in your sequence, check the box labeled "Show predicted domains in sequence". Otherwise, uncheck it.

  8. Click "Submit Request".

Fig. 3: Motif Scan graph of sites found.

Figure 3 shows an example of the output. Your protein is drawn as a thin rectangle. If any sites were found, they are marked above the rectangle with a short-hand name of the domain type (such as "Y_Kin", "SH2", or "PDZ"). If you requested the domains in your sequence to be shown, they will be marked as colored boxes with their names and residue ranges annotated below the rectangle. If phosphorylation and domain-binding sites already known from the literature are present, these will be marked below the domain names in a row labeled "Mapped sites" (none are present in this example). Further down, a plot of the surface accessibility indicates residues that are likely to be near the protein surface and thus able to interact with other proteins. At the bottom, a simple ruler indicates every hundredth position in the input protein sequence.

Fig. 4: Motif Scan table of sites found.

Below the protein image is a table listing the details of the sites found (see Figure 4). Similar motifs are grouped together (for example, all tyrosine-kinase domains). The table indicates the motif name and Gene Card (if one exists) for each site found. The next line lists each site found for that motif, with its score, the percentile that protein's score falls into compared with all vertebrate proteins in Swiss-Prot, the sequence surrounding that site, and the solvent accessibility at that position. Clicking on the Gene Card takes you to that entry on the Gene Card site (25). Clicking on the near-site sequence displays the full protein sequence with the site location highlighted. Clicking on the score displays a histogram showing where this score falls in the distribution of all vertebrate Swiss-Prot proteins that have been scored for this motif.

Notes

Motif Scan of an Input Protein Sequence

Fig. 5: Motif Scan form for sequence input.

If your protein is not in the public databases (or at least not in the ones Scansite uses), you can use this program to enter your sequence directly. Though it differs from the previous program in the input method, they are otherwise identical. To scan your protein by copying its sequence directly into Scansite, you will need a text file containing this sequence. It can be in any text format. If numbers, spaces, or other invalid characters are included, Scansite will remove them.

  1. In a web browser, go to the URL http://scansite.mit.edu. You will see the Scansite home page as shown in Figure 1.

  2. Under the "Motif Scan" heading, click "Scan a Protein by Input Sequence". You will see the Motif Scan input page shown in Figure 5.

  3. Enter the protein name in the text box labeled "Protein Name".

  4. Open the text file containing your protein sequence and copy it into the text box labeled "Sequence".

  5. Choose which motifs in Scansite's database to scan. To scan for all motifs, click the checkbox labeled "Look for all motifs". To scan only for motifs you specify, click the checkbox labeled "Look only for motifs and groups selected below". Select one or more items in the "Individual Motifs" list, and/or one or more items in the "Motif Groups" list.

  6. Choose the stringency level desired: high, medium, or low. This sets how high a sequence must score to be reported.

  7. To show domains recognized in your sequence, check the box labeled "Show predicted domains in sequence". Otherwise, uncheck it.

  8. Click "Submit Request".

The output will resemble that in Figures 3 and 4 as before. Some differences are that a description of the protein is no longer shown, and no mapped sites will be displayed even if some sites have been mapped for your protein, because this information cannot easily be inferred from the input sequence alone.


Database Search Using a Scansite Motif

Fig. 6: Database Search input page.

This program searches all proteins in a selected database for matches to a Scansite motif.

  1. In a web browser, go to the URL http://scansite.mit.edu. You will see the Scansite home page as shown in Figure 1.

  2. Under the "Database Search" heading, click "Search Using a Scansite Motif". You will see a page of options as shown in Figure 6.

  3. In the list box labeled "Select Motif to Use", scroll through the list of available motifs and select the one you want to search with.

  4. Select the database to search using this matrix (Swiss-Prot, TrEMBL, Genpept, or Ensembl).

  5. If you are looking for proteins with specific characteristics, you can restrict the search to get a shorter list of more relevant results. Steps 6 through 12 guide you through these choices. However, if you do not want to specify any of these, you can skip ahead to Step 13.

  6. In the "Organism class" drop-down list, select which class to search. The choices are Mammals, Vertebrates, Invertebrates, Plants, Fungi, Bacteria, or All. The default choice is Mammals. If you do not want to restrict your search by class, choose All. For a description of each class, click on the "Class descriptions" link.

  7. To search only among proteins in a given species, write the genus and species in the "Single species" text box, such as "Homo sapiens", "Caenorhabditis elegans", etc. Abbreviations and wild cards can be used here. Click the "Examples" link for more details. To avoid any restrictions on species, leave this text box blank.

  8. For targeting proteins in a range of molecular weights, fill in weights in daltons in the two boxes labeled "Molecular weight range", such as "20,000" to "25,000". To avoid molecular weight restrictions, leave these fields blank.

  9. If you are seeking only proteins with a characteristic isoelectric point, fill in these values in the two boxes labeled "Isoelectric point range", such as "5.5" to "6.0". To avoid isoelectric point restrictions, leave these fields blank.

  10. For proteins likely to be phosphorylated at one or more sites, click on the number of phosphorylations you want to target under "Phosphorylated sites" (0 to 3 phosphorylations are allowed). This will affect the calculated molecular weights and isoelectric points if you are using either of those restrictions above. If you do not need to account for phosphorylations, leave this option set at 0.

  11. To look for proteins in a functional category, enter a search term in the "Keyword search" field, such as "oxidoreductase", "cytochrome", "kinase", "membrane", or perhaps "hypothetical" (to focus on novel genes). For other examples, click on the "Examples" link. Leaving this blank will skip the keyword search function.

  12. To look for proteins that contain a consensus sequence or other sequence part, enter it in the "Sequence contains" text box. Wild cards are permitted here. For wildcard details, click the "Examples" link. Leave this field blank to avoid restricting your search by sequence parts.

  13. Now that all the restriction options have been specified (or no restrictions at all, if you so chose), select how large of an output list you want. The choices are 50, 100, 200, 300, 400, 500, 1000, or 2000 proteins.

  14. Click on "Submit" to start the search.

Fig. 7: Database Search table of proteins and sites found.

This program can take several minutes to run if you select the larger databases (TrEMBL, Genpept). When it finishes, you will see output like that shown in Figure 7. The proteins are sorted by score, but you can view them sorted by molecular weight or isoelectric point by clicking the "Sort by Molecular Weight" or "Sort by Isoelectric Point" links, which are near the top of the page. For each protein retrieved, its score, ID or accession number, description, site position, site sequence, molecular weight, and isoelectric point are shown. Clicking the score will show a histogram of how good this score is relative to all proteins that were scored in the selected database or database subset that you searched. Clicking the ID or accession number will take you to this protein's entry in the database that was searched. Lastly, clicking the small "Submit" button on the left of any entry will submit it to Scansite's Motif Scan program described earlier.

Notes

Database Search Using an Input Motif

Fig. 8: Page for adding a motif to Scansite.

Scansite's database searches can be made much more relevant to your own research by creating your own motifs. To use your own binding motif in a database search, you will need to define it in a text file and import the text file into Scansite. Detailed instructions are given on the FAQ page. As in the last program, this too searches all proteins in a database for matches to a motif.

  1. In a web browser, go to the URL http://scansite.mit.edu. You will see the Scansite home page as shown in Figure 1.

  2. Under the "Database Search" heading, click "Search Using an Input Motif". You will see the motif input page shown in Figure 8.

  3. In the text box labeled "Motif Name", enter a name to identify your motif.

  4. In the text box labeled "File of matrix values", type the location of your matrix file on your file system. Click the "Browse" button to select it from a directory listing. When finished, click the "SUBMIT" button. (See the FAQ for instructions on how to make the matrix file.) You will see the File Upload Verification page shown in Figure 9.

    Fig. 9: File Upload Verification page (for input motifs).
  5. Your matrix will be displayed at this point. Verify that it is in the correct format (See "Materials" section). If everything looks correct, click on "Yes, I would like to continue with this matrix." If some editing is required, click on "Yes, but I would like to edit this matrix." If it looks wrong, click on "No, I will upload the file again", and return to Step 4. If you have chosen to continue, you will see a page similar to Fig. 6, with the name of your motif displayed at the top.

  6. Select the database to search using this matrix (Swiss-Prot, TrEMBL, Genpept, or Ensembl).

  7. If you are looking for proteins with specific characteristics, you can restrict the search to get a shorter list of more relevant results. Steps 8 through 14 guide you through these choices. However, if you do not want to specify any of these, you can skip ahead to Step 15.

  8. In the "Organism class" drop-down list, select which class to search. The choices are Mammals, Vertebrates, Invertebrates, Plants, Fungi, Bacteria, or All. The default choice is Mammals. If you do not want to restrict your search by class, choose All. For a description of each class, click on the "Class descriptions" link.

  9. To search only among proteins in a given species, write the genus and species in the "Single species" text box, such as "Homo sapiens", "Caenorhabditis elegans", etc. Abbreviations and wild cards can be used here. Click the "Examples" link for more details. To avoid any restrictions on species, leave this text box blank.

  10. For targeting proteins in a range of molecular weights, fill in weights in daltons in the two boxes labeled "Molecular weight range", such as "20,000" to "25,000". To avoid molecular weight restrictions, leave these fields blank.

  11. If you are seeking only proteins with a characteristic isoelectric point, fill in these values in the two boxes labeled "Isoelectric point range", such as "5.5" to "6.0". To avoid isoelectric point restrictions, leave these fields blank.

  12. For proteins likely to be phosphorylated at one or more sites, click on the number of phosphorylations you want to target under "Phosphorylated sites" (0 to 3 phosphorylations are allowed). This will affect the calculated molecular weights and isoelectric points if you are using either of those restrictions above. If you do not need to account for phosphorylations, leave this option set at 0.

  13. To look for proteins in a functional category, enter a search term in the "Keyword search" field, such as "oxidoreductase", "cytochrome", "kinase", "membrane", or perhaps "hypothetical" (to focus on novel genes). For other examples, click on the "Examples" link. Leaving this blank will skip the keyword search function.

  14. To look for proteins that contain a consensus sequence or other sequence part, enter it in the "Sequence contains" text box. Wild cards are permitted here. For wildcard details, click the "Examples" link. Leave this field blank to avoid restricting your search by sequence parts.

  15. Now that all the restriction options have been specified (or no restrictions at all, if you so chose), select how large of an output list you want. The choices are 50, 100, 200, 300, 400, 500, 1000, or 2000 proteins.

  16. Click on "Submit" to start the search.

The output will look like that from the previous Database Search program in Fig. 7.

Notes

Most of the difficulties encountered with this program are with making the matrix properly. See the FAQ for complete details. Briefly, to avoid the most common problems, make sure your matrix meets these criteria:

Making an effective matrix can be a challenging task. You will often need to change some values to keep the resulting output reasonable. If your affinity values are from an experimental source, we recommend that your changes preserve the rank ordering of the raw values, so that your motif is as strongly grounded in experiment as possible.


Database Search Using the Quick Matrix Method

Fig. 10: Quick matrix method.

If you have insufficient information to make a full matrix of binding affinities, but you have a tentative consensus sequence describing your motif, you can still put this into Scansite and search the database for matches. To use this option, just have the consensus sequence available so you can enter it.

An alternative to making your own full matrix is to specify a consensus sequence as a binding motif. Scansite will then construct a rough matrix matching the characteristics of your consensus sequence, and this can be used to search the databases. The results in this case will be less quantitative. However, many users have found this program useful for quickly finding proteins with certain sequence characteristics.

  1. In a web browser, go to the URL http://scansite.mit.edu. You will see the Scansite home page as shown in Figure 1.

  2. Under the "Database Search" heading, click "Search Using Quick Matrix Method for Making a Motif". You will see the input page shown in Figure 10.

  3. In the text box labeled "Motif Name", enter a name to identify the motif that will be created.

  4. On the page there are two rows of small text boxes, labeled as positions -7 to +7, with 0 being the required fixed position. The top row is labeled "Primary Preference", and the second row is labeled "Secondary Preference". Start by entering your fixed residue in the only box at position 0. You can use the slash (/) to enter two fixed residues, such as "S/T".

  5. In the "Primary Preference" row (top one), enter the residues of your consensus sequence in their position relative to your fixed residue. You can use the slash (/) to enter two residues, such as "D/E". Wildcards can be used, which are "$" for hydrophobic residues (G, A, V, I, L, M), "@" for aromatics (F, Y, W), "!" for neutral hydrophilics (S, T, W, Q), "#" for positive hydrophilics (H, K, R), and "&" for negative hydrophilics (D, E). Scansite will give residues in this top row a score of 9.0. For positions with no residue preference, leave it blank or use "X".

  6. In the "Secondary Preference" row (bottom one), you can enter alternative residues at some positions if desired. These will be given a lower score of 4.5, and thus allows you to specify a weaker affinity for some residue types. The same wildcards can be used as in the last step. When you are finished, click the "Submit" button at the bottom of the page.

  7. You will see a page similar to Fig. 6, with the name and schematic description of your consensus sequence displayed at the top.

  8. Select the database to search using this matrix (Swiss-Prot, TrEMBL, Genpept, or Ensembl).

  9. If you are looking for proteins with specific characteristics, you can restrict the search to get a shorter list of more relevant results. Steps 10 through 16 guide you through these choices. However, if you do not want to specify any of these, you can skip ahead to Step 17.

  10. In the "Organism class" drop-down list, select which class to search. The choices are Mammals, Vertebrates, Invertebrates, Plants, Fungi, Bacteria, or All. The default choice is Mammals. If you do not want to restrict your search by class, choose All. For a description of each class, click on the "Class descriptions" link.

  11. To search only among proteins in a given species, write the genus and species in the "Single species" text box, such as "Homo sapiens", "Caenorhabditis elegans", etc. Abbreviations and wild cards can be used here. Click the "Examples" link for more details. To avoid any restrictions on species, leave this text box blank.

  12. For targeting proteins in a range of molecular weights, fill in weights in daltons in the two boxes labeled "Molecular weight range", such as "20,000" to "25,000". To avoid molecular weight restrictions, leave these fields blank.

  13. If you are seeking only proteins with a characteristic isoelectric point, fill in these values in the two boxes labeled "Isoelectric point range", such as "5.5" to "6.0". To avoid isoelectric point restrictions, leave these fields blank.

  14. For proteins likely to be phosphorylated at one or more sites, click on the number of phosphorylations you want to target under "Phosphorylated sites" (0 to 3 phosphorylations are allowed). This will affect the calculated molecular weights and isoelectric points if you are using either of those restrictions above. If you do not need to account for phosphorylations, leave this option set at 0.

  15. To look for proteins in a functional category, enter a search term in the "Keyword search" field, such as "oxidoreductase", "cytochrome", "kinase", "membrane", or perhaps "hypothetical" (to focus on novel genes). For other examples, click on the "Examples" link. Leaving this blank will skip the keyword search function.

  16. To look for proteins that contain a consensus sequence or other sequence part, enter it in the "Sequence contains" text box. Wild cards are permitted here. For wildcard details, click the "Examples" link. Leave this field blank to avoid restricting your search by sequence parts.

  17. Now that all the restriction options have been specified (or no restrictions at all, if you so chose), select how large of an output list you want. The choices are 50, 100, 200, 300, 400, 500, 1000, or 2000 proteins.

  18. Click on "Submit" to start the search.

The output will again look the Database Search results shown in Fig. 7.