A new version of Scansite has been developed in recent months to make better use of internal databases, to allow for easier additions of new matrices, and to improve the displayed graphics. Many of the changes were simply to facilitate further development, but some of them will benefit users more directly. This page summarizes the important changes that users will be interested in.
For the curious, the development-related changes are that Scansite matrices are now stored in a MySQL database (as are protein sequences after our previous update earlier this year), and all the programs have been rewritten to be more modular and non-redundant. This should make the code more maintainable and allow us to add new features more easily.
A new category of programs allows users to find all sequences in a database that match an exact subsequence. For example, how many proteins in Swiss-Prot contain the sequence PXXP, and what are their IDs? Residues can be grouped manually (such as "S/T/Y") or with wildcards ("$" for aliphatic residues, "@" for aromatics, etc.). Unlike the database search programs, these matches are not scored in any way; these programs simply list which proteins contain the specified sequence.
One sequence match program allows searching with regular expressions. These are powerful text-matching expressions that allow identifying more complex sequence patterns. More information can be found on the input page, Search Database for Regular Expression, and in this description.
The motifs in Scansite 2.0 can be grouped with others of similar type. While you can still scan with all motifs or with a few that you specify, you can now search for all motifs of a given type, such as all tyrosine kinase domains, all SH3 domains, all PDZ domains, and so on. You can also combine single motifs with groups of motifs in your scan. The output generated by our motif scanning programs now marks only the group type on the display, but includes each motif in that group (and its score) in the more detailed table below the figure.
The map of sites produced by our motif scanner programs has suffered in the past from overlapped text in the domains and occasionally also in the site names. The new version avoids these overlaps more reliably with larger numbers of sites. At low stringencies, when numerous sites are often detected, this method can still be overloaded and result in overlaps. In these cases, users are advised to scan using only a few motifs at a time.
Another difference in our graphical displays is that motifs of similar type are grouped. For example, if several SH2 domains recognize the same site, the graphical display marks it simply "SH2", and the output table below it lists each kind of SH2 domain with its score.
Some users have requested that the graphical output of motif scanner be drawn as a single image that can be easily downloaded and inserted into notebooks or papers. We have done that with this version. Each one is now a single PNG format image. In addition, we have made the domains recognized by Pfam more pronounced by narrowing the drawing of the non-domain regions of the protein sequence.
The matrix format used to represent domain binding affinities has been expanded to allow the optional use of U (selenocysteine), as well as residues B (Asp/Asn), Z (Glu/Gln), and X (any residue), which are found in some Genbank sequences. The ability to specify motifs that are specific for the C-terminus has been available in the past by specifying a value for the fictional residue "*" (asterisk), signifying the C terminus. In version 2.0, we have added the ability to specify a score for the N terminus, represented by the fictional residue "$". While these fundamental changes have required extensive changes to the Scansite programs, users can still insert their previous input matrices unchanged. Scansite will put in default values for any missing columns. This even applies to the 20 amino acids themselves -- if you have no data for cysteine in your peptide library screen, for example, leaving it out of your matrix will result in Scansite giving it the default score of 1 at each position.
Matrices can now have up to 5 alternative residues at the same "fixed position". (This number can be raised easily, but then it starts being less and less of a fixed position.) Previously the limit was 3 residues. This had required us to break the PDK1 Binding domain, which needs 4 residues at the fixed position, into two separate matrices, arbitrarily dubbed PDK1 Bind A and PDK1 Bind B. With the new limit, this motif can be (and has been) properly represented as a single matrix.
In addition to the surface accessibility plot below the site-mapped protein, the value of the surface accessibility function is displayed for each site in the output table.
In the previous version of Scansite, users could search for hits to two matrices at once, either two Scansite matrices or two input matrices. In the new version, you can search for proteins matching up to 5 motifs at once, and these 5 can be a combination of Scansite matrices and input matrices.
There are three small changes to the Scansite motifs.
Previously, Motif Scan would let you input proteins by Swiss-Prot ID (e.g. "VAV_HUMAN"), Genpept ID ("3282619"), TrEMBL accession number ("Q9UWR0"), or Ensembl accession number ("ENSP00000269862"). The database format has been expanded so that you can now also use the Swiss-Prot accession number ("P15498") or the Genbank accession number ("AAC25011") as well. (Note: while Genpept and Swiss-Prot assign both an ID and accession number to each entry, TrEMBL and Ensembl assign only an accession number to their entries.)
The database searches allow you to restrict your search to various organism classes (mammals, invertebrates, and others). There is now a new category to search only among viral proteins.