SimPlot version 2.4

March, 1999  

"SimPlot", software, documentation, and SimPlot icon copyright (c) 1998,1999 Stuart C. Ray, M.D., All rights reserved

First, I want to thank:

If you share this program with others, please distribute the installer rather than just the executable.

This program is "HelloWare".  If you use it, please send me an email or letter saying so.  This program is NOT in the public domain, and may not be sold by anyone other than the author, nor may it be included in any collection of software (such as a CD-ROM) for distribution without the author's written consent.

Disclaimer:  This software is distributed "as is", with no warranty expressed or implied.  So if it breaks your computer or distorts your results, I will not compensate you in any way. While I have made a reasonable effort to test this software, you should make sure the results make sense to you.

Before contacting me about a problem, please check the known problems section below to make sure it is not already on the list.

  Features | Background | Installation | How to use it | Version history | Known problems | Contacting the author  


Features go to top

features have been added since version 2.1


Background go to top

I created SimPlot in order to learn more about HIV-1 intersubtype recombination analysis when I encountered a mosaic HIV-1 genome during analysis of some clones from international isolates.  There is a program available for doing this sort of analysis, at the Los Alamos National Lab's Human Retroviruses and AIDS Database Web site (http://hiv-web.lanl.gov.).  The program is called the Recombination Identification Program (RIP), and the direct link is:http://hiv-web.lanl.gov./HTML/rip.html. It has very nice online documentation, and is also described in Siepel AC and Korber BT, Scanning the Database for Recombinant HIV-1 Genomes, in the Human Retroviruses and AIDS Compendium, 1995 (available from the Los Alamos site as a Adobe Acrobat file).

I wanted to do some customization, so here is SimPlot.  While the output from SimPlot bears a passing resemblance to that of RIP, I have used RIP very little, and modeled SimPlot after figures in various published reports.  RIP does some things that SimPlot does not, like the "informative mode", which limits the comparison to sites that contain at least 1 mismatch among the reference sequences.  If you find SimPlot useful and want that feature I may add it.  Similarity plots are only a screening tool, and as such SimPlot is pretty utilitarian.

SimPlot allows identification of one Query sequence, generally the one you suspect is mosaic, and the rest of the sequences are Reference sequences (or can be ignored - see the Select Function).  The graph that is generated is a set of lines (or optionally strings of points) that reflect the similarity (or distance) of each Reference sequence (or Group) to the query sequence.  In order to generate this plot a sliding window is passed across the alignment in small steps (the window size and step size are selectable).

Bootscanning is a new addition and this documentation will not do justice to this analytical approach. I recommend Mika Salminen, et al. Identification of breakpoints in intergenotypic recombinants of HIV type 1 by bootscanning. AIDS Res Hum Retroviruses. 1995 Nov;11(11):1423-5.

The informative sites module is largely based on Robertson,D.L., Hahn,B.H., and Sharp,P.M. Recombination in AIDS viruses. J Mol Evol 1995;40(3):249-59.

I recommend that you use the reference alignments available from the Los Alamos WWW site for your reference sequences.  Since SimPlot can now create on-the-fly consensus sequences, all you need is the full alignment. You may also want the majority consensus alignment for faster alignment of query sequences to the reference sequences.


Installation go to top

This is a 32 bit program, so you need to be running Windows 95 or 98.  Similarity plots and FindSites work fine Windows NT, but BootScan does not.  SimPlot takes up less than a megabyte of hard drive space.  I plan to test its memory requirements, but have not done so yet.  It allocates memory dynamically, meaning that it uses what it needs for the data set you use.  On my machine with 48 MB RAM it can handle at least 15 sequences of 9.7 kb each.  Please let me know if you have any memory problems.  I recommend using a screen resolution higher than 640x480 if you can such as 800x600 or 1024x768, to prevent the need for screen scrolling. Resizing the window onscreen will not result in higher resolution plots.

To install, just run the Installer and follow the prompts. Please do not distribute the program by itself.

To perform BootScanning, you also need Joe Felsenstein's PHYLIP suite, available from his FTP site (ftp://evolution.genetics.washington.edu/pub/phylip/). There is more information about these and other programs on the PHYLIP WWW site. The files you need to download are phylip3x.exe and phylip3y.exe. After downloading them, first send Dr. Felsenstein an email to let him know you have done so. Now you can create a directory wherever you want on your hard drive, and run these self-extracting files. If you only want to keep the files needed for bootscanning, then keep: DOS4GW.EXE, SEQBOOT.EXE, DNADIST.EXE, DNAPARS.EXE, NEIGHBOR.EXE, FITCH.EXE, and CONSENSE.EXE. Please note that if you already have the Windows version of PHYLIP installed on your machine, these programs will not interfere with each other as long as you keep them in separate folders.

***Note*** Because PHYLIP programs run as 16-bit applications, they do not understand long (32-bit Windows, i.e. Win95/98) directory names. The directory path that identifies the location of the PHYLIP programs must not have a name longer than 8 characters, nor should those directory names contain spaces. Any directory that would be okay under older versions of DOS or Windows 3.1 will work. Capitalization does not matter. For example:

Pending the availability of 32-bit PHYLIP programs (Dr. Felsenstein they will be available soon), I suggest 'c:\phylip'.


How to use SimPlot go to top

SimPlot reads most sequence file formats.  The format is automatically detected using code based on Don Gilbert's ReadSeq code (please see first page of this file). First, prepare your sequences by aligning them, and save them in a standard format such as FASTA/Pearson format.  SimPlot can use no more 26 groups of sequences (or individual sequences) - you can select the sequences you want to analyze.

First I will discuss using individual sequences as reference sequences, using the gagtest.fsa file included with SimPlot (so you can follow along). See below for the discussion of sequence groups and consensus sequence generation.

When you run the program, you will see:

Use the File menu to Open or ReOpen a sequence file.  You can also use the Ctrl-O (^O) key combination.  We will use the example file gagtest.fsa to demonstrate. If the file is read successfully, you will see:

Then after some manipulation:

This should be familiar - it is similar to Windows Explorer. The Sequences are the terminal branches. Each is contained within a group, with the same name as the sequence by default. You can drag any sequence into any group with the mouse. Groups cannot be nested within each other. You cannot rename a sequence. Group names, not sequence names, will be displayed in the plot legends. SimPlot supplies one such group. The buttons on the right should be self-explanatory. The order in which the sequences appear will be the order they appear in the plot legend. Sequences below any group named "hidden" will be hidden. Any hidden sequences will be ignored for the remainder of the analysis, but of course SimPlot will not affect the actual file on your disk - all of this happens in memory. In the picture on the right, group A has been moved down, group F and the Brazilian sequence have been moved up, and I am dragging the "Hidden" separator onto group A, which will hide it and the rest of the groups below, leaving four groups to be analyzed (the current limit for BootScan and FindSites).

Now click on the SPlotPage tab near the bottom. You can select a Query sequence:

Now you can either do the plot (^D or under the Commands menu), or alter some of the options.  The most apparent options are the Window size and Step size, available by clicking the status bar at the bottom of the window (or under the Options menu).  Note that the reference type (Individual sequences in this example) can be changed, but will have no effect, because this example only has one sequence per Group (see next section for more on groups).

Now hit ^D (or DoSimPlot under the Commands menu) and see:

The labels on the plot indicate that 93BR029 is the Query sequence and the rest are reference sequences.  The window and step size are the default values. At this point the user can customize the plot using the status bar to change window and step sizes as described above.  There are also a number of options available from the Options menu.

 

Options include:

To zoom in on a plot area, click (with the Left mouse button) on the upper left corner of the region of interest, and while holding the mouse button down, drag (i.e. don't let up on the mouse button yet) down and to the right to enclose the area of interest in the box that appears (example depicted at left).  When you release the mouse button the plot will redraw at the new level of magnification.  If you are dealing with a really big alignment this make take a second or so.  Below I explain how to zoom back out.  While you are zoomed in you can pan around the plot by clicking the Right mouse button and dragging as if you were moving a piece of paper.

In order to return to the original magnification level, click and drag up and to the left.  It does not matter how large an area you enclose - this is a signal to end zooming.

 At any level of magnification you can get more info about a particular point.  This can be especially useful if you want to know where the point of apparent crossover is located.  For scatter plots just click on a point.  For line plots you need to click on a vertex (a data point used to plot the line, as depicted at left). When you click on the point with the LEFT BUTTON the dialog depicted at left is displayed.

When you click on the same point with the RIGHT mouse button, you are able to change the color, and this color is saved when you Save Settings. For example, if the curve you chose represents the 3rd reference sequence, then any saved change will affect the color of the 3rd reference sequence in future use. The default color settings are restored when you Restore Default Settings (under the Options menu).

At any time, the current plot can be:

The bitmap and metafiles for Save and Copy will work in any Windows program that can handle these formats, like Word, WordPerfect, PowerPoint, Havard Graphics, or the Paint program that comes as part of Windows. The great advantage of the metafile is that in many programs you can edit each individual element (so that you can change fonts, colors, etc.). This does not seem to work as well when pasted from the clipboard, so if you have trouble, try saving to disk and then loading instead.

Now let's perform bootscanning. Click on the BootScan tab, choose a Query sequence (if the one you want is not already marked), and click the Do Bootscan button under the Commands menu. If this is the first time you have run a BootScan, you will be asked where the PHYLIP files can be found. Just follow the prompts and navigate to the folder that contains the PHYLIP files listed above in Installation. Once the bootscan is running, note the status bar, which indicates the tree number and the PHYLIP program that is currently running. You should see something like this after a few minutes:

Note where the hourglass cursor is. During the BootScan you can interrupt it by clicking on the Stop Bootscan button.

You cannot resume - you have to start over. Give it a few seconds to stop - while it is running a PHYLIP program it has to wait.

Once the run has stopped, you can do the same things (zoom, click, change colors) that were described above for similarity/distance plots.

If you chose to save CONSENSE output files, they can be found in the PHYLIP directory. Under the File menu there is an item "View PHYLIP Dir" that will take you straight there. The plan is to add a feature to SimPlot that allows you to re-scan these files and selectively examine relationships that look interesting, to view the consensus trees, etc. For now, the 'SimPlot CONSENSE Map' file should help in figuring out what the taxa are. Each file's name tells you the tree number (nXXX) and its center postion along the alignment (pXXXXX).

When you are ready to find informative sites for maximum chi-squared analysis, you need to have exactly 4 sequences selected (note that I did not say "4 groups"). Then click on the FindSites tab and the analysis will be run immediately. The output is pretty self-explanatory, and if you are familiar with the theory behind it this should be enough to get you going [please refer to Robertson,D.L., Hahn,B.H., and Sharp,P.M. Recombination in AIDS viruses. J Mol Evol 1995;40(3):249-59].

Working with sequence groups

It is probably more accurate to base these analyses on regional similarity to (or distance from) a group of sequences rather than a single representative. One way of doing this (which we used in JVirology 1999; 73:152-160) is to use threshold consensus alignments. The problem with this approach is that each time a sequence is added to the alignment you have to re-create the threshold consensus alignment, and if you want to try multiple thresholds that means more files. The last thing I need is more alignments to maintain.

SimPlot 2 will create the consensus sequences for you. At runtime, groups are created as described in the previous section. Alternatively, you can prepare your alignment by adding a sequence (it must be the first sequence) named "simplot", containing a string of letters (a-z, no numbers or punctuation, but lower- and uppercase are okay) which represent the group assignments. Hence, using a FASTA file as an example, if it begins with:

>simplot
aaabbbbbcccddeeffggg
>A
ATGAGAGTGATGGGGATACAGAGGAATTATCAACACTTG---TGGAGA--
----------------------TGG---GG-ACTATGATCTTTGGGATGA
TAATAATTTGT---AGTGCT----CAGAA---AA-TTGTGGGTCAC-GTC
...

SimPlot will interpret this to mean that there are 20 sequences in the alignment, the first 3 in group a, the next 5 in group b, 3 in group c, and so on. They can be intermingled as long as the "simplot" sequence reflects the exact order in which the sequences appear. The letters used need not be consecutive. I know this is a bit clumsy, but it is pretty easy to maintain such files, and it is also easy to let the letters be HIV subtypes, for instance.

Now you need to decide how you want the groups compared to your query sequence. Options include averaging the distance in the sliding window for each reference sequence, or comparing to a consensus sequence. The consensus sequence, in turn, can have a threshold from 0% (simple consensus: most common residue is used) to 100% (strict consensus: only sites that are 100% conserved have a residue in the consensus). Selection among these options is available by clicking the status bar at the bottom of the program window, or using the Reference Type menu item under Options.

Choosing among these options is left as an exercise for the user :-)


Version History go to top

version 2.4 - March 4, 1999

version 2.31 - February 26, 1999

version 2.3 - February 22, 1999

version 2.22-2.24 - bug fixes

version 2.21 - February 10, 1999

version 2.2 - February 6, 1999

version 2.1 - January 21, 1999

version 2.0 - January 3, 1999 (thanks to Mika Salminen for suggesting many of these changes)

version 1.4beta - August 6, 1998 (never fully released)

version 1.3 - July 31, 1998

version 1.2.2 - May 19, 1998

version 1.2.1 - May 18, 1998

version 1.2 - May 11, 1998

version 1.1 - April 26, 1998 - First public version


Known Problems/Plans go to top

Planned improvements:

Please contact me to report other problems or suggest additions/priorities for the improvement list.


Contacting the author go to top

I welcome comments and suggestions.

Stuart Ray, M.D.
email: sray@jhmi.edu
home page: http://www.welch.jhu.edu/~sray
snail mail:
  Division of Infectious Diseases
  Johns Hopkins University School of Medicine
  720 Rutland Avenue, Ross 1159
  Baltimore, MD, 21205

Windows is a registered trademark of the Microsoft Corporation.

"SimPlot", software, documentation, and SimPlot icon copyright (c) 1998,1999 Stuart C. Ray, M.D., All rights reserved