Contents - Index - Previous


Pitfalls

Visualization of data is powerful, because it generally involves some form of data transformation which smooths, summarizes, or otherwise distorts the data in a standardized way.  Once validated (under particular conditions) this sort of simplification permits rapid assessment of complex relationships.  The danger is that one must remain aware of the assumptions of the methods, and of additional methods that should be applied to validate results.

Similarity plotting, bootscanning, and informative sites analyses are no exception.  This page provides a few examples of misleading results, and ways to avoid them.

Low Bootstrap Values are Not Informative

The bootscan on the Bootscanning page illustrates this issue.  Between positions 1200 and 1400, the bootstrap support for monophyly of the query sequence (AC_IN.21301) and the subtype A reference sequence drops below 50%, while the support for monophyly of the query and the subtype G reference sequence rises to 60%.  As with "usual" phylogenetic analysis, this level of bootstrap support is unconvincing for a strong relationship, and requires further study.  

While phylogenetic analysis using other tools is the definitive way to investigate this result, here are some other suggestions:
  • Check the alignment - it is not possible to infer phylogenetic relationships from a poor alignment.  If a coding sequence, consider aligning the amino acid sequence then applying the resultant alignment to nucleotides.  A tool called SyncAlign to facilitate this will soon be available on the SCRoftware page.
  • Examine the Similarity/Distance plot in the same region.  If multiple sequences are highly homologous or divergent in the region of interest, low bootstrap values are expected.  
  • If they are homologous and the degree of diversity is similar to contiguous regions, then the involved sequences may share the same parental sequence, in which case the "true" parental sequence may be missing. If the degree of diversity is particularly low in the region of interest, then purifying selection may be operative.
  • If they are divergent and the degree of diversity is higher than in other regions, unequal rates of evolution may be the cause (or the alignment is incorrect, as above).  If the degree of divesity is similar to other regions, then the parental sequence for that region of the query may not be present among the reference sequences.
  • Set marks on either side of the region and use the QuickTree command to examine a simple phylogentic tree for the region.  Remember that bootscanning (as currently implemented in SimPlot) will give low bootstrap values when more than 2 groups cluster equally closely (e.g. there is a monophyletic group containing 3 groups, and no strongly supported 2-group clade).  Detailed phylogenetic analysis may indicate that a different grouping of sequences exists within the area of interest.
     
    Beware High Bootstrap Values in Regions of Low Similarity

    Imagine a putative recombinant query sequence and 4 reference sequences labeled A, B, C, and H.  A bootscan analysis is performed, showing high bootstrap support for the query clustering with reference A across much of the alignment, except for a midportion in which clustering of the query with reference B is strongly supported.  Similarity plotting shows that reference A is highly similar to the query across most of the alignment, except for the midportion noted in the bootscan, in which reference B has the highest similarity, but is not as high as the similarity scores for A in the other portions of the alignment.  An important possibility is the existence of another parental sequence, let's say "D", which is similar to B.  This is the case for HIV-1 group M, as subtypes B and D are monophyletic across the entire genome.  If a subtype A/D recombinant sequence were analyzed with reference sequences from all subtypes except D, the bootscan might appear to support mosaicism of subtypes A and B.  However, the similarity plot might reveal lower than expected similarity in the region clustering with subtype B.

    ... more to come - and suggestions are welcome ...