As access to high-performance computing has increased over the years, the scientific community has in turn sought to analyze increasingly complex data. Even with sophisticated methods of analysis, discovering and understanding relationships in multi-dimensional data can be difficult, and this difficulty is dramatically amplified with each added dimension. Exploratory data analysis (EDA) has traditionally been used to discover patterns and garner substantive understanding of data by emphasizing the use of graphical representation (Behrens, 1997). However, methods for visualizing data beyond two or three dimensions are rarely used--most mathematical and statistical packages provide only basic 3-dimensional plotting.
With the exception of visualization tools that extend 3-dimensional plotting to 4 or 5 dimensions via time (motion) and/or physical features, most multidimensional displays represent data as symbols, or icons, where components of the icon represent values of variables. Icons where data map on lengths of graphical components have included profiles, stars and polygons (Siegel, Goldwyn, & Friedman, 1971), and glyphs (Anderson, 1960). Capitalizing on the human ability to perceive and remember variation in human faces, Chernoff (1973) introduced the use of cartoon faces to represent data by mapping variables onto the shape and size of such features as area of the face, curvature of the mouth, and size and shape of the nose, eye-brows, and eyes.
In general, a graph is considered successful if the pattern it presents can be comprehended immediately. This can only be accomplished effectively by considering the properties of human perception and cognition. The advantage of Chernoff's faces comes from using a familiar object to integrate many variables. However, the disadvantage is that they become too meaningful, in that if, for example, smiles do not pertain to an appropriate positive assignment, the display becomes unintuitive. By taking advantage of visual texture perception inherent in humans, icons can be densely displayed creating textural gradients and contours indicating potentially interesting structures in the data as in the Exvis system (Grinstein, Pickett, & Williams 1989).
The goal of the current project is to combine the power of using a very familiar object, a tree, with the human ability to detect change in patterns to create a usable tool effective in deciphering relationships in a dataset. Observations are represented as binary trees where variables are mapped onto branch length and angles between branches and the trunk. The trees can then be sorted into groups based on similarities with a graphical user interface; thus trees that look the same should have some deeper similarity when considering the data.
A Lindenmayer system, or L-system, is a mathematical theory of plant development often used in computer graphics to render realistic flora. The central idea behind L-systems is rewriting--creating complex objects by successively replacing parts of a simple initial object (Prusinkiewicz & Lindenmayer, 1990). Although this project does not currently use the L-system algorithm to generate trees, the idea is the same: by manipulating aspects of the rewriting, categorically different trees can be generated.
Humans have the ability to categorize the natural world by reducing large degrees of freedom down to "oak trees" or "maple trees", for instance. The overall shape of the trees, however, is not simply based on pair-wise correlations. The shape of a tree is the outcome of higher-order relationships, including interactions among many components. Yet, even with such complexity, a young child can easily make such distinctions. L-systems, the result of simple procedural algorithms, can generate natural looking trees, and humans can easily categorize these just the same. Therefore, the shape of trees whose components are based upon data should provide a holistic view of the complex interactions that exist in that data.
The TreeView tool presented has been developed with Java, the choice language here for several reasons. First, it is object-oriented. The trees are themselves objects that go to the data, ask for their properties, and when prompted, draw themselves. Second, 2-dimensional rendering in Java is not only easy to accomplish, but surprisingly fast. Even when variables are being remapped, there is no noticeable time lag, which is important in maintaining an interactive continuity for the user. Third, it is portable in that it can be run on any system that has a Java interpreter (i.e. Windows, Apple, UNIX, Linux, etc.). Because a graphical user interface is the basis of the program, and that is generally one of the most difficult aspects of a program to recode for individual systems, this was an important facet of the decision to use Java.
After opening the program, the users see a menu giving them the ability to load a dataset. The dataset should be formatted as a rectangular array with observations as rows and variables as columns in a delimited ASCII file (i.e. variables separated by commas, tabs, spaces, etc.). The data file is then automatically parsed, variables are initially assigned to features of the tree, and the trees are generated on a Cartesian plane, where the X and Y-axis can also be mapped to variables.
After the data is loaded, the user can interact with the trees on the plane. The trees can be dragged around the plane with the mouse in order to be sorted. Interaction should be familiar to the user as it is similar to common drag-and-drop interfaces; trees can be selected by single clicking on their trunks, and ungrouped by clicking in white-space. They can also be selected in large portions by clicking in white-space and dragging a selection rectangle around the desired trees; the entire selection can then be moved to other areas of the plane. A zoom feature is available by using the mouse-wheel.
The trees can also be painted; if the user is able to group trees by similarity, he or she can paint the tree any color to differentiate it during future interaction. This color will remain constant, unless changed, during the session. This is especially useful when combined with one of the most important aspects of the program: feature mapping.
Variables are initially assigned to essentially random tree features. So, the first variable might be the X coordinate, the second the Y coordinate, the third the trunk length, then the angle from the trunk of the first branch, the length of the first branch, the angle of the second branch, the length of the second branch, etc. These assignments can be changed on the fly. Therefore, if some of the trees (grouped by location or color) show some initial similarity, a remapping might either confirm that similarity in that they remain similar to other trees in their previous group or indicate that a regrouping is in order.
Because the data can include many observations, therefore many trees, sorting by hand may be somewhat time-consuming. Therefore, an option is provided to save the tree locations on the plane in an external file, which can be opened in a different session following the reloading of the same data file.
Example 1 comes from a classic dataset often used in discriminant analysis and cluster analysis, but also to test multivariate visualization techniques. The dataset contains 150 random samples of flowers from the iris species collected by Anderson (1935) and originally published by Fisher (1936). There are 50 observations for each species for sepal length, sepal width, petal length, and petal width.
The following screen-shots, Figure 1.1a and Figure 1.1b, show TreeView running on the iris dataset. It is obvious that three species are easily recognizable.
Figure 1.2 shows the same data mapping as in the previous figures, except the trees have been painted to aid in categorization.
In Figure 1.3, the mapping of variables on the x and y-axis has been changed, but the initial attempt at categorization is preserved via the painting from Figure 1.2.
The iris dataset was easily sorted into three distinct groups using TreeView. However, that dataset only contained four variables. Although the following dataset has not been fully explored, it should provide a good example as to the promise of using TreeView with data of higher dimensions. Example 2 uses a dataset containing 62 samples, each of 2000 genes, from colon tissue of colon-cancer patients (Alon et al., 1999). Forty of the observations come from malignant tissue in tumors, while 22 are from non-malignant regions of the colon. For this example, 50 random dimensions out of the original 2000 were extracted and loaded into TreeView. The trees were sorted for demonstration, but no remapping was conducted in order to categorize some of the ambiguous trees; this is purely based upon the original random mapping of variables to tree components. Some of the trees in the following screen-shot are questionable as to which category they belong, but this would need to be tested experimentally in a usability study.
In figure 2.1, the trees in red and blue are from malignant tissue. Approximately 10 trees that looked very similar to the odd-looking trees in blue have been removed for the sake of space. The trees in green and the cluster in black are from normal tissue. Several very small trees like those in black were also removed because they were too small as to be meaningful in this demonstration. It will be left to the reader to determine the significance of the appearance of the trees in each group. The goal, however, is to demonstrate that even with high-dimensional data, the tool holds much promise.
Although it is in the initial phases of development, TreeView offers a promising new look at multi-dimensional data. The tool exploits the human ability to efficiently differentiate between individual types of trees, but does not have the same problem as other icon-based systems, in that variables can be assigned to features of the tree entirely at random, if desired, without obstructing understanding. Future releases of the tool include using a fractal index to summarize the tree quantitatively, a neural network to recognize patterns in groupings that the user created and to apply those patterns automatically to the rest of the dataset, and usability testing to experimentally determine the efficiency of this tool as compared to other methods of visualizing multidimensional data.
Executable Jar File: The source code is also contained in this file.
Download either the Java Runtime Environment (JRE) or Development Kit (JDK) at java.sun.com in order to execute the above file.
Alon, U., Barkai, N., Notterman, D. A., Gish, K., Ybarra, S., Mack D., et al. (1999). Broad patterns of gene expression revealed by clustering of tumor and normal colon tissues probed by oligonucleotide arrays. Proceedings of the National Academy of Sciences of the United States of America, 96, 6745-6750.
Anderson, E. (1935). The irises of the Gaspe peninsula. Bulletin of the American Iris Society, 59, 2-5.
Anderson, E. (1960). A semi-graphical method for the analysis of complex problems. Technometrics, 2, 387-392.
Behrens, J. T. (1997). Principles and Procedures of Exploratory Data Analysis. Psychological Methods, 2, 131-160.
Chernoff, H. (1973). The use of faces to represent points in k-dimensional space graphically. Journal of the American Statistical Association, 68, 361-368.
Fisher, R. A. (1936). The Use of Multiple Measurements in Taxonomic Problems. Annals of Eugenics, 7, 179-188.
Grinstein, G., Pickett, R., & Williams, M. G. (1989). EXVIS: An Exploratory Visualization Environment. Graphics Interface '89, 254-261.
Prusinkiewicz, P., & Lindenmayer, A. (1990). The Algorithmic Beauty of Plants. New York: Springer-Verlag New York Inc.
Siegel, J. H., Goldwyn, R. M., & Friedman, H. P. (1971). Pattern and process in the evolution of human septic shock. Surgery, 70, 232-245.