SIMS 247 Fall '00 Assignment 4

SIMS 247: Information Visualization and Presentation

Assignment 4: Exploratory Data Analysis

Summary of Results

The two most useful visualizations were TableLens/Eureka and Parellel-Coordinates in XMDV. Spotfire was a close third (especially because of its ability to filter the data flexibly). One person found glyphs in XMDV to be useful. Many others also found the scatterplot matrices of XMDV to be useful. Whether or not a tool was useful depended on the kind of dataset used.

Tablelens was found to be very stable and was able to handle any size dataset. It has features that allow for categorization of subsets of the data and zooming in on subsets of the data; some people made more use of this than others.

Parallel coordinates was extremely useful for finding multidimensional relationships. People seemed to have no problem understanding and using it, modulo the fact that the particular implementation we used had problems. From the results of this assignment I think we can confidently say that this should become a standard analysis technique.

Spotfire was especially useful when its scatterplot view was paired with use of the ability to dynamically filter which subset of datapoints are being viewed. One person found out how to make it show multiple scatterplots simultaneously, which very much helped.

An observation I made is that the visualizations seemed to work better when better color choices were made (and these were never the defaults that come with the tool, with the exception of Tablelens).

Below are the best examples of EDA, in my opinion, although I must say that almost everyone did a great job. I chose writeups that (a) had an interesting dataset (b) made good use of more than one visualization type (c) did a good job writing up the analysis, telling a story about the data. It should be noted that most of these were done by pairs of people, which is what I asked people to do. I think the interplay between two people working together tends to produce the best results.

Monica Fernandes. Population survey data; especially good use of categories in TableLens/Eureka, nice color choices for parallel-coordinates.
Barbara Rosario and Michal Feldman. Biomedical data. Nice analytical use of parallel-coordinates, and integration of this with scatterplot matrices. (Not nice use of word format instead of html, however.)
Francis Li and Sam Madden. Homicide data. Nice use of Tablelens, and integration with parallel-coordinates.
Philip Buonadonna and Jason Hill. Very interesting dataset about a chapped lips study. Nice description of how to use these visualizations to analyze the results of a controlled experiment.
Jennifer English and Joanna Plattner. Education data. Nice interaction between Tablelens and Spotfire.
Valerie Lanard. Primate sleep dataset. Only person to use (and like) glyphs.

All of the assignments

Assignment

In this assignment you will use a set of tools to perform exploratory data analysis on a dataset. The three tasks for which visualization is an important tool are exploration (searching a data set for interesting phenomena), confirmation (validating or refuting a hypothesis about features you believe to be present in the data), and presentation (conveying information to others). Our focus on this assignment will be the first two tasks.

It is important that you start this assignment soon. A good strategy is to explore the dataset, then go away and think about it, then go back to the data again. Doing it all at the end in a rush is less likely to produce interesting results.

You are encouraged to work in pairs for this assignment.

Both Matt Ward and Marti Hearst will be looking over the results.

Obtain a dataset or use one of those supplied. This dataset must have at least five dimensions (fields) and at least one of the important fields should be nominal data (i.e., have no inherent ordering). (Note that for some tools like Xmdv, you have to convert nominal data into numbered codes. For example, Japan = 1, USA = 2, Europe = 3.) The richer it is the more likely you are to find something interesting.
Think about what kinds of relationships you expect to find in the data. Write down at least three hypotheses about what you will find in the dataset. E.g., you can hypothesize that senior citizens spend more time on web pages on average than younger people.
See if you can find evidence to verify or refute any of these hypotheses. The idea is that these tools should help formulate hypotheses that would then be analyzed with rigorous statistical tests. (But we won't be doing this part in this class; for more information on how to do that part, see the web pages for Rashmi Sinha's Quantitative Methods class .) Use the analysis tools to look for, e.g.,
- relationships between pairs of variables (correlations, clusters)
- outliers of various kinds
- trends
Use the tools to explore around the dataset and look for other unexpected kinds of relations. Note what features of the visualized data attracted your attention\focus. If the tool supports it, try to highlight or otherwise isolate the subset of the data containing an interesting feature.
Write up a 3-5 page summary of these results (not counting screen shots) of what you found -- both expected and unexpected. This can include relationships that did not appear even though you thought they might. Try to report on at least one surprising piece of information. You should include some screen shots to help convey your discoveries (the third task of visualization - presentation!) but please scale them so they aren't too large.
Comment on the tool or tools you tried. What features did you find useful? Which ones were intuitive to use, and which were hard to understand? Was there any functionality that the tool did not have that you wished it did? In other words, how would you improve on the tool?

Sample Datasets:

KDD Cup 2000, web clickstream data (have to sign a confidentiality agreement)
The UC Data Website
US Census data (Be aware that they don't give info for individual data points, but only in aggregated groups. This makes this data harder to analyze in an interesting manner.)
Census Bureau Economic Data (I haven't looked at this, just saw a hyperlink.)
CMU Statlib Repository

Programs (that are or will be) available on the lab machines:

XmdvTool, includes brush-and-linkable scatterplot matrices, parallel coordinates, star glyphs.
Eureka, including the tablelens
Spotfire, interactive scatterplots (A relevant paper by Ahlberg and Shneiderman)