SIMS 247 Spring '02 Assignment 2

SIMS 247: Information Visualization and Presentation

Assignment 2: Exploratory Data Analysis

In this assignment you will use a set of tools to perform exploratory data analysis on a dataset. The three tasks for which visualization is an important tool are exploration (searching a data set for interesting phenomena), confirmation (validating or refuting a hypothesis about features you believe to be present in the data), and presentation (conveying information to others). Our focus on this assignment will be the first two tasks.

It is important that you start this assignment soon. A good strategy is to explore the dataset, then go away and think about it, then go back to the data again. Doing it all at the end in a rush is less likely to produce interesting results.

You are encouraged to work in pairs for this assignment.

Obtain a dataset or use one of those supplied. This dataset must have at least five dimensions (fields) and at least one of the important fields should be nominal data (i.e., have no inherent ordering). (Note that for some tools like Xmdv, you have to convert nominal data into numbered codes. For example, Japan = 1, USA = 2, Europe = 3.) The dataset should have at least 300 records. The richer it is the more likely you are to find something interesting.
Think about what kinds of relationships you expect to find in the data. Write down at least three hypotheses about what you will find in the dataset. E.g., you can hypothesize that senior citizens spend more time on web pages on average than younger people.
See if you can find evidence to verify or refute any of these hypotheses. The idea is that these tools should help formulate hypotheses that would then be analyzed with rigorous statistical tests. (But we won't be doing this part in this class; for more information on how to do that part, see the web pages for Rashmi Sinha's Quantitative Methods class .) Use the analysis tools to look for, e.g.,
- relationships between pairs of variables (correlations, clusters)
- outliers of various kinds
- trends
Use the tools to explore around the dataset and look for other unexpected kinds of relations. Note what features of the visualized data attracted your attention\focus. If the tool supports it, try to highlight or otherwise isolate the subset of the data containing an interesting feature.
Write up a 3-5 page summary of these results (not counting screen shots) of what you found -- both expected and unexpected. This can include relationships that did not appear even though you thought they might. Try to report on at least one surprising piece of information. You should include some screen shots to help convey your discoveries (the third task of visualization - presentation!) but please scale them so they aren't too large.
Comment on the tool or tools you tried. What features did you find useful? Which ones were intuitive to use, and which were hard to understand? Was there any functionality that the tool did not have that you wished it did? In other words, how would you improve on the tool?

Sample Datasets:

KDD Cup 2000, web clickstream data (have to sign a confidentiality agreement)
The UC Data Website
US Census data (Be aware that they don't give info for individual data points, but only in aggregated groups. This makes this data harder to analyze in an interesting manner.)
Census Bureau Economic Data (I haven't looked at this, just saw a hyperlink.)
CMU Statlib Repository

Programs that are available on the lab machines:

Eureka, including the tablelens
Spotfire, interactive scatterplots
Parallax (aka Parallel Coordinates) Manual for Parallax

Assignment Results

Some general lessons:

Once a relationship is found using one tool, use another tool to verify it.

If dataset is large, start with a reduced one.

Below are some comments on techniques used and findings found in people's assignments.

Movie Remake Database, Linda Harjono and Saifon Obromsook

Parallax to explore most hypotheses
Parallax plus Spotfire found a trend that combined attributes
Eureka found erroneous data points
Eureka led them to an unusually behaving set of data ("That's Entertainment" is composed of verbatim pieces from many movies).
Parallax to compare different time periods.

Women Wage Earners in Philadelphia in 1893, Stephanie Hornung and Susanne Eklund

All three interfaces to explore initial hypotheses; most found not to hold.
Spotfire color coding to make a scatterplot useful
Parallax used for interrelations among several variables (e.g., for the hypothesis that different industries hire women with different marital and child-rearing status). Found interesting differences between Telephone operators and Clothing manufacturers' hiring habits.
Eureka to find relationships between hours worked and wages (found industry-specific results with respect to Saturday hours)
Unanticipated Discoveries
- Spotfire to find women in clothing industry sew their own clothes; those in office jobs do not (and hypothesized that office jobs require more sophisticated store-bought clothes than factory jobs)
- Parallax to find relationships between vacation days and illness, verified with Spotfire

"Easterly" dataset on international labor and political issues, Leah Zagreus

Unexpected Discoveries:

Parallax shows an interesting relationship when showing countries with zero strikes -- turns out all are OPEC nations.
Nice use of the Zebra filter in Parallax to see changes over time.

School Effectiveness in Inner London (1985), Maggie Law and Vivien Petras

Note Figures 7 and 8: Parallax allows us to see relationships among several variables simultaneously. These figures show relationship between high exam scores and low need for food assistance, and to some extent the converse. However, they also show a different pattern in ethnic makeup of the two groups, which might not be noticed at first.

Strikes in OECD Countries, Craig Rixford

Spotfire to find outliers
All hypotheses wrong (!!). Strike patterns correlated with countries. Once this was realized, Eureka made it clear.
Eureka "spotlight" feature to look at high and low ends of strike spectrum.

Web News Dataset, Jean-Anne Fitzpatrick, James Reffell, Moryma Aydelott

Eureka for comparing term co-occurences across time and websites.
Spotfire to confirm spiking behavior.
Spotfire to find very interesting outliers (Houston News; more Enron coverage; Detroit Free Press, more KMart coverage; corresponds to the company's hometowns).

Auto Fatality Data, Sarah Waterson and Wayne Kao.

Eureka for comparing pairs of attributes

About the Tools:

Parallax has serious usability issues; is not walk-up-and-use; needs undo; complex data formatting; misleadingly hides many data pionts behind one line; many other issues. However, once you get used to it, it is very powerful. Spotting trends is easy, but drilling down is difficult. Too slow on large datasets.
Eureka is polished, but some problems with distortion for focus-plus-context. Also needs undo. Is also quite powerful, especially the sorting feature.
Spotfire polished and fast, very good for scatterplots, but less powerful, especially for seeing relationships among more than two variables, and not facile for nominal data. Can easily see information about individual data points.