SIMS 247: Information Visualization and Presentation

In this assignment you will use a set of tools to perform exploratory data analysis on a dataset. The three tasks for which visualization is an important tool are exploration (searching a data set for interesting phenomena), confirmation (validating or refuting a hypothesis about features you believe to be present in the data), and presentation (conveying information to others). Our focus on this assignment will be the first two tasks.

It is important that you start this assignment soon. A good strategy is to explore the dataset, then go away and think about it, then go back to the data again. Doing it all at the end in a rush is less likely to produce interesting results.

You are encouraged to work in pairs for this assignment.

  1. Obtain a dataset or use one of those supplied. This dataset must have at least five dimensions (fields) and at least one of the important fields should be nominal data (i.e., have no inherent ordering). (Note that for some tools like Xmdv, you have to convert nominal data into numbered codes. For example, Japan = 1, USA = 2, Europe = 3.) The dataset should have at least 300 records. The richer it is the more likely you are to find something interesting.

  2. Think about what kinds of relationships you expect to find in the data. Write down at least three hypotheses about what you will find in the dataset. E.g., you can hypothesize that senior citizens spend more time on web pages on average than younger people.

  3. See if you can find evidence to verify or refute any of these hypotheses. The idea is that these tools should help formulate hypotheses that would then be analyzed with rigorous statistical tests. (But we won't be doing this part in this class; for more information on how to do that part, see the web pages for Rashmi Sinha's Quantitative Methods class .) Use the analysis tools to look for, e.g.,

    • relationships between pairs of variables (correlations, clusters)
    • outliers of various kinds
    • trends

  4. Use the tools to explore around the dataset and look for other unexpected kinds of relations. Note what features of the visualized data attracted your attention\focus. If the tool supports it, try to highlight or otherwise isolate the subset of the data containing an interesting feature.

  5. Write up a 3-5 page summary of these results (not counting screen shots) of what you found -- both expected and unexpected. This can include relationships that did not appear even though you thought they might. Try to report on at least one surprising piece of information. You should include some screen shots to help convey your discoveries (the third task of visualization - presentation!) but please scale them so they aren't too large.

  6. Comment on the tool or tools you tried. What features did you find useful? Which ones were intuitive to use, and which were hard to understand? Was there any functionality that the tool did not have that you wished it did? In other words, how would you improve on the tool?

Sample Datasets:

Programs that are available on the lab machines:

Assignment Results

Some general lessons:

Below are some comments on techniques used and findings found in people's assignments.

Movie Remake Database, Linda Harjono and Saifon Obromsook

Women Wage Earners in Philadelphia in 1893, Stephanie Hornung and Susanne Eklund

"Easterly" dataset on international labor and political issues, Leah Zagreus

School Effectiveness in Inner London (1985), Maggie Law and Vivien Petras

Strikes in OECD Countries, Craig Rixford

Web News Dataset, Jean-Anne Fitzpatrick, James Reffell, Moryma Aydelott

Auto Fatality Data, Sarah Waterson and Wayne Kao.

About the Tools: