In this assignment you will use a set of tools to perform exploratory
data analysis on a dataset. The three tasks for which visualization is
an important tool are exploration (searching a data set for
interesting phenomena), confirmation (validating or refuting a
hypothesis about features you believe to be present in the data), and
presentation (conveying information to others). Our focus on this
assignment will be the first two tasks.
It is important that you start this assignment soon. A good strategy
is to explore the dataset, then go away and think about it, then go
back to the data again. Doing it all at the end in a rush is less
likely to produce interesting results.
You are encouraged to work in pairs for this assignment.
- Obtain a dataset or use one of those supplied. This dataset must
have at least five dimensions (fields) and at least one of the important
fields should be nominal data (i.e., have no inherent ordering).
(Note that for some tools like Xmdv, you have to convert nominal data
into numbered codes. For example, Japan = 1, USA = 2, Europe = 3.)
The dataset should have at least 300 records.
The richer it is the more likely you are to find something
interesting.
- Think about what kinds of relationships you expect to find in the
data. Write down at least three hypotheses about what you will find
in the dataset. E.g., you can hypothesize that senior citizens spend
more time on web pages on average than younger people.
- See if you can find evidence to verify or refute any of these
hypotheses. The idea is that these tools should help formulate
hypotheses that would then be analyzed with rigorous statistical
tests. (But we won't be doing this part in this class; for more
information on how to do that part, see the web pages for
Rashmi Sinha's Quantitative Methods class .)
Use the analysis tools to look for, e.g.,
- relationships between pairs of variables (correlations, clusters)
- outliers of various kinds
- trends
- Use the tools to explore around the dataset and look for other
unexpected kinds of relations. Note what features of the visualized
data attracted your attention\focus. If the tool supports it, try to
highlight or otherwise isolate the subset of the data containing an
interesting feature.
- Write up a 3-5 page summary of these results (not counting screen
shots) of what you found -- both expected and unexpected. This can
include relationships that did not appear even though you thought they
might. Try to report on at least one surprising piece of information.
You should include some screen shots to help convey your discoveries
(the third task of visualization - presentation!) but please scale
them so they aren't too large.
- Comment on the tool or tools you tried. What features did you find
useful? Which ones were intuitive to use, and which were hard to
understand? Was there any functionality that the tool did not have that
you wished it did? In other words, how would you improve on the tool?
Sample Datasets:
Programs that are available on the lab machines:
Assignment Results
Some general lessons:
In many cases the anticipated hypotheses did not hold. Only by
extensive exploration were interesting new relations (that nevertheless
make sense) found to hold.
Once a relationship is found using one tool, use another tool to
verify it.
If dataset is large, start with a reduced one.
Below are some comments on techniques used and findings found in
people's assignments.
Movie Remake Database,
Linda Harjono and Saifon Obromsook
Dataset focuses on how much time occurs between movie remakes and how
similar the remakes are to the original. Computed and added values
for variables that were not present in the original dataset.
- Parallax to explore most hypotheses
- Parallax plus Spotfire found a trend that combined attributes
- Eureka found erroneous data points
- Eureka led them to an unusually behaving set of data ("That's
Entertainment" is composed of verbatim pieces from many movies).
- Parallax to compare different time periods.
Women Wage Earners in Philadelphia in 1893,
Stephanie Hornung and Susanne Eklund
- All three interfaces to explore initial hypotheses; most
found not to hold.
- Spotfire color coding to make a scatterplot useful
- Parallax used for interrelations among several variables (e.g.,
for the hypothesis that different industries hire women with different
marital and child-rearing status). Found interesting differences
between Telephone operators and Clothing manufacturers' hiring habits.
- Eureka to find relationships between hours worked and wages
(found industry-specific results with respect to Saturday hours)
- Unanticipated Discoveries
- Spotfire to find women in clothing industry sew their own clothes; those in office
jobs do not (and hypothesized that office jobs require more
sophisticated store-bought clothes than factory jobs)
- Parallax to find relationships between vacation days and illness,
verified with Spotfire
"Easterly" dataset on international labor and political
issues, Leah Zagreus
Added a category code, removed records with missing data.
Unexpected Discoveries:
- Parallax shows an interesting relationship when showing countries
with zero strikes -- turns out all are OPEC nations.
- Nice use of the Zebra filter in Parallax to see changes over time.
School Effectiveness in Inner London (1985), Maggie Law
and Vivien Petras
Note Figures 7 and 8: Parallax allows us to see relationships among
several variables simultaneously. These figures show relationship
between high exam scores and low need for food assistance, and to some
extent the converse. However, they also show a different pattern in
ethnic makeup of the two groups, which might not be noticed at first.
Strikes in OECD Countries, Craig Rixford
- Spotfire to find outliers
- All hypotheses wrong (!!). Strike patterns correlated with
countries. Once this was realized, Eureka made it clear.
- Eureka "spotlight" feature to look at high and low ends of strike
spectrum.
Web News Dataset,
Jean-Anne Fitzpatrick, James Reffell, Moryma Aydelott
A difficult dataset, involving term frequencies, which are very high
dimensional. Resolved the issue by pre-selecting certain term
groups. This resulted in readable and interesting graphs, but reduced
likelihood of unexpected discoveries.
- Eureka for comparing term co-occurences across time and websites.
- Spotfire to confirm spiking behavior.
- Spotfire to find very interesting outliers (Houston News; more
Enron coverage; Detroit Free Press, more KMart coverage; corresponds to
the company's hometowns).
Auto Fatality Data, Sarah Waterson and Wayne Kao.
Very large dataset -- too large perhaps for Parallax.
- Eureka for comparing pairs of attributes
About the Tools:
- Parallax has serious usability issues; is not walk-up-and-use;
needs undo; complex data formatting; misleadingly hides many data
pionts behind one line; many other issues. However, once
you get used to it, it is very powerful. Spotting trends is easy, but
drilling down is difficult. Too slow on large datasets.
- Eureka is polished, but some problems with distortion for
focus-plus-context. Also needs undo. Is also quite powerful,
especially the sorting feature.
- Spotfire polished and fast, very good for scatterplots, but less
powerful, especially for seeing relationships among more than two
variables, and not facile for nominal data. Can easily see
information about individual data points.