Assignment 4: Exploratory Data Analysis
(Assigned Sept 21, due Oct 10 at 5pm)
Summary of Results
I was really pleased with almost all of these homeworks. The
critiques and comparisons of the tools were also very helpful and I'll
pass them on to the developers of the tools.
The two most useful visualizations were TableLens/Eureka and
Parellel-Coordinates in XMDV. Spotfire was a close third (especially
because of its ability to filter the data flexibly). One person found
glyphs in XMDV to be useful. Many others also found the scatterplot
matrices of XMDV to be useful. Whether or not a tool was useful
depended on the kind of dataset used.
Tablelens was found to be very stable and was able to handle any size
dataset. It has features that allow for categorization
of subsets of the data and zooming in on subsets of the data; some
people made more use of this than others.
Parallel coordinates was extremely useful for finding multidimensional
relationships. People seemed to have no problem understanding and
using it, modulo the fact that the particular implementation we used
had problems. From the results of this assignment I think we can
confidently say that this should become a standard analysis technique.
Spotfire was especially useful when its scatterplot view was paired
with use of the ability to dynamically filter which subset of
datapoints are being viewed. One person found out how to make it show
multiple scatterplots simultaneously, which very much helped.
An observation I made is that the visualizations seemed to work better
when better color choices were made (and these were never the defaults
that come with the tool, with the exception of Tablelens).
Below are the best examples of EDA, in my opinion, although I must
say that almost everyone did a great job. I chose writeups that (a)
had an interesting dataset (b) made good use of more than one
visualization type (c) did a good job writing up the analysis, telling
a story about the data.
It should be noted that most of these were done by pairs of
people, which is what I asked people to do. I think the interplay
between two people working together tends to produce the best results.
-
Monica Fernandes.
Population survey data; especially good use of categories in TableLens/Eureka,
nice color choices for parallel-coordinates.
-
Barbara Rosario and Michal Feldman.
Biomedical data. Nice analytical use of parallel-coordinates, and
integration of this with scatterplot matrices. (Not nice use of
word format instead of html, however.)
-
Francis Li and Sam Madden.
Homicide data. Nice use of Tablelens, and integration with parallel-coordinates.
- Philip Buonadonna and Jason Hill.
Very interesting dataset about a chapped lips study. Nice description
of how to use these visualizations to analyze the results of a controlled experiment.
-
Jennifer English and Joanna Plattner. Education data. Nice
interaction between Tablelens and Spotfire.
-
Valerie Lanard. Primate sleep dataset. Only person to use (and
like) glyphs.
All of the assignments
Assignment
In this assignment you will use a set of tools to perform exploratory
data analysis on a dataset. The three tasks for which visualization is
an important tool are exploration (searching a data set for
interesting phenomena), confirmation (validating or refuting a
hypothesis about features you believe to be present in the data), and
presentation (conveying information to others). Our focus on this
assignment will be the first two tasks.
It is important that you start this assignment soon. A good strategy
is to explore the dataset, then go away and think about it, then go
back to the data again. Doing it all at the end in a rush is less
likely to produce interesting results.
You are encouraged to work in pairs for this assignment.
Both
Matt Ward and Marti Hearst will be
looking over the results.
- Obtain a dataset or use one of those supplied. This dataset must
have at least five dimensions (fields) and at least one of the important
fields should be nominal data (i.e., have no inherent ordering).
(Note that for some tools like Xmdv, you have to convert nominal data
into numbered codes. For example, Japan = 1, USA = 2, Europe = 3.)
The
richer it is the more likely you are to find something interesting.
- Think about what kinds of relationships you expect to find in the
data. Write down at least three hypotheses about what you will find
in the dataset. E.g., you can hypothesize that senior citizens spend
more time on web pages on average than younger people.
- See if you can find evidence to verify or refute any of these
hypotheses. The idea is that these tools should help formulate
hypotheses that would then be analyzed with rigorous statistical
tests. (But we won't be doing this part in this class; for more
information on how to do that part, see the web pages for
Rashmi Sinha's Quantitative Methods class .)
Use the analysis tools to look for, e.g.,
- relationships between pairs of variables (correlations, clusters)
- outliers of various kinds
- trends
- Use the tools to explore around the dataset and look for other
unexpected kinds of relations. Note what features of the visualized
data attracted your attention\focus. If the tool supports it, try to
highlight or otherwise isolate the subset of the data containing an
interesting feature.
- Write up a 3-5 page summary of these results (not counting screen
shots) of what you found -- both expected and unexpected. This can
include relationships that did not appear even though you thought they
might. Try to report on at least one surprising piece of information.
You should include some screen shots to help convey your discoveries
(the third task of visualization - presentation!) but please scale
them so they aren't too large.
- Comment on the tool or tools you tried. What features did you find
useful? Which ones were intuitive to use, and which were hard to
understand? Was there any functionality that the tool did not have that
you wished it did? In other words, how would you improve on the tool?
Sample Datasets:
Programs (that are or will be) available on the lab machines: