Assignment 3: Exploratory Data Analysis

In this assignment you will use a set of tools to perform exploratory data analysis on a dataset. The three tasks for which visualization is an important tool are exploration (searching a data set for interesting phenomena), confirmation (validating or refuting a hypothesis about features you believe to be present in the data), and presentation (conveying information to others). Our focus on this assignment will be the first two tasks.

  • Think about what kinds of relationships you expect to find in the data. Write down at least three hypotheses about what you will find in the dataset.

  • See if you can find evidence to verify or refute any of these hypotheses. The idea is that these tools should help formulate hypotheses that would then be analyzed with rigorous statistical tests. (But we won't be doing this part in this class; for more information on how to do that part, see the web pages for Rashmi Sinha's Quantitative Methods class .) Use the analysis tools to look for, e.g.,

    • relationships between pairs of variables (correlations, clusters)
    • outliers of various kinds
    • trends

  • Use the tools to explore around the dataset and look for other unexpected kinds of relations. Note what features of the visualized data attracted your attention/focus. If the tool supports it, try to highlight or otherwise isolate the subset of the data containing an interesting feature.

  • Write up a 3-5 page summary of these results (not counting screen shots) of what you found -- both expected and unexpected. This can include relationships that did not appear even though you thought they might. Try to report on at least one surprising piece of information. You should include some screen shots to help convey your discoveries (the third task of visualization - presentation!) but please scale them so they aren't too large.

  • Comment on the tool or tools you tried. What features did you find useful? Which ones were intuitive to use, and which were hard to understand? Was there any functionality that the tool did not have that you wished it did? In other words, how would you improve on the tool?

    The software you'll be using is:

    (On the lab machines, the programs are under Programs > Research & Analysis > Data Visualization.)

    The suggested data set is Florida election data from the CMU Statistical Data Repository. (Note: when downloading these files, be sure to use the correct "save-file" operation for your browser ... IE tends to add extra characters that confused the programs.)

    Optional: Obtain a different dataset of your choice. It must have at least five dimensions (fields) and at least one of the important fields should be nominal data (i.e., have no inherent ordering). The dataset should have at least 100 records. The richer it is the more likely you are to find something interesting, but stay away from datasets that have more than about 30 fields, as they tend to be too complex to evaluate without statistical help.

    This assignment is due at class time on March 3rd. You are encouraged to work in pairs for this assignment. To turn this in, I'd prefer it if you can put your results at a url somewhere and email the url to me. (If you must do it in Word, mail me a url; I don't like dealing with attachments.) Please turn in a hardcopy in class as well as emailing an url to me.