Assignment 1: Exploratory Data Analysis using Visualization Tools
Due: Fri Feb 26th, 9pm
In this assignment you will use free or commercial, off-the-shelf visualization tools to perform exploratory data analysis on a real-world dataset. The goal is to gain practice generating and investigating hypotheses through visual analysis and to learn and critique leading visualization tools.
You may work in pairs for this assignment. You'll be turning it in online at a link to be made available later.
For this assignment, you should use visualization tools to analyze a data set. In particular:
- Think about what kinds of relationships you expect to find in the data. Write down at least three hypotheses about what you will find in the dataset.
- See if you can find evidence to verify or refute any of these hypotheses. The
idea is that these tools should help formulate hypotheses that would then be
analyzed with rigorous statistical tests (but we won't be doing this part in this
class). Use the analysis tools to look for, e.g.,
- relationships between pairs of variables (correlations, clusters)
- outliers of various kinds
- trends
- Use the visualization tools to explore around the dataset and look for other unexpected kinds of relations. Note what features of the visualized data attracted your attention/focus. If the tool supports it, try to highlight or otherwise isolate the subset of the data containing an interesting feature.
- In past years we've sometimes seen that some datasets are more amenable to analysis if some of the data is transformed (e.g., by computing averages or medians, by converting numbers to percentages, etc.) If you feel you need to do this, go ahead and do it (Tableau supports this), but if it isn't needed then leave the data as is.
- Write up a discussion of these results of what you found -- both expected and unexpected. This can include relationships that did not appear even though you thought they might. Try to report on at least one surprising piece of information. Be sure to illustrate your points with screenshots, but please scale them so they aren't too large.
- In your discussion, comment on the tools you tried. What features did you find useful? Which ones were intuitive to use, and which were hard to understand? Was there any functionality that the tool did not have that you wished it did? In other words, how would you improve on the tool?
- To get a better idea how to do this assignment, you may find it helpful to look at student writeups from the course in 2005. However, you cannot use the FEC dataset that was used that year for this assignment.
- Background reading (ppt)
- Original Data Set with commentary.
- Data in CSV format
You may use a dataset of your own choosing, but it needs to be sufficiently rich enough for you to be able to discover interesting information by exploring it. It's best if the dataset has a mix of nominal, ordinal, and quantitative data. Here are some sample datasets that have been used in the past.
Florida 2000 Ballot Data
This data set is Florida election data from the
CMU Statistical Data Repository.
(Note: when downloading these files, be sure to use the correct
"save-file" operation for your browser ... IE tends to add extra
characters that confused the programs.)
U.S. House of Representatives Roll Call Data
This contains roll call data from the 108th
House of Representatives: data about 1218 bills introduced
in the House and how each of its 439 members voted on it. The data
covers the years 2003 and 2004. The individual columns are a mix of
information about the bills and about the legislators, so there's
quite a bit of redundancy in the file for the sake of easier
processing in Tableau.
National Surveys of 8th Graders
From Beat, via his stats course:
"A nationally representative sample of eighth-graders were first
surveyed in the spring of 1988. A sample of these respondents were
then resurveyed through four follow-ups in 1990, 1992, 1994, and
2000. On the questionnaire, students reported on a range of topics
including: school, work, and home experiences; educational resources
and support; the role in education of their parents and peers;
neighborhood characteristics; educational and occupational
aspirations; and other student perceptions."
The .xls file contains 2000 records of students' responses to a variety of questions and at different points in time. The codebook explains the question and answer codes.
UC Berkeley Student Data
From Askhan, via Matt:
Government Spending Data
From Srikanth:
Have you ever wanted to find more information on government spending? Have you ever wondered where federal contracting dollars and grant awards go? Or perhaps you would just like to know, as a citizen, what the government is really doing with your money.
Climate Change Data
From the UK Met Office:
The data downloadable from this page are a subset of the full HadCRUT3 record of global temperatures, which is one of the global temperature records that have underpinned IPCC assessment reports and numerous scientific studies.
You may use Tableau or other visualization tools of your choosing. In the past we've also used Spotfire, although we haven't obtained licenses for Spotfire as Tableau now subsumes a lot of its functionality. If you prefer to use free software, there are a number of options, none of which are as full-featured as Tableau, but you are welcome to use any tool of your choice. Here are a few possibilities: XmdvTool, Ggobi, Protovis.
Assignments will be graded based on the quality of the analysis, the
creativeness in the use of the visualization tool(s), and the quality
of the writeup.