Jennifer English & Joanna Plattner

Infovis Assignment 4

October 10, 2000

The Dataset:

The dataset we used was from US News and World Report's US College Statistics. We found it using the CMU Statlib Repository. The dataset included 33 variables and 1302 records. We added several calculated variables to the dataset: percentage of students accepted from the applicant pool.

Those variables were:

Variable	Type	Note
State (postal code)	Nominal	Converted to numerical codes (1-50)
Public/private indicator (public=1, private=2)	Nominal
Average Math SAT score	Ordinal
Average Verbal SAT score	Ordinal
Average Combined SAT score	Ordinal
Average ACT score	Ordinal
First quartile - Math SAT		Variable removed - majority of respondents did not supply data
Third quartile - Math SAT		Variable removed - majority of respondents did not supply data
First quartile - Verbal SAT		Variable removed - majority of respondents did not supply data
Third quartile - Verbal SAT		Variable removed - majority of respondents did not supply data
First quartile - ACT		Variable removed - majority of respondents did not supply data
Third quartile - ACT		Variable removed - majority of respondents did not supply data
Number of applications received	Interval
Number of applicants accepted	Interval
Acceptance Rate	Interval	Calculated from Apps Received and Applicants Accepted
Number of new students enrolled	Interval
Pct. new students from top 10% of H.S. class	Interval
Pct. new students from top 25% of H.S. class	Interval
Number of full-time undergraduates	Interval
Number of part-time undergraduates	Interval
In-state tuition	Interval
Out-of-state tuition	Interval
Average Tuition	Interval	Calculated average from In-State and Out-of-State Tuition
Room and board costs	Interval
Room costs	Interval
Board costs	Interval
Additional fees	Interval
Estimated book costs	Interval
Estimated personal spending	Interval
Pct. of faculty with Ph.D.'s	Interval
Pct. of faculty with terminal degree	Interval
Student/faculty ratio	Interval
Pct.alumni who donate	Interval
Instructional expenditure per student	Interval
Graduation rate	Interval
Percentage of Part-Time Students	Interval	Calculated by dividing part-timers by Total Enrollment

States Were Re-coded as follows:

State	Code
AK	1
AL	2
AR	3
AZ	4
CA	5
CO	6
CT	7
DC	8
DE	9
FL	10
GA	11
HI	12
IA	13
ID	14
IL	15
IN	16
KS	17
KY	18
LA	19
MA	20
MD	21
ME	22
MI	23
MN	24
MO	25
MS	26
MT	27
NC	28
ND	29
NE	30
NH	31
NJ	32
NM	33
NV	34
NY	35
OH	36
OK	37
OR	38
PA	39
RI	40
SC	41
SD	42
TN	43
TX	44
UT	45
VA	46
VT	47
WA	48
WI	49
WV	50
WY	51

The Analysis:

The following hypotheses were based on a preliminary "naked eye" look at the data. Following each hypothesis is a discussion of the data exploration done with each of the tools used (Spotfire and Eureka).

Hypothesis One:

We expected to find that schools with a high rate of instructional expenditure per student would also have the highest tuition rates.

Results

Because there are two tuition amounts in this dataset (one for out-of-state and one for in-state), we first added a variable (Avg Tuition) which averaged these two amounts. The analysis revealed that there is not much of a correlation between spending on students and tuition rates. There were some general trends, however.

Overall, public schools (red) appear to spend more than they charge students. This can be explained by taxpayer subsidies of public universities. The story is slightly different for private schools. The overall trend is towards tuition being slightly higher than spending per student. This stands to reason, as administrative overhead is not included in instructional costs.

The outliers were military academies, Ivy-League and other prestigious universities, and Westmont College, Michigan at Ann Arbor and the University of Vermont.

Military academies charge no tuition but spend a great deal on their students. This stands to reason as they are funded by the military. Cadettes pay their tuition through promising to serve for a specified number of years.

The cost of out-of-state tuition seems to explain the three outlying public schools. They are the first to disappear when one adjusts the "out-of-state tuition" slider to the left - eliminating the larger values in the out-of-state tuition variable.

The last set of outliers - Ivy League and other prestigious schools - charges much less than they spend per student. This is almost certainly because much of the educational spending per student at these schools is subsidized by endowments. It would be interesting to add data about endowments to try to get at the reasons behind these schools' ability to spend so much more than

they charge students.

Figure 2 represents the same data) represented in Figure 1, though it is represented using Eureka. Both representations highlight the same trends. That is, the very highest spending institutions tend to charge very little, while with everyone else, the difference seems to be based on whether a school is public or private. It ends up looking like not much of a correlation at all, because the data is so jagged.

You can very quickly select those values at the top of the scale for Instructional Spending per Student and see that they are predominantly military academies and well-endowed (presumably) private schools).

Alternatively, you can select the outliers at the right-most part of the educational spending per student column and see what schools are represented. Just as you see in Spotfire, schools that spend a lot but charge no tuition tend to be military academies.

Gabe is cool

We also took a random cut at the data in the middle portion of the average tuition column:

This shows the trend of public schools being subsidized by taxpayers and most private schools being funded primarily by tuition.

Hypothesis Two:

Schools with the lowest student/faculty ratios will also have the highest graduation rates. We hypothesized this based on the assumption that students who get more individualized attention from faculty are more likely to stay in school and graduate. (Of course we are assuming here that if there are more faculty per student, students will seek and get more attention, which may not necessarily be true.)

Results

Wrong again. When a scatter plot shows a distribution that is fairly flat, there is no correlation between the variables. That looks like the case here. Examination of the outliers reinforces the fact that there is no correlation present here. The outlier on the right seems to have incorrect data as its graduation rate is over 100 percent. The majority of students at the outlier at the top are part time students, suggesting that there may be many students who never intend to graduate, or who may have jobs and families that keep them from graduating.

We decided to cycle through the other variables and compare them to graduation rate to see if there seemed to be any correlations in the data. The clearest trend found was between Combined SAT scores and graduation rates.

Even this correlation did not appear to be very strong. We decided to get a different view of the data in Eureka to see if the same trends appeared.

In Eureka, we performed a similar exercise of sorting by graduation rates and then looking for correlations. We noticed the correlation we had seen in Spotfire between grad rates and SAT scores, but noticed a similar level of correlation (not much) between grade rates and students coming from the top ten percent of their high school classes. It would be interesting to know what the missing values in Combined SAT scores mean. Are they missing or do these schools not require SAT scores? The answer would affect the correlations a great deal.

Overall, our exploration in Eureka confirmed our findings in Spotfire, which were that there do not seem to be any strong correlations between graduation rates and any other variables in this dataset. If anything, it seems that graduation rates are very much determined by the student themselves rather than the school - as both variables which show any correlation at all have to do with entering students' success in high school and on standardized exams. In the end, you just cannot make any assumptions graduation rates based on this dataset.

Hypothesis Three:

As student faculty-ratio decreases, spending per student increases.

Results

The beginnings of a trend can be seen in Figure 10. By eliminating the outliers and zooming in to the central data, more of a trend can be seen in Figure 11. Though most of the data is clustered very heavily in the center of the, there does seem to be a correlation between the two, suggesting the obvious conclusion that faculty account for much of the cost of educating students.

The trend seen in Spotfire is very much the same in Eureka. There is an inverse relationship between student-faculty ratio and educational spending per student. Though the correlation does not appear to be incredibly strong, it is definitely evident in both tools.

Hypothesis Four:

We expected schools with more than ½ their student body enrolled part-time to also be large schools (enrollment > 10,000). Our assumption is based on the theory that large schools are more capable of accommodating part-time students because the fixed cost per student at large schools is lower, and thus there is less pressure to maximize tuition revenue per student. We did not have a statistical reason for picking 10,000 students as the threshold that defines a college as a “large school,” it just seemed like a reasonable breaking point.

Results

Our assumption was wrong. The Spotfire scatter below (top of graph in figure 1) reveals that small schools are more likely to have high part-time enrollment numbers. Out of the 83 schools where at least 50% of their undergraduate students were enrolled part time, 74 had a total undergraduate enrollment of less than 10,000 students. There was one outlier, the University of Minnesota, which has 38,000+ undergraduate students, 57% of whom are enrolled part-time.

When we analyzed the same hypotheses using Eureka (figure 2), the same surprising conclusion is evident. At the left end of the graph (rotated to fit on this page) where the percentage of part time students is the highest, the total enrollment column shows the enrollment at those schools is relatively small. Eureka revealed an interesting feature of the data that we missed entirely in Spotfire. At the far right end of the graph, where part time enrollment is very low, total enrollment is very low as well. This does support our original hypothesis – small schools aren’t equipped to handle a lot of part time students.

We also took advantage of Eureka’s category feature to split the Percentage Part Time field into three categories: a) no data, b) < 50% part time, and c) => 50% part time

This perspective made it much easier to appreciate just how few of the 1302 schools had 50% or more part time students (figure 3).

Hypothesis Five

Next, we decided to explore the relationship between graduation rates and part time enrollment. We expected schools with high part time enrollment (=>50%) to have lower than average graduation rates. Our theory is based on the idea that as the amount of time in school increases, so to does the likelihood that the student will change schools or drop out. We used each school’s reported graduation rate to calculate the average graduation rate (60%) for all colleges (excluding schools where the data was missing).

Results

We took advantage of Spotfire’s ability to report multiple views of a given set of marked data points, in this case schools with high part-time enrollment, and added the total enrollment dimension to our analysis. The resulting three-dimensional graph is found in figure 4.

At first glance the even distribution of colleges with high part time enrollment along the x-axis (graduation rates), as indicated by data points shaded in green, seems to invalidate our assumption that those schools would have predominately lower than average graduation rates. As shown below, however, closer review of the data revealed a clear correlation between part time enrollment and graduation rates. The graph also revealed that one school reported an invalid graduation rate of 118%. There were other problems with the data that we didn’t discover until we counted up the data points for the purpose of calculating ratios. Ten of the original 83 colleges with high part time enrollment were missing graduation rate data.

Next, we turned our attention toward the relationship between total enrollment (y-axis) and the data points of interest (mostly part time student colleges). We immediately noticed an interesting correlation between enrollment and graduation rates. All of the large schools with high part time enrollment had much lower than average graduation rates.

Figure 4

We decided to take a closer look at the correlation between enrollment size and graduation rates for all the colleges in the dataset. Using Spotfire’s slider bars, we narrowed the dataset to determine how graduate rates breakdown when colleges are grouped according to size (large vs small). We split the analysis into two phases, one for small schools and one for large schools because the data points were too densely packed to differentiate the values visually. Although we could have created two additional nominal fields for a) school size (big/small) and b) part time enrollment density (high/low), it was more fun to use the tool to accomplish our objectives.

Figure 5 shows the view we used to analyze the small school subset

The table below is a more precise presentation of the results discovered via the Spotfire tool. Regardless of a school’s total enrollment figure, if its student body consists mostly of part time students, the school is more likely to have lower than average graduation rates.

Percent of schools with low graduation rates
	All Schools (with grad rates)	Schools with High Part Time Enrollment
Large Colleges (9 total, 8 with grad rates)	117/177=64%	8/8 = 100%
Small Colleges (74 total, 65 with grad rates)	477/1027 = 37%	35/64 = 54%
Total (83 total, 73 with grad rates)	594/1204 = 50%	43/73 = 59%

We found analyzing the same three dimensions in Eureka to be more challenging than in Spotfire. The only way we could visually track high versus low enrollment across the graduation rate and percent part time dimensions was to use the spotlight function.

We would have preferred to be able to group the enrollment field into two categories (large and small), but Eureka’s categorization feature only allowed grouping of numeric data into categories defined by fixed increments. In the case of total enrollment, rather than two categories (<10,000 and => 10,000), the system would have automatically generated four categories because the largest college has 38,338 undergraduate students. Using the spotlight function instead, “large” schools are differentiated by the pink bar that spans all columns as shown in figure 6.

Figure 6 (above) shows quite dramatically that there is a definite correlation between part time enrollment and graduation rates. The slope that is formed by the graduation rate data points for mostly part time student schools (see marker A) is much steeper than the graduation rate slope for schools whose enrollment consists mostly of full time students (see marker B). Although high enrollment schools, indicated by the pink horizontal bands in the figure, are distributed across the entire graph, there is clearly a higher density of pink (i.e. schools with more than 10,000 students) in the low graduation rate sections of the graduation rate column.

One of the nicest features of Eureka is the fact that all the data points can be seen at once, increasing the opportunity to discover unexpected relationships and characteristics of the data.

For example, in figure 6 we can see immediately that not only are several schools missing graduation rate date, they are also all relatively small schools.

Comments on Tools:

We spent about six hours trying to get our dataset into XMDV. Neither of us was able to install the software on our home machines, so we tried using the machines in the lab. We had a fairly large dataset, so the data cleaning and preparation of the file took a long time. After cleaning the data, replacing missing values and preparing the file header as directed, the data still crashed the program immediately upon loading. We sent the data to Matt Ward to get some tips. He recommended that we limit the bins to one, which we did. We also limited the number of variables to twenty (as another group had success with the same number of variables). Though Matt Ward could find no problems with our dataset, we were still, after all these remedies, unable to load the data. We finally gave up.

In contrast, Spotfire deals with data very intuitively, allowing the user to cut and paste or import and export a variety of file formats. Spotfire's interface is intuitive and useful. The ability to zoom in and out on specific records and sets of records, in addition to moving the sliders to get a better idea about certain ranges of the plotted values are really helpful. Using the sliders for any variable in the set is also extremely useful for overlaying a third variable and for investigating trends and outliers. It is really nice that you can view the complete set of data about a subject by highlighting a data point. It allows you to very quickly come up with reasons why a particular case might be an outlier. Switching over to a bar chart is really nice and makes it easier to look at nominal data.

Spotfire also supports multidimensional analysis. If multiple visualization windows are open, all the views are linked so that if a subset is selected in one window, the same records are highlighted in the second window.

Although overall we were very pleased with Spotfire, we do have a few improvements to suggest. First, it is hard to select a specific data point on the slider bars. For example, when we tried to isolate colleges in Minnesota (MN), the slider bar kept skimming over it, going straight from MI to MO. Second, although the ability to concurrently manipulate the same dataset across multiple visualization windows is arguably one if its most powerful features, if we had not stumbled upon an explanation for it while reading the software’s introductory guided tour document, we never would have known that the windows were synchronized. We would have continued to assume that every window was autonomous just like other MS Windows applications. The authors should make this feature of Spotfire much more visible to novice users. Last, when we analyzed the average tuition per student/average spending per student data in hypothesis 1, we searched in vain for a tool that would superimpose a diagonal line representing all points where x = y onto the graph. Such a line would have made it easier to visually identify colleges where tuition is less then spending and vice versa.

Eureka! This tool is named really appropriately. It is much more useful for examining data for correlations and patterns than Spotfire, which seems more useful for investigating hypothesis, zooming in on particular data points, and dynamically changing queries. It is easy to see interesting data quickly in Eureka. Then you can drill down to the individual records for more detail, and go back up to the surface to look for more trends based on the detailed information. (See Hypothesis One). The trends we saw in Eureka did not contradict what we saw in Spotfire - but it was easier to see quickly whether there were correlations in the data.

Though it looks intimidating from a distance, the interface to Eureka is really intuitive. After playing around for just a few minutes we both became very comfortable manipulating the data and experimenting with features of the program. One complaint is that it is a little hard to "grab" what you need to - that is, it is hard to get the table to collapse itself after you have expanded it. We found that we were using the menu commands frequently to "reset" the data and start over because the GUI did not seem to support "unselecting" the data.

Eureka also makes handling data easy. Bringing in data was simple (cut and paste or import). Once inside the application, it was easy to convert data. We could not see the acceptance rate data at first because we had brought it in as text - so we changed the data type within Eureka - which converted it to numbers.

Eureka’s greatest strength is its ability to present all that data points in a dataset in a single screen all while allowing easy access to the underlying detail. The automatically calculated summary fields like mean were very useful.

We were pleased to see that Eureka incorporates a direct query tool, but we weren’t certain how to use it and we couldn’t afford to invest the time required to learn it for the purposes of this homework assignment.

Eureka is a well-designed, sophisticated visualization tool. Although we found some usability problems, it is possible that they are by products of our inexperience with the system, not program design flaws. For example, despite following the instructions in the “help” system, we couldn’t select and then hide multiple columns at once. We resorted to deleting columns one at a time, and rearranging them so that the relevant columns appeared first.

With Spotfire it was easy to ignore outlier data points (like the school with a graduation rate of 118%), but in Eureka we never discovered how to mask individual data points. We found it curious that the width of the data columns didn’t have a clear meaning. In some cases, the high value bars appeared to touch the right hand border of the data column, but in other cases the high value bars fell short. Is this phenomenon an illusion? Either way the system should be changed to eliminate the ambiguity. Ideally, for numeric dimensions, the right hand boundary in each column would correspond to the field’s high value. Thus one could derive meaning from the white space between short bars and the column boundary to the right.