Assignment 4: Exploratory Data Analysis

Francis Li and Sam Madden

Introduction

We downloaded the Chicago Homicide Dataset from the National Archive of Criminal Justice Data. This dataset is a comprehensive report of every homicide on police record in Chicago from 1965 to 1995. There are over 100 variables and 23,000 records. For the purposes of this project, we reduced the number of variables to 16, including demographics on the victim and first offender, the date and location of the incident, and details of the relationships between the victim and offender, including possible motives and weapons involved. Finally, for most of the analysis we removed records with missing data, reducing the number to about 8,000.

Expected Observations

There are a number of expectations typical characteristics of murderers. We identified a few of them:

Most murderers are male and African American or Latino.
Lots of murders involve drugs
Lots of murders involve gangs
There aren't very many killers or victims under the age of ten.
Many women are killed in sexual assaults.

Actual Observations

Starting with our expectations, using the Eureka table lens with uniform scaling of the data makes it very easy to identify relative proportions between clusters. By sorting on offender race and gender as in Figure 1, we can see that indeed, most murderers are male and African American or Latino (although not a big difference between Latinos and Whites)

Figure 1: Gender and race of offenders

However, when looking at drugs as a factor in homicides, we found that our expectation was wrong. Just by sorting on the drug and/or liquor involvement column as in Figure 2 we can immediately see that nearly two thirds of homicides have no evidence of drug/liquor use nor a drug related motive. Liquor use is the next most involved intoxicant. Drug use and/or motive accounts for only about one-tenth of murders in Chicago.

Figure 2: Drug and/or liquor use and/or motive

So what are the motives behind the murder in these cases? By putting Eureka into non-uniform mapping mode as in Figure 3, we see that with no liquor or drugs involved, the motive ranges from instrumental (i.e. robbery) and expressive (i.e. random, hate-crimes), where with liquor use the motive is mostly expressive, and when drugs are involved the motive is primarily instrumental.

Figure 3: Motives behind murder involving liquor and/or drugs.

When investigating our hypothesis about gang-related murders, we found we were wrong again. Gang-related murders represent only about one-eighth of the homicides in Chicago (Figure 4). As we might expect though, both offenders and victims in gang-related murders are predominantly young males (Figure 5).

Figure 4: Motives behind murder. Gangs are not a majority.

Figure 5: Demographics behind offenders and victims in gang-related murders

With Eureka in non-uniform scaling mode, we can "zoom" into all the female victims. We find that sexual assault is far from the most common cause in murders of women. The real culprit- spousal attack (Figure 6).

Figure 6: Motives behind murders of women

XMDV makes it very easy to see minimum and maximum values of a dataset. It's immediately obvious that the are few very young killers or victims. Indeed, except for two 2-year old children, there are no victims under the age of ten. There are no offenders under the age of ten: the youngest male murderer is 11, and the youngest female is 17. These basic observations are clear from the XDMV screenshots shown in Figures 7, 8, and 9 below.

Figure 7: Murderers over the age of 75

The only murderers over the age of 75 (OAGE>75) were African American (ORACE=2). They only murdered other African Americans (V1RACE=2), and all of them had prior records (PRIOROF=2). There doesn't seem to be an intuitive reason for this to be true. One possible explanation is that most murderers and victims are African American (see Figure xxx) and have prior records (see Figure xxx) , which naturally leads us to conclude that most elderly murderers would be African American, have prior records, and kill African Americans. Still, it is a surprise that of the 55 murders by people over 75, all were committed by African Americans.

Figure 8: Asian American murderers

Asian murderers (ORACE=4) are all under the age of 50. Furthermore, no Asians killed a child or parent (RELATION=2), any other family member except a spouse (RELATION=3), or any coworker in a legal or illegal business (RELATION=7,8). Finally no Asians (or Hispanics, although this is not shown) killed anyone over the age of 75. Again, it's a bit hard to state conclusively that this indicates an actual difference between Asian murderers and other murderers, as there are far fewer Asian murderers than any other race, most murders don't involve family member or coworkers, most murderers are younger than 50, and most victims are under 75. However, there is a common perception in America that Asian Americans have a greater respect for their families and coworkers: our results seem to substantiate this stereotype.

Figure 9: Gang Related Murders

Looking further at Figure 8, there was only one murder committed by an Asian that was drug motivated, and none that were liquor motivated. Although a number of Asian committed murders involving alcohol (INTOXTOT=3), only one was drug motivated (INTOXTOT=5). One hypothesis might be that there is less gang involvement amongst Asians, although observation of Figure 9 reveals that a number of Asians committed rival gang related murders (RELATION=6), and the one drug related murder wasn't among them. So perhaps Asian gangs are just less involved in drugs than other gangs. Some other gang related observations (which most readers would have already guessed) include the fact that gang members and their victims are all under the age of 50, and that gang murders often involve intoxicants and conflicts over intoxicants. Because XMDV doesn't provide a way to count the number of tuples with a specific attribute, it's not possible to say just how many murders are gang or drug related using Figure 9.

Another observation unrelated to our hypothesis concerns the number of homicides relative to the time of year. In Figure 10, we see the homicides sorted by time of day and month. August and the other summer months show a distinctly greater number of homicides than February or November- the winter months. We assume that the temperature during those periods of the year discourage crime. There is a slight increase in murders in December and January- not as significant as over the summer, but we associate this increase with the holiday season. If you examine the shape of the time distribution for each month, you can see that most crimes occur (as would be expected) during the early morning or late night.

Figure 10: Time of incident

Notes about the Tools

Eureka was extremely useful for comparing relative proportions in the amount of data. The non-uniform scaling modes allowed us to "zoom in" to interesting areas of the data (as in Figure 6) or to compare different factors equally (as in Figure 3). Some difficulties arose in the sorting of columns- the lack of feedback in the order of columns sorted makes it difficult to know the results- it was a matter of trial and error to get the appropriate sorting. Currently, the only way to start over is to "reset" the row order back to the original and begin again. A simple list that represents the history of the column sorts would have been useful in this case. Finally, for some cases, it would be nice to coordinate the data with additional views from standard statistical analysis- such as histograms showing the distribution of age or of time, etc.

The XMDV package provides a number of visualizations, although only the parallel coordinates view was useful for this particular dataset. Whether this is due to some specific characteristic of our dataset, or due to the fact that parallel coordinates is a strictly superior visualization, we're not sure. The actual XMDV implementation of parallel coordinates is somewhat flawed: the program crashes regularly (I was never able to get version 4.2 to work on my Sony Vaio: it would open files and display them for a short period of time, then promptly crash.) The hierarchical parallel coordinates view is extremely hard to understand, and some features, such as the "Data Values Dialog", never seemed to work (see Figure 4). Having this feature would have eliminated the need for a lot of exploration with Eureka, as its most oft used feature was to count the number of points with a particular subset of data values. Those caveats aside, XMDV is a really nice environment for data exploration -- once the data set has been loaded in, it's surprisingly fun to ask simple questions about the data set, and usually surprisingly easy to answer them.

Figure 10: The "Data Values for Brushed Region" dialog never worked.

Discussion

The homicide data set used for this exploration proved to be very interesting: it confirmed a number of expected results, like that fact that most murderers are male -- and offered other provocative insights, such as the fact the no Asian murderers ever killed parents or relatives. There are a number of unexplored domains in the dataset itself: it includes information about the specific murder weapon (e.g. dinner fork, drive shaft), motivation (e.g. revenge, psychosis), and relationship between murderers and victims (e.g. baby sitter, cab fare), at a much finer granularity than the summary fields we chose to use. It is likely that there are interesting trends to be discovered by exploring these more specific fields. With regard to tools, both XMDV and Eureka tools proved to be useful, despite the few caveats mentioned above. All in all, an interesting assignment.