by Christine Petrozzo, Vaidyanath Venkitasubramanian, Sayantan Mukhopadhyay
Novels, Authors and The Big Data Analysis
Man has always been looking for ways to discover insights about culture. Today, one of the biggest tools at our disposal is the large volume of data available through technological methods. Up until recent times, analyzing a handful of texts did not yield much, namely regarding the context in which the text was written or additional interesting details explaining the reasons influencing a particular author in history. However, with the advent of big data and computational techniques, the fields of humanities and social sciences have evolved, helping these researchers identify statistical patterns and create data-driven hypothesis testing.
One would assume that Charles Dickens or Mark Twain would be considered as the most influential authors of their time. However, nothing is absolute until there’s enough data and conclusive evidence. In a recent article in the New York Times, Matthew L. Jockers, a researcher and author of “Macroanalysis: Digital Methods and Literary History,” looked at a corpus of books through text mining to assess the writing styles and themes of the most influential authors during 19th century. The computational analysis found otherwise: Jane Austen and Walter Scott were the most influential authors. Now the point of this research is not that, however, it is about how computational and quantitative methods help us in gaining an understanding of the past.
Humanities & Its Data-Centric Specialties at a Macro Level
Unlike the well-known quantitatively driven hard sciences that have often focused on the law of large numbers and repetition of experiments for testing laws and theories in the world, humanities traditionally approached data differently, and at a microlevel purely because of the limitations of tools available. In addition to analyzing a the historical aspects of a corpus of books, data-centric subfields have emerged, such as culturometrics, stylometry, and cliometrics to name a few, similar to the hard sciences’ nuanced fields. Culturometrics is a quantitative methodology measuring cultural identity, while stylometry analyzes linguistic writing styles through computational statistics, similar to Jocker’s approach, and cliometrics, the application of data models and mathematical methods used for analyzing economic history. However, with the digital technologies available, namely machine reading or parsing and algorithms, the humanities research can now be analyzed at a macro scale, whereby researchers can ask more, new and specific questions to form better hypotheses.
Data is Only As Good as the Analyst
In the field of literature, quantitative analysis can be used to discern how style and content has changed over time, from Shakespeare to J.K. Rowling. The problem with this approach though, is that the computational algorithm is only as good as the person who programmed it. This is all the more important mainly due to the fact that natural language is complex, as there’s historical and cultural context imbued within it. Computationally identifying differences between two similar pieces of work is no easy task.
Of course this problem exists in all computational fields. Even within the New York Times article, Jockers explains there will always be a need for the tools and the expert to make sense of the data. Data and models can be inaccurate, ambiguous and incomplete. As a note of caution, mistakes in analysis or prediction can be accidental or intentional, but the impacts are felt across the society, country or world as analysts, experts and researchers, influence public policy, corporate strategies and a more nuanced picture of history. Therefore, diligence on the part of the researcher is important, as the quantitative aspects never completes picture, and other methods should be used for minimizing oversight.