![]() |
Mano Marks and Judd Antin Final Project |
![]() |
Wikipedia Buddy |
Questions about information quality have surrounded Wikipedia since its inception. Critics have argued that because anyone can contribute, checks on the reliability or credibility of information on the site are unreliable or ineffective. Discussions of information quality, however, too frequently assume that quality is a standard and universal judgment. The perceptions of quality that surround Wikipedia are often personal and can be extremely diverse, varying for example by context of use, personal preference, and style judgments.
The goal of this project is to provide a companion to Wikipedia which can help individual
users make judgments about the quality of Wikipedia articles through visual tools based
on uncovering the notion of individual authorship. Rather than embedding pre-defined
quality metrics in these tools, we hope to visualize different dimensions of the data
based on user preferences, and provide analysis tools that allow users to explore
quality on their own terms. By visualizing the relationships between authors and
content, and the structure of Wikipedia itself, we hope to provide abstractions of
current articles and their histories that will help users make decisions about
information quality.
Wikipedia, the open source-style online encyclopedia project, has become a popular and successful experiment in collaborative knowledge production. Since its inception, the Wikipedia project has accumulated more than 2.6 million articles across more than 200 languages, and now receives more than 86,000,000 page views each day. The English language version alone hosts more than 800,000 articles and has 43,000 unique contributors. Wikipedia is, then, a force to be reckoned with. What was once envisioned as a kind of alternative knowledge source has now become mainstream.
Given its newfound status as a mainstream media source, questions about Wikipedia’s reliability and authority as a high quality information source are becoming more widespread. The recent questions surrounding both the apparently defamatory entry for journalist and government official John Siegenthaler and the actions of former MYV VJ and podcasting guru Adam Curry have heightened scrutiny of information quality on Wikipedia.
Wikipedia handles information
quality with brute force – with application of the open source community
principal that many eyeballs makes for high quality information. The simple but powerful
principal assumes that, given enough contributors, poor, incomplete, or inaccurate
information as well as vandalism will be weeded out. This model, however successful, is
an abstract one. There are no explicit checks on quality that users can see and interact
with. Users must decide as a general rule whether they believe the system works or not.
In the case of Wikipedia there is little information available for making determinations
of quality on a page-by-page basis.
Wikipedia Buddy is based on the idea of unveiling the notion of individual
authorship in Wikipedia. Authorship is an important means by which individuals make
judgments about information quality. By allowing users to interact with authors at the
level of individual lines of text, Wikipedia Buddy aims to provide the means for making
determinations of quality at a much finer level of resolution.
In order to
accomplish this goal, Wikipedia Buddy employs a variety of visual mappings and tools.
We divide the goals for Wikipedia Buddy (WB) into two categories: theoretical and practical.
We may not ultimately know whether Wikipedia Buddy is useful for addressing these goals until users have a chance to interact with it, and researchers have a chance ot use its analytical power to critically examine authors and texts on Wikipedia.
Wikipedia Buddy was developed in Java using Swing for GUI development. The application also uses a MySQL database and implements the Java JDBC MySQL connector for database connectivity. Finally, we used a custom SAX XML parser (Xerces) to transfer Wikipedia's XML-based database dump into our own database (as described below).
The root data source for Wikipedia Buddy is a complete database dump of the English language Wikipedia downloaded on November 5th, 2005. This data is available for download from Wikimedia Foundation’s servers in the form of a single 295 gigabyte XML file, compressed to ~ 4gb. (Downloads are available at http://download.wikipedia.org) Dealing with a dataset this large was a difficult task. In order to make use of this data, we constructed an algorithm which, by sifting through each revision of a given article, assigns an author to each line of text in the current revision. This mapping of authors to lines of text represents the core data source for the application.
In order to create a working dataset for the application, we developed a custom parser which captured the data and stored it in a set of MySQL databases. We used the SAX XML parser to extract the data one article at a time. Unfortunately, we could not process the entire Wikipedia, as we continually encountered articles that, because of anomalies in the text, would cause the parser to crash. However, we were able to capture data on over 50,000 articles.
The data was preprocessed into three tables: (1) stats, which captures preprocessed statistics about each article; (2) authors, which captures percentage contribution of each author to each article, and; (3) authorText, which captures the actual text of each article, broken down in chunks an associated with authors.
![]() |
History Flow is a tool for visualizing the evolution of content in Wikipedia articles by visualizing both authorship and persistence of content. It operates on the same basic principal as Wikipedia Buddy: uncovering the individual authorship of portions of text and using that information to interrogate the history and quality of a Wikipedia article, as well as the community of authors that surrounds it. We believe that History Flow is an extremely valuable research tool with which we hope to interface, and from which we have borrowed ideas. However, we feel it is too complex to be useful in an immediate and intuitive way for the average user who is browsing through Wikipedia. Our goal is to provide tools and visualizations at a resolution that is sufficiently detailed to be useful and accurate, but not so complex as to be confusing. |
![]() |
TouchGraph provides a radial graph viewer for visualizing the links between authors and pages on wikis called WikiBrowser. We initially explored using a node-link style graph for visualizing the relationships between authors, but we quickly realized that such a visualization would be poorly suited to the large quantity of data. |
A Comparison of the Readability of Graphs Using Node-Link and Matrix-Based
Representations
Ghoniem, et. al., InfoVis2004
|
Ghoniem, et. al. undertook an empirical comparison of the benefits and detriments of using node-link and matrix-based graphs. Their work indicates that matrix-based representations are far superior to node-link graphs when dealing with large and complex data-sets.
|
ForumReader
Martin Wattenberg
![]() |
ForumReader is a tool for browsing through forum text with visual representations of authors and threads that provide context. ForumReader was, in particular, an inspiration for the Document Overview visualization in Wikipedia Buddy. |
PaperLens
Lee, Czerwinski, Robertson, and Bederson
![]() |
PaperLens, like Wikipedia Buddy, aims to reveal the patterns of authorship that are woven throughout the body of literature produced over eight years of InfoVis conferences. PaperLens provided us with ample evidence that uncovering the connections between authors of a larger body of work would be both interesting and fruitful. |
We suggest that Wikipedia Buddy helps address questions surrounding information quality on Wikipedia in three specific ways.
Wikipedia Buddy contains six content panels. In the following sections we will discuss each element of the application.
The Document Panel contains the text of the currently selected article. The user can interact with the document by right clicking on any section. A context menu appears which contains five choices.
The title list provides two straightforward functions:
The Document Overview pane shows
the entire article, though not in a readable form. This pane has a red highlighted box
that adjusts to the position of the text displayed in the Document Panel. The Document
Overview also mirrors all highlighting currently in the Document Panel. The Document
Panel provides a ‘focus plus context’ like feature which allows a
user to view text in the Document Panel as well as see the displayed text in the larger
context of the article.
The heart of the application from an information visualization perspective is the Author Matrix (AM). The AM is constructed by gathering a list of all the authors who have contributed to the current version of an article, and then finding all the other articles to which those users contributed. Each row represents an author, and each column an article. The first column displayed is always the selected article. Each cell in the AM is assigned a color which corresponds to an individual author’s contribution to an individual article.
Based on feedback from a design critique we implemented a dynamic binning system for coloring both text in the Document Panel and cells in the AM. Our original design called for contributions to be evenly divided into ten bins. The current scheme, however, uses only four bins that we believe map more closely to users perceptions about authors. The four categories are ‘Typo Fixer’, ‘Minor Contributor,’ ‘Average Contributor,’ and ‘Major Contributor.’ The thresholds for each bin are computed according to the following method:
A legend which displays the results of this dynamic binning process is displayed below the AM.
While the Document Pane and the Document Overview are meant
to provide for analysis of a single article, the AM provides for analysis across
articles. The AM can be used, for instance, to find patterns of authorship among related
articles or to follow a single author’s content across article. In order to
maximize data density article titles are initially hidden, although individual columns
can be moved and expanded by the user. In addition, a tool-tip which shows both the
article title and that author’s contribution to it pops up whenever the user
hovers over a cell.
While the dynamic legend feature appears to work properly, we
face a conceptual issue related to color-coding the AM. Our current scheme dynamically
computes category thresholds for assigning colors, but only does so for the currently
selected article. Each of the other columns in the matrix is currently encoded with
those same thresholds. In order to correctly apply the dynamic binning technique, we
will likely need to pre-process this data for each column in every matrix, since doing
so on-the-fly would drastically effect performance because of the sheer number of
database calls.
The statistics panel contains both information about the
current article and a listing of the authors and pages that the user has selected. The
statistics provide a variety of background context on an article which may be useful for
making determinations of quality. These are the stats reported on:
The Statistics Panel also provides information about the currently selected author:
Finally, the Statistics Panel includes a list of the trusted authors and favorite pages selected by the user. Clicking on an author will highlight the contributions of that author to the current article. Clicking on a page will select that page in the application.
We began development of Wikipedia Buddy with the idea that we would
try not to make pre-judgements about the variables that influence quality and then
integrate them into the software. Rather, our goal was to uncover obscured information
related to authorship, and present that information using visual and textual tools so
that users could make their own determinations of quality. The statistics are intended
to provide contextual information about individual articles, though individual
statistics may only be useful for some users in some cases.
In order to gain a basic understand of how Wikipedia Buddy is progressing and identify potential issues as early as possible we conducted a round of informal user-testing. We interviewed 3 participants for approximately 30 minutes each and asked them to answer questions and perform a variety of tasks. (User Testing Script - PDF)
Our goal when we began this project was to create a tool that would be useful for average users. As such, we cannot simply ignore the questions of whether information quality on Wikipedia is a salient issue for many users and whether using the notion of trusted authors to address that problem is meaningful.
Our user
feedback has shown mixed results on these questions. One user commented that he does not
tend to view Wikipedia as an authoritative source regardless of how credible an
individual article seems. As such, for him, uncovering a page's individual articles was
not a meaningful activity, and he did not seem interested in exploring patterns of
authorship around individual articles or groups of articles.
Another user, however, showed quite a bit of enthusiasm for Wikipedia in general.
He was particularly interested in the matrix portion of Wikipedia Buddy, and expressed
interest in using the tool both to 'follow an author from page to page' and to explore
groups of related authors. Interestingly, this user imagined using our tool to find
patterns of authorship around a thematic group of articles, and perhaps identifying
where those patterns are not followed as a way of assisting determinations of
quality.
Our users contributed some valuable suggestions and critiques which we hope to incorporate into future iterations of the software. Here we summarize both suggestions about the visualization techniques we employed and suggestions for improving interaction with the application. We believe that the two categories are inextricably related, and so attention to both is necessary.
Each of our testers described the problem of focus when first looking at the WP Buddy screen. 'Where do I start?' one user asked, pointing out that each of the elements on the screen seems to be equally weighted, making it difficult to know where to begin.
Here, two out of three users immediately provided a simple yet profound observation that we had not considered. The main text window should include the title of the Wikipedia article at the top. This is true for at least two reasons:
While Java is a cross-platform language, much of the functionality in the main text window requires accessing a context menu with the right mouse-button. One of our testers who is primarily a Mac user had trouble finding and interacting with a context menu through which many of the Document View functions must be accessed.
Wikipedia Buddy uses a MySQL database to store and access data. While the database is quite fast, constructing both the text pane with author information and the matrix requires a large number of queries. The time necessary to complete these queries introduces an element of latency into the interaction with the program when selecting a new title. We are aware of this problem, and have in fact drastically reduced latency over the course of several iterations. Our user testing illustrated, however, that this is still an important problem that we must overcome. All of our users were confused about why clicking on a new title does not elicit any immediate action. While provdiding some feedback on the order of 'Please Wait' could be a stop-gap measure, the ultimate solution will be to eliminate latency by pre-processing as much data as is possible.
The choice of schemes for encoding authors and relative % of contributions continues to be a difficult issue. Our users had mixed opinions about the option we implemented, using a single color and varying the saturation to indicate the size of an author's contributions. Users also suggested several other options. Below we discuss the pros and cons of three possible methods:
Following a design critique completed at the end of November, we implemented several substantive changes which are reflected in the current version of Wikipedia Buddy.
Wikipedia Buddy shows a lot of promise. Preliminary work indicates that it has the potential to help users explore authorship and its relationship to subjective measures of quality in Wikipedia articles. We feel the following areas give us room for future work.
We would like to clean up the layout, including labeling all the panes and creating an easy way for Mac users to use the context menu.
Right now, our demo project is limited to only a small subset of Wikipedia articles. The large quantities of data that we were using slowed down our application considerably. To alleviate that, we preprocessed the data and then used a restricted dataset of 298 articles. These articles have about ten authors each, so they provide us with a rich dataset without unduly burdening the performance.
We can expand our database in the following ways.Allowing the users to sort the author matrix is a major improvement to user experience with Wikipedia. This would allow a user to interact with the data exploration rather than just passively consuming it. Right now, the matrix does allow the user to move columns around, but it is non-intuitive. In addition, automated sorting of columns and rows would allow the user to interact with the matrix more quickly and accurately. We envision implementing sorting and functions for the matrix based on:
In order to make Wikibuddy usable, we need to implement parsing of the Wikimedia markup language. Unparsed wikimedia markup is confusing and complex. This is important also because we would like to make the user experience of Wikipedia Buddy tightly coupled with Wikipedia itself.
We would like to allow linking between articles through in-line page links.
Our first three subjects have given us rich information. Once we have worked the kinds out of the application, more systematic user testing will be necessary.