Wikipedia Buddy

Mano Marks and Judd Antin

Final Project
IS247: Information Visualization and Presentation

Introduction

Goals

Foundations & Data

Related Work

Detailed Component Description

User Testing Results

Current Updates

Conclusion and Future Work

Wikipedia Buddy

Abstract

Questions about information quality have surrounded Wikipedia since its inception. Critics have argued that because anyone can contribute, checks on the reliability or credibility of information on the site are unreliable or ineffective. Discussions of information quality, however, too frequently assume that quality is a standard and universal judgment. The perceptions of quality that surround Wikipedia are often personal and can be extremely diverse, varying for example by context of use, personal preference, and style judgments.

The goal of this project is to provide a companion to Wikipedia which can help individual users make judgments about the quality of Wikipedia articles through visual tools based on uncovering the notion of individual authorship. Rather than embedding pre-defined quality metrics in these tools, we hope to visualize different dimensions of the data based on user preferences, and provide analysis tools that allow users to explore quality on their own terms. By visualizing the relationships between authors and content, and the structure of Wikipedia itself, we hope to provide abstractions of current articles and their histories that will help users make decisions about information quality.

Introduction

Wikipedia, the open source-style online encyclopedia project, has become a popular and successful experiment in collaborative knowledge production. Since its inception, the Wikipedia project has accumulated more than 2.6 million articles across more than 200 languages, and now receives more than 86,000,000 page views each day. The English language version alone hosts more than 800,000 articles and has 43,000 unique contributors. Wikipedia is, then, a force to be reckoned with. What was once envisioned as a kind of alternative knowledge source has now become mainstream.

Given its newfound status as a mainstream media source, questions about Wikipedia’s reliability and authority as a high quality information source are becoming more widespread. The recent questions surrounding both the apparently defamatory entry for journalist and government official John Siegenthaler and the actions of former MYV VJ and podcasting guru Adam Curry have heightened scrutiny of information quality on Wikipedia.

Wikipedia handles information quality with brute force – with application of the open source community principal that many eyeballs makes for high quality information. The simple but powerful principal assumes that, given enough contributors, poor, incomplete, or inaccurate information as well as vandalism will be weeded out. This model, however successful, is an abstract one. There are no explicit checks on quality that users can see and interact with. Users must decide as a general rule whether they believe the system works or not. In the case of Wikipedia there is little information available for making determinations of quality on a page-by-page basis.

Wikipedia Buddy is based on the idea of unveiling the notion of individual authorship in Wikipedia. Authorship is an important means by which individuals make judgments about information quality. By allowing users to interact with authors at the level of individual lines of text, Wikipedia Buddy aims to provide the means for making determinations of quality at a much finer level of resolution.
In order to accomplish this goal, Wikipedia Buddy employs a variety of visual mappings and tools.

Wikipedia Buddy uses simple highlighting to allow users to identify individual authors’ contributions in a single article.
Wikipedia Buddy takes advantage of the ease with which we can recognize contrast in colors to provide an overview of the composition of individual articles.
Wikipedia Buddy uses a focus + context technique to provide the user both with the ability to read the text of an article through the Document View and with an overview of the entire article through the Document Overview.
Wikipedia Buddy uses an adjacency matrix combined with color coding to allow the user to identify and explore patterns of authorship across articles.

Goals

We divide the goals for Wikipedia Buddy (WB) into two categories: theoretical and practical.

Theoretical Goals

To expose the notion of individual authorship in the context of Wikipedia articles
To problematize information quality on Wikipedia and provide tools for resolving uncertainty about information quality
To challenge whether Wikipedia’s system for ensuring information quality works at the level of individual articles
To explore where articles are truly collaboratively authored and where authorship is dominated by a few individuals

Practical Goals:

To create simple visual and textual tools for helping Wikipedia users analyze the quality of individual articles and portions of articles
To allow users to develop a body of trusted authors and identify content contributed by those authors across Wikipedia
To help users see the links between authors and pages

We may not ultimately know whether Wikipedia Buddy is useful for addressing these goals until users have a chance to interact with it, and researchers have a chance ot use its analytical power to critically examine authors and texts on Wikipedia.

Foundations & Data

Wikipedia Buddy was developed in Java using Swing for GUI development. The application also uses a MySQL database and implements the Java JDBC MySQL connector for database connectivity. Finally, we used a custom SAX XML parser (Xerces) to transfer Wikipedia's XML-based database dump into our own database (as described below).

The root data source for Wikipedia Buddy is a complete database dump of the English language Wikipedia downloaded on November 5th, 2005. This data is available for download from Wikimedia Foundation’s servers in the form of a single 295 gigabyte XML file, compressed to ~ 4gb. (Downloads are available at http://download.wikipedia.org) Dealing with a dataset this large was a difficult task. In order to make use of this data, we constructed an algorithm which, by sifting through each revision of a given article, assigns an author to each line of text in the current revision. This mapping of authors to lines of text represents the core data source for the application.

In order to create a working dataset for the application, we developed a custom parser which captured the data and stored it in a set of MySQL databases. We used the SAX XML parser to extract the data one article at a time. Unfortunately, we could not process the entire Wikipedia, as we continually encountered articles that, because of anomalies in the text, would cause the parser to crash. However, we were able to capture data on over 50,000 articles.

The data was preprocessed into three tables: (1) stats, which captures preprocessed statistics about each article; (2) authors, which captures percentage contribution of each author to each article, and; (3) authorText, which captures the actual text of each article, broken down in chunks an associated with authors.

Related Work

History Flow
Martin Wattenberg and Fernanda Viegas

HistoryFlow History Flow is a tool for visualizing the evolution of content in Wikipedia articles by visualizing both authorship and persistence of content. It operates on the same basic principal as Wikipedia Buddy: uncovering the individual authorship of portions of text and using that information to interrogate the history and quality of a Wikipedia article, as well as the community of authors that surrounds it. We believe that History Flow is an extremely valuable research tool with which we hope to interface, and from which we have borrowed ideas. However, we feel it is too complex to be useful in an immediate and intuitive way for the average user who is browsing through Wikipedia. Our goal is to provide tools and visualizations at a resolution that is sufficiently detailed to be useful and accurate, but not so complex as to be confusing.

TouchGraph WikiBrowser

TouchGraph provides a radial graph viewer for visualizing the links between authors and pages on wikis called WikiBrowser. We initially explored using a node-link style graph for visualizing the relationships between authors, but we quickly realized that such a visualization would be poorly suited to the large quantity of data.

A Comparison of the Readability of Graphs Using Node-Link and Matrix-Based Representations
Ghoniem, et. al., InfoVis2004

ForumReader

Ghoniem, et. al. undertook an empirical comparison of the benefits and detriments of using node-link and matrix-based graphs. Their work indicates that matrix-based representations are far superior to node-link graphs when dealing with large and complex data-sets.

ForumReader
Martin Wattenberg

ForumReader is a tool for browsing through forum text with visual representations of authors and threads that provide context. ForumReader was, in particular, an inspiration for the Document Overview visualization in Wikipedia Buddy.

PaperLens
Lee, Czerwinski, Robertson, and Bederson

PaperLens, like Wikipedia Buddy, aims to reveal the patterns of authorship that are woven throughout the body of literature produced over eight years of InfoVis conferences. PaperLens provided us with ample evidence that uncovering the connections between authors of a larger body of work would be both interesting and fruitful.

Component Description

We suggest that Wikipedia Buddy helps address questions surrounding information quality on Wikipedia in three specific ways.

Wikipedia Buddy allows users to easily and quickly discover the author of any individual portion of text. Using textual highlighting we can show both how much text and which portions of text an author contributed.
Wikipedia Buddy uses both highlighting functions and the author matrix to uncover the patterns of authorship behind both individual articles and groups of articles.
The previously mentioned functions are meant to allow users to build a body of trusted articles and pages which can be easily accessed through the application.

Wikipedia Buddy contains six content panels. In the following sections we will discuss each element of the application.

Document Panel

The Document Panel contains the text of the currently selected article. The user can interact with the document by right clicking on any section. A context menu appears which contains five choices.

‘Add <authorname> to Favorite Authors’: The author of the portion of text the user selected to a list of favorite authors in the statistics pane.
‘Add <pagetitle> to Favorite Pages’: The current page is added to a list of favorite pages in the statistics pane.
‘Highlight Current Author’: All text contributed by the selected author is highlighted. Highlighting is done in a shade of blue determined by the amount of contribution. This is explained more in the discussion below of the author matrix. This functionality is intended as a primary tool for exploring the contributions of individual authors within a single page.
‘Highlight All Authors’: All of the text in the article is highlighted according with the color appropriate for each author. Highlighting the entire article allows the user to quickly scan the article and learn about the diversity of authorship. Here we take advantage of the ease with which users can pick out contrasting colors in a field. A monochrome overview indicates homogenous authorship patterns, whereas variations in color indicates diversity. By noting the saturation of the coloring in the document, a user can also quickly determine whether an article is dominated by fewer authors, or whether many authors contributed smaller portions of text.
‘Clear All Highlighting’: All highlighting is removed from the text.

Title List

The title list provides two straightforward functions:

selecting a title allows a user to view and interact with that article; and
right clicking on a title allows the user to add that title to the favorites list shown in the statistics pane.

Document Overview

The Document Overview pane shows the entire article, though not in a readable form. This pane has a red highlighted box that adjusts to the position of the text displayed in the Document Panel. The Document Overview also mirrors all highlighting currently in the Document Panel. The Document Panel provides a ‘focus plus context’ like feature which allows a user to view text in the Document Panel as well as see the displayed text in the larger context of the article.

Author Matrix & Legend

The heart of the application from an information visualization perspective is the Author Matrix (AM). The AM is constructed by gathering a list of all the authors who have contributed to the current version of an article, and then finding all the other articles to which those users contributed. Each row represents an author, and each column an article. The first column displayed is always the selected article. Each cell in the AM is assigned a color which corresponds to an individual author’s contribution to an individual article.

Based on feedback from a design critique we implemented a dynamic binning system for coloring both text in the Document Panel and cells in the AM. Our original design called for contributions to be evenly divided into ten bins. The current scheme, however, uses only four bins that we believe map more closely to users perceptions about authors. The four categories are ‘Typo Fixer’, ‘Minor Contributor,’ ‘Average Contributor,’ and ‘Major Contributor.’ The thresholds for each bin are computed according to the following method:

Based on the size of the text in the current article, find the percentage of contribution which matches an arbitrary threshold for the ‘Typo Fixer’ category. In the current version that threshold is set at 200 characters.
Find the median percent contribution for the article.
Compute a value X which is the average of the difference between: (a) the threshold percent and the median percent; and (b) the maximum percent contribution and the median percent. Use this value to set lower (median – X) and upper (median + X) boundaries for the ‘Average Contributor’ bin.

A legend which displays the results of this dynamic binning process is displayed below the AM.

While the Document Pane and the Document Overview are meant to provide for analysis of a single article, the AM provides for analysis across articles. The AM can be used, for instance, to find patterns of authorship among related articles or to follow a single author’s content across article. In order to maximize data density article titles are initially hidden, although individual columns can be moved and expanded by the user. In addition, a tool-tip which shows both the article title and that author’s contribution to it pops up whenever the user hovers over a cell.

While the dynamic legend feature appears to work properly, we face a conceptual issue related to color-coding the AM. Our current scheme dynamically computes category thresholds for assigning colors, but only does so for the currently selected article. Each of the other columns in the matrix is currently encoded with those same thresholds. In order to correctly apply the dynamic binning technique, we will likely need to pre-process this data for each column in every matrix, since doing so on-the-fly would drastically effect performance because of the sheer number of database calls.

Statistics Panel

The statistics panel contains both information about the current article and a listing of the authors and pages that the user has selected. The statistics provide a variety of background context on an article which may be useful for making determinations of quality. These are the stats reported on:

Total Authors: The total number of authors who contributed to any revision of the article.
Current authors: The total number of authors who contributed to the current revision of the article.
% Current authors: The percentage of contributing authors who are represented in the current version. (i.e. Current Authors / Total Authors)
Number of revisions: The total number of revisions made to the article.
Max author %: The maximum percentage of text that any author has contributed to the current revision of the article.
% Anonymous Authors: The percentage of authors who contributed anonymously to the current revision of the article. These authors show up as IP Addresses in the AM.
% Anonymous content: The percentage of content in the current revision of the article which was contributed anonymously.

The Statistics Panel also provides information about the currently selected author:

Author name (or IP address)
Percentage of contribution to the current article.

Finally, the Statistics Panel includes a list of the trusted authors and favorite pages selected by the user. Clicking on an author will highlight the contributions of that author to the current article. Clicking on a page will select that page in the application.

We began development of Wikipedia Buddy with the idea that we would try not to make pre-judgements about the variables that influence quality and then integrate them into the software. Rather, our goal was to uncover obscured information related to authorship, and present that information using visual and textual tools so that users could make their own determinations of quality. The statistics are intended to provide contextual information about individual articles, though individual statistics may only be useful for some users in some cases.

User Testing

In order to gain a basic understand of how Wikipedia Buddy is progressing and identify potential issues as early as possible we conducted a round of informal user-testing. We interviewed 3 participants for approximately 30 minutes each and asked them to answer questions and perform a variety of tasks. (User Testing Script - PDF)

Our goal when we began this project was to create a tool that would be useful for average users. As such, we cannot simply ignore the questions of whether information quality on Wikipedia is a salient issue for many users and whether using the notion of trusted authors to address that problem is meaningful.

Our user feedback has shown mixed results on these questions. One user commented that he does not tend to view Wikipedia as an authoritative source regardless of how credible an individual article seems. As such, for him, uncovering a page's individual articles was not a meaningful activity, and he did not seem interested in exploring patterns of authorship around individual articles or groups of articles.

Another user, however, showed quite a bit of enthusiasm for Wikipedia in general. He was particularly interested in the matrix portion of Wikipedia Buddy, and expressed interest in using the tool both to 'follow an author from page to page' and to explore groups of related authors. Interestingly, this user imagined using our tool to find patterns of authorship around a thematic group of articles, and perhaps identifying where those patterns are not followed as a way of assisting determinations of quality.

Our users contributed some valuable suggestions and critiques which we hope to incorporate into future iterations of the software. Here we summarize both suggestions about the visualization techniques we employed and suggestions for improving interaction with the application. We believe that the two categories are inextricably related, and so attention to both is necessary.

Focus

Each of our testers described the problem of focus when first looking at the WP Buddy screen. 'Where do I start?' one user asked, pointing out that each of the elements on the screen seems to be equally weighted, making it difficult to know where to begin.

Titles

Here, two out of three users immediately provided a simple yet profound observation that we had not considered. The main text window should include the title of the Wikipedia article at the top. This is true for at least two reasons:

The mapping between the title list on the left and the text window is not clear at first. Putting a title at the top of the text window would make explicit that the text window contains the actuall article text.
In situations where the user has scrolled to a different part of the title list without clicking on a new title, the selected title is no longer visible in the list, nor anywhere else in the visualization.

Platform Issues

While Java is a cross-platform language, much of the functionality in the main text window requires accessing a context menu with the right mouse-button. One of our testers who is primarily a Mac user had trouble finding and interacting with a context menu through which many of the Document View functions must be accessed.

Screen Real-Estate

One of the primary challenges we faced was deciding how to divide the limited screen real estate that is available to us. The Document View can only shrink so far and still remain functional. As such, each of the other components suffered. Our users suggested that we explore options for maximizing the screen real estate. One suggested that we might do away with the Document View entirely in favor of more space for the Author Matrix. We believe this is not a viable option, since users must be able to actively read and interact with the text in order to identify trusted authors. However, one viable option would be to create a tabbed layout for the center section of the application. By clicking on an individual tab the user could choose from, for instance, three distinct views:

Text view: The text box takes up the whole window, the matrix is not shown.
Matrix view: The adjacency matrix takes up the whole window. With more screen space to devote to the matrix we could include full column titles and reduce the need to scroll. The matrix view could also include interactions related to sorting and querying functionality we will include in a future version. Each of these elements might serve to make interaction with the matrix more intuitive.
Multi view: The view as it currently is; the text box and the matrix share roughly equal space in the window.

Latency

Wikipedia Buddy uses a MySQL database to store and access data. While the database is quite fast, constructing both the text pane with author information and the matrix requires a large number of queries. The time necessary to complete these queries introduces an element of latency into the interaction with the program when selecting a new title. We are aware of this problem, and have in fact drastically reduced latency over the course of several iterations. Our user testing illustrated, however, that this is still an important problem that we must overcome. All of our users were confused about why clicking on a new title does not elicit any immediate action. While provdiding some feedback on the order of 'Please Wait' could be a stop-gap measure, the ultimate solution will be to eliminate latency by pre-processing as much data as is possible.

Color Encoding

The choice of schemes for encoding authors and relative % of contributions continues to be a difficult issue. Our users had mixed opinions about the option we implemented, using a single color and varying the saturation to indicate the size of an author's contributions. Users also suggested several other options. Below we discuss the pros and cons of three possible methods:

Single color, varying saturation: We chose to use color to visually encode percentage of content. Using a single color that varies in saturation according to a binned spectrum provides a simple visual mapping, and makes it easy to compare the contributions of several authors.
Two color spectrum (i.e. green to red): One user suggested that a two-color spectrum, especially one that used two opposing colors, could be more meaningful because the contrast between minor and major contributors would be more obvious. The first iteration of our software used this scheme, and we did indeed find that the contrast was good between the two extremes. We ran into problems, however, with colors in the middle of the spectrum. In the case of a red/green spectrum, the majority of colors are of a brownish hue, and it becomes difficult to discern which are more red and which are more green. Binning colors addressed this issue, but ultimately we chose a single-color scheme because we believe the mapping is more intuitive. The darker the color the greater the contribution.
Unique author colors: One suggestion has been to use a range of orthogonal colors to encode individual authors in the text view. The benefit of this scheme is that it makes it visually simple to distinguish the work of individual authors. This method also overcomes a problem with using the single-color method: it is not possible to distinguish between authors of contiguous portions of text when they have contributed the same total percentage range of content, and the encoded color is therefore identical. There are some important drawbacks to this method as well. Primarily we must consider that while such a scheme would work well in the text view, it would not be viable in the matrix view, and users would then be forced to contend with two separate schemes for encoding data using color. In addition, using distinct colors changes the data that is being encoding. The current scheme encodes contribution percentage where a unique color-scheme would encode authorship. While it might be possible to vary the saturation of each unique color in a parallel way to encode both authorship and contribution %, it seems unlikely that users would intuitively understand the mapping between distinct colors with the same saturation and the same contribution percentage range.

One possible solution to this problem is to combine text highlighting with some other visual cue to encode both kinds of information. So, for instance, we might choose to highlight text with a single color method but include uniquely colored bars to indicate authorship, as shown below.

Current Updates

Following a design critique completed at the end of November, we implemented several substantive changes which are reflected in the current version of Wikipedia Buddy.

The Document Overview window has been integrated into the main user interface.

The layout of the application has been redesigned to maximize the user of space.

We replaced a static coloring scheme that divded authors into 10% chunks with a simpler and we hope more meaningful system that uses only four categories. This new, dynamic coloring scheme not only provides a simpler mapping between % contribution and color but also allows us to pack more meaningful data into the legend.

A great deal of work since the design critique has also gone into transitioning Wikipedia Buddy from an extremely small test-set of about 10 articles to the larger, database-driven system.

Conclusion and Future Work

Wikipedia Buddy shows a lot of promise. Preliminary work indicates that it has the potential to help users explore authorship and its relationship to subjective measures of quality in Wikipedia articles. We feel the following areas give us room for future work.

Clean up the layout
We would like to clean up the layout, including labeling all the panes and creating an easy way for Mac users to use the context menu.
Scaling Up
Right now, our demo project is limited to only a small subset of Wikipedia articles. The large quantities of data that we were using slowed down our application considerably. To alleviate that, we preprocessed the data and then used a restricted dataset of 298 articles. These articles have about ten authors each, so they provide us with a rich dataset without unduly burdening the performance.
We can expand our database in the following ways.
- Expand the database to the whole of Wikipedia, while addressing performance issues
- Implement search rather than a title list, perhaps (as the Wikipedia site does)
- Figure out whether the matrix is manageable at that level
Sorting and Grouping
Allowing the users to sort the author matrix is a major improvement to user experience with Wikipedia. This would allow a user to interact with the data exploration rather than just passively consuming it. Right now, the matrix does allow the user to move columns around, but it is non-intuitive. In addition, automated sorting of columns and rows would allow the user to interact with the matrix more quickly and accurately. We envision implementing sorting and functions for the matrix based on:
- author names
- contribution
- co-occurence
- semantic groupings

Wikimedia Markup
In order to make Wikibuddy usable, we need to implement parsing of the Wikimedia markup language. Unparsed wikimedia markup is confusing and complex. This is important also because we would like to make the user experience of Wikipedia Buddy tightly coupled with Wikipedia itself.
Linking
We would like to allow linking between articles through in-line page links.
Systematic User Testing
Our first three subjects have given us rich information. Once we have worked the kinds out of the application, more systematic user testing will be necessary.