WPBuddy Logo
  

Mano Marks and Judd Antin

Final Project
IS247: Information Visualization and Presentation

Introduction
Goals
Foundations & Data
Related Work
Detailed Component Description
User Testing Results
Current Updates
Conclusion and Future Work

 

ScreenShot
Wikipedia Buddy

Abstract

Questions about information quality have surrounded Wikipedia since its inception. Critics have argued that because anyone can contribute, checks on the reliability or credibility of information on the site are unreliable or ineffective. Discussions of information quality, however, too frequently assume that quality is a standard and universal judgment. The perceptions of quality that surround Wikipedia are often personal and can be extremely diverse, varying for example by context of use, personal preference, and style judgments.

The goal of this project is to provide a companion to Wikipedia which can help individual users make judgments about the quality of Wikipedia articles through visual tools based on uncovering the notion of individual authorship. Rather than embedding pre-defined quality metrics in these tools, we hope to visualize different dimensions of the data based on user preferences, and provide analysis tools that allow users to explore quality on their own terms. By visualizing the relationships between authors and content, and the structure of Wikipedia itself, we hope to provide abstractions of current articles and their histories that will help users make decisions about information quality.

Introduction

Wikipedia, the open source-style online encyclopedia project, has become a popular and successful experiment in collaborative knowledge production. Since its inception, the Wikipedia project has accumulated more than 2.6 million articles across more than 200 languages, and now receives more than 86,000,000 page views each day. The English language version alone hosts more than 800,000 articles and has 43,000 unique contributors. Wikipedia is, then, a force to be reckoned with. What was once envisioned as a kind of alternative knowledge source has now become mainstream.

Given its newfound status as a mainstream media source, questions about Wikipedia’s reliability and authority as a high quality information source are becoming more widespread. The recent questions surrounding both the apparently defamatory entry for journalist and government official John Siegenthaler and the actions of former MYV VJ and podcasting guru Adam Curry have heightened scrutiny of information quality on Wikipedia.

Wikipedia handles information quality with brute force – with application of the open source community principal that many eyeballs makes for high quality information. The simple but powerful principal assumes that, given enough contributors, poor, incomplete, or inaccurate information as well as vandalism will be weeded out. This model, however successful, is an abstract one. There are no explicit checks on quality that users can see and interact with. Users must decide as a general rule whether they believe the system works or not. In the case of Wikipedia there is little information available for making determinations of quality on a page-by-page basis.

Wikipedia Buddy is based on the idea of unveiling the notion of individual authorship in Wikipedia. Authorship is an important means by which individuals make judgments about information quality. By allowing users to interact with authors at the level of individual lines of text, Wikipedia Buddy aims to provide the means for making determinations of quality at a much finer level of resolution.
In order to accomplish this goal, Wikipedia Buddy employs a variety of visual mappings and tools.

 

Goals

We divide the goals for Wikipedia Buddy (WB) into two categories: theoretical and practical.

Theoretical Goals

  1. To expose the notion of individual authorship in the context of Wikipedia articles
  2. To problematize information quality on Wikipedia and provide tools for resolving uncertainty about information quality
  3. To challenge whether Wikipedia’s system for ensuring information quality works at the level of individual articles
  4. To explore where articles are truly collaboratively authored and where authorship is dominated by a few individuals

Practical Goals:

  1. To create simple visual and textual tools for helping Wikipedia users analyze the quality of individual articles and portions of articles
  2. To allow users to develop a body of trusted authors and identify content contributed by those authors across Wikipedia
  3. To help users see the links between authors and pages

We may not ultimately know whether Wikipedia Buddy is useful for addressing these goals until users have a chance to interact with it, and researchers have a chance ot use its analytical power to critically examine authors and texts on Wikipedia.

 

Foundations & Data

Wikipedia Buddy was developed in Java using Swing for GUI development. The application also uses a MySQL database and implements the Java JDBC MySQL connector for database connectivity. Finally, we used a custom SAX XML parser (Xerces) to transfer Wikipedia's XML-based database dump into our own database (as described below).

The root data source for Wikipedia Buddy is a complete database dump of the English language Wikipedia downloaded on November 5th, 2005. This data is available for download from Wikimedia Foundation’s servers in the form of a single 295 gigabyte XML file, compressed to ~ 4gb. (Downloads are available at http://download.wikipedia.org) Dealing with a dataset this large was a difficult task. In order to make use of this data, we constructed an algorithm which, by sifting through each revision of a given article, assigns an author to each line of text in the current revision. This mapping of authors to lines of text represents the core data source for the application.

In order to create a working dataset for the application, we developed a custom parser which captured the data and stored it in a set of MySQL databases. We used the SAX XML parser to extract the data one article at a time. Unfortunately, we could not process the entire Wikipedia, as we continually encountered articles that, because of anomalies in the text, would cause the parser to crash. However, we were able to capture data on over 50,000 articles.

The data was preprocessed into three tables: (1) stats, which captures preprocessed statistics about each article; (2) authors, which captures percentage contribution of each author to each article, and; (3) authorText, which captures the actual text of each article, broken down in chunks an associated with authors.

 

Related Work

 History Flow
Martin Wattenberg and Fernanda Viegas
HistoryFlow History Flow is a tool for visualizing the evolution of content in Wikipedia articles by visualizing both authorship and persistence of content. It operates on the same basic principal as Wikipedia Buddy: uncovering the individual authorship of portions of text and using that information to interrogate the history and quality of a Wikipedia article, as well as the community of authors that surrounds it. We believe that History Flow is an extremely valuable research tool with which we hope to interface, and from which we have borrowed ideas. However, we feel it is too complex to be useful in an immediate and intuitive way for the average user who is browsing through Wikipedia. Our goal is to provide tools and visualizations at a resolution that is sufficiently detailed to be useful and accurate, but not so complex as to be confusing.

TouchGraph WikiBrowser

TouchGraph TouchGraph provides a radial graph viewer for visualizing the links between authors and pages on wikis called WikiBrowser. We initially explored using a node-link style graph for visualizing the relationships between authors, but we quickly realized that such a visualization would be poorly suited to the large quantity of data.

A Comparison of the Readability of Graphs Using Node-Link and Matrix-Based Representations
Ghoniem, et. al., InfoVis2004

ForumReader

Ghoniem, et. al. undertook an empirical comparison of the benefits and detriments of using node-link and matrix-based graphs. Their work indicates that matrix-based representations are far superior to node-link graphs when dealing with large and complex data-sets.

 

ForumReader
Martin Wattenberg

ForumReader ForumReader is a tool for browsing through forum text with visual representations of authors and threads that provide context. ForumReader was, in particular, an inspiration for the Document Overview visualization in Wikipedia Buddy.

PaperLens
Lee, Czerwinski, Robertson, and Bederson

ForumReader PaperLens, like Wikipedia Buddy, aims to reveal the patterns of authorship that are woven throughout the body of literature produced over eight years of InfoVis conferences. PaperLens provided us with ample evidence that uncovering the connections between authors of a larger body of work would be both interesting and fruitful.

 

 

Component Description

We suggest that Wikipedia Buddy helps address questions surrounding information quality on Wikipedia in three specific ways.

  1. Wikipedia Buddy allows users to easily and quickly discover the author of any individual portion of text. Using textual highlighting we can show both how much text and which portions of text an author contributed.
  2. Wikipedia Buddy uses both highlighting functions and the author matrix to uncover the patterns of authorship behind both individual articles and groups of articles.
  3. The previously mentioned functions are meant to allow users to build a body of trusted articles and pages which can be easily accessed through the application.

WPBuddy
                Screen Shot        with ID

Wikipedia Buddy contains six content panels. In the following sections we will discuss each element of the application.

Document Panel

The Document Panel contains the text of the currently selected article. The user can interact with the document by right clicking on any section. A context menu appears which contains five choices.

  1. ‘Add <authorname> to Favorite Authors’: The author of the portion of text the user selected to a list of favorite authors in the statistics pane.
  2. ‘Add <pagetitle> to Favorite Pages’: The current page is added to a list of favorite pages in the statistics pane.
  3. ‘Highlight Current Author’: All text contributed by the selected author is highlighted. Highlighting is done in a shade of blue determined by the amount of contribution. This is explained more in the discussion below of the author matrix. This functionality is intended as a primary tool for exploring the contributions of individual authors within a single page.
  4. ‘Highlight All Authors’: All of the text in the article is highlighted according with the color appropriate for each author. Highlighting the entire article allows the user to quickly scan the article and learn about the diversity of authorship. Here we take advantage of the ease with which users can pick out contrasting colors in a field. A monochrome overview indicates homogenous authorship patterns, whereas variations in color indicates diversity. By noting the saturation of the coloring in the document, a user can also quickly determine whether an article is dominated by fewer authors, or whether many authors contributed smaller portions of text.
  5. ‘Clear All Highlighting’: All highlighting is removed from the text.

Title List

The title list provides two straightforward functions:

  1. selecting a title allows a user to view and interact with that article; and
  2. right clicking on a title allows the user to add that title to the favorites list shown in the statistics pane.

Document Overview

The Document Overview pane shows the entire article, though not in a readable form. This pane has a red highlighted box that adjusts to the position of the text displayed in the Document Panel. The Document Overview also mirrors all highlighting currently in the Document Panel. The Document Panel provides a ‘focus plus context’ like feature which allows a user to view text in the Document Panel as well as see the displayed text in the larger context of the article.

Author Matrix & Legend

The heart of the application from an information visualization perspective is the Author Matrix (AM). The AM is constructed by gathering a list of all the authors who have contributed to the current version of an article, and then finding all the other articles to which those users contributed. Each row represents an author, and each column an article. The first column displayed is always the selected article. Each cell in the AM is assigned a color which corresponds to an individual author’s contribution to an individual article.

Based on feedback from a design critique we implemented a dynamic binning system for coloring both text in the Document Panel and cells in the AM. Our original design called for contributions to be evenly divided into ten bins. The current scheme, however, uses only four bins that we believe map more closely to users perceptions about authors. The four categories are ‘Typo Fixer’, ‘Minor Contributor,’ ‘Average Contributor,’ and ‘Major Contributor.’ The thresholds for each bin are computed according to the following method:

  1. Based on the size of the text in the current article, find the percentage of contribution which matches an arbitrary threshold for the ‘Typo Fixer’ category. In the current version that threshold is set at 200 characters.
  2. Find the median percent contribution for the article.
  3. Compute a value X which is the average of the difference between: (a) the threshold percent and the median percent; and (b) the maximum percent contribution and the median percent. Use this value to set lower (median – X) and upper (median + X) boundaries for the ‘Average Contributor’ bin.

A legend which displays the results of this dynamic binning process is displayed below the AM.

While the Document Pane and the Document Overview are meant to provide for analysis of a single article, the AM provides for analysis across articles. The AM can be used, for instance, to find patterns of authorship among related articles or to follow a single author’s content across article. In order to maximize data density article titles are initially hidden, although individual columns can be moved and expanded by the user. In addition, a tool-tip which shows both the article title and that author’s contribution to it pops up whenever the user hovers over a cell.

While the dynamic legend feature appears to work properly, we face a conceptual issue related to color-coding the AM. Our current scheme dynamically computes category thresholds for assigning colors, but only does so for the currently selected article. Each of the other columns in the matrix is currently encoded with those same thresholds. In order to correctly apply the dynamic binning technique, we will likely need to pre-process this data for each column in every matrix, since doing so on-the-fly would drastically effect performance because of the sheer number of database calls.

Statistics Panel

The statistics panel contains both information about the current article and a listing of the authors and pages that the user has selected. The statistics provide a variety of background context on an article which may be useful for making determinations of quality. These are the stats reported on:

  1. Total Authors: The total number of authors who contributed to any revision of the article.
  2. Current authors: The total number of authors who contributed to the current revision of the article.
  3. % Current authors: The percentage of contributing authors who are represented in the current version. (i.e. Current Authors / Total Authors)
  4. Number of revisions: The total number of revisions made to the article.
  5. Max author %: The maximum percentage of text that any author has contributed to the current revision of the article.
  6. % Anonymous Authors: The percentage of authors who contributed anonymously to the current revision of the article. These authors show up as IP Addresses in the AM.
  7. % Anonymous content: The percentage of content in the current revision of the article which was contributed anonymously.

The Statistics Panel also provides information about the currently selected author:

  1. Author name (or IP address)
  2. Percentage of contribution to the current article.

Finally, the Statistics Panel includes a list of the trusted authors and favorite pages selected by the user. Clicking on an author will highlight the contributions of that author to the current article. Clicking on a page will select that page in the application.

We began development of Wikipedia Buddy with the idea that we would try not to make pre-judgements about the variables that influence quality and then integrate them into the software. Rather, our goal was to uncover obscured information related to authorship, and present that information using visual and textual tools so that users could make their own determinations of quality. The statistics are intended to provide contextual information about individual articles, though individual statistics may only be useful for some users in some cases.

 

User Testing

In order to gain a basic understand of how Wikipedia Buddy is progressing and identify potential issues as early as possible we conducted a round of informal user-testing. We interviewed 3 participants for approximately 30 minutes each and asked them to answer questions and perform a variety of tasks. (User Testing Script - PDF)

Our goal when we began this project was to create a tool that would be useful for average users. As such, we cannot simply ignore the questions of whether information quality on Wikipedia is a salient issue for many users and whether using the notion of trusted authors to address that problem is meaningful.

Our user feedback has shown mixed results on these questions. One user commented that he does not tend to view Wikipedia as an authoritative source regardless of how credible an individual article seems. As such, for him, uncovering a page's individual articles was not a meaningful activity, and he did not seem interested in exploring patterns of authorship around individual articles or groups of articles.

Another user, however, showed quite a bit of enthusiasm for Wikipedia in general. He was particularly interested in the matrix portion of Wikipedia Buddy, and expressed interest in using the tool both to 'follow an author from page to page' and to explore groups of related authors. Interestingly, this user imagined using our tool to find patterns of authorship around a thematic group of articles, and perhaps identifying where those patterns are not followed as a way of assisting determinations of quality.

Our users contributed some valuable suggestions and critiques which we hope to incorporate into future iterations of the software. Here we summarize both suggestions about the visualization techniques we employed and suggestions for improving interaction with the application. We believe that the two categories are inextricably related, and so attention to both is necessary.

Focus

Each of our testers described the problem of focus when first looking at the WP Buddy screen. 'Where do I start?' one user asked, pointing out that each of the elements on the screen seems to be equally weighted, making it difficult to know where to begin.

Titles

Here, two out of three users immediately provided a simple yet profound observation that we had not considered. The main text window should include the title of the Wikipedia article at the top. This is true for at least two reasons:

Platform Issues

While Java is a cross-platform language, much of the functionality in the main text window requires accessing a context menu with the right mouse-button. One of our testers who is primarily a Mac user had trouble finding and interacting with a context menu through which many of the Document View functions must be accessed.

Screen Real-Estate

One of the primary challenges we faced was deciding how to divide the limited screen real estate that is available to us. The Document View can only shrink so far and still remain functional. As such, each of the other components suffered. Our users suggested that we explore options for maximizing the screen real estate. One suggested that we might do away with the Document View entirely in favor of more space for the Author Matrix. We believe this is not a viable option, since users must be able to actively read and interact with the text in order to identify trusted authors. However, one viable option would be to create a tabbed layout for the center section of the application. By clicking on an individual tab the user could choose from, for instance, three distinct views:

Latency

Wikipedia Buddy uses a MySQL database to store and access data. While the database is quite fast, constructing both the text pane with author information and the matrix requires a large number of queries. The time necessary to complete these queries introduces an element of latency into the interaction with the program when selecting a new title. We are aware of this problem, and have in fact drastically reduced latency over the course of several iterations. Our user testing illustrated, however, that this is still an important problem that we must overcome. All of our users were confused about why clicking on a new title does not elicit any immediate action. While provdiding some feedback on the order of 'Please Wait' could be a stop-gap measure, the ultimate solution will be to eliminate latency by pre-processing as much data as is possible.

Color Encoding

The choice of schemes for encoding authors and relative % of contributions continues to be a difficult issue. Our users had mixed opinions about the option we implemented, using a single color and varying the saturation to indicate the size of an author's contributions. Users also suggested several other options. Below we discuss the pros and cons of three possible methods:

 

Current Updates

Following a design critique completed at the end of November, we implemented several substantive changes which are reflected in the current version of Wikipedia Buddy.

 

Conclusion and Future Work

Wikipedia Buddy shows a lot of promise. Preliminary work indicates that it has the potential to help users explore authorship and its relationship to subjective measures of quality in Wikipedia articles. We feel the following areas give us room for future work.