ControVis - Informations and Overview
This web site has been created for the course "Visualisierug 2" at the Vienna University of Technology and demonstrates our implementation of a visual analysis tool introduced in this paper by the authors Ulrik Brandes and Jürgen Lerner. The tool is used for the visual analysis of controversy in user-generated encyclopedias - this is also where the name of our project comes from: ControversyVisualization.
Time for some theory
First of all we will try to explain the theoretical background of the project in order to be able to understand how this visualization actually works and what it shows us. Our implementation tries to visualize controversies between authors of wikipedia articles. This is done by using the revision history of articles and the information about the relations between the authors that can be derived from it.
The basic idea behind the visualization is to build a so called "who revises whom"-network. This is done by analyzing the list of revisions of a page ordered by time. Consecutive revisions can be interpreted as one author revising the changes of the other. Since we are interested in visualizing conflicts the revision edges (edges in the revision network between consecutive revisions) are weighted in such a way that the weights can be interpreted as the disagreements between the authors.
The final visualization of the conflicts is done by plotting the revision network in 2D space and mapping the authors (also shown as ellipses) to a big ellipse. This mapping is done by simply computing the two smallest eigenvectors of the adjacency matrix that is built from the revision network and then normalizing and mapping the resulting coordinate values to an ellipse. The detailed explanation can be found in the original paper. However there are several visual properties of the visualization that represent different characteristics of the revision network and the authros.
- Shape of the major ellipse: represents the overall skewness of the revision network.
- Shape of the author ellipses: Ratio between the amount of out/in going revisions. (horizontal: being revised more often; vertical: revising more often)
- Color of the author ellipse: Edit frequency of the authors. Red means unsteady, black means steady edit behaviour.
- Connecting lines: the ten revision edges with highest weights.
- Color gradient of connecting lines: Direction of the revision edges. Black means revising more often, white means being revised. Uniform colored lines represent symmetric revision ratios.
- Bar chart: aggregated edit volume of the whole article over time. Can be used for filtering the network.
Implementation
We chose to implement the visualization tool using webtechnologies because we thaught it would be practical to do the visualization online just like the articles of wikipedia itself. This decision was a big challenge, mainly because of the inability of those technologies to handle the huge amount of data provided by the wikipedia-history. Therefore, especially in the beginning, we had to overcome a lot of minor and sometimes even major problems.
Achievments
- Extraction of revision information from Wikipedia XML files
- parsing large XML files with a java using SAX parsing
- writing the extracted data for single pages to JSON files
- providing JSON files with revision history information of pages with a large number of revision
- Loading the prepared JSON files via an Applet and eluding the security issues that come with that approach
- Complex computation of various revision network attributes used for visualization
- Creation and visualization of a revision-network-graph
- Placing of variable user-nodes
- Visualizing aggregated edit volume of a page over time in a bar chart
- Interactive selection of time intervals in the bar chart using an additional overview chart
Used Libraries
- The Processing.js and flot.js for the visualization
- The gson-library for the parsing of the JSON-files
- The ejml-library for the matrix-calculations
- The jQuery javaScript library for solving various small javaScript issue.
Challenges
- Handling the huge XML files from wikipedia containing the revision histories
- loading and parsing the xml files
- efficient working with the huge amount of data
- Evading security-issues with Applets and file-handling
- Using the Procssing.js-library for visualization
- Communication between javascript, the Applet and Processing.js
- Dead Internet in Austria on the day of the hand-in
Open Issues
- Loading the xml-history direct from wikipedia-pages
- Successful/Fast Eigenvector-calculation for HUGE matrices
- Overlapping of the authors ellipses, revsion edges and labels.