The goal of this project is to separate document data with similar characteristics and assign them into clusters. Clustering has also several challenges, such as noisy data or outliers. This makes it even more challenging to find an automated clustering algorithm, since every dataset differs, and just because one method works really good on one dataset, it doesn’t mean that it would fit also new data. The amount of online documents is growing really fast in the last years. To cluster these documents,the authors developed a visual analysis system, iVisClustering, that performs interactive clustering for document data.

Overview of the implementation

We were using the gensim package in python for topic modeling and we extended it with our own functions to make it more suitable for this task. The dashboard was implemented using Dash, which can provide interactivity to the user.

General Analysis Procedure: After the automatic clustering (k clusters) the data will be cleaned. The user can perform some cluster level interactions like combining similar clusters and removing clusters. The meaning of each cluster will be refined using the LDA inference algorithm. The last set is to fine-tune the clusters such as reviewing the documents. With this five steps, the user will be able to maintain meaningful clusters for the data.

Views of the dashboard

Cluster Relation View (A)

The Cluster Relation View represents the datasets as colored points, where the points with the same color belongs to the same cluster. In case of clicking on the document, on the left side the document title can be seen and it also pops up on the Document View (F). In case the user doesn't find this document relevant, it can be deleted by klicking on the delete button. The slider for the cosine similarity colors the edges between the documents, which have a higher cosine similarity than the choosen value.
Cluster Summary View (B)
If the user clicks on a cluster node on the Cluster Relation View(A), the Cluster Summary View pops up. Here we can see the clusters with the most frequent words better. In case of unnecessary/unwanted cluster, the user can delete it by clicking on the delete button. If two clusters are really similar and the user can merge the selected clusters.
Term Weight View (C)
If the user clicks on a cluster node on the Cluster Relation View(A), the Term Weight View pops up. On the horizontal barplot the words with the highest probability of the chosen cluster can be seen. The values on the barchart represent the probability for a word in the cluster.
Parallel Coordinates View (D)
A Parallel Coordinates plot is a simple way to visualize multi-dimensional data. The lines in the plots are representing the documents and the colors are the clusters, the document belongs to. The slider above the plot provides the user the possibility to filter out some noisy documents.
Word Cloud (E)
The Word Cloud gives the user the possibility to look at the most representative words of the cluster. The size of the words depends on the probability of the word in the cluster. It can also help to get a better understanding of the chosen cluster.
Document View (F)
This view shows us the document itself, which helps us to understand why does a certain document belong to a cluster (in connection with the Cluster Relation View). This view also highlights terms in different colors according to which topic the terms belong to.

Corpus of documents

Collection of Infovis and Vast papers between 1997 and 2009, which contains 454 documents. Download data

Links

Code Documentation

Readme/further information to run the program

Source Code (Github)

References

Lee at al. "iVisClustering: An Interactive Visual Document Clustering via Topic Modeling" 2012.

GENSIM topic modeling for humans

Interactive document clustering

via Topic Modeling

by Andras Dörömbözi and Timea Toth

Overview of the implementation

Views of the dashboard

Corpus of documents

Links

References