WordStream-Extension

Description

The WordStream Sentiment Analysis Visualization is an extension of the original WordStream tool, developed as part of the Visualization 2 course project. This project introduces a new tab to WordStream dedicated to sentiment analysis, which combines word cloud visualization with temporal analysis. The extension supports multiple datasets, interactive features, and customizable parameters to analyze and explore the evolution of sentiment over time.

This extension enhances WordStream’s capabilities by incorporating sentiment analysis, allowing users to:

Explore how sentiment associated with topics changes over time.
Compare sentiment trends across multiple categories or datasets.
Gain deeper insights into text-heavy datasets, such as news articles, social media posts, and other temporal data sources.

Paper

"WordStream" by Dang et al. (2019) presents an innovative interactive visualization tool for analyzing and illustrating the evolution of topics over time. It synthesizes two popular techniques, word clouds and stacked graphs, to create a hybrid visualization method that provides both temporal and spatial insights into text data. The tool is evaluated on datasets like political blogs, news articles, and academic publications.

The key contributions of the "WordStream" paper include the development of a hybrid visualization method that combines word clouds and stacked graphs. Word clouds represent important terms with varying font sizes based on frequency or significance, while stacked graphs depict temporal trends of topics, with stream layers representing the evolution of topic significance over time. The integration of word clouds within stream layers optimizes space usage and visually links terms to their corresponding time periods.

The design and implementation of the tool were carried out as an interactive prototype using D3.js, enabling users to explore topic trends dynamically. A space-sharing approach was introduced to maximize term placement efficiency while preserving the temporal context, and the tool allows customization of visual settings such as font scaling, number of displayed terms, and layout dimensions. The algorithms include a spiral placement algorithm for terms within stream layers, ensuring compactness and collision avoidance, with terms arranged to reflect their temporal context and stream orientation, providing an intuitive flow.

The evaluation of the tool involved quantitative metrics, such as compactness (coverage efficiency of terms within layers), to assess layout quality across datasets, and qualitative feedback from informal studies with domain experts, highlighting the tool’s usability for longitudinal trend analysis and its limitations in handling highly cluttered streams or showing term relationships explicitly.

Implementation

The implementation of the WordStream-Extension project involved several key components, including data acquisition and preprocessing, extending the existing WordStream visualization, and enhancing interactivity using D3.js.

Data Acquisition and Preprocessing

We integrated three new datasets into WordStream: Rotten Tomatoes movie reviews, CNN news articles, and Reddit posts from the /datasets subreddit. The preprocessing pipeline for these datasets included:

Data Collection: Acquired datasets from trusted sources, ensuring a diverse range of text corpora.
Data Cleaning: Performed text normalization, including lowercasing, removal of special characters, and handling of missing values to ensure consistency across datasets.
Keyword Extraction: Utilized the SpaCy library in Python to extract relevant keywords from each dataset, enhancing the quality of the visualization.
Sentiment Analysis: Applied the VADER Sentiment Analyzer to compute sentiment scores for each text entry, categorizing sentiments as positive, neutral, or negative.
Additional data Transformation: Necessary to visualize the data in the new visualization we built. The data was aggregated data by year, calculating the average sentiment per text and the frequency of each keyword annually to facilitate temporal analysis.

Other datasets containing data such as social media posts and fact-check articles were added. However, due to reasons that we were not able to understand, these datasets proved to be difficult to visualize in the WordStream. In order not to waste the effort invested in these datasets, they are only visualizible in the SentimentCloud and SentimentStream tabs.

Extending the WordStream Visualization

Building upon the original WordStream tool, we introduced two new visualization tabs: SentimentCloud and SentimentStream.

Enhancing Interactivity with D3.js

To improve user engagement and interactivity, we leveraged D3.js to implement dynamic sliders - interactive sliders for sentiment thresholds and word ranking allow real-time updates to the visualization based on user input. And diverging color schemes - transitioned from categorical to diverging color schemes to better represent the spectrum of sentiments.

For explanation of the classes and function we created and used, see Code Documentation.

Program

Running the Application

To run the WordStream-Extension application on the web, click here. Note that this version will not display some of the data, due to GitHub's file size limit, in order to visualize WordStream-Extension correctly follow the steps described bellow.

Ensure you have Python installed on your machine. Then, open the command line, navigate to the folder where WordStream-Extension is located, and execute the following command:

python -m http.server 8000

This command starts a simple HTTP server on port 8000. Navigate to http://localhost:8000 in your web browser to access the application.

Additional Datasets

We expanded the original WordStream with several new datasets to enhance analysis capabilities:

Rotten Tomatoes Movie Reviews
CNN News Articles
Reddit Posts from the /datasets Subreddit

Each dataset underwent extensive preprocessing, including data cleaning, keyword extraction with SpaCy, sentiment analysis using VADER, and aggregation by year to facilitate temporal visualization.

Data Cleaning: Standardized text formats and removed inconsistencies.
Keyword Extraction: Identified significant terms to be visualized.
Sentiment Analysis: Calculated sentiment scores to categorize words by sentiment.
Data Aggregation: Compiled yearly sentiment averages and keyword frequencies.

SentimentCloud Tab

The SentimentCloud tab offers an interactive word cloud that visualizes sentiment scores:

Color-Coded Words: Red indicates negative sentiment, blue indicates positive sentiment.
Dataset Selection: Users can choose from multiple datasets to visualize sentiment dynamics.
Category Filtering: Allows filtering by categories such as 'person', 'organization', or 'country'.
Sentiment Threshold Sliders: Users can adjust sliders to set custom thresholds for classifying sentiment as positive or negative.

SentimentStream Tab

The SentimentStream tab integrates temporal analysis with sentiment visualization:

Yearly Word Clouds: For each year, separate word clouds display positive and negative sentiments.
Interactive Line Charts: Clicking on a word generates a line chart showing the sentiment evolution of that word over time.
Multiple Category Comparison: Users can select multiple categories to compare their sentiment trends across different datasets.
Adjustable Sentiment Thresholds: Similar to the SentimentCloud tab, allowing dynamic customization of sentiment classification.

Sentiment scores are visually represented using a diverging color scheme for clarity:

Positive Sentiments: Mapped to shades of blue, with intensity proportional to the sentiment score.
Negative Sentiments: Mapped to shades of red, similarly scaled.

Links

Paper (Dang et al. (2019))

Program

Code (GitHub repository)

Code Documentation

References

Dang, T., Nguyen, H. N., & Pham, V. (2019). WordStream: Interactive Visualization for Topic Evolution. In J. Johansson, F. Sadlo, & G. E. Marai (Eds.), EuroVis 2019 - Short Papers. The Eurographics Association. https://doi.org/10.2312/evs.20191178

iDataVisualizationLab. (2019). WordStream [Source code]. GitHub. https://github.com/iDataVisualizationLab/WordStream

Cui, W., Liu, S., Tan, L., Shi, C., Song, Y., Gao, Z., Qu, H., & Tong, X. (2011). TextFlow: Towards Better Understanding of Evolving Topics in Text. IEEE Transactions on Visualization and Computer Graphics, 17(12), 2412-2421. https://doi.org/10.1109/TVCG.2011.239

Liu, S., Zhou, M. X., Pan, S., Qian, W., Cai, W., & Lian, X. (2009). Interactive, topic-based visual text summarization and analysis. In Proceedings of the 18th ACM Conference on Information and Knowledge Management (pp. 543-552). Association for Computing Machinery. https://doi.org/10.1145/1645953.1646023

Wang, X., Liu, S., Chen, Y., Peng, T.-Q., Su, J., Yang, J., & Guo, B. (2016). How ideas flow across multiple social groups. In 2016 IEEE Conference on Visual Analytics Science and Technology (VAST) (pp. 51-60). IEEE. https://doi.org/10.1109/VAST.2016.7883511

Rotten Tomatoes Movies and Critic Reviews Dataset

CNN Articles After Basic Cleaning

The Reddit Dataset

Social Media Sentiments Analysis Dataset

Fake and Real News Dataset