Welcome to Viz2 2021 Lee et al LDA’s documentation!¶

LDA class methods:¶

class LDA_simplified.LDA(k, path)[source]¶

Init of class: load and clean the data, and make the first lda clustering with the the given k cluster :param: k: desired number of clusters used in LDA , should be at least 2, and 10 at maximum :param: path: Load path of the file

build_bag_of_words_model()[source]¶

Build the bag of words model from the cleaned data and the dictionary of the unique words

Returns: dictionary of all words, bag of words model of the documents

build_term_higlights(doc_input)[source]¶

Setup the document view by highlighting the most relevant terms with their topic correspondence

Parameters: doc_input – Id of the document
Returns: str(): document as string, with the proper html tags assigned for specific terms (will

be used in the visualization by dash.DangerouslySetInnerHTML())

calculate_cosine_similarity()[source]¶

Calculate Cosine similarity between the documents with sklearn implementation Also removing the duplicities (taking only the upper trinagle elements from the result) In order to avoid duplicate edges on the graph

Returns: cosine_sim_matrix = np.array([number_of_documents,number_of_documents])

clean_lemmatize_data()[source]¶

Cleaning of data based on the paper description and with further useful approaches:

tokenize
removal of numeric characters
removal of punctuations
removal of stopwords
lemmatize the ‘cleaned data’

Returns: dict<document_number, title> , dict<title, cleaned_text>

color_assign_to_topic(x)[source]¶

Parameters: x – topic id
Returns: color assigned to topic x (dict)

color_assign_to_topic_with_opacity(x)[source]¶

Assign color to topics with opacity

Parameters: x – topic id
Returns: opaque color for term highlighting

delete_cluster()[source]¶

Remove cluster from lda: modifying the state of the model

Returns: lda with the removed cluster

property filter_data¶

Filter for the paper-defined time interval(1994-2010) Separate the title and the document description

Store in dictionary format: k: Title , v: document text

Returns: Filtered data stored in dictionary<title, document text>

filter_parall_coords_topic_contribution(value)[source]¶

filter paralell coordinates based on the input value (>value has to be kept) filtering also the document-topic df to filter in cytoscape

Parameters: value – paralell coordinate filter threshold

:return:filtered paralell coordinates and topic dataframe

format_topics_sentence()[source]¶

build up a pandas dataframe with several useful informations: document - Topic belongings, contribution, assigned color keywords

Returns: pd.DataFrame(‘Document_No’, ‘Dominant_Topic’, ‘Topic_Perc_Contrib’, ‘Keywords’, ‘Text’, ‘Title’,’color’)

get_col()[source]¶

Returns: cluster colors (extracted from matplotlib colors)

get_color_with_opacity(id, is_node_id)[source]¶

Parameters

id – id of the node /cluster
is_node_id – bool, whether the selected cyoscape element is document node or cluster node

Returns

opaque color for the background

get_colormap_for_cluster()[source]¶

Build colormap for wordcloud

Returns: colormap related to current cluster (assigned by the already related cluster color)

get_document_nodes()[source]¶

build dictionary for document nodes to the cytoscape network visualization

Returns: dict<document_id,(document_title, document_color: color of the cluster,cluster)>

get_filtered_edges()[source]¶

Get the visible edges (edges between document nodes over the cosine sim threshold

Returns: [(node_0, node_1, cosine_similarity value),…]

get_k()[source]¶

Returns: number of clusters specified

get_lda()[source]¶

Model LDA with the given cluster number and the built up bag of words model

Returns: lda model

get_most_relevant_topics()[source]¶

extract the most relevant top 4 terms for the topics (relevant for topic node representation)

:return:dict<cluster_id, [top_4_terms_for_cluster]>

get_parall_coord_df()[source]¶

Build pandas dataframe for the paralell coordinates view

Returns: pandas dataframe for documents with their dominant topic

get_top_n_word_probs_for_topic_i(topic_id, n=10)[source]¶

Term weight table input: extract the top n words for the currently selected topic

Parameters

topic_id – id of the cluster
n – number of words to be extracted

Returns

dataframe with the words and the related probabilities

get_top_topic_for_words()[source]¶

build Topic - word - probability df with the related opaque color for term highlight in document view

Returns: pandas.DataFrame(‘Word’, ‘Color’)

property get_topic_nodes¶

build dictionary for the topics which will be input for the cytoscape node, generate a random position as well

Returns: <topic_id, (color, position)>

get_word_probabilities()[source]¶

extract all the word probabilities from the lda model for each cluster

Returns: dict<topic_id, [word_probabilitities]>

merge_cluster(cluster_ids)[source]¶

Merge clusters selected from the checklist: sum up the probs at row wise

Parameters: cluster_ids – cluster ids selected in checklist
Returns: model with merged clusters

read_data()[source]¶: Read the original dataset with bssoup xml extractor :return: data stored in BeautiflSoup instance

remove_document(value)[source]¶

Remove documents by clicking on the delete document button

Parameters: value – id of the document node
Returns: updated dictionary of nodes, by removing the marked document

reset_settings()[source]¶: Reset the the view and the lda class itself with the original cluster number

set_indexed_topic_node_df()[source]¶

topic dataframe in indexed format: for faster searching

Returns: indexed df

update_cosine_sim()[source]¶: Update cosine similarites

update_lda()[source]¶: Re-cluster with the new cluster number (reaction of “update” button on the app)

update_lda_related_class_elements()[source]¶: update the relevant class components after cluster merge / delete , re-clustering steps

Visualization methods:¶

visualization.build_cluster_merge_list()[source]¶

Prepare the checklist for the merge cluster functionality

Returns: dash checklist content with proper label

visualization.build_cluster_summary_view()[source]¶

prepare the data for the cluster summary view

Returns: data in cluster summary view feedable format

visualization.get_doc_topic_edges()[source]¶

Returns: Invisible edges within document nodes and their dominant topic

visualization.get_graph_cos_sim_edges()[source]¶

Returns: edges within document based on cosine similarity

visualization.get_graph_document_nodes()[source]¶

Returns: graph document nodes with the proper coloring in cytoscape format

visualization.get_graph_topic_nodes()[source]¶

extract the color from class settings: will be defined in stylesheet the label is the top 4 words

Returns: topic nodes in cytoscape format

visualization.plot_wordcloud(number_of_words=20)[source]¶

Wordcloud plot

Parameters: number_of_words – number of words to be plotted (default: 20)
Returns: Wordcloud plot (image, not interactive, but the words are not overlapped)

visualization.update_stylesheet()[source]¶

Update stylessheet: define all the new clusters class settings (colors etc.)

Returns: Updated graph stylesheet

Welcome to Viz2 2021 Lee et al LDA’s documentation!¶

LDA class methods:¶

Visualization methods:¶

Indices and tables¶