Welcome to Viz2 2021 Lee et al LDA’s documentation!¶
LDA class methods:¶
- class LDA_simplified.LDA(k, path)[source]¶
Init of class: load and clean the data, and make the first lda clustering with the the given k cluster :param: k: desired number of clusters used in LDA , should be at least 2, and 10 at maximum :param: path: Load path of the file
- build_bag_of_words_model()[source]¶
Build the bag of words model from the cleaned data and the dictionary of the unique words
- Returns
dictionary of all words, bag of words model of the documents
- build_term_higlights(doc_input)[source]¶
Setup the document view by highlighting the most relevant terms with their topic correspondence
- Parameters
doc_input – Id of the document
- Returns
str(): document as string, with the proper html tags assigned for specific terms (will
be used in the visualization by dash.DangerouslySetInnerHTML())
- calculate_cosine_similarity()[source]¶
Calculate Cosine similarity between the documents with sklearn implementation Also removing the duplicities (taking only the upper trinagle elements from the result) In order to avoid duplicate edges on the graph
- Returns
cosine_sim_matrix = np.array([number_of_documents,number_of_documents])
- clean_lemmatize_data()[source]¶
- Cleaning of data based on the paper description and with further useful approaches:
tokenize
removal of numeric characters
removal of punctuations
removal of stopwords
lemmatize the ‘cleaned data’
- Returns
dict<document_number, title> , dict<title, cleaned_text>
- color_assign_to_topic_with_opacity(x)[source]¶
Assign color to topics with opacity
- Parameters
x – topic id
- Returns
opaque color for term highlighting
- delete_cluster()[source]¶
Remove cluster from lda: modifying the state of the model
- Returns
lda with the removed cluster
- property filter_data¶
Filter for the paper-defined time interval(1994-2010) Separate the title and the document description
Store in dictionary format: k: Title , v: document text
- Returns
Filtered data stored in dictionary<title, document text>
- filter_parall_coords_topic_contribution(value)[source]¶
filter paralell coordinates based on the input value (>value has to be kept) filtering also the document-topic df to filter in cytoscape
- Parameters
value – paralell coordinate filter threshold
:return:filtered paralell coordinates and topic dataframe
- format_topics_sentence()[source]¶
build up a pandas dataframe with several useful informations: document - Topic belongings, contribution, assigned color keywords
- Returns
pd.DataFrame(‘Document_No’, ‘Dominant_Topic’, ‘Topic_Perc_Contrib’, ‘Keywords’, ‘Text’, ‘Title’,’color’)
- get_color_with_opacity(id, is_node_id)[source]¶
- Parameters
id – id of the node /cluster
is_node_id – bool, whether the selected cyoscape element is document node or cluster node
- Returns
opaque color for the background
- get_colormap_for_cluster()[source]¶
Build colormap for wordcloud
- Returns
colormap related to current cluster (assigned by the already related cluster color)
- get_document_nodes()[source]¶
build dictionary for document nodes to the cytoscape network visualization
- Returns
dict<document_id,(document_title, document_color: color of the cluster,cluster)>
- get_filtered_edges()[source]¶
Get the visible edges (edges between document nodes over the cosine sim threshold
- Returns
[(node_0, node_1, cosine_similarity value),…]
- get_lda()[source]¶
Model LDA with the given cluster number and the built up bag of words model
- Returns
lda model
- get_most_relevant_topics()[source]¶
extract the most relevant top 4 terms for the topics (relevant for topic node representation)
:return:dict<cluster_id, [top_4_terms_for_cluster]>
- get_parall_coord_df()[source]¶
Build pandas dataframe for the paralell coordinates view
- Returns
pandas dataframe for documents with their dominant topic
- get_top_n_word_probs_for_topic_i(topic_id, n=10)[source]¶
Term weight table input: extract the top n words for the currently selected topic
- Parameters
topic_id – id of the cluster
n – number of words to be extracted
- Returns
dataframe with the words and the related probabilities
- get_top_topic_for_words()[source]¶
build Topic - word - probability df with the related opaque color for term highlight in document view
- Returns
pandas.DataFrame(‘Word’, ‘Color’)
- property get_topic_nodes¶
build dictionary for the topics which will be input for the cytoscape node, generate a random position as well
- Returns
<topic_id, (color, position)>
- get_word_probabilities()[source]¶
extract all the word probabilities from the lda model for each cluster
- Returns
dict<topic_id, [word_probabilitities]>
- merge_cluster(cluster_ids)[source]¶
Merge clusters selected from the checklist: sum up the probs at row wise
- Parameters
cluster_ids – cluster ids selected in checklist
- Returns
model with merged clusters
- read_data()[source]¶
Read the original dataset with bssoup xml extractor :return: data stored in BeautiflSoup instance
- remove_document(value)[source]¶
Remove documents by clicking on the delete document button
- Parameters
value – id of the document node
- Returns
updated dictionary of nodes, by removing the marked document
- reset_settings()[source]¶
Reset the the view and the lda class itself with the original cluster number
- set_indexed_topic_node_df()[source]¶
topic dataframe in indexed format: for faster searching
- Returns
indexed df
- update_lda()[source]¶
Re-cluster with the new cluster number (reaction of “update” button on the app)
update the relevant class components after cluster merge / delete , re-clustering steps
Visualization methods:¶
- visualization.build_cluster_merge_list()[source]¶
Prepare the checklist for the merge cluster functionality
- Returns
dash checklist content with proper label
- visualization.build_cluster_summary_view()[source]¶
prepare the data for the cluster summary view
- Returns
data in cluster summary view feedable format
- visualization.get_doc_topic_edges()[source]¶
- Returns
Invisible edges within document nodes and their dominant topic
- visualization.get_graph_cos_sim_edges()[source]¶
- Returns
edges within document based on cosine similarity
- visualization.get_graph_document_nodes()[source]¶
- Returns
graph document nodes with the proper coloring in cytoscape format
- visualization.get_graph_topic_nodes()[source]¶
extract the color from class settings: will be defined in stylesheet the label is the top 4 words
- Returns
topic nodes in cytoscape format