Data Analysis Insight: 01/20/25

Monday, January 20, 2025

Text Mining and Representing Tweets

I've been investigating text mining through the recommended book (Kwartler, 2017). Some interesting processing on Delta airlines support tweets to remove stopwords and punctuation, lowercase the data. Then from the processed block of tweets a 'Term Document Matrix', or TDM was made. This is a list of words and the frequency of where they appear. The diagram essentially shows the way certain words cluster together. The 'Height' represents the "distance" between words in the matrix.

Figure 1.1. Dendrogram from Delta airlines tweets

Another way to look at this is by colouring the branches. Representing the dendrogram in a circle gives a different perspective too (Fig 1.2).

Figure 1.2 Circular dendrogram from Delta airlines tweets

I should reiterate that whilst these dendrograms formed the basis of my understanding, they are worked examples from Ted Kwartler's book. I just thought they looked cool!

References

Kwartler, T. (2017). Text mining in practice with R. John Wiley & Sons.

Data Analysis Insight

Monday, January 20, 2025

Text Mining and Representing Tweets

Evaluating Embeddings for NLP and Document Clustering

Report Abuse