Monday, January 20, 2025

Text Mining and Representing Tweets

I've been investigating text mining through the recommended book (Kwartler, 2017). Some interesting processing on Delta airlines support tweets to remove stopwords and punctuation, lowercase the data. Then from the processed block of tweets a 'Term Document Matrix', or TDM was made. This is a list of words and the frequency of where they appear. The diagram essentially shows the way certain words cluster together. The 'Height' represents the "distance" between words in the matrix. 

Figure 1.1. Dendrogram from Delta airlines tweets

Another way to look at this is by colouring the branches. Representing the dendrogram in a circle gives a different perspective too (Fig 1.2). 


Figure 1.2 Circular dendrogram from Delta airlines tweets

I should reiterate that whilst these dendrograms formed the basis of my understanding, they are worked examples from Ted Kwartler's book. I just thought they looked cool!

References

Kwartler, T. (2017). Text mining in practice with R. John Wiley & Sons.

No comments:

Post a Comment

Evaluating Embeddings for NLP and Document Clustering

Embeddings are an alternative to traditional methods of vectorisation first proposed by Bengio et al in 2003, who developed the first langua...