Monday, February 10, 2025

Text Mining and Sentiment Analysis in R

Running through the same concepts as before, I started looking at TripAdvisor data for a specific pub in Nottingham - the Malt Cross on St James Street. Their page of reviews can be seen here. I did a simple 'screen scrape' to get the review text, did  a bit of work in Excel to bring it into better shape, then went through the pre-processing steps - lowercasing, removing stopwords and punctuation. I didn't do any stemming - taking words back to a 'stem' where for instance both 'liked' and 'likes' would be shown converted 'lik' to remove duplication - as I wanted whole words for a word cloud. This would mean that 'disappointed' and disappointing', for example, would be shown separately but I was prepared to go with that. 

The "tm" library in R has a lexicon of around 6,500 words described as positive or negative that can be used for sentiment analysis with text mining. The polarity() function scans through the data and assigns a number between -1 and +1 to words in each sentence and then an overall positivity or negativity score to each sentence. Apparently people are on average positive in their comments (even on the Internet!) so these numbers should be scaled so the mean value is zero. After scaling the score distribution was as shown in figure 3.1. Slightly more positive than negative I'd say. 

 

Figure 2.1. Frequency distribution (scaled) for polarity of documents


Taking individual words, I constructed a 'word cloud' (figure 3.2). Here we see positive words in the review in green and negative ones in red. The size of the word corresponds to how often it appears in the reviews.

Figure 2.2. Word cloud for positive and negative sentiments






Evaluating Embeddings for NLP and Document Clustering

Embeddings are an alternative to traditional methods of vectorisation first proposed by Bengio et al in 2003, who developed the first langua...