I've become fascinated by NLP recently - although, jarringly, I still primarily translate those initials as 'neuro-linguistic programming', a "pseudoscientific approach to communication, personal development and psychotherapy" (Wikipedia), first proposed in 1975. No, we are of course talking about Natural Language Processing - which essentially covers the myriad ways of extracting meaning from data.
I was looking at the TripAdvisor reviews for a pub customer recently and wondering how I could represent them in an easy to digest format. For sure I could show a table of ratings - but they do that on the site anyway, and raw numbers are (a) sometimes hard to process and (b) wouldn't necessarily give you the full flavour of the reviews. Time for some NLP! No, not neuro-linguistic programming. Natural language processing, yes... we've been over that.
The basic idea was to scrape the data, process it, then represent it in a meaningful way. Literal screen-scraping is tedious but not difficult; once I had a goodly number of reviews I edited/mashed them into a useable format: there were only a few hundred so it was easiest to do this manually given the raw structure. Next step involved removing stopwords such as 'the', 'and' or ''of', followed by labelling the remaining with sentiment from a lexicon (I used the standard R one). The lexicon in this case contained a large number of words with a sentiment score associated with each one - positive words like 'brilliant' or 'good' would have a score of +1. Negative words like 'broken' or 'useless' would score -1. So what we then had was a 'stopped' bunch of revies scored for sentiment.
At this point you could take a measure of sentence-based sentiment - weighing up the sentiment of each review by adding the positive and negative scores, and perhaps totalling the positives the negative. Are the reviews good or bad overall? There's your answer. But what I wanted o do was visualise the scores in some way, so I decided to make a word cloud. By representing the positive words in green and the negative words in red, with the size of the words representing the frequency of their occurrence in classic word cloud style, I made what I think is a a pleasingly striking representation of TripAdvisor reviews that says so much more than that table of data.