Wednesday, September 27, 2017

Word Cloud with R and Twitter


I'm teaching R in the morning to my Advanced Programming class. So I thought let's do some exploration with Twitter and show the power of R and its packages! In this example which I've used from another source (links at the bottom) I plot word clouds by searching for trends on twitter using hashtags. The tutorial I followed led me into a number of problems which I had to solve eventually to get my word clouds!

Step 1: Installation using github instead of CRAN 

https://cran.r-project.org/web/packages/twitteR/README.html

In this step we use the installation through github method as using install.packages("twitteR") was causing authorization errors with Twitter. The following paragraph is copied from the README.html paged referenced above. 

"twitteR is an R package which provides access to the Twitter API. Most functionality of the API is supported, with a bias towards API calls that are more useful in data analysis as opposed to daily interaction.

Getting Started

  • Please read the user vignette, which admittedly can get a bit out of date
  • Create a Twitter application at http://dev.twitter.com. Make sure to give the app read, write and direct message authority.
  • Take note of the following values from the Twitter app page: "API key", "API secret", "Access token", and "Access token secret".
  • You can use the CRAN version (stable) via the standard install.packages("twitteR") or use the github version. To do the latter:
  • install.packages(c("devtools", "rjson", "bit64", "httr"))
  • Make sure to restart your R session at this point
  • library(devtools)
  • install_github("geoffjentry/twitteR")
  • At this point you should have twitteR installed and can proceed:
  • library(twitteR)
  • setup_twitter_oauth("API key", "API secret")
    • The API key and API secret are from the Twitter app page above. This will lead you through httr's OAuth authentication process. I recommend you look at the man page for Tokenin httr for an explanation of how it handles caching.
  • You should be ready to go!
  • If you have any questions or issues, check out the mailing list "

Step2: Get your Twitter API key and API secret

Create a new App

Add application name, some dummy URL agree to the terms and create your application.


Once it is created you can find the API key and API secret in the menu in blue. You can use these in your R script to authorize this app to be used from your account.


Step 3: Install required packages
#install the necessary packages
install.packages("twitteR") (This is already installed in Step 1 so ignore it)
install.packages("wordcloud")
install.packages("tm")

Step 4: Load the required libraries
library("twitteR")
library("wordcloud")
library("tm")

Step 5: Run the following lines

#necessary file for Windows
download.file(url="http://curl.haxx.se/ca/cacert.pem", destfile="cacert.pem")
#to get your consumerKey and consumerSecret see the twitteR documentation for instructions
consumer_key <- 'your key'
consumer_secret <- 'your secret'
access_token <- 'your access token'
access_secret <- 'your access secret'
setup_twitter_oauth(consumer_key,
                    consumer_secret,
                    access_token,
                    access_secret)


Step 6: Start your analysis
You can experiment by replacing the hashtag in the following function. n=500 means to bring the first 500 tweets from Twitter with hashtag #MUFC


r_stats <- searchTwitter("#MUFC", n=500)
#should get 1500
length(r_stats)
#[1] 1500
#save text
r_stats_text <- sapply(r_stats, function(x) x$getText())


Step 7: Fixing the emoticons problem
twitteR package seems a little old so I was having trouble with emoticons in the tweets that were being returned with the searchTwitter querry. Some functions of R were not able to handle the encoding of the emoticons so I found a solution on 

The solution is not neat, as it simply converts these emoticons into HEX values which the twitteR functions can handle, although they are being printed in the word cloud. So need more work on that. It would be interesting to plot an 'emoticloud' instead of a word cloud using twitter hashtags. Maybe an assignment for the students 😈! Notice the use of "emoticon" in discussing a problem with emoticons !

#Fixing the emoticons problem
r_stats_text <- data.frame(text = iconv(r_stats_text, "latin1", "ASCII", "byte"), 
                      stringsAsFactors = FALSE)

Step 8: Clean up and Create the Word Cloud
#create corpus
r_stats_text_corpus <- Corpus(VectorSource(r_stats_text))

#clean up
r_stats_text_corpus <- tm_map(r_stats_text_corpus, content_transformer(tolower) )
r_stats_text_corpus <- tm_map(r_stats_text_corpus, removePunctuation)
r_stats_text_corpus <- tm_map(r_stats_text_corpus, function(x)removeWords(x,stopwords()))
wordcloud(r_stats_text_corpus)


Here is the result of my query. You can see some garbage as the emoticons are being showed as hex numbers. That's just a small workaround. Need to explore more to completely remove them. 


Another Result. Searching with the hashtag #muhammad reveals Love and prophet as the two most used words!


This is a lot of fun! Going to experiment some more later! Now gtg as I have a lecture 9am in the morning and its 4:30 am right now!

References:




No comments:

Post a Comment