Text Analysis with R

Learn the basics of using the R programming language for text analysis

Tool Logo

Text Analysis with R

In this workshop you will learn about the basics of doing text analysis with the R programming language. This includes:

A basic understanding of the R programming language is recommended for this workshop.

Estimated workshop length: 2 hours


Setup Instructions

In preparation for this workshop, you will need to have a Posit account (previously was an R Studio Cloud account) and have a new R Studio project open on Posit Cloud. Follow the steps below to get set up.

Source Button Location

Title Change

Packages to Install

    
    install.packages("tidyverse")
    install.packages("tokenizers")
    
    library(tidyverse)
    library(tokenizers)  
    

Workshop Tasks

Task Set #1

  1. Create a variable called “text” using the following code:

     text <- paste("You will rejoice to hear that no disaster has accompanied the commencement of an enterprise which you have regarded with such evil forebodings. I arrived here yesterday, and my first task is to assure my dear sister of my welfare and increasing confidence in the success of my undertaking")   
    
  2. Create a variable containing the above text tokenized into words (Remember that the tokenizer makes a list of lists)
  3. Find out how long your new list of words is using the length function
  4. Turn your list of words into a data frame (remember to make it into a table first)
  5. Arrange your data frame so the most common words are listed first
  6. When you are done type “GOT IT!!” into the chat

Task Set #2

  1. Tokenize the paragraph in the “text” variable into sentences and pull out just the list
  2. Tokenize your sentences into lists of words
  3. Use the “sapply” function to find the length of each list of words
  4. When you are done type “TOKENIZED!!” into the chat

Task Set #3

  1. Use the code below to load in the full text of the book “Frankenstein”

     text <- paste(readLines("https://raw.githubusercontent.com/BrockDSL/R_for_Text_Analysis/master/frankenstein.txt"),collapse = "\n")   
    
  2. Using your code from before, tokenize the book into words and then turn it into a dataframe arranged by count
  3. Use the code below to load in the word frequency dataset

     wordfreq <- read_csv("https://raw.githubusercontent.com/BrockDSL/R_for_Text_Analysis/master/wordfrequency.csv")   
    
  4. Join the two datasets together to get frequency values for each word in the book
  5. Filter your results to remove the stopwords. (Try out different frequency values to see more or less common words)
  6. Type “STOPWORDS ELIMINATED” into the chat when you are done

Task Set 4

  1. Make a function that takes in a variable containing text and outputs a dataframe filtered to remove stopwords.
  2. Try out your new function by running on the text variable.
  3. (optional) Try out the function on some of these other books using the code below to build the variables.

     dracula_text <- paste(readLines("https://raw.githubusercontent.com/BrockDSL/R_for_Text_Analysis/master/dracula.txt"),collapse = "\n")  
    
     prideandprejudice_text <- paste(readLines("https://raw.githubusercontent.com/BrockDSL/R_for_Text_Analysis/master/prideandprejudice.txt"),collapse = "\n")  
    
     gatsby_text <- paste(readLines("https://raw.githubusercontent.com/BrockDSL/R_for_Text_Analysis/master/greatgatsby.txt"),collapse = "\n")  
    
  4. When you are done type “TEXT ANALYSIS MASTERED!!!” in the chat.

Follow Up Material

If you are looking to continue enhancing your knowledge of R, check out our other R workshops or try out one of the options below!

YaRrr! The Pirate’s Guide to R

Programming Historians R Text Analysis

W3Schools R Tutorial


This workshop is brought to you by the Brock University Digital Scholarship Lab. For a listing of our upcoming workshops go to Experience BU if you are a Brock affiliate or Eventbrite page for external attendees.