This is the first report in the Data Science Capstone in the Data Science Specialization by John Hopkins University. The purpose is to showcase my progress towards building a text prediction model and a Shiny app that allows users to type a word, and predict what words comes next.
The motivation for this project is to:
In this report, I will be downloading the corpus data provided, sample a small percentage of it, and clean the data by removing stop words and other unnecessary text data. Then, I performed a brief exploration of the data in terms of frequency of words using the n-gram models to better understand the data and gain insights towards building the model.
The data for this project was obtained from the course website through this link, which came from HC Corpora, a collection of corpora for various languages freely available to download. View corpora-info.md
for more information about the data.
There were several languages provided, but I only used the English files:
en_US.blogs.txt
: blogs texten_US.news.txt
: news feed texten_US.twitter.txt
: twitter text dataI first start with loading the libraries using the package check function from this article to check if they’re installed
## list all packages
= c(
packages "tidyverse", # a collection of R packages designed for data science
"tidytext", # text mining library that follows the tidy data principle
"here", # easy file referencing in project-oriented workflows
"tm", #A framework for text mining applications within R.
"RColorBrewer", # color palette for R plots.
"wordcloud2", # word cloud as html widgets
"htmlwidgets", # html widgets in R
"webshot", # take screenshots of web pages from R
"kableExtra", # simple table generator to make table outputs nicer
"feather" # A Fast On-Disk Format for data frames powered by Apache Arrow
)
## If a package is installed, it will be loaded. If any are not, the missing
## package(s) will be installed from CRAN and then loaded.
<- lapply(
package.check
packages,FUN = function(x) {
if (!require(x, character.only = TRUE)) {
install.packages(x, dependencies = TRUE)
library(x, character.only = TRUE)
}
}
)
::install_phantomjs() # to save wordcloud plot
webshottheme_set(theme_classic())
# Declare location of current report
::i_am("report/milestone-report.Rmd") here
I coded a helper function download_data
which essentially downloads the data from the URL from the course website based on the given local, then puts it under the data folder as a folder based on the specified name. It also removes the other data files. Please refer to appendix for the code.
# download_data("en_US", "original")
# save the data path
<- here("data/original")
data_path
<- list.files(data_path) # list all files
file_names
# save paths for our text data files
<- here(data_path, file_names[1])
blogs_txt_path <- here(data_path, file_names[2])
news_txt_path <- here(data_path, file_names[3])
twitter_txt_path
file_names
## [1] "blogs.txt" "news.txt" "twitter.txt"
I also created a function file_info
, which takes the file name as arguments, and outputs a tibble with the following information:
Refer to appendix for the code and for a bash alternative.
# show information of our text data.
data_info(file_names) %>%
kbl() %>%
kable_styling(bootstrap_options = c("striped", "hover"))
file_name | size | line_count | word_count | max_line |
---|---|---|---|---|
blogs.txt | 200.4 MB | 899,288 | 37,334,690 | 140 |
news.txt | 196.3 MB | 1,010,242 | 34,372,720 | 11,384 |
twitter.txt | 159.4 MB | 2,360,148 | 30,374,206 | 40,833 |
From this, we see that the largest file is are the blogs, followed by news then twitter.
Since the data is fairly large as observed from the file information, I’ve decided to sample the data to speed up my analysis and the development of the initial model. I have three sources - news, blogs and tweets.
Since news text will logically have better English standards like spelling and grammar, I will sample more (15%) from news, 5% from blogs, and 1% from twitter, which can be done using probability and rbinom
from R
Code for sample_file
is in appendix below.
set.seed(2021)
# prob is probability of sampling
sample_data(file_names[1], blogs_txt_path, prob = 0.05)
sample_data(file_names[2], news_txt_path, prob = 0.15)
sample_data(file_names[3], twitter_txt_path, prob = 0.01)
<- here("data/sampled")
sample_path list.files(sample_path)
## [1] "sub_blogs.txt" "sub_news.txt" "sub_twitter.txt"
The sampled data will be written to a new folder called sampled
under the data folder, and the text files will be have the “sub_” prefix.
With our sampled data, we can now read in the text files individually using a helper function readTxtFile
which uses the readLines
function.
The text files are then combined together.
<- list.files(sample_path)
sampled_file_names
<- readTxtFile(here(sample_path, sampled_file_names[1]))
twitter_txt <- readTxtFile(here(sample_path, sampled_file_names[2]))
news_txt <- readTxtFile(here(sample_path, sampled_file_names[3]))
blogs_txt
<- paste(c(twitter_txt, news_txt, blogs_txt))
combined_txt
# clear from memory
rm("twitter_txt", "news_txt", "blogs_txt")
Now it’s time to clean the data to prepare it for analysis.
Most of these tasks are already performed by the unnest_tokens
function from tidytext
, which makes my job easier.
# Load data for stop words
data(stop_words)
head(stop_words)
<- get_bad_words() %>%
bad_words rename('word' = text)
head(bad_words, 3)
## save bad words file
write_feather(bad_words, here("data/bad_words.feather"))
With profane and stop words data ready, we can begin building our n-gram models. Note the cleaning is happening as we are building our ngram tibbles.
Below showcases the functionality of unnest_tokens
in tidytext
<- "Alice was beginning to get very tired of sitting by her sister
string on the bank, and of having nothing to do: once or twice she had
peeped into the book her sister was reading, but it had no
pictures or conversations in it, `and what is the use of a book,'
thought Alice `without pictures or conversation?'"
tibble(line = 1:length(string), text=string) %>%
unnest_tokens(word, text) %>%
head()
Here’s how we get bigrams
tibble(line = 1:length(string), text=string) %>%
unnest_tokens(bigram, text, token = "ngrams", n = 2) %>%
head()
To get further ngrams, all you need to do is increase the value of n
<- tibble(line = 1:(length(combined_txt)), text = combined_txt)
ngram_tb head(ngram_tb)
<- ngram_tb %>%
unigram_tb unnest_tokens(word, text) %>% # turn our text file into individual words
anti_join(stop_words, by = "word") %>% # remove stop words
anti_join(bad_words, by = "word") %>% # remove profane words
filter(!str_detect(word, "\\d+")) %>% # filter digits
mutate_at("word", str_replace, "[[:punct:]]", "") # remove punctuation
head(unigram_tb)
<- ngram_tb %>%
bigram_tb unnest_tokens(bigram, text, token = "ngrams", n = 2) %>%
mutate_at("bigram", str_replace, "[[:punct:]]", "") %>%
filter(!str_detect(bigram, "\\d+")) %>%
separate(bigram, c("word1", "word2"), sep = " ") %>%
filter(!word1 %in% stop_words$word,
!word2 %in% stop_words$word,
!word1 %in% bad_words$word,
!word2 %in% bad_words$word)
head(bigram_tb)
<- ngram_tb %>%
trigram_tb unnest_tokens(trigram, text, token = "ngrams", n = 3) %>%
mutate_at("trigram", str_replace, "[[:punct:]]", "") %>%
filter(!str_detect(trigram, "\\d+")) %>%
separate(trigram, c("word1", "word2", "word3"), sep = " ") %>%
filter(!word1 %in% stop_words$word,
!word2 %in% stop_words$word,
!word3 %in% stop_words$word) %>%
filter(!word1 %in% bad_words$word,
!word2 %in% bad_words$word,
!word3 %in% bad_words$word)
head(trigram_tb)
<- ngram_tb %>%
quadgram_tb unnest_tokens(quadgram, text, token = "ngrams", n = 4) %>%
mutate_at("quadgram", str_replace, "[[:punct:]]", "") %>%
filter(!str_detect(quadgram, "\\d+")) %>%
separate(quadgram, c("word1", "word2", "word3", "word4"), sep = " ") %>%
filter(!word1 %in% stop_words$word,
!word2 %in% stop_words$word,
!word3 %in% stop_words$word,
!word4 %in% stop_words$word) %>%
filter(!word1 %in% bad_words$word,
!word2 %in% bad_words$word,
!word3 %in% bad_words$word,
!word4 %in% bad_words$word)
head(quadgram_tb)
if(!dir.exists(here("data/ngrams"))){
dir.create(here("data/ngrams"))
}
# save ngrams with feather
<- here('data/ngrams')
ngrams_path write_feather(unigram_tb, here(ngrams_path, "unigrams.feather"))
write_feather(bigram_tb, here(ngrams_path, "bigrams.feather"))
write_feather(trigram_tb, here(ngrams_path, "trigrams.feather"))
write_feather(quadgram_tb, here(ngrams_path, "quadgram.feather"))
With the data cleaned, we can start analyzing the frequency of words.
To visualize the data, helper functions plot_top and wordcloud_plot were created to plot the top_n words and word cloud.
%>%
unigram_tb plot_top(15)
ggsave(here("figs/unigram_bar.png"), width = 12, height = 8)
<- "unigram_wc.png"
file_name wordcloud_plot(unigram_tb, file_name, 150)
::include_graphics(path.expand(here("figs", file_name))) knitr
For single term words, the words time, people and day are very common. We also see the word ‘rt’ which might stand for retweet from twitter, and should be added to stop words. We also see teh words game, team, and night which all points towards a sports game. Then there’s the word ‘lol’ which I believe comes from twitter as well.
The word cloud shows us more words, and can prove useful for identifying stop words that we should add later on.
<- bigram_tb %>%
bigram_tb unite(bigram, word1, word2, sep=" ")
%>%
bigram_tb plot_top(15)
ggsave(here("figs/bigram_bar.png"), width = 12, height = 8)
<- "bigram_wc.png"
file_name wordcloud_plot(bigram_tb, file_name, 100)
::include_graphics(path.expand(here("figs", file_name))) knitr
For bigrams, we see some US locations are very common, along with terms related to the government (white house, vice president, supreme court) which possible came from the news corpus. We also see the terms happy birthday and mothers day, which is should belong to the twitter corpus.
<- trigram_tb %>%
trigram_tb unite(bigram, word1, word2, word3, sep=" ")
%>%
trigram_tb plot_top(15)
ggsave(here("figs/trigram_bar.png"), width = 12, height = 8)
<- "trigram_wc.png"
file_name wordcloud_plot(trigram_tb, file_name, 100)
::include_graphics(path.expand(here("figs", file_name))) knitr
As we analyse trigrams now, we see the previous term “mothers day” is now connected to “happy mothers day”, which shows the relationship between the terms. Holidays like Cinco de Mayo and St. Patrick’s Day are also popping up, along with names like Gov Chris Christie, President Barrack Obama, and Martin Luther King.
For computers to understand our data, we need to convert it into a machine understandable form. In natural language processing (NLP), one of the techniques is called TF-IDF, which stands for term frequency, inverse document frequency.
TF-IDF will convert text documents in to a form where each sentence is a document and words in the sentence are tokens. The result is something called a DocumentTermMatrix (DTM), or TermDocumentMatrix (TDM), depending on whether the documents correspond to row or column. What this does is essentially provide measure to weigh the importance of different words.
Using the tm
package, I can cast my data frames into a dtm.
<- ngram_tb %>%
my_dtm unnest_tokens(word, text) %>%
count(line, word) %>%
cast_dtm(line, word, n)
my_dtm
## <<DocumentTermMatrix (documents: 265350, terms: 153707)>>
## Non-/sparse entries: 5813631/40780338819
## Sparsity : 100%
## Maximal term length: 109
## Weighting : term frequency (tf)
Our dtm has a total of 265350 sentences and 153707 terms. It also seems to be 100% sparse, which can cause problems to our model. This will have to be fixed later on.
The analysis helped me understand more about what kind of information my sampled data captures. With a dtm ready, the next step is to get more data for testing and validation, then build the model. After that, I will start building the shiny app for users to use the data product. Throughout the process, I will by studying more from the book Tidy text mining and research suitable algorithms to use.
The steps are summarized below:
Prepare train test and validation dataset
Train and evaluate text prediction model
Build shiny app
Slide deck
I plan to generate another random sample from the news dataset to validate my prediction model. I choose the news dataset because it should contain the most proper English text.
# locale options: en_US, de_DE, ru_RU and fi_FI
# outdir = directory name
<- function(locale, outdir) {
download_data ::i_am("report/milestone-report.Rmd")
here<- here("data")
data_path
if (dir.exists(here(data_path, outdir))) {
print("directory already exists")
else {
} options(timeout = 200) # to prevent timeout error
# download data into temp file
<- tempfile()
temp download.file(url = "https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip",
temp)
# unzip tempfile and remove
unzip(zipfile = temp, exdir = data_path)
unlink(temp)
}
# save directory of extracted zip file
<- here(data_path, "final")
final_path
# create outdir directory
dir.create(here(data_path, outdir))
# grabs files with en_US
<-
old_names list.files(
path = final_path,
pattern = paste0("^", locale),
recursive = TRUE
)
# provide new names
<- c("blogs.txt", "news.txt", "twitter.txt")
new_names
# rename and remove old ones.
file.rename(from = file.path(final_path, old_names),
to = file.path(here(data_path, outdir), new_names))
# remove final folder from zip file
unlink(here(data_path, "final"), recursive = TRUE)
}
# create txt file
<- function(path) {
readTxtFile <- file(path, "r")
con <- readLines(con, skipNul = T)
text close(con)
return(text)
}
# Creates a table given the text files
<- function(names) {
file_info # file size
<- file.info(here(data_path, names))$size / (2 ** 20)
size
# word count
<-
(total_words_bash system("wc -w ../data/original/*.txt", intern = TRUE))
<- "[[:digit:]]+"
regexp
<-
word_count unlist(str_split(str_extract(total_words_bash, regexp), " ")[1:3])
<- c()
line_count <- c()
max_line
for (name in names) {
<- readTxtFile(here(data_path, name))
file <- length(file)
num_lines
<- as.numeric(summary(nchar(file))["Max."])
longest_line
<- c(line_count, num_lines)
line_count <- c(longest_line, max_line)
max_line
}
<- tibble(
tb "file_name" = names,
"size" = paste(round(size, 1), "MB"),
"line_count" = line_count,
"word_count" = as.integer(word_count),
"max_line" = as.integer(max_line)
%>%
) mutate_if(is.numeric, list( ~ prettyNum(., big.mark = ",")))
return(tb)
}
# -w gives word count
# -c gives byte count
# -l gives line count
echo " lines words bytes"
wc -l -w -c ../data/original/*
<- function(filename, filepath, prob) {
sample_file set.seed(2021)
<- file(filepath, "r")
con <- readLines(con, skipNul = T)
file <- length(file)
len <- file[rbinom(n = len, size = 1, prob = prob) == 1]
sub_file close(con)
<- here("data/sampled")
sample_path if (!dir.exists(sample_path)) {
dir.create(sample_path)
}
<- paste0(sample_path, "/sub_", filename)
new_file_path if (!file.exists(new_file_path)) {
<- file(new_file_path, "w")
out writeLines(sub_file, con = out)
close(out)
} }
# plots top n words
<- function(tibble, top_num) {
plot_top %>%
tibble rename(ngram = colnames(tibble)[2]) %>%
count(ngram, sort = TRUE) %>%
slice(1:top_num) %>%
mutate(ngram = reorder(ngram, n)) %>%
ggplot(aes(n, ngram)) +
geom_col() +
labs(y = NULL)
}
# word cloud plots top n words
<- function(tibble, file_name, top_num=100) {
wordcloud_plot <- tibble %>%
wordcloud rename(ngram = colnames(tibble)[2]) %>%
count(ngram, sort = TRUE) %>%
slice(1:top_num) %>%
wordcloud2(size=0.7, color='random-dark', minRotation = 0, maxRotation = 0)
saveWidget(wordcloud, "tmp.html", selfcontained = F)
webshot("tmp.html", here("10_DataScienceCapstone/figs", file_name), delay = 5, vwidth = 1000, vheight = 800)
unlink(here("10_DataScienceCapstone/report", "tmp_files"), recursive = TRUE)
unlink(here("10_DataScienceCapstone/report", "tmp.html"))
}