在本例中,我们需要首先清理非结构化数据,然后将其转换为数据矩阵,以便对其应用主题建模。一般来说,当从 twitter 获取数据时,有几个字符是我们不感兴趣的,至少在数据清理过程的第一阶段是这样。
例如,在获得推文后,我们会得到这些奇怪的字符:“<ed><U+00A0><U+00BD><ed><U+00B8><U+008B>”。这些可能是表情符号,因此为了清理数据,我们将使用以下脚本将它们删除。此代码也可在 bda/part1/collect_data/cleaning_data.R 文件中找到。
rm(list = ls(all = TRUE)); gc() # Clears the global environment
# Some tweets
[1] "I’m not a big fan of turkey but baked Mac &
cheese <ed><U+00A0><U+00BD><ed><U+00B8><U+008B>"
[2] "@Jayoh30 Like no special sauce on a big mac. HOW"
### We are interested in the text - Let’s clean it!
# We first convert the encoding of the text from latin1 to ASCII
df$text <- sapply(df$text,function(row) iconv(row, "latin1", "ASCII", sub = ""))
# Create a function to clean tweets
clean.text <- function(tx) {
tx <- gsub("htt.{1,20}", " ", tx, ignore.case = TRUE)
tx = gsub("[^#[:^punct:]]|@|RT", " ", tx, perl = TRUE, ignore.case = TRUE)
tx = gsub("[[:digit:]]", " ", tx, ignore.case = TRUE)
tx = gsub(" {1,}", " ", tx, ignore.case = TRUE)
tx = gsub("^\\s+|\\s+$", " ", tx, ignore.case = TRUE)
clean_tweets <- lapply(df$text, clean.text)
# Cleaned tweets
[1] " WeNeedFeminlsm MAC s new make up line features men woc and big girls "
[1] " TravelsPhoto What Happens To Your Body One Hour After A Big Mac "