forked from cjbarrie/CTA-ED
-
Notifications
You must be signed in to change notification settings - Fork 1
/
Copy path02-week2demo.Rmd
230 lines (175 loc) · 6.57 KB
/
02-week2demo.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
# Week 2 Demo
## Setup
In this section, we'll have a quick overview of how we're processing text data when conducting analyses of word frequency. We'll be using some randomly simulated text.
First we load the packages that we'll be using:
```{r, warning = F, message = F}
library(stringi) #to generate random text
library(stringr) # to facilitate working with strings
library(dplyr) #tidyverse package for wrangling data
library(tidytext) #package for 'tidy' manipulation of text data
library(ggplot2) #package for visualizing data
library(scales) #additional package for formatting plot axes
library(kableExtra) #package for displaying data in html format (relevant for formatting this worksheet mainly)
```
## Tokenizing
We'll first get some random text to see what it looks like when we're tokenizing text.
```{r}
lipsum_text <- data.frame(text = stri_rand_lipsum(1, start_lipsum = TRUE))
head(lipsum_text$text)
```
We can then tokenize with the `unnest_tokens()` function in `tidytext`.
```{r}
tokens <- lipsum_text %>%
unnest_tokens(word, text)
head(tokens)
```
Now we'll get some larger data, simulating 5000 observations (rows) of random Latin text strings.
```{r}
## Varying total words example
lipsum_text <- data.frame(text = stri_rand_lipsum(5000, start_lipsum = TRUE))
```
We'll then add another column and call this "weeks." This will be our unit of analysis.
```{r}
# make some weeks one to ten
lipsum_text$week <- as.integer(rep(seq.int(1:10), 5000/10))
```
Now we'll simulate a trend where we see an increasing number of words as weeks go by. Don't worry too much about this as the code is a little more complex, but I share it here in case of interest.
```{r}
for(i in 1:nrow(lipsum_text)) {
week <- lipsum_text[i, 2]
morewords <-
paste(rep("more lipsum words", times = sample(1:100, 1) * week), collapse = " ")
lipsum_words <- lipsum_text[i, 1]
new_lipsum_text <- paste0(morewords, lipsum_words, collapse = " ")
lipsum_text[i, 1] <- new_lipsum_text
}
```
And we can see that as each week goes by, we have more and more text.
```{r}
lipsum_text %>%
unnest_tokens(word, text) %>%
group_by(week) %>%
dplyr::count(word) %>%
select(week, n) %>%
distinct() %>%
ggplot() +
geom_bar(aes(week, n), stat = "identity") +
labs(x = "Week", y = "n words") +
scale_x_continuous(breaks= pretty_breaks())
```
We can then do the same but with a trend where each week sees a decreasing number of words.
```{r}
# simulate decreasing words trend
lipsum_text <- data.frame(text = stri_rand_lipsum(5000, start_lipsum = TRUE))
# make some weeks one to ten
lipsum_text$week <- as.integer(rep(seq.int(1:10), 5000/10))
for(i in 1:nrow(lipsum_text)) {
week <- lipsum_text[i,2]
morewords <- paste(rep("more lipsum words", times = sample(1:100, 1)* 1/week), collapse = " ")
lipsum_words <- lipsum_text[i,1]
new_lipsum_text <- paste0(morewords, lipsum_words, collapse = " ")
lipsum_text[i,1] <- new_lipsum_text
}
lipsum_text %>%
unnest_tokens(word, text) %>%
group_by(week) %>%
dplyr::count(word) %>%
select(week, n) %>%
distinct() %>%
ggplot() +
geom_bar(aes(week, n), stat = "identity") +
labs(x = "Week", y = "n words") +
scale_x_continuous(breaks= pretty_breaks())
```
Now let's check out the top frequency words in this text.
```{r}
lipsum_text %>%
unnest_tokens(word, text) %>%
dplyr::count(word, sort = T) %>%
top_n(5) %>%
knitr::kable(format="html")%>%
kable_styling("striped", full_width = F)
```
We're going to check out the frequencies for the word "sed" and then we're gonna normalize these by denominating by total word frequencies for each week.
First we need to get total word frequencies for each week.
```{r}
lipsum_totals <- lipsum_text %>%
group_by(week) %>%
unnest_tokens(word, text) %>%
dplyr::count(word) %>%
mutate(total = sum(n)) %>%
distinct(week, total)
```
```{r}
# let's look for "sed"
lipsum_sed <- lipsum_text %>%
group_by(week) %>%
unnest_tokens(word, text) %>%
filter(word == "sed") %>%
dplyr::count(word) %>%
mutate(total_sed = sum(n)) %>%
distinct(week, total_sed)
```
Then we can join these two dataframes together with the `left_join()` function where we're joining by the "week" column. We can then pipe the joined data into a plot.
```{r}
lipsum_sed %>%
left_join(lipsum_totals, by = "week") %>%
mutate(sed_prop = total_sed/total) %>%
ggplot() +
geom_line(aes(week, sed_prop)) +
labs(x = "Week", y = "
Proportion sed word") +
scale_x_continuous(breaks= pretty_breaks())
```
## Regexing
You'll notice that in the worksheet on word frequencies that at one point there are a set of parentheses after `str_detect()` we have the string "[a-z]". This is called a __character class__ and these use square brackets like `[]`.
Other character classes include, as helpfully listed in this [vignette](https://cran.r-project.org/web/packages/stringr/vignettes/regular-expressions.html) for the <tt>stringr</tt> package. What follows is adapted from these materials on regular expressions.
* `[abc]`: matches a, b, or c.
* `[a-z]`: matches every character between a and z
(in Unicode code point order).
* `[^abc]`: matches anything except a, b, or c.
* `[\^\-]`: matches `^` or `-`.
Several other patterns match multiple characters. These include:
* `\d`: matches any digit; the opposite of this is `\D`, which matches any character that
is not a decimal digit.
```{r}
str_extract_all("1 + 2 = 3", "\\d+")
str_extract_all("1 + 2 = 3", "\\D+")
```
* `\s`: matches any whitespace; its opposite is `\S`
```{r}
(text <- "Some \t badly\n\t\tspaced \f text")
str_replace_all(text, "\\s+", " ")
```
* `^`: matches start of the string
```{r}
x <- c("apple", "banana", "pear")
str_extract(x, "^a")
```
* `$`: matches end of the string
```{r}
x <- c("apple", "banana", "pear")
str_extract(x, "^a$")
```
* `^` then `$`: exact string match
```{r}
x <- c("apple", "banana", "pear")
str_extract(x, "^apple$")
```
Hold up: what do the plus signs etc. mean?
* `+`: 1 or more.
* `*`: 0 or more.
* `?`: 0 or 1.
So if you can tell me why this output makes sense, you're getting there!
```{r}
str_extract_all("1 + 2 = 3", "\\d+")[[1]]
str_extract_all("1 + 2 = 3", "\\D+")[[1]]
str_extract_all("1 + 2 = 3", "\\d*")[[1]]
str_extract_all("1 + 2 = 3", "\\D*")[[1]]
str_extract_all("1 + 2 = 3", "\\d?")[[1]]
str_extract_all("1 + 2 = 3", "\\D?")[[1]]
```
### Some more regex resources:
1. Regex crossword: [https://regexcrossword.com/](https://regexcrossword.com/).
2. Regexone: [https://regexone.com/](https://regexone.com/)
3. R4DS [chapter 14](https://r4ds.had.co.nz/strings.html#matching-patterns-with-regular-expressions)