forked from MarionLieutaud/CTA-ED
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy path04-week4demo.Rmd
217 lines (149 loc) · 6.76 KB
/
04-week4demo.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
# Week 4 Demo
## Setup
First, we'll load the packages we'll be using in this week's brief demo.
```{r, message = F, warning = F}
library(quanteda)
library(quanteda.textstats)
library(quanteda.textplots)
library(tidytext)
library(stringdist)
library(corrplot)
library(janeaustenr)
```
## Character-based similarity
A first measure of text similarity is at the level of characters. We can look *for the last time* (I promise) at the example from the lecture and see how similarity compares.
We'll make two sentences and create two character objects from them. These are two thoughts imagined up from our classes.
```{r}
a <- "We are all very happy to be at a lecture at 11AM"
b <- "We are all even happier that we don’t have two lectures a week"
```
We know that the "longest common substring measure" is, according to the [stringdist](https://cran.r-project.org/web/packages/stringdist/stringdist.pdf) package documentation, "the longest string that can be obtained by pairing characters from *a* and *b* while keeping the order of characters intact."
And we can easily get different distance/similarity measures by comparing our character objects `a` and `b` as so.
```{r}
## longest common substring distance
stringdist(a, b,
method = "lcs")
## levenshtein distance
stringdist(a, b,
method = "lv")
## jaro distance
stringdist(a, b,
method = "jw", p =0)
```
## Term-based similarity
In this second example from the lecture, we're taking the opening line of *Pride and Prejudice* alongside my own versions of this same famous opening line.
We can get the text of Jane Austen very easily thanks to the `janeaustenr` package.
```{r}
## similarity and distance example
text <- janeaustenr::prideprejudice
sentences <- text[10:11]
sentence1 <- paste(sentences[1], sentences[2], sep = " ")
sentence1
```
We're then going to specify our alternative versions of this same sentence.
```{r}
sentence2 <- "Everyone knows that a rich man without wife will want a wife"
sentence3 <- "He's loaded so he wants to get married. Everyone knows that's what happens."
```
Finally, we're going to convert these into a document feature matrix. We're doing this with the `quanteda` package, which is a package that we'll begin using more and more over coming weeks as the analyses we're performing get gradually more technical.
```{r, warning=F}
dfmat <- dfm(tokens(c(sentence1,
sentence2,
sentence3)),
remove_punct = TRUE, remove = stopwords("english"))
dfmat
```
What do we see here?
Well, it's clear that `text2` and `text3` are not very similar to `text1` at all---they share few words. But we also see that `text2` does at least contain some words that are shared with `text1`, which is the original opening line of Jane Austen's *Pride and Prejudice*.
So, how do we then measure the similarity or distance between these texts?
The first way is simply by correlating the two sets of ones and zeroes. We can do this with the `quanteda.textstats` package like so.
```{r}
## correlation
textstat_simil(dfmat, margin = "documents", method = "correlation")
```
And you'll see that this is the same as what we would get if we manipulated the data into tidy format (rows for words and columns of 1s and 0s).
```{r, warning = F}
test <- tidy(dfmat)
test <- test %>%
cast_dfm(term, document, count)
test <- as.data.frame(test)
res <- cor(test[,2:4])
res
```
And we see that as expected `text2` is more highly correlated with `text1` than is `text3`.
```{r}
corrplot(res, type = "upper", order = "hclust",
tl.col = "black", tl.srt = 45)
```
As for Euclidean distances, we can again use `quanteda` as so.
```{r}
textstat_dist(dfmat, margin = "documents", method = "euclidean")
```
Or we could define our own function just so we see what's going on behind the scenes.
```{r}
# function for Euclidean distance
euclidean <- function(a,b) sqrt(sum((a - b)^2))
# estimating the distance
euclidean(test$text1, test$text2)
euclidean(test$text1, test$text3)
euclidean(test$text2, test$text3)
```
For Manhattan distance, we could use `quanteda` again.
```{r}
textstat_dist(dfmat, margin = "documents", method = "manhattan")
```
Or we could again define our own function.
```{r}
## manhattan
manhattan <- function(a, b){
dist <- abs(a - b)
dist <- sum(dist)
return(dist)
}
manhattan(test$text1, test$text2)
manhattan(test$text1, test$text3)
manhattan(test$text2, test$text3)
```
And for the cosine similarity, `quanteda` again makes this straightforward.
```{r}
textstat_simil(dfmat, margin = "documents", method = "cosine")
```
But to make clear what's going on here, we could again write our own function.
```{r}
## cosine
cos.sim <- function(a, b)
{
return(sum(a*b)/sqrt(sum(a^2)*sum(b^2)) )
}
cos.sim(test$text1, test$text2)
cos.sim(test$text1, test$text3)
cos.sim(test$text2, test$text3)
```
## Complexity
Note: this section borrows notation from the materials for the [`texstat_readability()` function](https://quanteda.io/reference/textstat_readability.html).
We also talked about different document-level measures of text characteristics. One of these is the "complexity" or readability of a text. One of the most frequently used is Flesch's Reading Ease Score (Flesch 1948).
This is computed as:
\item{\code{"Flesch"}:}{Flesch's Reading Ease Score (Flesch 1948).
\deqn{206.835 - (1.015 \times ASL) - (84.6 \times \frac{n_{sy}}{n_{w}})}{
206.835 - (1.015 * ASL) - (84.6 * (Nsy / Nw))}}
We can estimate a readability score for our respective sentences as such. The Flesch score from 1948 is the default.
```{r}
textstat_readability(sentence1)
textstat_readability(sentence2)
textstat_readability(sentence3)
```
What do we see here? The original Austen opening line is marked lower in readability than our more colloquial alternatives.
But there are other alternatives measures we might use. You can check these out by clicking through the links of the function `textstat_readability()`. Below I display a few of these.
One such is the McLaughlin (1969) "Simple Measure of Gobbledygook, which is based on the recurrence of words with 3 syllables or more and is calculated as:
\item{\code{"SMOG"}:}{Simple Measure of Gobbledygook (SMOG) (McLaughlin 1969). \deqn{ 1.043
\times \sqrt{n_{wsy>=3}} \times \frac{30}{n_{st}} + 3.1291}{1.043 * sqrt(Nwmin3sy
* 30 / Nst) + 3.1291}
where \eqn{n_{wsy>=3}} = Nwmin3sy = the number of words with 3 syllables or more.
This measure is regression equation D in McLaughlin's original paper.}
We can calculate this for our three sentences as below.
```{r}
textstat_readability(sentence1, measure = "SMOG")
textstat_readability(sentence2, measure = "SMOG")
textstat_readability(sentence3, measure = "SMOG")
```
Here, again, we see that the original Austen sentence has a higher level of complexity (or gobbledygook!).