“Computational Text Analysis” (PGSP11584)
-This is the dedicated webpage for the course Computational Text Analysis” (PGSP11584) at the University of Edinburgh, taught by Christopher Barrie. Go to the Course Overview and Introduction tabs for a course overview and introduction to R.
+This is the dedicated webpage for the course Computational Text Analysis” (PGSP11584) at the University of Edinburgh, taught by Christopher Barrie. Go to the Course Overview and Introduction tabs for a course overview and introduction to R.
We will be using this online book throughout the course. Each week has a set of essential and recommended readings. The essential readings must be consulted in full prior to the Lecture and Seminar for that week. In addition, you will find online Exercises and examples written in R. This is a “live” book and will be amended and updated during the course itself.
diff --git a/docs/main_files/figure-html/unnamed-chunk-170-1.png b/docs/main_files/figure-html/unnamed-chunk-170-1.png
index f29d2f3..48d04f1 100644
Binary files a/docs/main_files/figure-html/unnamed-chunk-170-1.png and b/docs/main_files/figure-html/unnamed-chunk-170-1.png differ
diff --git a/docs/main_files/figure-html/unnamed-chunk-171-1.png b/docs/main_files/figure-html/unnamed-chunk-171-1.png
index e19915b..34279eb 100644
Binary files a/docs/main_files/figure-html/unnamed-chunk-171-1.png and b/docs/main_files/figure-html/unnamed-chunk-171-1.png differ
diff --git a/docs/main_files/figure-html/unnamed-chunk-19-1.png b/docs/main_files/figure-html/unnamed-chunk-19-1.png
index 6bb7478..457941a 100644
Binary files a/docs/main_files/figure-html/unnamed-chunk-19-1.png and b/docs/main_files/figure-html/unnamed-chunk-19-1.png differ
diff --git a/docs/main_files/figure-html/unnamed-chunk-20-1.png b/docs/main_files/figure-html/unnamed-chunk-20-1.png
index 8ce27ec..7b867aa 100644
Binary files a/docs/main_files/figure-html/unnamed-chunk-20-1.png and b/docs/main_files/figure-html/unnamed-chunk-20-1.png differ
diff --git a/docs/main_files/figure-html/unnamed-chunk-24-1.png b/docs/main_files/figure-html/unnamed-chunk-24-1.png
index c290dd9..abae781 100644
Binary files a/docs/main_files/figure-html/unnamed-chunk-24-1.png and b/docs/main_files/figure-html/unnamed-chunk-24-1.png differ
diff --git a/docs/main_files/figure-html/unnamed-chunk-270-1.png b/docs/main_files/figure-html/unnamed-chunk-270-1.png
index dfd0a4a..d3e1b85 100644
Binary files a/docs/main_files/figure-html/unnamed-chunk-270-1.png and b/docs/main_files/figure-html/unnamed-chunk-270-1.png differ
diff --git a/docs/main_files/figure-html/unnamed-chunk-36-1.png b/docs/main_files/figure-html/unnamed-chunk-36-1.png
index db097bf..e3195b1 100644
Binary files a/docs/main_files/figure-html/unnamed-chunk-36-1.png and b/docs/main_files/figure-html/unnamed-chunk-36-1.png differ
diff --git a/docs/main_files/figure-html/unnamed-chunk-37-1.png b/docs/main_files/figure-html/unnamed-chunk-37-1.png
index 42ac18a..9db1762 100644
Binary files a/docs/main_files/figure-html/unnamed-chunk-37-1.png and b/docs/main_files/figure-html/unnamed-chunk-37-1.png differ
diff --git a/docs/main_files/figure-html/unnamed-chunk-38-1.png b/docs/main_files/figure-html/unnamed-chunk-38-1.png
index 9aff76a..2b49b0f 100644
Binary files a/docs/main_files/figure-html/unnamed-chunk-38-1.png and b/docs/main_files/figure-html/unnamed-chunk-38-1.png differ
diff --git a/docs/main_files/figure-html/unnamed-chunk-42-1.png b/docs/main_files/figure-html/unnamed-chunk-42-1.png
index 88d57ac..98a1ab3 100644
Binary files a/docs/main_files/figure-html/unnamed-chunk-42-1.png and b/docs/main_files/figure-html/unnamed-chunk-42-1.png differ
diff --git a/docs/main_files/figure-html/unnamed-chunk-75-1.png b/docs/main_files/figure-html/unnamed-chunk-75-1.png
index 2e2f985..3935aaa 100644
Binary files a/docs/main_files/figure-html/unnamed-chunk-75-1.png and b/docs/main_files/figure-html/unnamed-chunk-75-1.png differ
diff --git a/docs/references-2.html b/docs/references-2.html
index f196683..c45a4c1 100644
--- a/docs/references-2.html
+++ b/docs/references-2.html
@@ -157,6 +157,9 @@
Brier, Alan, and Bruno Hopp. 2011. “Computer Assisted Text Analysis in the Social Sciences.” Quality & Quantity 45 (1): 103–28. https://doi.org/10.1007/s11135-010-9350-8.
+
+Brooke, SJ. 2021. “Trouble in Programmer’s Paradise: Gender-Biases in Sharing and Recognising Technical Knowledge on Stack Overflow.” Information, Communication & Society 24 (14): 2091–2112.
+
Bunea, Adriana, and Raimondas Ibenskas. 2015. “Quantitative Text Analysis and the Study of EU Lobbying and Interest Groups.” European Union Politics 16 (3): 429–55. https://doi.org/10.1177/1465116515577821.
@@ -196,6 +199,9 @@
———. 2013b. “Text as Data: The Promise and Pitfalls of Automatic Content Analysis Methods for Political Texts.” Political Analysis 21 (3): 267–97. https://doi.org/10.1093/pan/mps028.
+
+Haroon, Muhammad, Anshuman Chhabra, Xin Liu, Prasant Mohapatra, Zubair Shafiq, and Magdalena Wojcieszak. 2022. “YouTube, the Great Radicalizer? Auditing and Mitigating Ideological Biases in YouTube Recommendations.” https://doi.org/10.48550/ARXIV.2203.10666.
+
Hopkins, Daniel J., and Gary King. 2010. “A Method of Automated Nonparametric Content Analysis for Social Science.” American Journal of Political Science 54 (1): 229–47. https://doi.org/10.1111/j.1540-5907.2009.00428.x.
diff --git a/docs/search.json b/docs/search.json
index 9a569f4..db31413 100644
--- a/docs/search.json
+++ b/docs/search.json
@@ -1 +1 @@
-[{"path":"index.html","id":"computational-text-analysis-pgsp11584","chapter":"“Computational Text Analysis” (PGSP11584)","heading":"“Computational Text Analysis” (PGSP11584)","text":" dedicated webpage course Computational Text Analysis” (PGSP11584) University Edinburgh, taught Christopher Barrie. Go Course Overview Introduction tabs course overview introduction R.using online book throughout course. week set essential recommended readings. essential readings must consulted full prior Lecture Seminar week. addition, find online Exercises examples written R. “live” book amended updated course .","code":""},{"path":"index.html","id":"structure","chapter":"“Computational Text Analysis” (PGSP11584)","heading":"0.1 Structure","text":"course structured alternating weeks substantive technical instruction.","code":""},{"path":"index.html","id":"acknowledgments","chapter":"“Computational Text Analysis” (PGSP11584)","heading":"Acknowledgments","text":"compiling course, benefited syllabus materials shared online Margaret Roberts, Alexandra Siegel, Arthur Spirling. Thanks also Justin Grimmer, Margaret Roberts, Brandon Stewart providing early view access forthcoming Text Data book.","code":""},{"path":"course-overview.html","id":"course-overview","chapter":"Course Overview","heading":"Course Overview","text":"recent years, use computational techniques quantitative analysis text exploded. volume quantity text data now access digital age enormous. led social scientists seek new means analyzing text data scale.see text records, form digital traces left social media platforms, archived works literature, parliamentary speeches, video transcripts, print news, can help us answer huge range important questions.","code":""},{"path":"course-overview.html","id":"learning-outcomes","chapter":"Course Overview","heading":"Learning outcomes","text":"course give students training use computational text analysis techniques. course prepare students dissertation work uses textual data provide hands-training use R programming language () Python.course provide venue seminar discussion examples using methods empirical social sciences well lectures technical /statistical dimensions application.","code":""},{"path":"course-overview.html","id":"course-structure","chapter":"Course Overview","heading":"Course structure","text":"using online book ten-week course “Computational Text Analysis” (PGSP11584). chapter contains readings week. book also includes worksheets example code conduct text analysis techniques discuss week.week (partial exception week 1), discussing, alternately, substantive technical dimensions published research empirical social sciences. readings week generally contain two “substantive” readings—, examples application text analysis techniques empirical data—one “technical” reading focuses mainly statistical computational aspects given technique.study first technical aspects analytical approaches , second, substantive dimensions applications. means , discussing readings, able discuss satisfactory given approach illuminating question topic hand.Lectures primarily focused technical dimensions given technique. seminar (Q&) follows give us opportunity study discuss questions social scientific interest, computational text analysis used answer .","code":""},{"path":"course-overview.html","id":"course-pre-preparation","chapter":"Course Overview","heading":"Course pre-preparation","text":"NOTE: lecture Week 2, students complete two introductory R exercises. students already done courses Semester 1 need .haven’t done pre-preparation tasks already, , first, consult worksheet , introduction setting understanding basics working R. Second, Ugur Ozdemir provided comprehensive introductory R course Research Training Centre University Edinburgh can follow instructions access .","code":""},{"path":"course-overview.html","id":"reference-sources","chapter":"Course Overview","heading":"Reference sources","text":"several reference texts use course:Wickham, Hadley Garrett Grolemund. R Data Science: https://r4ds..co.nz/Silge, Julia David Robinson. Text Mining R: https://www.tidytextmining.com/\nlearning tidytext, online tutorial used: https://juliasilge.shinyapps.io/learntidytext/\nlearning tidytext, online tutorial used: https://juliasilge.shinyapps.io/learntidytext/(later course) Hvitfelft, Emil Julia Silge. Supervised Machine Learning Text Analysis R: https://smltar.com/several weeks, also referring two textbooks, available online, information retrieval text processing. :Jurafsky, Dan James H. Martin. Speech Language Processing (3rd ed. draft): https://nlp.stanford.edu/IR-book/information-retrieval-book.htmlManning, Christopher D.,Prabhakar Raghavan, Hinrich Schütze. Introduction Information Retrieval: https://nlp.stanford.edu/IR-book/information-retrieval-book.html","code":""},{"path":"course-overview.html","id":"assessment","chapter":"Course Overview","heading":"Assessment","text":"","code":""},{"path":"course-overview.html","id":"fortnightly-worksheets","chapter":"Course Overview","heading":"Fortnightly worksheets","text":"fortnight, provide one worksheet walks implement different text analysis technique. end worksheets find set questions. buddy someone else class go together.called “pair programming” ’s reason . Firstly, coding can isolating difficult thing—’s good bring friend along ride! Secondly, ’s something don’t know, maybe buddy . saves time. Thirdly, buddy can check code write , vice versa. , means working together produce check something go along.subsequent week’s lecture, pick pair random answer one worksheet’s questions (.e., ~1/3 chance ’re going get picked week). ask walk us code. remember: ’s also fine struggled didn’t get end! encountered obstacle, can work together. matters try.remainder seminar worksheet weeks dedicated seminar discussion discuss readings together.","code":""},{"path":"course-overview.html","id":"fortnightly-flash-talks","chapter":"Course Overview","heading":"Fortnightly flash talks","text":"weeks going tasked coding assignment, ’re hook… selecting pair random (coding pair) talk one readings. pick different pair reading (.e., ~ 1/3 chance ).Don’t let cause great anguish: just want thirty seconds minutes lay least one—preferably two three—criticisms articles required reading week,, want think whether article really answered research question, whether data appropriate answering question, whether method appropriate answering question, whether results show author claims show.remainder seminar flash talk weeks dedicated group work go coding Worksheet together.","code":""},{"path":"course-overview.html","id":"final-assessment","chapter":"Course Overview","heading":"Final assessment","text":"Assessment takes form one summative assessment. 4000 word essay subject choosing (prior approval ). , required select range data sources provide. may also suggest data source.asked : ) formulate research question; b) use least one computational text analysis technique studied; c) conduct analysis data source provided; d) write initial findings; e) outline potential extensions analysis.provide code used reproducible (markdown) format assessed substantive content essay contribution (social science part) well demonstrated competency coding text analysis (computational part).","code":""},{"path":"introduction-to-r.html","id":"introduction-to-r","chapter":"Introduction to R","heading":"Introduction to R","text":"section designed ensure familiar R environment.","code":""},{"path":"introduction-to-r.html","id":"getting-started-with-r-at-home","chapter":"Introduction to R","heading":"0.2 Getting started with R at home","text":"Given ’re working home days, ’ll need download R RStudio onto devices. R name programming language ’ll using coding exercises; RStudio IDE (“Integrated Development Environment”), .e., piece software almost everyone uses working R.can download Windows Mac easily free. one first reasons use “open-source” programming language: ’s free everyone can contribute!Services University Edinburgh provided walkthrough needed get started. also break :Install R Mac : https://cran.r-project.org/bin/macosx/. Install R Windows : https://cran.r-project.org/bin/windows/base/.Install R Mac : https://cran.r-project.org/bin/macosx/. Install R Windows : https://cran.r-project.org/bin/windows/base/.Download RStudio Windows Mac : https://rstudio.com/products/rstudio/download/, choosing Free version: people use enough needs.Download RStudio Windows Mac : https://rstudio.com/products/rstudio/download/, choosing Free version: people use enough needs.programs free. Make sure load everything listed operating system R work properly!","code":""},{"path":"introduction-to-r.html","id":"some-basic-information","chapter":"Introduction to R","heading":"0.3 Some basic information","text":"script text file write commands (code) comments.script text file write commands (code) comments.put # character front line text line executed; useful add comments script!put # character front line text line executed; useful add comments script!R case sensitive, careful typing.R case sensitive, careful typing.send code script console, highlight relevant line code script click Run, select line hit ctrl+enter PCR cmd+enter MacTo send code script console, highlight relevant line code script click Run, select line hit ctrl+enter PCR cmd+enter MacAccess help files R functions preceding name function ? (e.g., ?table)Access help files R functions preceding name function ? (e.g., ?table)pressing key, can go back commands used beforeBy pressing key, can go back commands used beforePress tab key auto-complete variable names commandsPress tab key auto-complete variable names commands","code":""},{"path":"introduction-to-r.html","id":"getting-started-in-rstudio","chapter":"Introduction to R","heading":"0.4 Getting Started in RStudio","text":"Begin opening RStudio (located desktop). first task create new script (write commands). , click:screen now four panes:Script (top left)Script (top left)Console (bottom left)Console (bottom left)Environment/History (top right)Environment/History (top right)Files/Plots/Packages/Help/Viewer (bottom right)Files/Plots/Packages/Help/Viewer (bottom right)","code":"File --> NewFile --> RScript"},{"path":"introduction-to-r.html","id":"a-simple-example","chapter":"Introduction to R","heading":"0.5 A simple example","text":"Script (top left) write commands R. can try first time writing small snipped code follows:tell R run command, highlight relevant row script click Run button (top right Script) - hold ctrl+enter Windows cmd+enter Mac - send command Console (bottom left), actual evaluation calculations taking place. shortcut keys become familiar quickly!Running command creates object named ‘x’, contains words message.can now see ‘x’ Environment (top right). view contained x, type Console (bottom left):","code":"\nx <- \"I can't wait to learn Computational Text Analysis\" #Note the quotation marks!\nprint(x)## [1] \"I can't wait to learn Computational Text Analysis\"\n# or alternatively you can just type:\n\nx## [1] \"I can't wait to learn Computational Text Analysis\""},{"path":"introduction-to-r.html","id":"loading-packages","chapter":"Introduction to R","heading":"0.6 Loading packages","text":"‘base’ version R powerful able everything , least ease. technical specialized forms analysis, need load new packages.need install -called ‘package’—program includes new tools (.e., functions) carry specific tasks. can think ‘extensions’ enhancing R’s capacities.take one example, might want something little exciting print excited course. Let’s make map instead.might sound technical. beauty packaged extensions R contain functions perform specialized types analysis ease.’ll first need install one packages, can :package installed, need load environment typing library(). Note , , don’t need wrap name package quotation marks. trick:now? Well, let’s see just easy visualize data using ggplot package comes bundled larger tidyverse package.wanted save ’d got making plots, want save scripts, maybe data used well, return later stage.","code":"\ninstall.packages(\"tidyverse\")\nlibrary(tidyverse)\nggplot(data = mpg) + \n geom_point(mapping = aes(x = displ, y = hwy))"},{"path":"introduction-to-r.html","id":"saving-your-objects-plots-and-scripts","chapter":"Introduction to R","heading":"0.7 Saving your objects, plots and scripts","text":"Saving scripts: save script RStudio (.e. top left panel), need click File –> Save (choose name script). script something like: myfilename.R.Saving scripts: save script RStudio (.e. top left panel), need click File –> Save (choose name script). script something like: myfilename.R.Saving plots: made plots like save, click Export (plotting pane) choose relevant file extension (e.g. .png, .pdf, etc.) size.Saving plots: made plots like save, click Export (plotting pane) choose relevant file extension (e.g. .png, .pdf, etc.) size.save individual objects (example x ) environment, run following command (choosing suitable filename):save individual objects (example x ) environment, run following command (choosing suitable filename):save objects (.e. everything top right panel) , run following command (choosing suitable filename):objects can re-loaded R next session running:many file formats might use save output. encounter course progresses.","code":"\nsave(x,file=\"myobject.RData\")\nload(file=\"myobject.RData\")\nsave.image(file=\"myfilname.RData\")\nload(file=\"myfilename.RData\")"},{"path":"introduction-to-r.html","id":"knowing-where-r-saves-your-documents","chapter":"Introduction to R","heading":"0.8 Knowing where R saves your documents","text":"home, open new script make sure check set working directory (.e. folder files create saved). check working directory use getwd() command (type Console write script Source Editor):set working directory, run following command, substituting file directory choice. Remember anything following `#’ symbol simply clarifying comment R process .","code":"\ngetwd()\n## Example for Mac \nsetwd(\"/Users/Documents/mydir/\") \n## Example for PC \nsetwd(\"c:/docs/mydir\") "},{"path":"introduction-to-r.html","id":"practicing-in-r","chapter":"Introduction to R","heading":"0.9 Practicing in R","text":"best way learn R use . workshops text analysis place become fully proficient R. , however, chance conduct hands-analysis applied examples fast-expanding field. best way learn . give shot!practice R programming language, look Wickham Grolemund (2017) , tidy text analysis, Silge Robinson (2017).free online book Hadley Wickham “R Data Science” available hereThe free online book Hadley Wickham “R Data Science” available hereThe free online book Julia Silge David Robinson “Text Mining R” available hereThe free online book Julia Silge David Robinson “Text Mining R” available hereFor practice R, may want consult set interactive tutorials, available package “learnr.” ’ve installed package, can go tutorials calling:practice R, may want consult set interactive tutorials, available package “learnr.” ’ve installed package, can go tutorials calling:","code":"\nlibrary(learnr)\n\navailable_tutorials() # this will tell you the names of the tutorials available\n\nrun_tutorial(name = \"ex-data-basics\", package = \"learnr\") #this will launch the interactive tutorial in a new Internet browser window"},{"path":"introduction-to-r.html","id":"one-final-note","chapter":"Introduction to R","heading":"0.10 One final note","text":"’ve dipped “R Data Science” book ’ll hear lot -called tidyverse R. essentially set packages use alternative, intuitive, way interacting data.main difference ’ll notice , instead separate lines function want run, wrapping functions inside functions, sets functions “piped” using “pipe” functions, look appearance: %>%.using “tidy” syntax weekly exercises computational text analysis workshops. anything unclear, can provide equivalents “base” R . lot useful text analysis packages now composed ‘tidy’ syntax.","code":""},{"path":"week-1-retrieving-and-analyzing-text.html","id":"week-1-retrieving-and-analyzing-text","chapter":"1 Week 1: Retrieving and analyzing text","heading":"1 Week 1: Retrieving and analyzing text","text":"first task conducting large-scale text analyses gathering curating text information . focus chapters Manning, Raghavan, Schtze (2007) listed . , ’ll find introduction different ways can reformat ‘query’ text data order begin asking questions . often referred computer science natural language processing contexts “information retrieval” foundation many search, including web search, processes.articles Tatman (2017) Pechenick, Danforth, Dodds (2015) focus seminar (Q&). articles get us thinking fundamentals text discovery sampling. reading articles think locating texts, sampling , biases might inhere sampling process, texts represent; .e., population phenomenon interest might provide inferences.Questions seminar:access text? need consider ?sample texts?biases need keep mind?Required reading:Tatman (2017)Tatman (2017)Pechenick, Danforth, Dodds (2015)Pechenick, Danforth, Dodds (2015)Manning, Raghavan, Schtze (2007) (chs.1 10): https://nlp.stanford.edu/IR-book/information-retrieval-book.htmlManning, Raghavan, Schtze (2007) (chs.1 10): https://nlp.stanford.edu/IR-book/information-retrieval-book.htmlKlaus Krippendorff (2004) (ch. 6)Klaus Krippendorff (2004) (ch. 6)reading:Olteanu et al. (2019)Biber (1993)Barberá Rivero (2015)Slides:Week 1 Slides","code":""},{"path":"week-2-tokenization-and-word-frequencies.html","id":"week-2-tokenization-and-word-frequencies","chapter":"2 Week 2: Tokenization and word frequencies","heading":"2 Week 2: Tokenization and word frequencies","text":"approaching large-scale quantiative analyses text, key task identify capture unit analysis. One commonly used approaches, across diverse analytical contexts, text tokenization. , splitting text word units: unigrams, bigrams, trigrams etc.chapters Manning, Raghavan, Schtze (2007), listed , provide technical introduction task “querying” text according different word-based queries. task studying hands-assignment week.seminar discussion, focusing widely-cited examples research applied social sciences employing token-based, word frequency, analyses large corpora. first, Michel et al. (2011) uses enormous Google books corpus measure cultural linguistic trends. second, Bollen et al. (2021a) uses corpus demonstrate specific change time—-called “cognitive distortion.” examples, attentive questions sampling covered previous weeks. question central back--forths short responses replies articles Michel et al. (2011) Bollen et al. (2021a).Questions:Tokenizing counting: capture?Corpus-based sampling: biases might threaten inference?write critique either Michel et al. (2011) Bollen et al. (2021a), focus ?Required reading:Michel et al. (2011)\nSchwartz (2011)\nMorse-Gagné (2011)\nAiden, Pickett, Michel (2011)\nSchwartz (2011)Morse-Gagné (2011)Aiden, Pickett, Michel (2011)Bollen et al. (2021a)\nSchmidt, Piantadosi, Mahowald (2021)\nBollen et al. (2021b)\nSchmidt, Piantadosi, Mahowald (2021)Bollen et al. (2021b)Manning, Raghavan, Schtze (2007) (ch. 2): https://nlp.stanford.edu/IR-book/information-retrieval-book.html]Klaus Krippendorff (2004) (ch. 5)reading:Rozado, Al-Gharbi, Halberstadt (2021)Alshaabi et al. (2021)Campos et al. (2015)Greenfield (2013)Slides:Week 2 Slides","code":""},{"path":"week-2-demo.html","id":"week-2-demo","chapter":"3 Week 2 Demo","heading":"3 Week 2 Demo","text":"","code":""},{"path":"week-2-demo.html","id":"setup","chapter":"3 Week 2 Demo","heading":"3.1 Setup","text":"section, ’ll quick overview ’re processing text data conducting analyses word frequency. ’ll using randomly simulated text.First load packages ’ll using:","code":"\nlibrary(stringi) #to generate random text\nlibrary(dplyr) #tidyverse package for wrangling data\nlibrary(tidytext) #package for 'tidy' manipulation of text data\nlibrary(ggplot2) #package for visualizing data\nlibrary(scales) #additional package for formatting plot axes\nlibrary(kableExtra) #package for displaying data in html format (relevant for formatting this worksheet mainly)"},{"path":"week-2-demo.html","id":"tokenizing","chapter":"3 Week 2 Demo","heading":"3.2 Tokenizing","text":"’ll first get random text see looks like ’re tokenizing text.can tokenize unnest_tokens() function tidytext.Now ’ll get larger data, simulating 5000 observations (rows) random Latin text strings.’ll add another column call “weeks.” unit analysis.Now ’ll simulate trend see increasing number words weeks go . Don’t worry much code little complex, share case interest.can see week goes , text.can trend week sees decreasing number words.Now let’s check top frequency words text.’re going check frequencies word “sed” ’re gonna normalize denominating total word frequencies week.First need get total word frequencies week.can join two dataframes together left_join() function ’re joining “week” column. can pipe joined data plot.","code":"\nlipsum_text <- data.frame(text = stri_rand_lipsum(1, start_lipsum = TRUE))\n\nhead(lipsum_text$text)## [1] \"Lorem ipsum dolor sit amet, consectetur dictum ante id urna, quis convallis. Eros ut magnis mauris, eros auctor! Auctor ipsum eu himenaeos interdum. Dictum, litora urna sapien ut morbi, dui at ante at. Lorem vitae ac ut commodo. Id non ridiculus leo erat, tristique inceptos mauris faucibus consectetur erat et. Ex sed at accumsan. Molestie ultricies eu nisl congue duis volutpat ac. Lectus, est ornare sed vel dignissim ac parturient nisl vivamus.\"\ntokens <- lipsum_text %>%\n unnest_tokens(word, text)\n\nhead(tokens)## word\n## 1 lorem\n## 2 ipsum\n## 3 dolor\n## 4 sit\n## 5 amet\n## 6 consectetur\n## Varying total words example\nlipsum_text <- data.frame(text = stri_rand_lipsum(5000, start_lipsum = TRUE))\n# make some weeks one to ten\nlipsum_text$week <- as.integer(rep(seq.int(1:10), 5000/10))\nfor(i in 1:nrow(lipsum_text)) {\n week <- lipsum_text[i, 2]\n morewords <-\n paste(rep(\"more lipsum words\", times = sample(1:100, 1) * week), collapse = \" \")\n lipsum_words <- lipsum_text[i, 1]\n new_lipsum_text <- paste0(morewords, lipsum_words, collapse = \" \")\n lipsum_text[i, 1] <- new_lipsum_text\n}\nlipsum_text %>%\n unnest_tokens(word, text) %>%\n group_by(week) %>%\n dplyr::count(word) %>%\n select(week, n) %>%\n distinct() %>%\n ggplot() +\n geom_bar(aes(week, n), stat = \"identity\") +\n labs(x = \"Week\", y = \"n words\") +\n scale_x_continuous(breaks= pretty_breaks())\n# simulate decreasing words trend\nlipsum_text <- data.frame(text = stri_rand_lipsum(5000, start_lipsum = TRUE))\n\n# make some weeks one to ten\nlipsum_text$week <- as.integer(rep(seq.int(1:10), 5000/10))\n\nfor(i in 1:nrow(lipsum_text)) {\n week <- lipsum_text[i,2]\n morewords <- paste(rep(\"more lipsum words\", times = sample(1:100, 1)* 1/week), collapse = \" \")\n lipsum_words <- lipsum_text[i,1]\n new_lipsum_text <- paste0(morewords, lipsum_words, collapse = \" \")\n lipsum_text[i,1] <- new_lipsum_text\n}\n\nlipsum_text %>%\n unnest_tokens(word, text) %>%\n group_by(week) %>%\n dplyr::count(word) %>%\n select(week, n) %>%\n distinct() %>%\n ggplot() +\n geom_bar(aes(week, n), stat = \"identity\") +\n labs(x = \"Week\", y = \"n words\") +\n scale_x_continuous(breaks= pretty_breaks())\nlipsum_text %>%\n unnest_tokens(word, text) %>%\n dplyr::count(word, sort = T) %>%\n top_n(5) %>%\n knitr::kable(format=\"html\")%>% \n kable_styling(\"striped\", full_width = F)## Selecting by n\nlipsum_totals <- lipsum_text %>%\n group_by(week) %>%\n unnest_tokens(word, text) %>%\n dplyr::count(word) %>%\n mutate(total = sum(n)) %>%\n distinct(week, total)\n# let's look for \"sed\"\nlipsum_sed <- lipsum_text %>%\n group_by(week) %>%\n unnest_tokens(word, text) %>%\n filter(word == \"sed\") %>%\n dplyr::count(word) %>%\n mutate(total_sed = sum(n)) %>%\n distinct(week, total_sed)\nlipsum_sed %>%\n left_join(lipsum_totals, by = \"week\") %>%\n mutate(sed_prop = total_sed/total) %>%\n ggplot() +\n geom_line(aes(week, sed_prop)) +\n labs(x = \"Week\", y = \"\n Proportion sed word\") +\n scale_x_continuous(breaks= pretty_breaks())"},{"path":"week-2-demo.html","id":"regexing","chapter":"3 Week 2 Demo","heading":"3.3 Regexing","text":"’ll notice worksheet word frequencies one point set parentheses str_detect() string “[-z]”. called character class use square brackets like [].character classes include, helpfully listed vignette stringr package. follows adapted materials regular expressions.[abc]: matches , b, c.[-z]: matches every character z\n(Unicode code point order).[^abc]: matches anything except , b, c.[\\^\\-]: matches ^ -.Several patterns match multiple characters. include:\\d: matches digit; opposite \\D, matches character \ndecimal digit.\\s: matches whitespace; opposite \\S^: matches start string$: matches end string^ $: exact string matchHold : plus signs etc. mean?+: 1 .*: 0 .?: 0 1.can tell output makes sense, ’re getting !","code":"\nstr_extract_all(\"1 + 2 = 3\", \"\\\\d+\")## [[1]]\n## [1] \"1\" \"2\" \"3\"\nstr_extract_all(\"1 + 2 = 3\", \"\\\\D+\")## [[1]]\n## [1] \" + \" \" = \"\n(text <- \"Some \\t badly\\n\\t\\tspaced \\f text\")## [1] \"Some \\t badly\\n\\t\\tspaced \\f text\"\nstr_replace_all(text, \"\\\\s+\", \" \")## [1] \"Some badly spaced text\"\nx <- c(\"apple\", \"banana\", \"pear\")\nstr_extract(x, \"^a\")## [1] \"a\" NA NA\nx <- c(\"apple\", \"banana\", \"pear\")\nstr_extract(x, \"^a$\")## [1] NA NA NA\nx <- c(\"apple\", \"banana\", \"pear\")\nstr_extract(x, \"^apple$\")## [1] \"apple\" NA NA\nstr_extract_all(\"1 + 2 = 3\", \"\\\\d+\")[[1]]## [1] \"1\" \"2\" \"3\"\nstr_extract_all(\"1 + 2 = 3\", \"\\\\D+\")[[1]]## [1] \" + \" \" = \"\nstr_extract_all(\"1 + 2 = 3\", \"\\\\d*\")[[1]]## [1] \"1\" \"\" \"\" \"\" \"2\" \"\" \"\" \"\" \"3\" \"\"\nstr_extract_all(\"1 + 2 = 3\", \"\\\\D*\")[[1]]## [1] \"\" \" + \" \"\" \" = \" \"\" \"\"\nstr_extract_all(\"1 + 2 = 3\", \"\\\\d?\")[[1]]## [1] \"1\" \"\" \"\" \"\" \"2\" \"\" \"\" \"\" \"3\" \"\"\nstr_extract_all(\"1 + 2 = 3\", \"\\\\D?\")[[1]]## [1] \"\" \" \" \"+\" \" \" \"\" \" \" \"=\" \" \" \"\" \"\""},{"path":"week-2-demo.html","id":"some-more-regex-resources","chapter":"3 Week 2 Demo","heading":"3.3.1 Some more regex resources:","text":"Regex crossword: https://regexcrossword.com/.Regexone: https://regexone.com/R4DS chapter 14","code":""},{"path":"week-3-dictionary-based-techniques.html","id":"week-3-dictionary-based-techniques","chapter":"4 Week 3: Dictionary-based techniques","heading":"4 Week 3: Dictionary-based techniques","text":"extension word frequency analyses, covered last week, -called “dictionary-based” techniques. basic form, analyses use index target terms classify corpus interest based presence absence. technical dimensions type analysis covered chapter section Klaus Krippendorff (2004), issues attending article - Loughran Mcdonald (2011).also reading two examples application techniques Martins Baumard (2020) Young Soroka (2012). , discussing successful authors measuring phenomenon interest (“prosociality” “tone” respectively). Questions sampling representativeness relevant , naturally inform assessments work.Questions:general dictionaries possible; domain-specific?know dictionary accurate?enhance/supplement dictionary-based techniques?Required reading:Martins Baumard (2020)Voigt et al. (2017)reading:Tausczik Pennebaker (2010)Klaus Krippendorff (2004) (pp.283-289)Brier Hopp (2011)Bonikowski Gidron (2015)Barberá et al. (2021)Young Soroka (2012)Slides:Week 3 Slides","code":""},{"path":"week-3-demo.html","id":"week-3-demo","chapter":"5 Week 3 Demo","heading":"5 Week 3 Demo","text":"section, ’ll quick overview ’re processing text data conducting basic sentiment analyses.","code":""},{"path":"week-3-demo.html","id":"setup-1","chapter":"5 Week 3 Demo","heading":"5.1 Setup","text":"’ll first load packages need.","code":"\nlibrary(stringi)\nlibrary(dplyr)\nlibrary(tidytext)\nlibrary(ggplot2)\nlibrary(scales)"},{"path":"week-3-demo.html","id":"happy-words","chapter":"5 Week 3 Demo","heading":"5.2 Happy words","text":"discussed lectures, might find text class’s collective thoughts increase “happy” words time.simulated dataset text split weeks, students, words plus whether word word “happy” 0 means word “happy” 1 means .three datasets: one constant number “happy” words; one increasing number “happy” words; one decreasing number “happy” words. called: happyn, happyu, happyd respectively.can see trend “happy” words week student.First, dataset constant number happy words time.now simulated data increasing number happy words.finally decreasing number happy words.","code":"\nhead(happyn)## # A tibble: 6 × 4\n## # Groups: week, student [1]\n## week student word happy\n## \n## 1 1 23 lorem 0\n## 2 1 23 ipsum 0\n## 3 1 23 dolor 0\n## 4 1 23 sit 0\n## 5 1 23 amet 0\n## 6 1 23 amet 0\nhead(happyu)## # A tibble: 6 × 4\n## # Groups: week, student [1]\n## week student word happy\n## \n## 1 1 23 lorem 0\n## 2 1 23 ipsum 0\n## 3 1 23 dolor 0\n## 4 1 23 sit 0\n## 5 1 23 amet 0\n## 6 1 23 amet 0\nhead(happyd)## # A tibble: 6 × 4\n## # Groups: week, student [1]\n## week student word happy\n## \n## 1 1 23 lorem 0\n## 2 1 23 ipsum 0\n## 3 1 23 dolor 0\n## 4 1 23 sit 0\n## 5 1 23 amet 0\n## 6 1 23 amet 0## Warning: Returning more (or less) than 1 row per `summarise()` group was deprecated in\n## dplyr 1.1.0.\n## ℹ Please use `reframe()` instead.\n## ℹ When switching from `summarise()` to `reframe()`, remember that `reframe()`\n## always returns an ungrouped data frame and adjust accordingly.\n## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was\n## generated.## `summarise()` has grouped output by 'week', 'student'. You can override using\n## the `.groups` argument.\n## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'## Warning: Returning more (or less) than 1 row per `summarise()` group was deprecated in\n## dplyr 1.1.0.\n## ℹ Please use `reframe()` instead.\n## ℹ When switching from `summarise()` to `reframe()`, remember that `reframe()`\n## always returns an ungrouped data frame and adjust accordingly.\n## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was\n## generated.## `summarise()` has grouped output by 'week', 'student'. You can override using\n## the `.groups` argument.\n## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'## Warning: Returning more (or less) than 1 row per `summarise()` group was deprecated in\n## dplyr 1.1.0.\n## ℹ Please use `reframe()` instead.\n## ℹ When switching from `summarise()` to `reframe()`, remember that `reframe()`\n## always returns an ungrouped data frame and adjust accordingly.\n## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was\n## generated.## `summarise()` has grouped output by 'week', 'student'. You can override using\n## the `.groups` argument.\n## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'"},{"path":"week-3-demo.html","id":"normalizing-sentiment","chapter":"5 Week 3 Demo","heading":"5.3 Normalizing sentiment","text":"discussed lecture, also know just total number happy words increases, isn’t indication ’re getting happier class time.can begin make inference, need normalize total number words week., simulate data number happy words actually week (happyn dataset ).join data three datasets: happylipsumn, happylipsumu, happylipsumd. datasets random text, number happy words.first also number total words week. second two, however, differing number total words week: happylipsumu increasing number total words week; happylipsumd decreasing number total words week., see , ’re splitting week, student, word, whether “happy” word.plot number happy words divided number total words week student datasets, get .get normalized sentiment score–“happy” score–need create variable (column) dataframe sum happy words divided total number words dataframe.can following way.repeat datasets plot see following.plots look like ?Well, first, number total words week number happy words week. divided latter former, get proportion also stable time.second, however, increasing number total words week, number happy words time. means dividing ever larger number, giving ever smaller proportions. , trend decreasing time.third, decreasing number total words week, number happy words time. means dividing ever smaller number, giving ever larger proportions. , trend increasing time.","code":"\nhead(happylipsumn)## # A tibble: 6 × 4\n## # Groups: week, student [1]\n## week student word happy\n## \n## 1 1 23 lorem 0\n## 2 1 23 ipsum 0\n## 3 1 23 dolor 0\n## 4 1 23 sit 0\n## 5 1 23 amet 0\n## 6 1 23 taciti 0\nhead(happylipsumu)## # A tibble: 6 × 4\n## # Groups: week, student [1]\n## week student word happy\n## \n## 1 1 23 lorem 0\n## 2 1 23 ipsum 0\n## 3 1 23 dolor 0\n## 4 1 23 sit 0\n## 5 1 23 amet 0\n## 6 1 23 maecenas 0\nhead(happylipsumd)## # A tibble: 6 × 4\n## # Groups: week, student [1]\n## week student word happy\n## \n## 1 1 23 lorem 0\n## 2 1 23 ipsum 0\n## 3 1 23 dolor 0\n## 4 1 23 sit 0\n## 5 1 23 amet 0\n## 6 1 23 et 0\nhappylipsumn %>%\n group_by(week, student) %>%\n mutate(index_total = n()) %>%\n filter(happy==1) %>%\n summarise(sum_hap = sum(happy),\n index_total = index_total,\n prop_hap = sum_hap/index_total) %>%\n distinct()## Warning: Returning more (or less) than 1 row per `summarise()` group was deprecated in\n## dplyr 1.1.0.\n## ℹ Please use `reframe()` instead.\n## ℹ When switching from `summarise()` to `reframe()`, remember that `reframe()`\n## always returns an ungrouped data frame and adjust accordingly.\n## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was\n## generated.## `summarise()` has grouped output by 'week', 'student'. You can override using\n## the `.groups` argument.## # A tibble: 300 × 5\n## # Groups: week, student [300]\n## week student sum_hap index_total prop_hap\n## \n## 1 1 1 904 4471 0.202\n## 2 1 2 977 3970 0.246\n## 3 1 3 974 4452 0.219\n## 4 1 4 1188 5644 0.210\n## 5 1 5 962 4468 0.215\n## 6 1 6 686 2758 0.249\n## 7 1 7 1105 4493 0.246\n## 8 1 8 1182 5373 0.220\n## 9 1 9 733 3578 0.205\n## 10 1 10 1235 4537 0.272\n## # ℹ 290 more rows"},{"path":"week-4-natural-language-complexity-and-similarity.html","id":"week-4-natural-language-complexity-and-similarity","chapter":"6 Week 4: Natural language, complexity, and similarity","heading":"6 Week 4: Natural language, complexity, and similarity","text":"week delving deeply language used text. previous weeks, tried two main techniques rely, different ways, counting words. week, thinking sophisticated techniques identify measure language use, well compare texts . article Gomaa Fahmy (2013) provides overview different approaches. covering technical dimensions lecture.article Urman, Makhortykh, Ulloa (2021) investigates key question contemporary communications research—information exposed online—shows might compare web search results using similarity measures. Schoonvelde et al. (2019) article, hand, looks “complexity” texts, compares politicians different ideological stripes communicate.Questions:measure linguistic complexity/sophistication?biases might involved measuring sophistication?applications might similarity measures?Required reading:Urman, Makhortykh, Ulloa (2021)Schoonvelde et al. (2019)Gomaa Fahmy (2013)reading:Voigt et al. (2017)Peng Hengartner (2002)Lowe (2008)Bail (2012)Ziblatt, Hilbig, Bischof (2020)Benoit, Munger, Spirling (2019)Slides:Week 4 Slides","code":""},{"path":"week-4-demo.html","id":"week-4-demo","chapter":"7 Week 4 Demo","heading":"7 Week 4 Demo","text":"","code":""},{"path":"week-4-demo.html","id":"setup-2","chapter":"7 Week 4 Demo","heading":"7.1 Setup","text":"First, ’ll load packages ’ll using week’s brief demo.","code":"\nlibrary(quanteda)\nlibrary(quanteda.textstats)\nlibrary(quanteda.textplots)\nlibrary(tidytext)\nlibrary(stringdist)\nlibrary(corrplot)\nlibrary(janeaustenr)"},{"path":"week-4-demo.html","id":"character-based-similarity","chapter":"7 Week 4 Demo","heading":"7.2 Character-based similarity","text":"first measure text similarity level characters. can look last time (promise) example lecture see similarity compares.’ll make two sentences create two character objects . two thoughts imagined classes.know “longest common substring measure” , according stringdist package documentation, “longest string can obtained pairing characters b keeping order characters intact.”can easily get different distance/similarity measures comparing character objects b .","code":"\na <- \"We are all very happy to be at a lecture at 11AM\"\nb <- \"We are all even happier that we don’t have two lectures a week\"\n## longest common substring distance\nstringdist(a, b,\n method = \"lcs\")## [1] 36\n## levenshtein distance\nstringdist(a, b,\n method = \"lv\")## [1] 27\n## jaro distance\nstringdist(a, b,\n method = \"jw\", p =0)## [1] 0.2550103"},{"path":"week-4-demo.html","id":"term-based-similarity","chapter":"7 Week 4 Demo","heading":"7.3 Term-based similarity","text":"second example lecture, ’re taking opening line Pride Prejudice alongside versions famous opening line.can get text Jane Austen easily thanks janeaustenr package.’re going specify alternative versions sentence.Finally, ’re going convert document feature matrix. ’re quanteda package, package ’ll begin using coming weeks analyses ’re performing get gradually technical.see ?Well, ’s clear text2 text3 similar text1 —share words. also see text2 least contain words shared text1, original opening line Jane Austen’s Pride Prejudice., measure similarity distance texts?first way simply correlating two sets ones zeroes. can quanteda.textstats package like .’ll see get manipulated data tidy format (rows words columns 1s 0s).see expected text2 highly correlated text1 text3.\nEuclidean distances, can use quanteda .define function just see ’s going behind scenes.Manhattan distance, use quanteda .define function.cosine similarity, quanteda makes straightforward.make clear ’s going , write function.","code":"\n## similarity and distance example\n\ntext <- janeaustenr::prideprejudice\n\nsentences <- text[10:11]\n\nsentence1 <- paste(sentences[1], sentences[2], sep = \" \")\n\nsentence1## [1] \"It is a truth universally acknowledged, that a single man in possession of a good fortune, must be in want of a wife.\"\nsentence2 <- \"Everyone knows that a rich man without wife will want a wife\"\n\nsentence3 <- \"He's loaded so he wants to get married. Everyone knows that's what happens.\"\ndfmat <- dfm(tokens(c(sentence1,\n sentence2,\n sentence3)),\n remove_punct = TRUE, remove = stopwords(\"english\"))\n\ndfmat## Document-feature matrix of: 3 documents, 21 features (58.73% sparse) and 0 docvars.\n## features\n## docs truth universally acknowledged single man possession good fortune must\n## text1 1 1 1 1 1 1 1 1 1\n## text2 0 0 0 0 1 0 0 0 0\n## text3 0 0 0 0 0 0 0 0 0\n## features\n## docs want\n## text1 1\n## text2 1\n## text3 0\n## [ reached max_nfeat ... 11 more features ]\n## correlation\ntextstat_simil(dfmat, margin = \"documents\", method = \"correlation\")## textstat_simil object; method = \"correlation\"\n## text1 text2 text3\n## text1 1.000 -0.117 -0.742\n## text2 -0.117 1.000 -0.173\n## text3 -0.742 -0.173 1.000\ntest <- tidy(dfmat)\ntest <- test %>%\n cast_dfm(term, document, count)\ntest <- as.data.frame(test)\n\nres <- cor(test[,2:4])\nres## text1 text2 text3\n## text1 1.0000000 -0.1167748 -0.7416198\n## text2 -0.1167748 1.0000000 -0.1732051\n## text3 -0.7416198 -0.1732051 1.0000000\ncorrplot(res, type = \"upper\", order = \"hclust\", \n tl.col = \"black\", tl.srt = 45)\ntextstat_dist(dfmat, margin = \"documents\", method = \"euclidean\")## textstat_dist object; method = \"euclidean\"\n## text1 text2 text3\n## text1 0 3.74 4.24\n## text2 3.74 0 3.74\n## text3 4.24 3.74 0\n# function for Euclidean distance\neuclidean <- function(a,b) sqrt(sum((a - b)^2))\n# estimating the distance\neuclidean(test$text1, test$text2)## [1] 3.741657\neuclidean(test$text1, test$text3)## [1] 4.242641\neuclidean(test$text2, test$text3)## [1] 3.741657\ntextstat_dist(dfmat, margin = \"documents\", method = \"manhattan\")## textstat_dist object; method = \"manhattan\"\n## text1 text2 text3\n## text1 0 14 18\n## text2 14 0 12\n## text3 18 12 0\n## manhattan\nmanhattan <- function(a, b){\n dist <- abs(a - b)\n dist <- sum(dist)\n return(dist)\n}\n\nmanhattan(test$text1, test$text2)## [1] 14\nmanhattan(test$text1, test$text3)## [1] 18\nmanhattan(test$text2, test$text3)## [1] 12\ntextstat_simil(dfmat, margin = \"documents\", method = \"cosine\")## textstat_simil object; method = \"cosine\"\n## text1 text2 text3\n## text1 1.000 0.364 0\n## text2 0.364 1.000 0.228\n## text3 0 0.228 1.000\n## cosine\ncos.sim <- function(a, b) \n{\n return(sum(a*b)/sqrt(sum(a^2)*sum(b^2)) )\n} \n\ncos.sim(test$text1, test$text2)## [1] 0.3636364\ncos.sim(test$text1, test$text3)## [1] 0\ncos.sim(test$text2, test$text3)## [1] 0.2279212"},{"path":"week-4-demo.html","id":"complexity","chapter":"7 Week 4 Demo","heading":"7.4 Complexity","text":"Note: section borrows notation materials texstat_readability() function.also talked different document-level measures text characteristics. One “complexity” readability text. One frequently used Flesch’s Reading Ease Score (Flesch 1948).computed :{:}{Flesch’s Reading Ease Score (Flesch 1948).\n}can estimate readability score respective sentences . Flesch score 1948 default.see ? original Austen opening line marked lower readability colloquial alternatives.alternatives measures might use. can check clicking links function textstat_readability(). display .One McLaughlin (1969) “Simple Measure Gobbledygook, based recurrence words 3 syllables calculated :{:}{Simple Measure Gobbledygook (SMOG) (McLaughlin 1969). = Nwmin3sy = number words 3 syllables .\nmeasure regression equation D McLaughlin’s original paper.}can calculate three sentences ., , see original Austen sentence higher level complexity (gobbledygook!).","code":"\ntextstat_readability(sentence1)## document Flesch\n## 1 text1 62.10739\ntextstat_readability(sentence2)## document Flesch\n## 1 text1 88.905\ntextstat_readability(sentence3)## document Flesch\n## 1 text1 83.09904\ntextstat_readability(sentence1, measure = \"SMOG\")## document SMOG\n## 1 text1 13.02387\ntextstat_readability(sentence2, measure = \"SMOG\")## document SMOG\n## 1 text1 8.841846\ntextstat_readability(sentence3, measure = \"SMOG\")## document SMOG\n## 1 text1 7.168622"},{"path":"week-5-scaling-techniques.html","id":"week-5-scaling-techniques","chapter":"8 Week 5: Scaling techniques","heading":"8 Week 5: Scaling techniques","text":"begin thinking automated techniques analyzing texts. bunch additional considerations now need bring mind. considerations sparked significant debates… matter means settled.stake ? weeks come, studying various techniques ‘classify,’ ‘position’ ‘score’ texts based features. success techniques depends suitability question hand also higher-level questions meaning. short, ask : way can access underlying processes governing generation text? meaning governed set structural processes? can derive ‘objective’ measures contents given text?readings Justin Grimmer, Roberts, Stewart (2021), Denny Spirling (2018), Goldenstein Poschmann (2019b) (well response replies Nelson (2019) Goldenstein Poschmann (2019a)) required reading Flexible Learning Week.Justin Grimmer, Roberts, Stewart (2021)Justin Grimmer, Roberts, Stewart (2021)Justin Grimmer Stewart (2013a)Justin Grimmer Stewart (2013a)Denny Spirling (2018)Denny Spirling (2018)Goldenstein Poschmann (2019b)\nNelson (2019)\nGoldenstein Poschmann (2019a)\nGoldenstein Poschmann (2019b)Nelson (2019)Goldenstein Poschmann (2019a)substantive focus week set readings employ different types “scaling” “low-dimensional document embedding” techniques. article Lowe (2008) provides technical overview “wordfish” algorithm uses political science contexts. article Klüver (2009) also uses “wordfish” different way—measure “influence” interest groups. response article Bunea Ibenskas (2015) subsequent reply Klüver (2015) helps illuminate debates around questions. work Kim, Lelkes, McCrain (2022) gives insight ability text-scaling techniques capture key dimensions political communication bias.Questions:assumptions underlie scaling models text?; latent text decides?might scaling useful outside estimating ideological position/bias text?Required reading:Lowe (2008)Kim, Lelkes, McCrain (2022)Klüver (2009)\nBunea Ibenskas (2015)\nKlüver (2015)\nBunea Ibenskas (2015)Klüver (2015)reading:Benoit et al. (2016)Laver, Benoit, Garry (2003)Slapin Proksch (2008)Schwemmer Wieczorek (2020)Slides:Week 5 Slides","code":""},{"path":"week-5-demo.html","id":"week-5-demo","chapter":"9 Week 5 Demo","heading":"9 Week 5 Demo","text":"","code":""},{"path":"week-5-demo.html","id":"setup-3","chapter":"9 Week 5 Demo","heading":"9.1 Setup","text":"First, ’ll load packages ’ll using week’s brief demo.","code":"\ndevtools::install_github(\"conjugateprior/austin\")\nlibrary(austin)\nlibrary(quanteda)\nlibrary(quanteda.textstats)"},{"path":"week-5-demo.html","id":"wordscores","chapter":"9 Week 5 Demo","heading":"9.2 Wordscores","text":"can inspect function wordscores model Laver, Benoit, Garry (2003) following way:can take example data included austin package.reference documents documents marked “R” reference; .e., columns one five.matrix simply series words (: letters) reference texts word counts .can look wordscores words, calculated using reference dimensions reference documents.see thetas contained wordscores object, .e., reference dimensions reference documents pis, .e., estimated wordscores word.can now use score -called “virgin” texts follows.","code":"\nclassic.wordscores## function (wfm, scores) \n## {\n## if (!is.wfm(wfm)) \n## stop(\"Function not applicable to this object\")\n## if (length(scores) != length(docs(wfm))) \n## stop(\"There are not the same number of documents as scores\")\n## if (any(is.na(scores))) \n## stop(\"One of the reference document scores is NA\\nFit the model with known scores and use 'predict' to get virgin score estimates\")\n## thecall <- match.call()\n## C.all <- as.worddoc(wfm)\n## C <- C.all[rowSums(C.all) > 0, ]\n## F <- scale(C, center = FALSE, scale = colSums(C))\n## ws <- apply(F, 1, function(x) {\n## sum(scores * x)\n## })/rowSums(F)\n## pi <- matrix(ws, nrow = length(ws))\n## rownames(pi) <- rownames(C)\n## colnames(pi) <- c(\"Score\")\n## val <- list(pi = pi, theta = scores, data = wfm, call = thecall)\n## class(val) <- c(\"classic.wordscores\", \"wordscores\", class(val))\n## return(val)\n## }\n## \n## \ndata(lbg)\nref <- getdocs(lbg, 1:5)\nref## docs\n## words R1 R2 R3 R4 R5\n## A 2 0 0 0 0\n## B 3 0 0 0 0\n## C 10 0 0 0 0\n## D 22 0 0 0 0\n## E 45 0 0 0 0\n## F 78 2 0 0 0\n## G 115 3 0 0 0\n## H 146 10 0 0 0\n## I 158 22 0 0 0\n## J 146 45 0 0 0\n## K 115 78 2 0 0\n## L 78 115 3 0 0\n## M 45 146 10 0 0\n## N 22 158 22 0 0\n## O 10 146 45 0 0\n## P 3 115 78 2 0\n## Q 2 78 115 3 0\n## R 0 45 146 10 0\n## S 0 22 158 22 0\n## T 0 10 146 45 0\n## U 0 3 115 78 2\n## V 0 2 78 115 3\n## W 0 0 45 146 10\n## X 0 0 22 158 22\n## Y 0 0 10 146 45\n## Z 0 0 3 115 78\n## ZA 0 0 2 78 115\n## ZB 0 0 0 45 146\n## ZC 0 0 0 22 158\n## ZD 0 0 0 10 146\n## ZE 0 0 0 3 115\n## ZF 0 0 0 2 78\n## ZG 0 0 0 0 45\n## ZH 0 0 0 0 22\n## ZI 0 0 0 0 10\n## ZJ 0 0 0 0 3\n## ZK 0 0 0 0 2\nws <- classic.wordscores(ref, scores=seq(-1.5,1.5,by=0.75))\nws## $pi\n## Score\n## A -1.5000000\n## B -1.5000000\n## C -1.5000000\n## D -1.5000000\n## E -1.5000000\n## F -1.4812500\n## G -1.4809322\n## H -1.4519231\n## I -1.4083333\n## J -1.3232984\n## K -1.1846154\n## L -1.0369898\n## M -0.8805970\n## N -0.7500000\n## O -0.6194030\n## P -0.4507576\n## Q -0.2992424\n## R -0.1305970\n## S 0.0000000\n## T 0.1305970\n## U 0.2992424\n## V 0.4507576\n## W 0.6194030\n## X 0.7500000\n## Y 0.8805970\n## Z 1.0369898\n## ZA 1.1846154\n## ZB 1.3232984\n## ZC 1.4083333\n## ZD 1.4519231\n## ZE 1.4809322\n## ZF 1.4812500\n## ZG 1.5000000\n## ZH 1.5000000\n## ZI 1.5000000\n## ZJ 1.5000000\n## ZK 1.5000000\n## \n## $theta\n## [1] -1.50 -0.75 0.00 0.75 1.50\n## \n## $data\n## docs\n## words R1 R2 R3 R4 R5\n## A 2 0 0 0 0\n## B 3 0 0 0 0\n## C 10 0 0 0 0\n## D 22 0 0 0 0\n## E 45 0 0 0 0\n## F 78 2 0 0 0\n## G 115 3 0 0 0\n## H 146 10 0 0 0\n## I 158 22 0 0 0\n## J 146 45 0 0 0\n## K 115 78 2 0 0\n## L 78 115 3 0 0\n## M 45 146 10 0 0\n## N 22 158 22 0 0\n## O 10 146 45 0 0\n## P 3 115 78 2 0\n## Q 2 78 115 3 0\n## R 0 45 146 10 0\n## S 0 22 158 22 0\n## T 0 10 146 45 0\n## U 0 3 115 78 2\n## V 0 2 78 115 3\n## W 0 0 45 146 10\n## X 0 0 22 158 22\n## Y 0 0 10 146 45\n## Z 0 0 3 115 78\n## ZA 0 0 2 78 115\n## ZB 0 0 0 45 146\n## ZC 0 0 0 22 158\n## ZD 0 0 0 10 146\n## ZE 0 0 0 3 115\n## ZF 0 0 0 2 78\n## ZG 0 0 0 0 45\n## ZH 0 0 0 0 22\n## ZI 0 0 0 0 10\n## ZJ 0 0 0 0 3\n## ZK 0 0 0 0 2\n## \n## $call\n## classic.wordscores(wfm = ref, scores = seq(-1.5, 1.5, by = 0.75))\n## \n## attr(,\"class\")\n## [1] \"classic.wordscores\" \"wordscores\" \"list\"\n#get \"virgin\" documents\nvir <- getdocs(lbg, 'V1')\nvir## docs\n## words V1\n## A 0\n## B 0\n## C 0\n## D 0\n## E 0\n## F 0\n## G 0\n## H 2\n## I 3\n## J 10\n## K 22\n## L 45\n## M 78\n## N 115\n## O 146\n## P 158\n## Q 146\n## R 115\n## S 78\n## T 45\n## U 22\n## V 10\n## W 3\n## X 2\n## Y 0\n## Z 0\n## ZA 0\n## ZB 0\n## ZC 0\n## ZD 0\n## ZE 0\n## ZF 0\n## ZG 0\n## ZH 0\n## ZI 0\n## ZJ 0\n## ZK 0\n# predict textscores for the virgin documents\npredict(ws, newdata=vir)## 37 of 37 words (100%) are scorable\n## \n## Score Std. Err. Rescaled Lower Upper\n## V1 -0.448 0.0119 -0.448 -0.459 -0.437"},{"path":"week-5-demo.html","id":"wordfish","chapter":"9 Week 5 Demo","heading":"9.3 Wordfish","text":"wish, can inspect function wordscores model Slapin Proksch (2008) following way. much complex algorithm, printed , can inspect devices.can simulate data, formatted appropriately wordfiash estimation following way:can see document word-level FEs, well specified range thetas estimates.estimating document positions simply matter implementing algorithm.","code":"\nwordfish\ndd <- sim.wordfish()\n\ndd## $Y\n## docs\n## words D01 D02 D03 D04 D05 D06 D07 D08 D09 D10\n## W01 19 24 23 18 14 12 8 13 6 4\n## W02 25 11 22 22 12 11 6 10 4 4\n## W03 14 21 18 19 13 16 17 10 3 11\n## W04 34 23 25 11 19 16 10 6 13 7\n## W05 25 19 20 20 16 10 10 12 7 2\n## W06 4 5 12 7 13 20 19 19 23 31\n## W07 6 6 15 7 13 16 14 15 19 28\n## W08 5 4 12 14 15 13 18 19 19 20\n## W09 6 7 7 9 8 17 19 20 17 20\n## W10 6 6 9 6 13 13 13 19 17 27\n## W11 59 59 46 38 39 28 26 25 15 17\n## W12 58 52 53 58 36 38 26 19 26 19\n## W13 59 55 49 44 41 27 24 18 21 10\n## W14 59 59 45 45 32 30 31 15 17 12\n## W15 65 54 43 34 44 39 21 36 13 14\n## W16 12 13 22 36 31 34 49 40 55 53\n## W17 9 23 19 24 31 39 59 50 51 46\n## W18 7 21 10 29 36 34 52 58 57 58\n## W19 14 21 22 27 41 45 42 59 49 58\n## W20 14 17 28 32 33 42 36 37 68 59\n## \n## $theta\n## [1] -1.4863011 -1.1560120 -0.8257228 -0.4954337 -0.1651446 0.1651446 0.4954337\n## [8] 0.8257228 1.1560120 1.4863011\n## \n## $doclen\n## D01 D02 D03 D04 D05 D06 D07 D08 D09 D10 \n## 500 500 500 500 500 500 500 500 500 500 \n## \n## $psi\n## [1] 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1\n## \n## $beta\n## [1] 0 0 0 0 0 1 1 1 1 1 0 0 0 0 0 1 1 1 1 1\n## \n## attr(,\"class\")\n## [1] \"wordfish.simdata\" \"list\"\nwf <- wordfish(dd$Y)\nsummary(wf)## Call:\n## wordfish(wfm = dd$Y)\n## \n## Document Positions:\n## Estimate Std. Error Lower Upper\n## D01 -1.6378 0.12078 -1.87454 -1.40109\n## D02 -1.0988 0.10363 -1.30193 -0.89571\n## D03 -0.7959 0.09716 -0.98635 -0.60548\n## D04 -0.4694 0.09256 -0.65084 -0.28802\n## D05 -0.1188 0.09023 -0.29565 0.05807\n## D06 0.2096 0.09047 0.03232 0.38695\n## D07 0.6201 0.09404 0.43578 0.80442\n## D08 0.7459 0.09588 0.55795 0.93381\n## D09 1.1088 0.10322 0.90646 1.31108\n## D10 1.4366 0.11257 1.21598 1.65725"},{"path":"week-5-demo.html","id":"using-quanteda","chapter":"9 Week 5 Demo","heading":"9.4 Using quanteda","text":"can also use quanteda implement scaling techniques, demonstrated Exercise 4.","code":""},{"path":"week-6-unsupervised-learning-topic-models.html","id":"week-6-unsupervised-learning-topic-models","chapter":"10 Week 6: Unsupervised learning (topic models)","heading":"10 Week 6: Unsupervised learning (topic models)","text":"week builds upon past scaling techniques explored Week 5 instead turns another form unsupervised approach—topic modelling.substantive articles Nelson (2020) Alrababa’h Blaydes (2020) provide, turn, illuminating insights using topic models categorize thematic content text information.article Ying, Montgomery, Stewart (2021) provides valuable overview accompaniment earlier work Denny Spirling (2018) thinking validate findings test robustness inferences make models.Questions:assumptions underlie topic modelling approaches?Can develop structural models text?topic modelling discovery measurement strategy?validate model?Required reading:Nelson (2020)PARTHASARATHY, RAO, PALANISWAMY (2019)Ying, Montgomery, Stewart (2021)reading:Chang et al. (2009)Alrababa’h Blaydes (2020)J. Grimmer King (2011)Denny Spirling (2018)Smith et al. (2021)Boyd et al. (2018)Slides:Week 6 Slides","code":""},{"path":"week-6-demo.html","id":"week-6-demo","chapter":"11 Week 6 Demo","heading":"11 Week 6 Demo","text":"","code":""},{"path":"week-6-demo.html","id":"setup-4","chapter":"11 Week 6 Demo","heading":"11.1 Setup","text":"First, ’ll load packages ’ll using week’s brief demo.Estimating topic model requires us first data form document-term-matrix. another term referred previous weeks document-feature-matrix.can take example data topicmodels package. text news releases Associated Press. consists around 2,200 articles (documents) 10,000 terms (words).estimate topic model need specify document-term-matrix using, number (k) topics estimating. speed estimation, estimating 100 articles.can inspect contents topic follows.can use tidy() function tidytext gather relevant parameters ’ve estimated. get \\(\\beta\\) per-topic-per-word probabilities (.e., probability given term belongs given topic) can following.get \\(\\gamma\\) per-document-per-topic probabilities (.e., probability given document (: article) belongs particular topic) following.can easily plot \\(\\beta\\) estimates follows.shows us words associated topic, size associated \\(\\beta\\) coefficient.","code":"\nlibrary(topicmodels)\nlibrary(dplyr)\nlibrary(tidytext)\nlibrary(ggplot2)\nlibrary(ggthemes)\ndata(\"AssociatedPress\", \n package = \"topicmodels\")\nlda_output <- LDA(AssociatedPress[1:100,], k = 10)\nterms(lda_output, 10)## Topic 1 Topic 2 Topic 3 Topic 4 Topic 5 \n## [1,] \"soviet\" \"government\" \"i\" \"dukakis\" \"new\" \n## [2,] \"roberts\" \"congress\" \"administration\" \"bush\" \"immigration\"\n## [3,] \"years\" \"jewish\" \"people\" \"rating\" \"central\" \n## [4,] \"gorbachev\" \"million\" \"bush\" \"new\" \"year\" \n## [5,] \"million\" \"soviet\" \"president\" \"president\" \"company\" \n## [6,] \"year\" \"jews\" \"noriega\" \"i\" \"greyhound\" \n## [7,] \"officers\" \"new\" \"thats\" \"day\" \"snow\" \n## [8,] \"gas\" \"people\" \"american\" \"told\" \"southern\" \n## [9,] \"polish\" \"church\" \"peres\" \"blackowned\" \"union\" \n## [10,] \"study\" \"east\" \"official\" \"rose\" \"contact\" \n## Topic 6 Topic 7 Topic 8 Topic 9 Topic 10 \n## [1,] \"fire\" \"percent\" \"i\" \"percent\" \"police\" \n## [2,] \"barry\" \"state\" \"people\" \"new\" \"two\" \n## [3,] \"warming\" \"year\" \"new\" \"bank\" \"mrs\" \n## [4,] \"global\" \"man\" \"duracell\" \"prices\" \"i\" \n## [5,] \"moore\" \"years\" \"soviet\" \"york\" \"last\" \n## [6,] \"summit\" \"last\" \"waste\" \"year\" \"school\" \n## [7,] \"mundy\" \"north\" \"agents\" \"california\" \"get\" \n## [8,] \"saudi\" \"government\" \"children\" \"economy\" \"liberace\"\n## [9,] \"asked\" \"national\" \"like\" \"oil\" \"shot\" \n## [10,] \"monday\" \"black\" \"company\" \"report\" \"man\"\nlda_beta <- tidy(lda_output, matrix = \"beta\")\n\nlda_beta %>%\n arrange(-beta)## # A tibble: 104,730 × 3\n## topic term beta\n## \n## 1 9 percent 0.0207\n## 2 7 percent 0.0184\n## 3 1 soviet 0.0143\n## 4 9 new 0.0143\n## 5 4 dukakis 0.0131\n## 6 7 state 0.0125\n## 7 10 police 0.0124\n## 8 8 i 0.0124\n## 9 4 bush 0.0120\n## 10 6 fire 0.0120\n## # ℹ 104,720 more rows\nlda_gamma <- tidy(lda_output, matrix = \"gamma\")\n\nlda_gamma %>%\n arrange(-gamma)## # A tibble: 1,000 × 3\n## document topic gamma\n## \n## 1 76 3 1.00\n## 2 81 2 1.00\n## 3 6 3 1.00\n## 4 43 1 1.00\n## 5 31 9 1.00\n## 6 95 9 1.00\n## 7 77 4 1.00\n## 8 29 1 1.00\n## 9 80 9 1.00\n## 10 57 2 1.00\n## # ℹ 990 more rows\nlda_beta %>%\n group_by(topic) %>%\n top_n(10, beta) %>%\n ungroup() %>%\n arrange(topic, -beta) %>%\n mutate(term = reorder_within(term, beta, topic)) %>%\n ggplot(aes(beta, term, fill = factor(topic))) +\n geom_col(show.legend = FALSE) +\n facet_wrap(~ topic, scales = \"free\", ncol = 4) +\n scale_y_reordered() +\n theme_tufte(base_family = \"Helvetica\")"},{"path":"week-7-unsupervised-learning-word-embedding.html","id":"week-7-unsupervised-learning-word-embedding","chapter":"12 Week 7: Unsupervised learning (word embedding)","heading":"12 Week 7: Unsupervised learning (word embedding)","text":"week discussing second form “unsupervised” learning—word embeddings. previous weeks allowed us characterize complexity text, cluster text potential topical focus, word embeddings permit us expansive form measurement. essence, producing matrix representation entire corpus.reading Pedro L. Rodriguez Spirling (2022) provides effective overview technical dimensions technique. articles Garg et al. (2018) Kozlowski, Taddy, Evans (2019) two substantive articles use word embeddings provide insights prejudice bias manifested language time.Required reading:Garg et al. (2018)Kozlowski, Taddy, Evans (2019)Waller Anderson (2021)reading:P. Rodriguez Spirling (2021)Pedro L. Rodriguez Spirling (2022)Osnabrügge, Hobolt, Rodon (2021)Rheault Cochrane (2020)Jurafsky Martin (2021, ch.6): https://web.stanford.edu/~jurafsky/slp3/]Slides:Week 7 Slides","code":""},{"path":"week-7-demo.html","id":"week-7-demo","chapter":"13 Week 7 Demo","heading":"13 Week 7 Demo","text":"","code":""},{"path":"week-7-demo.html","id":"setup-5","chapter":"13 Week 7 Demo","heading":"13.1 Setup","text":"First, ’ll load packages ’ll using week’s brief demo. pre-loading already-estimated PMI matrix results singular value decomposition approach.work?Various approaches, including:\nSVD\n\nNeural network-based techniques like GloVe Word2Vec\n\nSVD\nSVDNeural network-based techniques like GloVe Word2Vec\nNeural network-based techniques like GloVe Word2VecIn approaches, :Defining context window (see figure )Looking probabilities word appearing near another wordsThe implementation technique using singular value decomposition approach requires following data structure:Word pair matrix PMI (Pairwise mutual information)PMI = log(P(x,y)/P(x)P(y))P(x,y) probability word x appearing within six-word window word yand P(x) probability word x appearing whole corpusand P(y) probability word y appearing whole corpusAnd resulting matrix object take following format:use “Singular Value Decomposition” (SVD) techique. another multidimensional scaling technique, first axis resulting coordinates captures variance, second second-etc…, simply need following.can collect vectors word inspect .","code":"\nlibrary(Matrix) #for handling matrices\nlibrary(tidyverse)\nlibrary(irlba) # for SVD\nlibrary(umap) # for dimensionality reduction\n\nload(\"data/wordembed/pmi_svd.RData\")\nload(\"data/wordembed/pmi_matrix.RData\")## 6 x 6 sparse Matrix of class \"dgCMatrix\"\n## the to and of https a\n## the 0.653259169 -0.01948121 -0.006446459 0.27136395 -0.5246159 -0.32557524\n## to -0.019481205 0.75498084 -0.065170433 -0.25694210 -0.5731182 -0.04595798\n## and -0.006446459 -0.06517043 1.027782342 -0.03974904 -0.4915159 -0.05862969\n## of 0.271363948 -0.25694210 -0.039749043 1.02111517 -0.5045067 0.09829389\n## https -0.524615878 -0.57311817 -0.491515918 -0.50450674 0.5451841 -0.57956404\n## a -0.325575239 -0.04595798 -0.058629689 0.09829389 -0.5795640 1.03048355## Formal class 'dgCMatrix' [package \"Matrix\"] with 6 slots\n## ..@ i : int [1:350700] 0 1 2 3 4 5 6 7 8 9 ...\n## ..@ p : int [1:21173] 0 7819 14360 20175 25467 29910 34368 39207 43376 46401 ...\n## ..@ Dim : int [1:2] 21172 21172\n## ..@ Dimnames:List of 2\n## .. ..$ : chr [1:21172] \"the\" \"to\" \"and\" \"of\" ...\n## .. ..$ : chr [1:21172] \"the\" \"to\" \"and\" \"of\" ...\n## ..@ x : num [1:350700] 0.65326 -0.01948 -0.00645 0.27136 -0.52462 ...\n## ..@ factors : list()\npmi_svd <- irlba(pmi_matrix, 256, maxit = 500)\nword_vectors <- pmi_svd$u\nrownames(word_vectors) <- rownames(pmi_matrix)\ndim(word_vectors)## [1] 21172 256\nhead(word_vectors[1:5, 1:5])## [,1] [,2] [,3] [,4] [,5]\n## the 0.007810973 0.07024009 0.06377615 0.03139044 -0.12362108\n## to 0.006889381 -0.03210269 0.10665925 0.03537632 0.10104552\n## and -0.050498380 0.09131495 0.19658197 -0.08136253 -0.01605705\n## of -0.015628371 0.16306386 0.13296127 -0.04087709 -0.23175976\n## https 0.301718525 0.07658843 -0.01720398 0.26219147 0.07930941"},{"path":"week-7-demo.html","id":"using-glove-or-word2vec","chapter":"13 Week 7 Demo","heading":"13.2 Using GloVe or word2vec","text":"neural network approach considerably involved, figure gives overview picture differing algorithmic approaches might use.","code":""},{"path":"week-8-sampling-text-information.html","id":"week-8-sampling-text-information","chapter":"14 Week 8: Sampling text information","heading":"14 Week 8: Sampling text information","text":"week ’ll thinking best sample text information, thinking different biases might inhere data-generating process, well representativeness generalizability text corpus construct.reading Barberá Rivero (2015) invesitgates representativeness Twitter data, give us pause thinking using digital trace data general barometer public opinion.reading Michalopoulos Xue (2021) takes entirely different tack, illustrates can think systematically text information broadly representative societies general.Required reading:Barberá Rivero (2015)Michalopoulos Xue (2021)Klaus Krippendorff (2004, chs. 5 6)reading:Martins Baumard (2020)Baumard et al. (2022)Slides:Week 8 Slides","code":""},{"path":"week-9-supervised-learning.html","id":"week-9-supervised-learning","chapter":"15 Week 9: Supervised learning","heading":"15 Week 9: Supervised learning","text":"Required reading:Hopkins King (2010)King, Pan, Roberts (2017)Siegel et al. (2021)Yu, Kaufmann, Diermeier (2008)Manning, Raghavan, Schtze (2007, chs. 13,14, 15): https://nlp.stanford.edu/IR-book/information-retrieval-book.html]reading:Denny Spirling (2018)King, Lam, Roberts (2017)","code":""},{"path":"week-10-validation.html","id":"week-10-validation","chapter":"16 Week 10: Validation","heading":"16 Week 10: Validation","text":"week ’ll thinking validate techniques ’ve used preceding weeks. Validation necessary important part text analysis technique.Often speak validation context machine labelling large text data. validation need ——restricted automated classification tasks. articles Ying, Montgomery, Stewart (2021) Pedro L. Rodriguez, Spirling, Stewart (2021) describe ways approach validation unsupervised contexts. Finally, article Peterson Spirling (2018) shows validation accuracy might provide measure substantive significance.Required reading:Ying, Montgomery, Stewart (2021)Pedro L. Rodriguez, Spirling, Stewart (2021)Peterson Spirling (2018)Manning, Raghavan, Schtze (2007, ch.2: https://nlp.stanford.edu/IR-book/information-retrieval-book.html)reading:K. Krippendorff (2004)Denny Spirling (2018)Justin Grimmer Stewart (2013b)Barberá et al. (2021)Schiller, Daxenberger, Gurevych (2021)Slides:Week 10 Slides","code":""},{"path":"exercise-1-word-frequency-analysis.html","id":"exercise-1-word-frequency-analysis","chapter":"17 Exercise 1: Word frequency analysis","heading":"17 Exercise 1: Word frequency analysis","text":"","code":""},{"path":"exercise-1-word-frequency-analysis.html","id":"introduction","chapter":"17 Exercise 1: Word frequency analysis","heading":"17.1 Introduction","text":"tutorial, learn summarise, aggregate, analyze text R:tokenize filter textHow clean preprocess textHow visualize results ggplotHow perform automated gender assignment name data (think possible biases methods may enclose)","code":""},{"path":"exercise-1-word-frequency-analysis.html","id":"setup-6","chapter":"17 Exercise 1: Word frequency analysis","heading":"17.2 Setup","text":"practice skills, use dataset already collected Edinburgh Fringe Festival website.can try : obtain data, must first obtain API key. Instructions available Edinburgh Fringe API page:","code":""},{"path":"exercise-1-word-frequency-analysis.html","id":"load-data-and-packages","chapter":"17 Exercise 1: Word frequency analysis","heading":"17.3 Load data and packages","text":"proceeding, ’ll load remaining packages need tutorial.tutorial, using data pre-cleaned provided .csv format. data come Edinburgh Book Festival API, provide data every event taken place Edinburgh Book Festival, runs every year month August, nine years: 2012-2020. many questions might ask data. tutorial, investigate contents event, speakers event, determine trends gender representation time.first task, , read data. can read_csv() function.read_csv() function takes .csv file loads working environment data frame object called “edbfdata.” can call object anything though. Try changing name object <- arrow. Note R allow names spaces , however. also good idea name object something beginning numbers, means call object within ` marks.’re working document computer (“locally”) can download Edinburgh Fringe data following way:","code":"\nlibrary(tidyverse) # loads dplyr, ggplot2, and others\nlibrary(tidytext) # includes set of functions useful for manipulating text\nlibrary(ggthemes) # includes a set of themes to make your visualizations look nice!\nlibrary(readr) # more informative and easy way to import data\nlibrary(babynames) #for gender predictions\nedbfdata <- read_csv(\"data/wordfreq/edbookfestall.csv\")## New names:\n## Rows: 5938 Columns: 12\n## ── Column specification\n## ───────────────────────────────────────────────────────── Delimiter: \",\" chr\n## (8): festival_id, title, sub_title, artist, description, genre, age_categ... dbl\n## (4): ...1, year, latitude, longitude\n## ℹ Use `spec()` to retrieve the full column specification for this data. ℹ\n## Specify the column types or set `show_col_types = FALSE` to quiet this message.\n## • `` -> `...1`\nedbfdata <- read_csv(\"https://raw.githubusercontent.com/cjbarrie/RDL-Ed/main/02-text-as-data/data/edbookfestall.csv\")"},{"path":"exercise-1-word-frequency-analysis.html","id":"inspect-and-filter-data","chapter":"17 Exercise 1: Word frequency analysis","heading":"17.4 Inspect and filter data","text":"next job cut dataset size, including columns need. first can inspect see existing column names , variable coded. can first call::can see description event included column named “description” year event “year.” now ’ll just keep two. Remember: ’re interested tutorial firstly representation gender feminism forms cultural production given platform Edinburgh International Book Festival. Given , first foremost interested reported content artist’s event.use pipe %>% functions tidyverse package quickly efficiently select columns want edbfdata data.frame object. pass data new data.frame object, call “evdes.”let’s take quick look many events time festival. , first calculate number individual events (row observations) year (column variable).can plot using ggplot!Perhaps unsurprisingly, context pandemic, number recorded bookings 2020 Festival drastically reduced.","code":"\ncolnames(edbfdata)## [1] \"...1\" \"festival_id\" \"title\" \"sub_title\" \"artist\" \n## [6] \"year\" \"description\" \"genre\" \"latitude\" \"longitude\" \n## [11] \"age_category\" \"ID\"\nglimpse(edbfdata)## Rows: 5,938\n## Columns: 12\n## $ ...1 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17,…\n## $ festival_id \"book\", \"book\", \"book\", \"book\", \"book\", \"book\", \"book\", \"b…\n## $ title \"Denise Mina\", \"Alex T Smith\", \"Challenging Expectations w…\n## $ sub_title \"HARD MEN AND CARDBOARD GANGSTERS\", NA, NA, \"WHAT CAUSED T…\n## $ artist \"Denise Mina\", \"Alex T Smith\", \"Peter Cocks\", \"Paul Mason\"…\n## $ year 2012, 2012, 2012, 2012, 2012, 2012, 2012, 2012, 2012, 2012…\n## $ description \"\\n\\tAs the grande dame of Scottish crime fiction, Deni…\n## $ genre \"Literature\", \"Children\", \"Children\", \"Literature\", \"Child…\n## $ latitude 55.9519, 55.9519, 55.9519, 55.9519, 55.9519, 55.9519, 55.9…\n## $ longitude -3.206913, -3.206913, -3.206913, -3.206913, -3.206913, -3.…\n## $ age_category NA, \"AGE 4 - 7\", \"AGE 11 - 14\", NA, \"AGE 10 - 14\", \"AGE 6 …\n## $ ID \"Denise Mina2012\", \"Alex T Smith2012\", \"Peter Cocks2012\", …\n# get simplified dataset with only event contents and year\nevdes <- edbfdata %>%\n select(description, year)\n\nhead(evdes)## # A tibble: 6 × 2\n## description year\n## \n## 1 \"\\n\\tAs the grande dame of Scottish crime fiction, Denise Mina places… 2012\n## 2 \"
\\n\\tWhen Alex T Smith was a little boy he wanted to be a chef, a rab… 2012\n## 3 \"
\\n\\tPeter Cocks is known for his fantasy series Triskellion written … 2012\n## 4 \"
\\n\\tTwo books by influential journalists are among the first to look… 2012\n## 5 \"
\\n\\tChris d’Lacey tells you all about The Fire Ascending, the … 2012\n## 6 \"
\\n\\tIt’s time for the honourable, feisty and courageous young … 2012\nevtsperyr <- evdes %>%\n mutate(obs=1) %>%\n group_by(year) %>%\n summarise(sum_events = sum(obs))\nggplot(evtsperyr) +\n geom_line(aes(year, sum_events)) +\n theme_tufte(base_family = \"Helvetica\") + \n scale_y_continuous(expand = c(0, 0), limits = c(0, NA))"},{"path":"exercise-1-word-frequency-analysis.html","id":"tidy-the-text","chapter":"17 Exercise 1: Word frequency analysis","heading":"17.5 Tidy the text","text":"Given data obtained API outputs data originally HTML format, text still contains HTML PHP encodings e.g. bold font paragraphs. ’ll need get rid , well punctuation analyzing data.set commands takes event descriptions, extracts individual words, counts number times appear years covered book festival data.","code":"\n#get year and word for every word and date pair in the dataset\ntidy_des <- evdes %>% \n mutate(desc = tolower(description)) %>%\n unnest_tokens(word, desc) %>%\n filter(str_detect(word, \"[a-z]\"))"},{"path":"exercise-1-word-frequency-analysis.html","id":"back-to-the-fringe","chapter":"17 Exercise 1: Word frequency analysis","heading":"17.6 Back to the Fringe","text":"see resulting dataset large (~446k rows). commands first taken events text, “mutated” set lower case character string. “unnest_tokens” function taken individual string create new column called “word” contains individual word contained event description texts.terminology also appropriate . tidy text format, often refer data structures consisting “documents” “terms.” “tokenizing” text “unnest_tokens” functions generating dataset one term per row., “documents” collection descriptions events year Edinburgh Book Festival. way sort text “documents” depends choice individual researcher.Instead year, might wanted sort text “genre.” , two genres: “Literature” “Children.” done , two “documents,” contained words included event descriptions genre.Alternatively, might interested contributions individual authors time. case, sorted text documents author. case, “document” represent words included event descriptions events given author (many multiple appearances time festival given year).can yet tidy , though. First ’ll remove stop words ’ll remove apostrophes:see number rows dataset reduces half ~223k rows. natural since large proportion string contain many -called “stop words”. can see stop words typing:lexicon (list words) included tidytext package produced Julia Silge David Robinson (see ). see contains 1000 words. remove informative interested substantive content text (rather , say, grammatical content).Now let’s look common words data:can see one common words “rsquo,” HTML encoding apostrophe. Clearly need clean data bit . common issue large-n text analysis key step want conduct reliably robust forms text analysis. ’ll another go using filter command, specifying keep words included string words rsquo, em, ndash, nbsp, lsquo.’s like ! words feature seem make sense now (actual words rather random HTML UTF-8 encodings).Let’s now collect words data.frame object, ’ll call edbf_term_counts:year, see “book” common word… perhaps surprises . evidence ’re properly pre-processing cleaning data. Cleaning text data important element preparing text analysis. often process trial error text data looks alike, may come e.g. webpages HTML encoding, unrecognized fonts unicode, potential cause issues! finding errors also chance get know data…","code":"\ntidy_des <- tidy_des %>%\n filter(!word %in% stop_words$word)\nstop_words## # A tibble: 1,149 × 2\n## word lexicon\n## \n## 1 a SMART \n## 2 a's SMART \n## 3 able SMART \n## 4 about SMART \n## 5 above SMART \n## 6 according SMART \n## 7 accordingly SMART \n## 8 across SMART \n## 9 actually SMART \n## 10 after SMART \n## # ℹ 1,139 more rows\ntidy_des %>%\n count(word, sort = TRUE)## # A tibble: 24,995 × 2\n## word n\n## \n## 1 rsquo 5638\n## 2 book 2088\n## 3 event 1356\n## 4 author 1332\n## 5 world 1240\n## 6 story 1159\n## 7 join 1095\n## 8 em 1064\n## 9 life 879\n## 10 strong 864\n## # ℹ 24,985 more rows\nremove_reg <- c(\"&\",\"<\",\">\",\"\", \"<\/p>\",\"&rsquo\", \"‘\", \"'\", \"\", \"<\/strong>\", \"rsquo\", \"em\", \"ndash\", \"nbsp\", \"lsquo\", \"strong\")\n \ntidy_des <- tidy_des %>%\n filter(!word %in% remove_reg)\ntidy_des %>%\n count(word, sort = TRUE)## # A tibble: 24,989 × 2\n## word n\n## \n## 1 book 2088\n## 2 event 1356\n## 3 author 1332\n## 4 world 1240\n## 5 story 1159\n## 6 join 1095\n## 7 life 879\n## 8 stories 860\n## 9 chaired 815\n## 10 books 767\n## # ℹ 24,979 more rows\nedbf_term_counts <- tidy_des %>% \n group_by(year) %>%\n count(word, sort = TRUE)\nhead(edbf_term_counts)## # A tibble: 6 × 3\n## # Groups: year [6]\n## year word n\n## \n## 1 2016 book 295\n## 2 2018 book 283\n## 3 2019 book 265\n## 4 2012 book 254\n## 5 2013 book 241\n## 6 2015 book 239"},{"path":"exercise-1-word-frequency-analysis.html","id":"analyze-keywords","chapter":"17 Exercise 1: Word frequency analysis","heading":"17.7 Analyze keywords","text":"Okay, now list words, number times appear, can tag words think might related issues gender inequality sexism. may decide list imprecise inexhaustive. , feel free change terms including grepl() function.","code":"\nedbf_term_counts$womword <- as.integer(grepl(\"women|feminist|feminism|gender|harassment|sexism|sexist\", \n x = edbf_term_counts$word))\nhead(edbf_term_counts)## # A tibble: 6 × 4\n## # Groups: year [6]\n## year word n womword\n## \n## 1 2016 book 295 0\n## 2 2018 book 283 0\n## 3 2019 book 265 0\n## 4 2012 book 254 0\n## 5 2013 book 241 0\n## 6 2015 book 239 0"},{"path":"exercise-1-word-frequency-analysis.html","id":"compute-aggregate-statistics","chapter":"17 Exercise 1: Word frequency analysis","heading":"17.8 Compute aggregate statistics","text":"Now tagged individual words relating gender inequality feminism, can sum number times words appear year denominate total number words event descriptions.intuition increase decrease percentage words relating issues capturing substantive change representation issues related sex gender.think measure? adequate measure representation issues cultural sphere?keywords used precise enough? , change?","code":"\n#get counts by year and word\nedbf_counts <- edbf_term_counts %>%\n group_by(year) %>%\n mutate(year_total = sum(n)) %>%\n filter(womword==1) %>%\n summarise(sum_wom = sum(n),\n year_total= min(year_total))\nhead(edbf_counts)## # A tibble: 6 × 3\n## year sum_wom year_total\n## \n## 1 2012 22 23146\n## 2 2013 40 23277\n## 3 2014 30 25366\n## 4 2015 24 22158\n## 5 2016 34 24356\n## 6 2017 55 27602"},{"path":"exercise-1-word-frequency-analysis.html","id":"plot-time-trends","chapter":"17 Exercise 1: Word frequency analysis","heading":"17.9 Plot time trends","text":"see? Let’s take count words relating gender dataset, denominate total number words data per year.can add visual guides draw attention apparent changes data. , might wish signal year #MeToo movement 2017.label highlighting year 2017 including text label along vertical line.","code":"\nggplot(edbf_counts, aes(year, sum_wom / year_total, group=1)) +\n geom_line() +\n xlab(\"Year\") +\n ylab(\"% gender-related words\") +\n scale_y_continuous(labels = scales::percent_format(),\n expand = c(0, 0), limits = c(0, NA)) +\n theme_tufte(base_family = \"Helvetica\") \nggplot(edbf_counts, aes(year, sum_wom / year_total, group=1)) +\n geom_line() +\n geom_vline(xintercept = 2017, col=\"red\") +\n xlab(\"Year\") +\n ylab(\"% gender-related words\") +\n scale_y_continuous(labels = scales::percent_format(),\n expand = c(0, 0), limits = c(0, NA)) +\n theme_tufte(base_family = \"Helvetica\")\nggplot(edbf_counts, aes(year, sum_wom / year_total, group=1)) +\n geom_line() +\n geom_vline(xintercept = 2017, col=\"red\") +\n geom_text(aes(x=2017.1, label=\"#metoo year\", y=.0015), \n colour=\"black\", angle=90, text=element_text(size=8)) +\n xlab(\"Year\") +\n ylab(\"% gender-related words\") +\n scale_y_continuous(labels = scales::percent_format(),\n expand = c(0, 0), limits = c(0, NA)) +\n theme_tufte(base_family = \"Helvetica\")"},{"path":"exercise-1-word-frequency-analysis.html","id":"bonus-gender-prediction","chapter":"17 Exercise 1: Word frequency analysis","heading":"17.10 Bonus: gender prediction","text":"might decide measure inadequate expansive answer question hand. Another way measuring representation cultural production measure gender authors spoke events.course, take quite time individually code approximately 6000 events included dataset.exist alternative techniques imputing gender based name individual.first create new data.frame object, selecting just columns artist name year. generate new column containing just artist’s (author’s) first name:set packages called gender genderdata used make process predicting gender based given individual’s name pretty straightforward. technique worked reference U.S. Social Security Administration baby name data.Given common gender associated given name changes time, function also allows us specify range years cohort question whose gender inferring. Given don’t know wide cohort artists , specify broad range 1920-2000.Unfortunately, package longer works newer versions R; fortunately, recreated using original “babynames” data, comes bundled babynames package.don’t necessarily follow step done —include information sake completeness.babynames package. contains, year, number children born given name, well sex. information, can calculate total number individuals given name born sex given year.Given also total number babies born total cross records, can denominate (divide) sums name total number births sex year. can take proportion representing probability given individual Edinburgh Fringe dataset male female.information babynames package can found .first load babynames package R environment data.frame object. data.frame “babynames” contained babynames package can just call object store .dataset contains names years period 1800–2019. variable “n” represents number babies born given name sex year, “prop” represents, according package materials accessible , “n divided total number applicants year, means proportions people gender name born year.”calculate total number babies female male sex born year. merge get combined dataset male female baby names year. merge information back original babynames data.frame object.can calculate, babies born 1920, number babies born name sex. information, can get proportion babies given name particular sex. example, 92% babies born name “Mary” female, give us .92 probability individual name “Mary” female.every name dataset, excluding names proportion equal .5; .e., names adjudicate whether less likely male female.proportions names, can merge back names artists Edinburgh Fringe Book Festival. can easily plot proportion artists Festival male versus female year Festival.can conclude form graph?Note merged proportions th “babynames” data Edinburgh Fringe data lost observations. names Edinburgh Fringe data match “babynames” data. Let’s look names match:notice anything names? tell us potential biases using sources US baby names data foundation gender prediction? alternative ways might go task?","code":"\n# get columns for artist name and year, omitting NAs\ngendes <- edbfdata %>%\n select(artist, year) %>%\n na.omit()\n\n# generate new column with just the artist's (author's) first name\ngendes$name <- sub(\" .*\", \"\", gendes$artist)\ngenpred <- gender(gendes$name,\n years = c(1920, 2000))\nbabynames <- babynames\nhead(babynames)## # A tibble: 6 × 5\n## year sex name n prop\n## \n## 1 1880 F Mary 7065 0.0724\n## 2 1880 F Anna 2604 0.0267\n## 3 1880 F Emma 2003 0.0205\n## 4 1880 F Elizabeth 1939 0.0199\n## 5 1880 F Minnie 1746 0.0179\n## 6 1880 F Margaret 1578 0.0162\ntotals_female <- babynames %>%\n filter(sex==\"F\") %>%\n group_by(year) %>%\n summarise(total_female = sum(n))\n\ntotals_male <- babynames %>%\n filter(sex==\"M\") %>%\n group_by(year) %>%\n summarise(total_male = sum(n))\n\ntotals <- merge(totals_female, totals_male)\n\ntotsm <- merge(babynames, totals, by = \"year\")\nhead(totsm)## year sex name n prop total_female total_male\n## 1 1880 F Mary 7065 0.07238359 90993 110491\n## 2 1880 F Anna 2604 0.02667896 90993 110491\n## 3 1880 F Emma 2003 0.02052149 90993 110491\n## 4 1880 F Elizabeth 1939 0.01986579 90993 110491\n## 5 1880 F Minnie 1746 0.01788843 90993 110491\n## 6 1880 F Margaret 1578 0.01616720 90993 110491\ntotprops <- totsm %>%\n filter(year >= 1920) %>%\n group_by(name, year) %>%\n mutate(sumname = sum(n),\n prop = ifelse(sumname==n, 1,\n n/sumname)) %>%\n filter(prop!=.5) %>%\n group_by(name) %>%\n slice(which.max(prop)) %>%\n summarise(prop = max(prop),\n totaln = sum(n),\n name = max(name),\n sex = unique(sex))\n\nhead(totprops)## # A tibble: 6 × 4\n## name prop totaln sex \n## \n## 1 Aaban 1 5 M \n## 2 Aabha 1 7 F \n## 3 Aabid 1 5 M \n## 4 Aabir 1 5 M \n## 5 Aabriella 1 5 F \n## 6 Aada 1 5 F\nednameprops <- merge(totprops, gendes, by = \"name\")\n\nggplot(ednameprops, aes(x=year, fill = factor(sex))) +\n geom_bar(position = \"fill\") +\n xlab(\"Year\") +\n ylab(\"% women authors\") +\n labs(fill=\"\") +\n scale_y_continuous(labels = scales::percent_format(accuracy = 1)) +\n theme_tufte(base_family = \"Helvetica\") +\n geom_abline(slope=0, intercept=0.5, col = \"black\",lty=2)\nnames1 <- ednameprops$name\nnames2 <- gendes$name\ndiffs <- setdiff(names2, names1)\ndiffs## [1] \"L\" \"Kapka\" \"Menzies\" \"Ros\" \n## [5] \"G\" \"Pankaj\" \"Uzodinma\" \"Rodge\" \n## [9] \"A\" \"Zoë\" \"László\" \"Sadakat\" \n## [13] \"Michèle\" \"Maajid\" \"Yrsa\" \"Ahdaf\" \n## [17] \"Noo\" \"Dilip\" \"Sjón\" \"François\" \n## [21] \"J\" \"K\" \"Aonghas\" \"S\" \n## [25] \"Bashabi\" \"Kjartan\" \"Romesh\" \"T\" \n## [29] \"Chibundu\" \"Yiyun\" \"Fiammetta\" \"W\" \n## [33] \"Sindiwe\" \"Cat\" \"Jez\" \"Fi\" \n## [37] \"Sunder\" \"Saci\" \"C.J\" \"Halik\" \n## [41] \"Niccolò\" \"Sifiso\" \"C.S.\" \"DBC\" \n## [45] \"Phyllida\" \"R\" \"Struan\" \"C.J.\" \n## [49] \"SF\" \"Nadifa\" \"Jérome\" \"D\" \n## [53] \"Xiaolu\" \"Ramita\" \"John-Paul\" \"Ha-Joon\" \n## [57] \"Niq\" \"Andrés\" \"Sasenarine\" \"Frane\" \n## [61] \"Alev\" \"Gruff\" \"Line\" \"Zakes\" \n## [65] \"Pip\" \"Witi\" \"Halsted\" \"Ziauddin\" \n## [69] \"J.\" \"Åsne\" \"Alecos\" \".\" \n## [73] \"Julián\" \"Sunjeev\" \"A.C.S\" \"Etgar\" \n## [77] \"Hyeonseo\" \"Jaume\" \"A.\" \"Jesús\" \n## [81] \"Jón\" \"Helle\" \"M\" \"Jussi\" \n## [85] \"Aarathi\" \"Shappi\" \"Macastory\" \"Odafe\" \n## [89] \"Chimwemwe\" \"Hrefna\" \"Bidisha\" \"Packie\" \n## [93] \"Tahmima\" \"Sara-Jane\" \"Tahar\" \"Lemn\" \n## [97] \"Neu!\" \"Jürgen\" \"Barroux\" \"Jan-Philipp\" \n## [101] \"Non\" \"Metaphrog\" \"Wilko\" \"Álvaro\" \n## [105] \"Stef\" \"Erlend\" \"Grinagog\" \"Norma-Ann\" \n## [109] \"Fuchsia\" \"Giddy\" \"Joudie\" \"Sav\" \n## [113] \"Liu\" \"Jayne-Anne\" \"Wioletta\" \"Sinéad\" \n## [117] \"Katherena\" \"Siân\" \"Dervla\" \"Teju\" \n## [121] \"Iosi\" \"Daša\" \"Cosey\" \"Bettany\" \n## [125] \"Thordis\" \"Uršuľa\" \"Limmy\" \"Meik\" \n## [129] \"Zindzi\" \"Dougie\" \"Ngugi\" \"Inua\" \n## [133] \"Ottessa\" \"Bjørn\" \"Novuyo\" \"Rhidian\" \n## [137] \"Sibéal\" \"Hsiao-Hung\" \"Audur\" \"Sadek\" \n## [141] \"Özlem\" \"Zaffar\" \"Jean-Pierre\" \"Lalage\" \n## [145] \"Yaba\" \"H\" \"DJ\" \"Sigitas\" \n## [149] \"Clémentine\" \"Celeste-Marie\" \"Marawa\" \"Ghillie\" \n## [153] \"Ahdam\" \"Suketu\" \"Goenawan\" \"Niviaq\" \n## [157] \"Steinunn\" \"Shoo\" \"Ibram\" \"Venki\" \n## [161] \"DeRay\" \"Diarmaid\" \"Serhii\" \"Harkaitz\" \n## [165] \"Adélaïde\" \"Agustín\" \"Jérôme\" \"Siobhán\" \n## [169] \"Nesrine\" \"Jokha\" \"Gulnar\" \"Uxue\" \n## [173] \"Taqralik\" \"Tayi\" \"E\" \"Dapo\" \n## [177] \"Dunja\" \"Maaza\" \"Wayétu\" \"Shokoofeh\""},{"path":"exercise-1-word-frequency-analysis.html","id":"exercises","chapter":"17 Exercise 1: Word frequency analysis","heading":"17.11 Exercises","text":"Filter books genre (selecting e.g., “Literature” “Children”) plot frequency women-related words time.Choose another set terms filter (e.g., race-related words) plot frequency time.","code":""},{"path":"exercise-1-word-frequency-analysis.html","id":"references","chapter":"17 Exercise 1: Word frequency analysis","heading":"17.12 References","text":"","code":""},{"path":"exercise-2-dictionary-based-methods.html","id":"exercise-2-dictionary-based-methods","chapter":"18 Exercise 2: Dictionary-based methods","heading":"18 Exercise 2: Dictionary-based methods","text":"","code":""},{"path":"exercise-2-dictionary-based-methods.html","id":"introduction-1","chapter":"18 Exercise 2: Dictionary-based methods","heading":"18.1 Introduction","text":"tutorial, learn :Use dictionary-based techniques analyze textUse common sentiment dictionariesCreate “dictionary”Use Lexicoder sentiment dictionary Young Soroka (2012)","code":""},{"path":"exercise-2-dictionary-based-methods.html","id":"setup-7","chapter":"18 Exercise 2: Dictionary-based methods","heading":"18.2 Setup","text":"hands-exercise week uses dictionary-based methods filtering scoring words. Dictionary-based methods use pre-generated lexicons, list words associated scores variables measuring valence particular word. sense, exercise unlike analysis Edinburgh Book Festival event descriptions. , filtering descriptions based presence absence word related women gender. can understand approach particularly simple type “dictionary-based” method. , “dictionary” “lexicon” contained just words related gender.","code":""},{"path":"exercise-2-dictionary-based-methods.html","id":"load-data-and-packages-1","chapter":"18 Exercise 2: Dictionary-based methods","heading":"18.3 Load data and packages","text":"proceeding, ’ll load remaining packages need tutorial.exercise ’ll using another new dataset. data collected Twitter accounts top eight newspapers UK circulation. can see names newspapers code :details access Twitter data academictwitteR, check details package .can download final dataset :’re working document computer (“locally”) can download tweets data following way:","code":"\nlibrary(academictwitteR) # for fetching Twitter data\nlibrary(tidyverse) # loads dplyr, ggplot2, and others\nlibrary(readr) # more informative and easy way to import data\nlibrary(stringr) # to handle text elements\nlibrary(tidytext) # includes set of functions useful for manipulating text\nlibrary(quanteda) # includes functions to implement Lexicoder\nlibrary(textdata)\nnewspapers = c(\"TheSun\", \"DailyMailUK\", \"MetroUK\", \"DailyMirror\", \n \"EveningStandard\", \"thetimes\", \"Telegraph\", \"guardian\")\n\ntweets <-\n get_all_tweets(\n users = newspapers,\n start_tweets = \"2020-01-01T00:00:00Z\",\n end_tweets = \"2020-05-01T00:00:00Z\",\n data_path = \"data/sentanalysis/\",\n n = Inf,\n )\n\ntweets <- \n bind_tweets(data_path = \"data/sentanalysis/\", output_format = \"tidy\")\n\nsaveRDS(tweets, \"data/sentanalysis/newstweets.rds\")\ntweets <- readRDS(\"data/sentanalysis/newstweets.rds\")\ntweets <- readRDS(gzcon(url(\"https://github.com/cjbarrie/CTA-ED/blob/main/data/sentanalysis/newstweets.rds?raw=true\")))"},{"path":"exercise-2-dictionary-based-methods.html","id":"inspect-and-filter-data-1","chapter":"18 Exercise 2: Dictionary-based methods","heading":"18.4 Inspect and filter data","text":"Let’s look data:row tweets produced one news outlets detailed five month period, January–May 2020. Note also tweets particular date. can therefore use look time changes.won’t need variables let’s just keep interest us:manipulate data tidy format , unnesting token (: words) tweet text.’ll tidy , previous example, removing stop words:","code":"\nhead(tweets)## # A tibble: 6 × 31\n## tweet_id user_username text lang author_id source possibly_sensitive\n## \n## 1 1212334402266521… DailyMirror \"Sec… en 16887175 Tweet… FALSE \n## 2 1212334169457676… DailyMirror \"RT … en 16887175 Tweet… FALSE \n## 3 1212333195879993… thetimes \"A c… en 6107422 Echob… FALSE \n## 4 1212333194864988… TheSun \"Way… en 34655603 Echob… FALSE \n## 5 1212332920507191… DailyMailUK \"Stu… en 111556423 Socia… FALSE \n## 6 1212332640570875… TheSun \"Dad… en 34655603 Twitt… FALSE \n## # ℹ 24 more variables: conversation_id , created_at , user_url ,\n## # user_location , user_protected , user_verified ,\n## # user_name , user_profile_image_url , user_description ,\n## # user_created_at , user_pinned_tweet_id , retweet_count ,\n## # like_count , quote_count , user_tweet_count ,\n## # user_list_count , user_followers_count ,\n## # user_following_count , sourcetweet_type , sourcetweet_id , …\ncolnames(tweets)## [1] \"tweet_id\" \"user_username\" \"text\" \n## [4] \"lang\" \"author_id\" \"source\" \n## [7] \"possibly_sensitive\" \"conversation_id\" \"created_at\" \n## [10] \"user_url\" \"user_location\" \"user_protected\" \n## [13] \"user_verified\" \"user_name\" \"user_profile_image_url\"\n## [16] \"user_description\" \"user_created_at\" \"user_pinned_tweet_id\" \n## [19] \"retweet_count\" \"like_count\" \"quote_count\" \n## [22] \"user_tweet_count\" \"user_list_count\" \"user_followers_count\" \n## [25] \"user_following_count\" \"sourcetweet_type\" \"sourcetweet_id\" \n## [28] \"sourcetweet_text\" \"sourcetweet_lang\" \"sourcetweet_author_id\" \n## [31] \"in_reply_to_user_id\"\ntweets <- tweets %>%\n select(user_username, text, created_at, user_name,\n retweet_count, like_count, quote_count) %>%\n rename(username = user_username,\n newspaper = user_name,\n tweet = text)\ntidy_tweets <- tweets %>% \n mutate(desc = tolower(tweet)) %>%\n unnest_tokens(word, desc) %>%\n filter(str_detect(word, \"[a-z]\"))\ntidy_tweets <- tidy_tweets %>%\n filter(!word %in% stop_words$word)"},{"path":"exercise-2-dictionary-based-methods.html","id":"get-sentiment-dictionaries","chapter":"18 Exercise 2: Dictionary-based methods","heading":"18.5 Get sentiment dictionaries","text":"Several sentiment dictionaries come bundled tidytext package. :AFINN Finn Årup Nielsen,bing Bing Liu collaborators, andnrc Saif Mohammad Peter TurneyWe can look see relevant dictionaries stored.see . First, AFINN lexicon gives words score -5 +5, negative scores indicate negative sentiment positive scores indicate positive sentiment. nrc lexicon opts binary classification: positive, negative, anger, anticipation, disgust, fear, joy, sadness, surprise, trust, word given score 1/0 sentiments. words, nrc lexicon, words appear multiple times enclose one emotion (see, e.g., “abandon” ). bing lexicon minimal, classifying words simply binary “positive” “negative” categories.Let’s see might filter texts selecting dictionary, subset dictionary, using inner_join() filter tweet data. might, example, interested fear words. Maybe, might hypothesize, uptick fear toward beginning coronavirus outbreak. First, let’s look words tweet data nrc lexicon codes fear-related words.total 1,174 words fear valence tweet data according nrc classification. Several seem reasonable (e.g., “death,” “pandemic”); others seems less (e.g., “mum,” “fight”).","code":"\nget_sentiments(\"afinn\")## # A tibble: 2,477 × 2\n## word value\n## \n## 1 abandon -2\n## 2 abandoned -2\n## 3 abandons -2\n## 4 abducted -2\n## 5 abduction -2\n## 6 abductions -2\n## 7 abhor -3\n## 8 abhorred -3\n## 9 abhorrent -3\n## 10 abhors -3\n## # ℹ 2,467 more rows\nget_sentiments(\"bing\")## # A tibble: 6,786 × 2\n## word sentiment\n## \n## 1 2-faces negative \n## 2 abnormal negative \n## 3 abolish negative \n## 4 abominable negative \n## 5 abominably negative \n## 6 abominate negative \n## 7 abomination negative \n## 8 abort negative \n## 9 aborted negative \n## 10 aborts negative \n## # ℹ 6,776 more rows\nget_sentiments(\"nrc\")## # A tibble: 13,875 × 2\n## word sentiment\n## \n## 1 abacus trust \n## 2 abandon fear \n## 3 abandon negative \n## 4 abandon sadness \n## 5 abandoned anger \n## 6 abandoned fear \n## 7 abandoned negative \n## 8 abandoned sadness \n## 9 abandonment anger \n## 10 abandonment fear \n## # ℹ 13,865 more rows\nnrc_fear <- get_sentiments(\"nrc\") %>% \n filter(sentiment == \"fear\")\n\ntidy_tweets %>%\n inner_join(nrc_fear) %>%\n count(word, sort = TRUE)## Joining with `by = join_by(word)`## # A tibble: 1,173 × 2\n## word n\n## \n## 1 mum 4509\n## 2 death 4073\n## 3 police 3275\n## 4 hospital 2240\n## 5 government 2179\n## 6 pandemic 1877\n## 7 fight 1309\n## 8 die 1199\n## 9 attack 1099\n## 10 murder 1064\n## # ℹ 1,163 more rows"},{"path":"exercise-2-dictionary-based-methods.html","id":"sentiment-trends-over-time","chapter":"18 Exercise 2: Dictionary-based methods","heading":"18.6 Sentiment trends over time","text":"see time trends? First let’s make sure data properly arranged ascending order date. ’ll add column, ’ll call “order,” use become clear sentiment analysis.Remember structure tweet data one token (word) per document (tweet) format. order look sentiment trends time, ’ll need decide many words estimate sentiment., first add sentiment dictionary inner_join(). use count() function, specifying want count dates, words indexed order (.e., row number) every 1000 rows (.e., every 1000 words).means one date many tweets totalling >1000 words, multiple observations given date; one two tweets might just one row associated sentiment score date.calculate sentiment scores sentiment types (positive, negative, anger, anticipation, disgust, fear, joy, sadness, surprise, trust) use spread() function convert separate columns (rather rows). Finally calculate net sentiment score subtracting score negative sentiment positive sentiment.different sentiment dictionaries look compared ? can plot sentiment scores time sentiment dictionaries like :see look pretty similar… interestingly seems overall sentiment positivity increases pandemic breaks.","code":"\n#gen data variable, order and format date\ntidy_tweets$date <- as.Date(tidy_tweets$created_at)\n\ntidy_tweets <- tidy_tweets %>%\n arrange(date)\n\ntidy_tweets$order <- 1:nrow(tidy_tweets)\n#get tweet sentiment by date\ntweets_nrc_sentiment <- tidy_tweets %>%\n inner_join(get_sentiments(\"nrc\")) %>%\n count(date, index = order %/% 1000, sentiment) %>%\n spread(sentiment, n, fill = 0) %>%\n mutate(sentiment = positive - negative)## Joining with `by = join_by(word)`## Warning in inner_join(., get_sentiments(\"nrc\")): Detected an unexpected many-to-many relationship between `x` and `y`.\n## ℹ Row 2 of `x` matches multiple rows in `y`.\n## ℹ Row 7712 of `y` matches multiple rows in `x`.\n## ℹ If a many-to-many relationship is expected, set `relationship =\n## \"many-to-many\"` to silence this warning.\ntweets_nrc_sentiment %>%\n ggplot(aes(date, sentiment)) +\n geom_point(alpha=0.5) +\n geom_smooth(method= loess, alpha=0.25)## `geom_smooth()` using formula = 'y ~ x'\ntidy_tweets %>%\n inner_join(get_sentiments(\"bing\")) %>%\n count(date, index = order %/% 1000, sentiment) %>%\n spread(sentiment, n, fill = 0) %>%\n mutate(sentiment = positive - negative) %>%\n ggplot(aes(date, sentiment)) +\n geom_point(alpha=0.5) +\n geom_smooth(method= loess, alpha=0.25) +\n ylab(\"bing sentiment\")## Joining with `by = join_by(word)`## Warning in inner_join(., get_sentiments(\"bing\")): Detected an unexpected many-to-many relationship between `x` and `y`.\n## ℹ Row 54114 of `x` matches multiple rows in `y`.\n## ℹ Row 3848 of `y` matches multiple rows in `x`.\n## ℹ If a many-to-many relationship is expected, set `relationship =\n## \"many-to-many\"` to silence this warning.## `geom_smooth()` using formula = 'y ~ x'\ntidy_tweets %>%\n inner_join(get_sentiments(\"nrc\")) %>%\n count(date, index = order %/% 1000, sentiment) %>%\n spread(sentiment, n, fill = 0) %>%\n mutate(sentiment = positive - negative) %>%\n ggplot(aes(date, sentiment)) +\n geom_point(alpha=0.5) +\n geom_smooth(method= loess, alpha=0.25) +\n ylab(\"nrc sentiment\")## Joining with `by = join_by(word)`## Warning in inner_join(., get_sentiments(\"nrc\")): Detected an unexpected many-to-many relationship between `x` and `y`.\n## ℹ Row 2 of `x` matches multiple rows in `y`.\n## ℹ Row 7712 of `y` matches multiple rows in `x`.\n## ℹ If a many-to-many relationship is expected, set `relationship =\n## \"many-to-many\"` to silence this warning.## `geom_smooth()` using formula = 'y ~ x'\ntidy_tweets %>%\n inner_join(get_sentiments(\"afinn\")) %>%\n group_by(date, index = order %/% 1000) %>% \n summarise(sentiment = sum(value)) %>% \n ggplot(aes(date, sentiment)) +\n geom_point(alpha=0.5) +\n geom_smooth(method= loess, alpha=0.25) +\n ylab(\"afinn sentiment\")## Joining with `by = join_by(word)`\n## `summarise()` has grouped output by 'date'. You can override using the `.groups`\n## argument.\n## `geom_smooth()` using formula = 'y ~ x'"},{"path":"exercise-2-dictionary-based-methods.html","id":"domain-specific-lexicons","chapter":"18 Exercise 2: Dictionary-based methods","heading":"18.7 Domain-specific lexicons","text":"course, list- dictionary-based methods need focus sentiment, even one common uses. essence, ’ll seen sentiment analysis techniques rely given lexicon score words appropriately. nothing stopping us making dictionaries, whether measure sentiment . data , might interested, example, prevalence mortality-related words news. , might choose make dictionary terms. look like?minimal example choose, example, words like “death” synonyms score 1. combine dictionary, ’ve called “mordict” .use technique bind data look incidence words time. Combining sequence scripts following:simply counts number mortality words time. might misleading , example, longer tweets certain points time; .e., length quantity text time-constant.matter? Well, just mortality words later just tweets earlier . just counting words, taking account denominator.alternative, preferable, method simply take character string relevant words. sum total number words across tweets time. filter tweet words whether mortality word , according dictionary words constructed. words, summing number times appear date., join data frame total words date. Note using full_join() want include dates appear “totals” data frame appear filter mortality words; .e., days mortality words equal 0. go plotting .","code":"\nword <- c('death', 'illness', 'hospital', 'life', 'health',\n 'fatality', 'morbidity', 'deadly', 'dead', 'victim')\nvalue <- c(1, 1, 1, 1, 1, 1, 1, 1, 1, 1)\nmordict <- data.frame(word, value)\nmordict## word value\n## 1 death 1\n## 2 illness 1\n## 3 hospital 1\n## 4 life 1\n## 5 health 1\n## 6 fatality 1\n## 7 morbidity 1\n## 8 deadly 1\n## 9 dead 1\n## 10 victim 1\ntidy_tweets %>%\n inner_join(mordict) %>%\n group_by(date, index = order %/% 1000) %>% \n summarise(morwords = sum(value)) %>% \n ggplot(aes(date, morwords)) +\n geom_bar(stat= \"identity\") +\n ylab(\"mortality words\")## Joining with `by = join_by(word)`\n## `summarise()` has grouped output by 'date'. You can override using the `.groups`\n## argument.\nmordict <- c('death', 'illness', 'hospital', 'life', 'health',\n 'fatality', 'morbidity', 'deadly', 'dead', 'victim')\n\n#get total tweets per day (no missing dates so no date completion required)\ntotals <- tidy_tweets %>%\n mutate(obs=1) %>%\n group_by(date) %>%\n summarise(sum_words = sum(obs))\n\n#plot\ntidy_tweets %>%\n mutate(obs=1) %>%\n filter(grepl(paste0(mordict, collapse = \"|\"),word, ignore.case = T)) %>%\n group_by(date) %>%\n summarise(sum_mwords = sum(obs)) %>%\n full_join(totals, word, by=\"date\") %>%\n mutate(sum_mwords= ifelse(is.na(sum_mwords), 0, sum_mwords),\n pctmwords = sum_mwords/sum_words) %>%\n ggplot(aes(date, pctmwords)) +\n geom_point(alpha=0.5) +\n geom_smooth(method= loess, alpha=0.25) +\n xlab(\"Date\") + ylab(\"% mortality words\")## `geom_smooth()` using formula = 'y ~ x'"},{"path":"exercise-2-dictionary-based-methods.html","id":"using-lexicoder","chapter":"18 Exercise 2: Dictionary-based methods","heading":"18.8 Using Lexicoder","text":"approaches use general dictionary-based techniques designed domain-specific text news text. Lexicoder Sentiment Dictionary, Young Soroka (2012) designed specifically examining affective content news text. follows, see implement analysis using dictionary.conduct analysis using quanteda package. see can tokenize text similar way using functions included quanteda package.quanteda package first need create “corpus” object, declaring tweets corpus object. , make sure date column correctly stored create corpus object corpus() function. Note specifying text_field “tweet” text data interest , including information date tweet published. information specified docvars argument. ’ll see tthen corpus consists text -called “docvars,” just variables (columns) original dataset. , included date column.tokenize text using tokens() function quanteda, removing punctuation along way:take data_dictionary_LSD2015 comes bundled quanteda select positive negative categories, excluding words deemed “neutral.” , ready “look ” dictionary tokens corpus scored tokens_lookup() function.creates long list texts (tweets) annotated series ‘positive’ ‘negative’ annotations depending valence words text. creators quanteda recommend generate document feature matric . Grouping date, get dfm object, quite convoluted list object can plot using base graphics functions plotting matrices.Alternatively, can recreate tidy format follows:plot accordingly:","code":"\ntweets$date <- as.Date(tweets$created_at)\n\ntweet_corpus <- corpus(tweets, text_field = \"tweet\", docvars = \"date\")## Warning: docvars argument is not used.\ntoks_news <- tokens(tweet_corpus, remove_punct = TRUE)\n# select only the \"negative\" and \"positive\" categories\ndata_dictionary_LSD2015_pos_neg <- data_dictionary_LSD2015[1:2]\n\ntoks_news_lsd <- tokens_lookup(toks_news, dictionary = data_dictionary_LSD2015_pos_neg)\n# create a document document-feature matrix and group it by date\ndfmat_news_lsd <- dfm(toks_news_lsd) %>% \n dfm_group(groups = date)\n\n# plot positive and negative valence over time\nmatplot(dfmat_news_lsd$date, dfmat_news_lsd, type = \"l\", lty = 1, col = 1:2,\n ylab = \"Frequency\", xlab = \"\")\ngrid()\nlegend(\"topleft\", col = 1:2, legend = colnames(dfmat_news_lsd), lty = 1, bg = \"white\")\n# plot overall sentiment (positive - negative) over time\n\nplot(dfmat_news_lsd$date, dfmat_news_lsd[,\"positive\"] - dfmat_news_lsd[,\"negative\"], \n type = \"l\", ylab = \"Sentiment\", xlab = \"\")\ngrid()\nabline(h = 0, lty = 2)\nnegative <- dfmat_news_lsd@x[1:121]\npositive <- dfmat_news_lsd@x[122:242]\ndate <- dfmat_news_lsd@Dimnames$docs\n\n\ntidy_sent <- as.data.frame(cbind(negative, positive, date))\n\ntidy_sent$negative <- as.numeric(tidy_sent$negative)\ntidy_sent$positive <- as.numeric(tidy_sent$positive)\ntidy_sent$sentiment <- tidy_sent$positive - tidy_sent$negative\ntidy_sent$date <- as.Date(tidy_sent$date)\ntidy_sent %>%\n ggplot() +\n geom_line(aes(date, sentiment))"},{"path":"exercise-2-dictionary-based-methods.html","id":"exercises-1","chapter":"18 Exercise 2: Dictionary-based methods","heading":"18.9 Exercises","text":"Take subset tweets data “user_name” names describe name newspaper source Twitter account. see different sentiment dynamics look different newspaper sources?Build (minimal) dictionary-based filter technique plot resultApply Lexicoder Sentiment Dictionary news tweets, break analysis newspaper","code":""},{"path":"exercise-2-dictionary-based-methods.html","id":"references-1","chapter":"18 Exercise 2: Dictionary-based methods","heading":"18.10 References","text":"","code":""},{"path":"exercise-3-comparison-and-complexity.html","id":"exercise-3-comparison-and-complexity","chapter":"19 Exercise 3: Comparison and complexity","heading":"19 Exercise 3: Comparison and complexity","text":"","code":""},{"path":"exercise-3-comparison-and-complexity.html","id":"introduction-2","chapter":"19 Exercise 3: Comparison and complexity","heading":"19.1 Introduction","text":"hands-exercise week focuses : 1) comparing texts; 2) measuring document-level characteristics text—, complexity.tutorial, learn :Compare texts using character-based measures similarity distanceCompare texts using term-based measures similarity distanceCalculate complexity textsReplicate analyses Schoonvelde et al. (2019)","code":""},{"path":"exercise-3-comparison-and-complexity.html","id":"setup-8","chapter":"19 Exercise 3: Comparison and complexity","heading":"19.2 Setup","text":"proceeding, ’ll load remaining packages need tutorial.example ’ll using data 2017-2018 Theresa May Cabinet UK. data tweets members cabinet.can load data follows.’re working document computer (“locally”) can download tweets data following way:see data contain three variables: “username,” username MP question; “tweet,” text given tweet, “date” days yyyy-mm-dd format.24 MPs whose tweets ’re examining.","code":"\nlibrary(readr) # more informative and easy way to import data\nlibrary(quanteda) # includes functions to implement Lexicoder\nlibrary(quanteda.textstats) # for estimating similarity and complexity measures\nlibrary(stringdist) # for basic character-based distance measures\nlibrary(dplyr) #for wrangling data\nlibrary(tibble) #for wrangling data\nlibrary(ggplot2) #for visualization\ntweets <- readRDS(\"data/comparison-complexity/cabinet_tweets.rds\")\ntweets <- readRDS(gzcon(url(\"https://github.com/cjbarrie/CTA-ED/blob/main/data/comparison-complexity/cabinet_tweets.rds?raw=true\")))\nhead(tweets)## # A tibble: 6 × 3\n## username tweet date \n## \n## 1 aluncairns \"A good luck message to Chris Coleman’s squad @FAWales a… 2017-10-09\n## 2 aluncairns \".@AlunCairns “The close relationship between industry a… 2017-10-09\n## 3 aluncairns \"@BarclaysCorp & @SPTS_Tech \\\"voice of Welsh Manufac… 2017-10-09\n## 4 aluncairns \"Today we announced plans to ban the sale of ivory in th… 2017-10-06\n## 5 aluncairns \"Unbeaten Wales overcome Georgia to boost their @FIFAWor… 2017-10-06\n## 6 aluncairns \".@GutoAberconwy marks 25 years of engine production @to… 2017-10-06\nunique(tweets$username)## [1] \"aluncairns\" \"amberrudduk\" \"andrealeadsom\" \"borisjohnson\" \n## [5] \"brandonlewis\" \"damiangreen\" \"damianhinds\" \"daviddavismp\" \n## [9] \"davidgauke\" \"davidmundelldct\" \"dlidington\" \"gavinwilliamson\"\n## [13] \"gregclarkmp\" \"jbrokenshire\" \"jeremy_hunt\" \"juliansmithuk\" \n## [17] \"justinegreening\" \"liamfox\" \"michaelgove\" \"pennymordaunt\" \n## [21] \"philiphammonduk\" \"sajidjavid\" \"theresa_may\" \"trussliz\"\nlength(unique(tweets$username))## [1] 24"},{"path":"exercise-3-comparison-and-complexity.html","id":"generate-document-feature-matrix","chapter":"19 Exercise 3: Comparison and complexity","heading":"19.3 Generate document feature matrix","text":"order use quanteda package accompanying quanteda.textstats package, need reformat data quanteda “corpus” object. just need specify text ’re interested well associated document-level variables ’re interested.can follows.now ready reformat data document feature matrix.Note need tokenized corpus object first. can wrapping tokens function inside dfm() function .object? Well documents tweets. matrix sparse (.e., mostly zeroes) matrix 1s 0s whether given word appears document (tweet) question.vertical elements (columns) vector made words used tweets combined. , helps imagine every tweet positioned side side understand ’s going .","code":"\n#make corpus object, specifying tweet as text field\ntweets_corpus <- corpus(tweets, text_field = \"tweet\")\n\n#add in username document-level information\ndocvars(tweets_corpus, \"username\") <- tweets$username\n\ntweets_corpus## Corpus consisting of 10,321 documents and 2 docvars.\n## text1 :\n## \"A good luck message to Chris Coleman’s squad @FAWales ahead ...\"\n## \n## text2 :\n## \".@AlunCairns “The close relationship between industry and go...\"\n## \n## text3 :\n## \"@BarclaysCorp & @SPTS_Tech \"voice of Welsh Manufacturing...\"\n## \n## text4 :\n## \"Today we announced plans to ban the sale of ivory in the UK....\"\n## \n## text5 :\n## \"Unbeaten Wales overcome Georgia to boost their @FIFAWorldCup...\"\n## \n## text6 :\n## \".@GutoAberconwy marks 25 years of engine production @toyotaf...\"\n## \n## [ reached max_ndoc ... 10,315 more documents ]\ndfmat <- dfm(tokens(tweets_corpus),\n remove_punct = TRUE, \n remove = stopwords(\"english\"))## Warning: '...' should not be used for tokens() arguments; use 'tokens()' first.## Warning: 'remove' is deprecated; use dfm_remove() instead\ndfmat## Document-feature matrix of: 10,321 documents, 26,956 features (99.95% sparse) and 2 docvars.\n## features\n## docs good luck message chris coleman’s squad @fawales ahead tonight’s crucial\n## text1 1 1 1 1 1 1 1 1 1 1\n## text2 0 0 0 0 0 0 0 0 0 0\n## text3 0 0 0 0 0 0 0 0 0 0\n## text4 0 0 0 0 0 0 0 0 0 0\n## text5 0 0 0 0 0 0 0 0 0 0\n## text6 0 0 0 0 0 0 0 0 0 0\n## [ reached max_ndoc ... 10,315 more documents, reached max_nfeat ... 26,946 more features ]"},{"path":"exercise-3-comparison-and-complexity.html","id":"compare-between-mps","chapter":"19 Exercise 3: Comparison and complexity","heading":"19.4 Compare between MPs","text":"data format, ready compare text produced members Theresa May’s Cabinet.’s example correlations combined tweets 5 MPs .Note ’re using dfm_group() function, allows take document feature matrix make calculations grouping one document-level variables specified .many different measures similarity, however, might think using., combine four different measures similarity, see compare across MPs. Note ’re looking similarity MP’s tweets Prime Minister, Theresa May.","code":"\ncorrmat <- dfmat %>%\n dfm_group(groups = username) %>%\n textstat_simil(margin = \"documents\", method = \"correlation\")\n\ncorrmat[1:5,1:5]## 5 x 5 Matrix of class \"dspMatrix\"\n## aluncairns amberrudduk andrealeadsom borisjohnson brandonlewis\n## aluncairns 1.0000000 0.3610579 0.4717627 0.4137785 0.4815319\n## amberrudduk 0.3610579 1.0000000 0.4746674 0.4657415 0.5866139\n## andrealeadsom 0.4717627 0.4746674 1.0000000 0.5605795 0.6905958\n## borisjohnson 0.4137785 0.4657415 0.5605795 1.0000000 0.6685258\n## brandonlewis 0.4815319 0.5866139 0.6905958 0.6685258 1.0000000"},{"path":"exercise-3-comparison-and-complexity.html","id":"compare-between-measures","chapter":"19 Exercise 3: Comparison and complexity","heading":"19.5 Compare between measures","text":"Let’s see looks like one measures—cosine similarity.first get similarities text MP tweets MPs.remember ’re interested compare Theresa May saying.need take cosine similarities retain similarity measures corresponding text Theresa May’s tweets.first convert textstat_simil() output matrix.can see 23rd row matrix contains similarity measures Theresa May tweets.take row, removing similarity Theresa May (always = 1), convert datframe object.rename cosine similarity column appropriate name convert row names column variable cells containing information MP cosine similarity measure refers.like data tidy format, can plot like .Combining steps single loop, can see different similarity measures interest compare.","code":"\n#estimate similarity, grouping by username\n\ncos_sim <- dfmat %>%\n dfm_group(groups = username) %>%\n textstat_simil(margin = \"documents\", method = \"cosine\") #specify method here as character object\ncosmat <- as.matrix(cos_sim) #convert to a matrix\n#generate data frame keeping only the row for Theresa May\ncosmatdf <- as.data.frame(cosmat[23, c(1:22, 24)])\n#rename column\ncolnames(cosmatdf) <- \"corr_may\"\n \n#create column variable from rownames\ncosmatdf <- tibble::rownames_to_column(cosmatdf, \"username\")\nggplot(cosmatdf) +\n geom_point(aes(x=reorder(username, -corr_may), y= corr_may)) + \n coord_flip() +\n xlab(\"MP username\") +\n ylab(\"Cosine similarity score\") + \n theme_minimal()\n#specify different similarity measures to explore\nmethods <- c(\"correlation\", \"cosine\", \"dice\", \"edice\")\n\n#create empty dataframe\ntestdf_all <- data.frame()\n\n#gen for loop across methods types\nfor (i in seq_along(methods)) {\n \n #pass method to character string object\n sim_method <- methods[[i]]\n \n #estimate similarity, grouping by username\n test <- dfmat %>%\n dfm_group(groups = username) %>%\n textstat_simil(margin = \"documents\", method = sim_method) #specify method here as character object created above\n \n testm <- as.matrix(test) #convert to a matrix\n \n #generate data frame keeping only the row for Theresa May\n testdf <- as.data.frame(testm[23, c(1:22, 24)])\n \n #rename column\n colnames(testdf) <- \"corr_may\"\n \n #create column variable from rownames\n testdf <- tibble::rownames_to_column(testdf, \"username\")\n \n #record method in new column variable\n testdf$method <- sim_method\n\n #bind all together\n testdf_all <- rbind(testdf_all, testdf) \n \n}\n\n#create variable (for viz only) that is mean of similarity scores for each MP\ntestdf_all <- testdf_all %>%\n group_by(username) %>%\n mutate(mean_sim = mean(corr_may))\n\nggplot(testdf_all) +\n geom_point( aes(x=reorder(username, -mean_sim), y= corr_may, color = method)) + \n coord_flip() +\n xlab(\"MP username\") +\n ylab(\"Similarity score\") + \n theme_minimal()"},{"path":"exercise-3-comparison-and-complexity.html","id":"complexity-1","chapter":"19 Exercise 3: Comparison and complexity","heading":"19.6 Complexity","text":"now move document-level measures text characteristics. focus paper Schoonvelde et al. (2019).using subset data, taken EU speeches given four politicians. provided authors https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/S4IZ8K.can load data follows.’re working document computer (“locally”) can download tweets data following way:can take look data contains .data contain speeches four different politicians, positioned different points liberal-conservative scale.can calculate Flesch-Kincaid readability/complexity score quanteda.textstats package like .want information aggregated politicians: Gordon Brown, Jose Zapatero”, David Cameron, Mariano Rajoy. recorded data column called “speaker.”gives us data tidy format looks like .can plot—see results look like Figure 1 published article Schoonvelde et al. (2019).","code":"\nspeeches <- readRDS(\"data/comparison-complexity/speeches.rds\")\nspeeches <- readRDS(gzcon(url(\"https://github.com/cjbarrie/CTA-ED/blob/main/data/comparison-complexity/speeches.rds?raw=true\")))\nhead(speeches)## speaker\n## 1 J.L.R. Zapatero\n## 2 J.L.R. Zapatero\n## 3 J.L.R. Zapatero\n## 4 J.L.R. Zapatero\n## 5 J.L.R. Zapatero\n## 6 J.L.R. Zapatero\n## text\n## 1 Dear friends, good morning to you all, both men and women. Thank you very much, Enriqueta, for you work, for your attitude and for your e-mails, for your messages of affection -some of them, so beautiful- and for making the Federation of Progressit Women possible, creating it and providing it with dignity. Thanks to all those women who form part of this Federation for their work, for their spirit and for their temperament. I haven't heard anything resembling a cry or an insult in all the voices that have spoken out from this tribune, because we are talking about equality and equality is the deepest expression of the dignity of men, of the rights of men, of citizens; it is not compatible with looking down on anyone, with shouting or insulting. Equality and dignity are a form of respect. Thanks for working with respect. Thanks to the Federation of Progressist Women. I have been awarded this prize - and I am going to use the stereotype - and apart from picking it up with affection, which I do, I pick up interpreting that it is a prize for most of the Spanish society, which has made it possible for us to do what we can do from the Government; or better, to do what must be done from the Government. The Spanish society is not afraid of equality. The Spanish society defends and believes in equality, in the equality between men and women, of course; in the equality that still has a long way to go. Thus, for me, this prize represents a firm, committed souvenir of the way we still have to go, rather than the recognition to the task of a Government, so that not a single woman is dominated, discriminated, mistreated, forgotten in this society. And there are still too many. I would like to tell you that the equality and the dignity of men and women is the great motor, the great horizon of my political project. I can assure that there are persons, women and, fortunately enough, more men each time, who are going to keep moving on in favour of equality and dignity, and I can assure that I am going to lead those men and women in this country who want more equality and more rights and freedoms for women and for all citizens. You can trust my commitment. This prize has a great value for it was awarded by the associational movement, which I always try to praise and value and to which I try to show my gratitude; an associational movement that has taught and is still teaching us a lesson, that has spotted the problem, that has called the attention of long-forgotten problems and that has lent its voice to those who had been deprived of it, to those who had been silenced. Thanks to the associational movement, to all the organisations that step by step, with their constant effort, give our society a horizon with more rights, with more freedoms and with more equality. I have said it more than once and I must repeat it again here: I am a convinced feminist and I am proud of it. And I would like to tell you something that is very important: I think I have passed this spirit on to the Vice President, to practically the whole Government and to practically the whole Socialist Party. I can tell you that your twenty years of work in favour of the equality of rights and opportunities have been worth it. You may feel proud of yourselves. You have broken conventions and stereotypes that, as Pedro said when he referred to his career from his childhood and to his mother, when one looks back and reflects upon the meaning of the treatment, the consideration and the dominance exerted upon women for centuries in our society, one is tempted to state - and I think that it is a fair thing to do - that we cannot feel proud of History and we cannot feel proud of the history of civilisation, because in most societies it is blotted by the fact that it has dominated, forgotten, marginalized and discriminated women. The 20th century was basically the first time in History when the dawn of equality brought hope to the 21st century, not only to the Spanish society, but to all societies, and there are many societies characterised by painful, intolerable and unbearable marginalization and discrimination and this must become a turning point in History. I would also like to congratulate all those who have received the prize. Congratulations to Mr. Arsenio Escolar, director of \"\"20 Minutos\"\". Life always forces us to chose, to decide among different dilemmas and you chose to support the dignity of women when you had to choose among publicity benefits and the dignity of women, taking out those short ads referring to prostitution. This honours you and this honours us. Thank you very much, Arsenio. Ms. Paz Hernandez Felgueroso, Ms. Maria Jose Ramos and Ms. Begona Fernandez, with their Casa Malva, a welcoming, nice name, have given an exemplary public answer to a social blot and its consequences, namely, gender violence. Congratulations on this deserved prize and I wish you will keep on going, with courage, working in that direction, so that there may be each time more centres, more homes like the Casa Malva, in order to make us think about who need investments and public resources and also think how can a society be dignified, a common task, such as the task that leaders carry out in a democracy. A Casa Malva to restore the dignity of mistreated women. It has been mentioned here that we have observed the compromise of making a first law on gender violence in this Legislature. This was the first one elaborated and taken to the Parliament by the Government; a law that is the beginning; a law without which we would not have had any hope and a law that is not going to do away with the blot of criminal male chauvinism on its own. But this Law must be added our will and yours; this Law must be added measures, resources and a calling to the general awareness, which I would like to repeat today, from all the public operators, from all the public Administrations of Justice, of the State Security Forces and Bodies, and from all the social, support network, so that they may know and remember that the Government is asking them, as their first duty as public servers, to pay attention and be near those women who might suffer or have already suffered gender violence. You know that our Government has placed equality at the first position among the values that any country should defend; an equality that can only be efficient and real if it is based upon rights and rights are included in Laws. Thus, the Law on equality has opened up a space at work, harmonising work and family life, in the public arena, and also in the big companies, in the area of economic power, a determinant, decisive space for equality. I am very sorry that this Law has not been supported by all the political parties and I am even sorrier about the fact that it has been appealed before the Constitutional Court. It is not possible to appeal equality between men and women, no amendment is possible; no amendment or petition. Pedro, I start with you now. You know I admire you and I can tell you, because people sometimes say things about you, as President of the Government of Spain, that this country is proud of you. What else could a country offer to the rest of the world? Its culture and the different artistic creations. It is there that talent, creativity, the seed of freedom and the values of equality lie. It is there. It is there that the origin lies and having a well-known, internationally praised director is one of the best occasions to feel a patriot, to feel Spanish in the broad sense of the word.Thank you, Pedro. The whole work of Pedro, especially as far as his female characters are concerned, is a move in favour of equality, because his female characters break the conventions and the stereotypes, and they are a great proof of the hard, everyday life of working women, who devote all their efforts to a family that does not thank them for it or recognise their work. He shows us the profiles of that extraordinary strength that Pedro was mentioning a while ago, with great realism; strength, mainly in sweetness, mainly in love and mainly in courage, because only courage makes it possible to attain equality, that type of equality that is present in all your films, defending women in your films. Yet, women are the best characters in your films, in my opinion, Pedro, the most solid ones, and sometimes the most tormented and the best defined ones. You have been able to outline them in your films and several generations of Spanish women have seen themselves reflected in your films. We saw this when we were having a look at some of the photographs. I would also like to praise, through this prize, in a special way, in a very special way, all those generations of Spanish women who have had to live without being able to speak up, without being able to study, to ask; having to remain silent, to obey; having to assume that they were different and inferior. All that generation of women who left many things behind in their lives, because they were not allowed to have a life; all those women deserve my deepest praise, those Spanish women who have not been able to live in freedom. I would like to conclude with two ideas: one has to do with democracy and politics, and the other one with emotions. I will start with politics and democracy. When one arrives enters power -I am not going to comment on those matters pointed out by Pedro Almodovar, for as you know I am not a specialist in that- one contemplates, knows the social reality even better than before being in power, of course. The conclusion I would like to express, the one I have always expressed and I will always keep expressing from my experience as President of the Government. I am convinced that in those countries where there is equality between men and women there is more freedom, more life, more creativity, more respect and more democracy; that wherever there is more power - and let's analyse and see through the eyes of a social reality - wherever there is more power in the hardest sense of the word, and also in the less democratic sense of the word, there are less women. Therefore, for a progressist, for someone who believes in changes, in reform, in transformation and in equality, changing a society implies having more women in those places in which they were not in the past and it also implies that men will have to assume that they do it just like us and, in most cases, better than us. The second thing I would like to say, just to conclude, has to do with emotions, with the field of emotions. This celebration is for me one of the most dearly ones that I have attended so far as President of the Government. Nothing excites me more than contemplating a country with a clear horizon in favour of freedom and equality. Nothing excites me more than being able to contribute with my grain of sand or with many grains of sand so that every day in Spain women may have more freedom, dignity and equality. Today is a good day to say this. I feel very proud of being able to lead the values that you represent. Thank you very much.\n## 2 Honourable President, Honourable Deputies, I want to start with a moving mention to the six \"\"Blue Helmets\"\", soldiers of the Spanish army who died tragically last Sunday, 24 th of June, in Southern Lebanon, and transmit to their families our deepest condolences for their irreparable loss. The Minister of Defence will attend, at his own request, the corresponding Commission of these Chambers in order to give a detailed explanation on the research that is currently being carried out concerning the circumstances and consequences of the attack. I know that I word the feelings of Your Honours by expressing in this very moment the recognition and support for the valuable, heroic task of our contingent in Southern Lebanon. And I also know that I word the feelings of the Spanish citizens by stating that Manuel David Portas, Jonathan Galea, Jefferson Vargas, Yeison Alejandro Castaño, Yhon Edisson Posada and Juan Carlos Villoria, either born in Spain or in Colombia, will always be one of us and will always be with us. They lived together, patrolled together and gave their lives together for the same cause. Their families will always feel the encouraging support of the Spanish society, the support of the institutions and the proximity of many citizens that share their immense pain today. Their cause was the cause of peace and solidarity. The cause for which our Army and our Civil Guard are there, in Lebanon, with a triple backing: legal, politic and moral. They are there following the explicit demand contained in Resolution 1071 of the United Nations Security Council; they are there with the support of all the Parliamentary Groups of these Chambers, at the suggestion of the Government, as expressed last September 2006; they are there with the moral urge of contributing to maintaining the cease-fire in a very dangerous zone, helping the locals dismantle the mines, and backing the reconstruction tasks; but they are there, mainly, on a peace operation of the United Nations, contributing with their efforts and sacrifice and even giving their lives in order to help to establish stability in an area where most of the events that place world peace at risk are elucidated. Thus, apart from us, the Heads of State and Government and the heads of international organisms, and in particular, the General Secretary of the United Nations, Mr. Ban-Ki-Moon have also paid homage to them. We have paid a very high price, but our commitment with peace in the Middle East will not be altered, nor shall we stop offering our support to the United Nations as the main factor in order to reach peace. We will also persist in our determination so that those who are guilty for this deadly blast assume it and pay for their felony and, of course, so that they never attain their aims. Honourable President, Honourable Deputies, I will now go on to analyse de results of the European Council held in Brussels last 21 st and 22 nd of June, this year, which makes me feel great satisfaction. We had gone through a two-year blocking, which on many occasions revealed itself as a form of paralysis, and we risked to continue in this situation, damaging the very consistency of the European project. We could have got lost in the inextricable labyrinth of the particular demands of twenty seven States and we could have been defeated by the temptation to postpone the progression to a further attempt. Nothing of this would have been useful for the European Union. Yet, Honourable Deputies, we have made it. We have an agreement that will reactivate the process of European Integration. It is a huge step forward for Europe and a good move for Spain. We, the Heads of State and Government, have agreed upon an order to carry out an Intergovernmental Conference that will adopt a new Treaty for the Reform of the European Union. This order has an extraordinary political meaning, because it develops all the relevant aspects of the future Reform Treaty. Thus, the commitment we have reached represents in fact the future Reform Treaty. Therefore, this commitment represents an agreement de facto and an initial agreement that affects both the form and the content of the new Treaty. It has been, Honourable Deputies, a long, complex negotiation. It was not easy to reach an agreement after such an important, varied political breach among the member States. In some cases, such as in our case, the Governments had received a clear order from the people and from the parliament in favour of the text of the Constitutional Treaty. In some other cases, such as in the case of France and Holland, the citizens had positioned themselves against it. Two years and deep reflection, comprehension and political willpower have been necessary in order to overcome this situation. The Government pointed out from the very first moment that its main aim in the negotiation was to take Europe out from the state of stagnation into which it had fallen; of course, preserving the essential contents and the balance of the Constitutional Treaty, at all times. We believed that Europe needed a solution as soon as possible and that this European Council was the chance to get it. And we have made proposals; we have been active, available and we have worked for it. We offered full support to the German Presidency and we backed its efforts through direct contacts with the member States that posed greater difficulties. We explained clearly and in due time the main points of our position and we pointed out the limits that would never be given up. In that framework we proved ourselves flexible enough to understand and add the coherent proposals to an adequate solution agreed upon through consensus. Thus, we increased confidence in the relationship with our partners. All this has been essential for these days' negotiation in Spain to contribute directly to fix the terms of the agreement. Honourable Deputies, the success of the European Council is our own success. We all had risked a lot in this negotiation. It is a success for Europe and for us as European citizens. It is a success for Spain and for the interests of Spain. All the contents of the Constitutional Treaty that we considered essential are included in the new Treaty. This means exactly that the most efficient and democratic Europe the Spanish voted for at the referendum will soon become a reality as soon as the new text of the Treaty comes into force. It is true that in order to achieve this agreement we have had to make concessions too. Spain would have preferred to have come further, with a single Treaty simplifying the European legislation, keeping the term \"\"Constitution\"\" and the reference to the symbols of the Union. Those seemed to us positive contributions, but we also knew that those were not the substantial aspects of the Treaty. It was not those aspects that placed the future of Europe at risk. Thus, it was decided that if this terminology posed difficulties to the other States concerning the agreement, we could eventually accept its modification. The final result is an excellent one. If the previous Treaty was said not to be a proper Constitution, the new one will doubtlessly have to be recognized as much more than a Treaty, from a political point of view. It is a project with a foundational character, a Treaty for the new Europe. The new Treaty establishes in a clear way the binding juridical value of the Charter of Fundamental Rights and Duties. This recognition is essential in order to bring into force our shared value system. Besides, the Treaty introduces a substantial advance for the efficient functioning of the European Union. The subjects that may be voted by qualified majority will increase from 36 to 85, which sets a significant limit to the principle of unanimity that slows down or blocks decision-making in Europe so many times. Once it comes into force, the qualified majority will be the regime applicable to other delicate questions for Spain such as immigration, energy and cooperation in the areas of justice and internal affairs. These areas have a great potential inside the European Union and they require a more agile system in order to be developed. Our citizens, the Spanish citizens, will be the first to experience the benefits of these measures. As the Honourable Deputies already know, the definition of the qualified majority voting system has been one of the most discussed questions at this Council. We finally have reached an agreement which consists in keeping the existing system until the 1 st of November 2014, with a further transition period until the 31 st of March 2017, during which the blocking minorities may constitute themselves either upon the basis of the system in force, the one known as \"\"Nice system\"\", or upon the basis of the double majority system, following the decision of the interested States. Both in the case of the existing system and in the case of the double majority system, Spain has an adequate representation, according to its population; but Spain wishes to have a superior influence, as compared to its own number of votes or to what its inhabitants represent, since it knows, by its own experience, that real power in the Union does not depend on more or less votes, but on the capacity of the member States to generate confidence, attract involvement, make alliances and defend its national positions from a European perspective. Against the option of the blocking minorities, the Treaty proposes reinforced cooperation and establishes that these may be promoted by nine States minimum. This is also especially relevant for a country such as Spain, which wants to be at the avant-garde of the integration process in almost all the fields of action of the Union. And there is something more: before the end of October this year, a proposal on the new composition of the European Parliament will have to be put forward, and with regards to this proposal, it guarantees an increase in the number of seats corresponding to Spain during the elections to the Parliament that will be celebrated in 200 As far as the institutional field is concerned, with the creation of the new figures, namely, the President of the Council of the European Union and High Representative for the Common Foreign and Security Policy, Europe will progressively reinforce its efficiency, its visibility and its importance as an authentic European Government. Thanks to these figures it will be easier to identify the personality of the Union and speak on its behalf with a single voice in the international area. It is a very important step in the process of political integration in the European Union, which will give institutional coherence to the functioning of the Council and to the direction and development of the Common Foreign and Security Policy, besides, thanks to the Treaty it will also have an external European service so that it can enter into force. Moreover, the Treaty is a great advance as far as the creation of a Space for Freedom, Security and Justice is concerned, these depend completely on the qualified majority after the introduction in this category of the areas of police and criminal cooperation. These are very good news for Spain and promoting such policies at a European level is a reward to our efforts, and it is also a very significant change for our citizens, as it reinforces the protection of their interests and of their security. With this new framework for actuation, the European policy on immigration promoted by the Spanish Government will be more efficient from the perspective of the European Union. Besides, as far as another question of strategic importance for Spanish interests is concerned, the Treaty makes specific reference to the promotion of energetic interconnections among the member States, which, as the Honourable Deputies know, is an essential landmark for the security and development of our energetic policy. The Union recognizes that the principle of energetic solidarity can not be understood in Europe without the development of interconnections. This means, doubtlessly, a great support for the achievement of such interconnections, which are vital for our energetic system. Spain has also been able to keep, in the new text, the improvements established by the Constitutional Treaty with regards to a question that is rather delicate in the case of our country, that concerning the Statute of Ultra-peripheral Regions. At the same time, the Treaty reinforces the role of the national Parliaments, by increasing their capacity to intervene in the European legislative process whenever a simple majority of the votes attributed to those national Parliaments deems that the project put forward does not respect the principle of subsidiarity. Honourable Deputies, I believe that we can be really satisfied with these results. We have not left out any substantial point of the Constitutional Treaty and we have obtained some positive changes for Spain. As far as Spain, this Council has been a reinforcement of our position in Europe. We have worked in cooperation and in harmony with the German Presidency, and I would like to congratulate them once more, from here, on the success achieved thanks to this agreement; the political determination of the German Presidency has doubtlessly been essential for the command that the European Council has given to the Intergovernmental Conference. We have kept a close contact with France, which is the country with which we presented a common proposal a few hours before the meeting of the European Council. And I can tell you that the coordination of our positions and the common mediation have been very useful for the German Presidency. Similarly, we have been working with Italy, Belgium and Luxembourg in order to defend those parts of the Constitutional Treaty that we considered essential. Spain has acted in favour of stability and agreement. It has generated confidence during the whole negotiation and with this attitude we have been able to impulse the defence of the contents and the ambition of a new Treaty. Portugal , the country that will occupy the European Presidency during the next semester, will have our full support during the Intergovernmental Conference. I am convinced that we will have a new Treaty this same year and that its ratification process will take place without any further difficulties. Honourable Deputies, Even though this negotiation about the new Treaty has been the centre of attention of the debates of the Council, during this Council other conclusions about other matters have also been approved of. As you will see, these matters, which I will briefly refer to next, are also of importance for Spain. The Council went on dealing with European immigration policies, following the proposals of Spain. Thus, it stated the need to develop further on the actions in Africa and in the Mediterranean region, signing new Mobility Agreements with the Countries of origin and with the Countries of passage; it congratulated itself on the achievement of agreements for the creation of quick intervention teams and a network of coastal patrols, and it decided to keep reinforcing the capacity of the European Exterior Frontiers Agency. Besides, the Council reaffirmed the importance of the fact that a good management of legal immigration may contribute to dissuade illegal migration flows, and it developed some aspects of the application of this European policy on immigration in the Eastern and South-eastern frontiers of the Union. As far as economic, social and environmental policies are concerned, the Council paid attention to the progress made and to the projects that are currently being carried out with regards to matters such as joint technological initiatives or the European Institute of Technology; it repeated the importance of moving forward towards a European, efficient and sustainable transport, and it also encouraged the work on coordination of the social security systems and on the application of the Action Plan against AIDS. Finally, the conclusions of the Council also focus on the European neighbourhood policy, the strategy of the European Union for a new association with Central Asia and the dialogue process with the so-called emerging economies. Similarly, the European Council celebrated the fact that Cyprus and Malt are in condition to adopt the euro by next 1 st of January 200 Honourable Deputies, These have been the main contents of the European Council that has given us back the image of the Europe we want, the Europe in which we believe and for which we have been working so far: a Europe full of ambition and built upon consensus. Spain and Europe have come out of this process even stronger. We were the first to ratify, by referendum, the Constitutional Treaty. In so doing, we reinforced it so that it could survive in essence against the difficulties. We have now contributed in a decisive way to lay the foundations for the agreement and we have proved our solidarity throughout the negotiation. Spain is perceived at a European level as a member State that transmits stability and confidence, and assumes its responsibilities when Europe requires it. This is how we are perceived, this is how we are needed and this is how we are recognized. It is for this reason that we should feel reasonably satisfied and proud of our contribution and, also, and mainly, because Europe has achieved an agreement that will be applied and it will thus bring about a more democratic, efficient functioning of the Union, which is, no doubt, what most of the Spanish and what most of the European citizens want. Thank you very much.\n## 3 Honourable President, Honourable Deputies, I want to start with a moving mention to the six \"\"Blue Helmets\"\", soldiers of the Spanish army who died tragically last Sunday, 24th of June, in Southern Lebanon, and transmit to their families our deepest condolences for their irreparable loss. The Minister of Defence will attend, at his own request, the corresponding Commission of these Chambers in order to give a detailed explanation on the research that is currently being carried out concerning the circumstances and consequences of the attack. I know that I word the feelings of Your Honours by expressing in this very moment the recognition and support for the valuable, heroic task of our contingent in Southern Lebanon. And I also know that I word the feelings of the Spanish citizens by stating that Manuel David Portas, Jonathan Galea, Jefferson Vargas, Yeison Alejandro Castaño, Yhon Edisson Posada and Juan Carlos Villoria, either born in Spain or in Colombia, will always be one of us and will always be with us. They lived together, patrolled together and gave their lives together for the same cause. Their families will always feel the encouraging support of the Spanish society, the support of the institutions and the proximity of many citizens that share their immense pain today. Their cause was the cause of peace and solidarity. The cause for which our Army and our Civil Guard are there, in Lebanon, with a triple backing: legal, politic and moral. They are there following the explicit demand contained in Resolution 1071 of the United Nations Security Council; they are there with the support of all the Parliamentary Groups of these Chambers, at the suggestion of the Government, as expressed last September 2006; they are there with the moral urge of contributing to maintaining the cease-fire in a very dangerous zone, helping the locals dismantle the mines, and backing the reconstruction tasks; but they are there, mainly, on a peace operation of the United Nations, contributing with their efforts and sacrifice and even giving their lives in order to help to establish stability in an area where most of the events that place world peace at risk are elucidated. Thus, apart from us, the Heads of State and Government and the heads of international organisms, and in particular, the General Secretary of the United Nations, Mr. Ban-Ki-Moon have also paid homage to them. We have paid a very high price, but our commitment with peace in the Middle East will not be altered, nor shall we stop offering our support to the United Nations as the main factor in order to reach peace. We will also persist in our determination so that those who are guilty for this deadly blast assume it and pay for their felony and, of course, so that they never attain their aims. Honourable President, Honourable Deputies, I will now go on to analyse de results of the European Council held in Brussels last 21st and 22nd of June, this year, which makes me feel great satisfaction. We had gone through a two-year blocking, which on many occasions revealed itself as a form of paralysis, and we risked to continue in this situation, damaging the very consistency of the European project. We could have got lost in the inextricable labyrinth of the particular demands of twenty seven States and we could have been defeated by the temptation to postpone the progression to a further attempt. Nothing of this would have been useful for the European Union. Yet, Honourable Deputies, we have made it. We have an agreement that will reactivate the process of European Integration. It is a huge step forward for Europe and a good move for Spain. We, the Heads of State and Government, have agreed upon an order to carry out an Intergovernmental Conference that will adopt a new Treaty for the Reform of the European Union. This order has an extraordinary political meaning, because it develops all the relevant aspects of the future Reform Treaty. Thus, the commitment we have reached represents in fact the future Reform Treaty. Therefore, this commitment represents an agreement de facto and an initial agreement that affects both the form and the content of the new Treaty. It has been, Honourable Deputies, a long, complex negotiation. It was not easy to reach an agreement after such an important, varied political breach among the member States. In some cases, such as in our case, the Governments had received a clear order from the people and from the parliament in favour of the text of the Constitutional Treaty. In some other cases, such as in the case of France and Holland, the citizens had positioned themselves against it. Two years and deep reflection, comprehension and political willpower have been necessary in order to overcome this situation. The Government pointed out from the very first moment that its main aim in the negotiation was to take Europe out from the state of stagnation into which it had fallen; of course, preserving the essential contents and the balance of the Constitutional Treaty, at all times. We believed that Europe needed a solution as soon as possible and that this European Council was the chance to get it. And we have made proposals; we have been active, available and we have worked for it. We offered full support to the German Presidency and we backed its efforts through direct contacts with the member States that posed greater difficulties. We explained clearly and in due time the main points of our position and we pointed out the limits that would never be given up. In that framework we proved ourselves flexible enough to understand and add the coherent proposals to an adequate solution agreed upon through consensus. Thus, we increased confidence in the relationship with our partners. All this has been essential for these days' negotiation in Spain to contribute directly to fix the terms of the agreement. Honourable Deputies, the success of the European Council is our own success. We all had risked a lot in this negotiation. It is a success for Europe and for us as European citizens. It is a success for Spain and for the interests of Spain. All the contents of the Constitutional Treaty that we considered essential are included in the new Treaty. This means exactly that the most efficient and democratic Europe the Spanish voted for at the referendum will soon become a reality as soon as the new text of the Treaty comes into force. It is true that in order to achieve this agreement we have had to make concessions too. Spain would have preferred to have come further, with a single Treaty simplifying the European legislation, keeping the term \"\"Constitution\"\" and the reference to the symbols of the Union. Those seemed to us positive contributions, but we also knew that those were not the substantial aspects of the Treaty. It was not those aspects that placed the future of Europe at risk. Thus, it was decided that if this terminology posed difficulties to the other States concerning the agreement, we could eventually accept its modification. The final result is an excellent one. If the previous Treaty was said not to be a proper Constitution, the new one will doubtlessly have to be recognized as much more than a Treaty, from a political point of view. It is a project with a foundational character, a Treaty for the new Europe. The new Treaty establishes in a clear way the binding juridical value of the Charter of Fundamental Rights and Duties. This recognition is essential in order to bring into force our shared value system. Besides, the Treaty introduces a substantial advance for the efficient functioning of the European Union. The subjects that may be voted by qualified majority will increase from 36 to 85, which sets a significant limit to the principle of unanimity that slows down or blocks decision-making in Europe so many times. Once it comes into force, the qualified majority will be the regime applicable to other delicate questions for Spain such as immigration, energy and cooperation in the areas of justice and internal affairs. These areas have a great potential inside the European Union and they require a more agile system in order to be developed. Our citizens, the Spanish citizens, will be the first to experience the benefits of these measures. As the Honourable Deputies already know, the definition of the qualified majority voting system has been one of the most discussed questions at this Council. We finally have reached an agreement which consists in keeping the existing system until the 1st of November 2014, with a further transition period until the 31st of March 2017, during which the blocking minorities may constitute themselves either upon the basis of the system in force, the one known as \"\"Nice system\"\", or upon the basis of the double majority system, following the decision of the interested States. Both in the case of the existing system and in the case of the double majority system, Spain has an adequate representation, according to its population; but Spain wishes to have a superior influence, as compared to its own number of votes or to what its inhabitants represent, since it knows, by its own experience, that real power in the Union does not depend on more or less votes, but on the capacity of the member States to generate confidence, attract involvement, make alliances and defend its national positions from a European perspective. Against the option of the blocking minorities, the Treaty proposes reinforced cooperation and establishes that these may be promoted by nine States minimum. This is also especially relevant for a country such as Spain, which wants to be at the avant-garde of the integration process in almost all the fields of action of the Union. And there is something more: before the end of October this year, a proposal on the new composition of the European Parliament will have to be put forward, and with regards to this proposal, it guarantees an increase in the number of seats corresponding to Spain during the elections to the Parliament that will be celebrated in 200 As far as the institutional field is concerned, with the creation of the new figures, namely, the President of the Council of the European Union and High Representative for the Common Foreign and Security Policy, Europe will progressively reinforce its efficiency, its visibility and its importance as an authentic European Government. Thanks to these figures it will be easier to identify the personality of the Union and speak on its behalf with a single voice in the international area. It is a very important step in the process of political integration in the European Union, which will give institutional coherence to the functioning of the Council and to the direction and development of the Common Foreign and Security Policy, besides, thanks to the Treaty it will also have an external European service so that it can enter into force. Moreover, the Treaty is a great advance as far as the creation of a Space for Freedom, Security and Justice is concerned, these depend completely on the qualified majority after the introduction in this category of the areas of police and criminal cooperation. These are very good news for Spain and promoting such policies at a European level is a reward to our efforts, and it is also a very significant change for our citizens, as it reinforces the protection of their interests and of their security. With this new framework for actuation, the European policy on immigration promoted by the Spanish Government will be more efficient from the perspective of the European Union. Besides, as far as another question of strategic importance for Spanish interests is concerned, the Treaty makes specific reference to the promotion of energetic interconnections among the member States, which, as the Honourable Deputies know, is an essential landmark for the security and development of our energetic policy. The Union recognizes that the principle of energetic solidarity can not be understood in Europe without the development of interconnections. This means, doubtlessly, a great support for the achievement of such interconnections, which are vital for our energetic system. Spain has also been able to keep, in the new text, the improvements established by the Constitutional Treaty with regards to a question that is rather delicate in the case of our country, that concerning the Statute of Ultra-peripheral Regions. At the same time, the Treaty reinforces the role of the national Parliaments, by increasing their capacity to intervene in the European legislative process whenever a simple majority of the votes attributed to those national Parliaments deems that the project put forward does not respect the principle of subsidiarity. Honourable Deputies, I believe that we can be really satisfied with these results. We have not left out any substantial point of the Constitutional Treaty and we have obtained some positive changes for Spain. As far as Spain, this Council has been a reinforcement of our position in Europe. We have worked in cooperation and in harmony with the German Presidency, and I would like to congratulate them once more, from here, on the success achieved thanks to this agreement; the political determination of the German Presidency has doubtlessly been essential for the command that the European Council has given to the Intergovernmental Conference. We have kept a close contact with France, which is the country with which we presented a common proposal a few hours before the meeting of the European Council. And I can tell you that the coordination of our positions and the common mediation have been very useful for the German Presidency. Similarly, we have been working with Italy, Belgium and Luxembourg in order to defend those parts of the Constitutional Treaty that we considered essential. Spain has acted in favour of stability and agreement. It has generated confidence during the whole negotiation and with this attitude we have been able to impulse the defence of the contents and the ambition of a new Treaty. Portugal, the country that will occupy the European Presidency during the next semester, will have our full support during the Intergovernmental Conference. I am convinced that we will have a new Treaty this same year and that its ratification process will take place without any further difficulties. Honourable Deputies, Even though this negotiation about the new Treaty has been the centre of attention of the debates of the Council, during this Council other conclusions about other matters have also been approved of. As you will see, these matters, which I will briefly refer to next, are also of importance for Spain. The Council went on dealing with European immigration policies, following the proposals of Spain. Thus, it stated the need to develop further on the actions in Africa and in the Mediterranean region, signing new Mobility Agreements with the Countries of origin and with the Countries of passage; it congratulated itself on the achievement of agreements for the creation of quick intervention teams and a network of coastal patrols, and it decided to keep reinforcing the capacity of the European Exterior Frontiers Agency. Besides, the Council reaffirmed the importance of the fact that a good management of legal immigration may contribute to dissuade illegal migration flows, and it developed some aspects of the application of this European policy on immigration in the Eastern and South-eastern frontiers of the Union. As far as economic, social and environmental policies are concerned, the Council paid attention to the progress made and to the projects that are currently being carried out with regards to matters such as joint technological initiatives or the European Institute of Technology; it repeated the importance of moving forward towards a European, efficient and sustainable transport, and it also encouraged the work on coordination of the social security systems and on the application of the Action Plan against AIDS. Finally, the conclusions of the Council also focus on the European neighbourhood policy, the strategy of the European Union for a new association with Central Asia and the dialogue process with the so-called emerging economies. Similarly, the European Council celebrated the fact that Cyprus and Malt are in condition to adopt the euro by next 1st of January 200 Honourable Deputies, These have been the main contents of the European Council that has given us back the image of the Europe we want, the Europe in which we believe and for which we have been working so far: a Europe full of ambition and built upon consensus. Spain and Europe have come out of this process even stronger. We were the first to ratify, by referendum, the Constitutional Treaty. In so doing, we reinforced it so that it could survive in essence against the difficulties. We have now contributed in a decisive way to lay the foundations for the agreement and we have proved our solidarity throughout the negotiation. Spain is perceived at a European level as a member State that transmits stability and confidence, and assumes its responsibilities when Europe requires it. This is how we are perceived, this is how we are needed and this is how we are recognized. It is for this reason that we should feel reasonably satisfied and proud of our contribution and, also, and mainly, because Europe has achieved an agreement that will be applied and it will thus bring about a more democratic, efficient functioning of the Union, which is, no doubt, what most of the Spanish and what most of the European citizens want. Thank you very much.\n## 4 President .- Good morning. Thank you for attending this press conference. I hope you all have had time to rest. In the first place, I would like to say that today is a good day for Europe and I am satisfied as we have attained a very important agreement in order to modify in a substantial way the operation of the European Union, in order to make it more efficient so as to provide an answer to the social problems and to the problems of the European citizens. As you know, achieving this agreement was a difficult challenge after the process that we had gone through as a consequence of the referendums in France and in the Netherlands, and we have made it. We all have been willing and we all have committed in order to get the European Union going again, in order to complete a new stage, in order to make it move in the right direction in this new stage, so that it may achieve an each time more perfect, efficient and useful political union. Thus, I would like to express the satisfaction of the Government of Spain, of a Europeist country, a country that has firmly decided to support the European Union, the strengthening of the European Union and its construction, in order -as usual- to establish a compromise, in this case a compromise among twenty seven countries according to the political circumstances that we already know, which I have just mentioned. Everyone has made concessions so that everyone could win a lot. As you all know, the European Council has issued a mandate for the Intergovernmental Conference to reform the basic treaties of the European Union; a mandate whose most important aspects for the operation of the European Union, from the Spanish perspective, are the following: in the first place, the consecration of the rights, of the principles of the Chart of Fundamental Rights with legal value; in the second place, and this might be the most operative achievement, the subjects that will be decided on by qualified majority have passed from 36 to 87, thus, we will reduce the unanimity system and, accordingly, the right of veto, which will facilitate the decision making process with regards to important issues for the whole Union and for Spain, such as, for instance, immigration, energy or justice and interior. I was saying that the European Union would function more democratically with the reform of the Treaty because one of the main principles of our democracies, namely, the weight of the majority, has spread in the heart of Europe, for it could not keep on working according to the principles of Europe when it had six, nine or twelve members. Many of these issues, as I said, are very important for Spain. As you know, the definition of the concept of qualified majority has been one of the most discussed issues, among others, during this European Council. We finally have reached an agreement whereby the voting system in force will be kept, the one known as \"\"Nice\"\" system, until 2014, followed by another period, until the 31st of March 2017, during which the minorities will be allowed to use the double majority system or the \"\"Nice\"\" system, according to the decision of the States concerned if it is so requested by any other State. Besides, one of the main achievements of the reform, in my opinion, is the new definition of the EU Foreign and Security Policy. Europe is going to have a single, stronger voice in order to carry out its activities in the world, thanks to the High Representative of the Union, who will be Vice President of the Commission and, besides, will be provided with its own service, with a foreign service, in order to carry out his tasks. One of the main objectives of all this reform process has been to endow the European Union with a stronger voice, more efficient and unified as far of foreign policy and security are concerned. Besides, we will have a common, integral policy on immigration upon the bases that we have already established during the last few months, with the active participation of the Government of Spain, as you know. This has been an important step forward in the area of justice, freedom and security for the construction of the European space. Besides, with regards to a matter of strategic interest for Spain, at Spain's own request, an agreement has been adopted in order to introduce a specific reference in the energy policy for the promotion of energetic interconnections among the member States. Similarly, I would like to emphasise a specific matter, for the improvements established in the Constitutional Treaty concerning the Statute of Ultraperipheral Regions are maintained, for, as you know, this is a very interesting matter for Spain. The European Council has issued a limited mandate, very defined and specific, for the Intergovernmental Conference. You know that the objective of the Portuguese Presidency, as expressed yesterday, is to complete the process as soon as possible so that before the end of the year we may have approved of the reform of the Treaty; this new Treaty that will enable the European Union to operate more democratically, efficiently and according to our present times. The German Presidency has played a fundamental role in order to establish this Agreement. It has been supported by Spain in order to achieve the common backing and in order to approach the most distant positions throughout the last few weeks and, of course, the European Council itself has been supported by Spain. Similarly, we have been working intensely with the President of the Republic of France, Mr. Nicolas Sarkozy; with the Prime Minister of Italy, Mr. Romano Prodi, and, of course, we have also collaborated with the Prime Minister of Great Britain, Mr. Tony Blair, with regards to many important aspects; and, by the way, the latter received yesterday, quite logically, an affectionate applause for this was the last time for him to take part in a European Council. I would also like to express my gratitude for the fact that we have been able to set up a relationship and to work with Mr. Tony Blair at the European Council during this period. To sum up, Europe has provided an answer to a difficult situation as the Constitutional Treaty had not been approved of. We have finally included the most important aspects for the practical operation of the European Union in the reform of the Treaty. Thus, the progressively more united, political, efficient Europe that we want, the one with a more powerful voice, with a single voice in the world, will come true once the Treaty is ratified and enters into force. P.- I would like to know whether the Spanish Government has already decided on what it is going to propose to the Spanish citizens with regards to this new Treaty, for Spain ratified the referendum on the European Union back in February 2005: whether it is going to propose a new referendum, whether it is going to be approved of by the Parliament… How is that process going to be? President .- The ratification process is going to take place at the Parliament. Spanish citizens already gave their opinion about a text that, of course, included an important change in the operation of the Union and most of that text is going to form part of the Treaties of the European Union. Thus, the ratification is going to take place at the Parliament. P .- We all agree that this agreement was necessary because, among other things, we could not provide an answer to a two-year long crisis with an even more serious crisis if we had failed. But you, in particular, who have defended so firmly the Constitutional Treaty that has now been forgotten, don't you have a sour feeling? There are evident achievements, but Spain has had to renounce to many of its demands, during the negotiation and, mainly, during the last moments of this negotiation, Europe has had to give in to Poland. Don't you have a sour feeling, of some type, thinking of what could have been and finally has not been achieved, in spite of what has been achieved? President .- Most certainly not, and even less at dawn, for it was at dawn that we finished the meeting yesterday… I have a very positive feeling, for things, as you know very well, were very difficult, months ago, one year ago, in order to find a solution for all, which is how we work in Europe. Of course, from the point of view of the practical operation, which was the essential reason of the Constitutional Treaty and the Convention, from the practical point of view, the most important thing is the reform of the Treaties, which we passed yesterday. The fact that many issues will be approved of by qualified majority, the abandonment of unanimity, which blocks out, which hinders common policies, which does not allow the integration of a European action in important areas such as the ones I have just mentioned (immigration, energy and justice or Interior) was our essential objective. The fact that there is a voting formula and other instruments that are each time more democratic, such as the intense role of national Parliaments and the legislative initiative for citizens, all this implies an important change from the democratic point of view and from the point of view of the capacities of the European Union. Whatever will change with this Treaty is going to change for better. Could we have implemented more changes? Yes, but, once more, Europe has remained loyal to its tradition. As the founders foresaw, Europe is moving forward step by step. Yesterday's step was an important one. All the steps towards the construction of the European Union have yielded very positive results for the European Union and for the countries that form part of it. In fact, I think that we will all agree that this political project is admired all over the world and this is a political project whose current destiny we would not have believed only fifteen years ago, if we had been described -once century ago- a Europe formed by Twenty Seven members, after all the difficulties of co-existence among flags and nations, quite surely. Its destiny is democracy, which is essential, unity, peaceful coexistence and prosperity. That is Europe and, to a great extent, thanks to the European Union. The more we have a European Union, and the better the European Union may work, the better, for I am firmly convinced that the horizon of its Twenty Seven members will involve more security, welfare and prosperity. This is a step, an important step in a difficult situation. P .- Nonetheless, the characteristics of the negotiation in the last moment, that is to say, of what we saw yesterday, the very fact that the Polish have been left outside the Intergovernmental Conference so far, doesn't it increase or augment, to a certain extent, the sensation or the impression that this procedure and this Constitution lack legitimacy? My question is whether that sensation of lack of legitimacy might give raise to the demands or petitions for referendums and, thus, it might hinder the whole process once more. President .- Of course that is not the perspective. I think that, quite obviously, the legitimacy issues from the European Law itself and from the way the Treaties are reformed, from the way it has usually been done in most of the occasions in which a Treaty has been approved of, and also from the great political consensus, which is what gives more legitimacy. I am convinced that what European citizens wanted yesterday was an agreement and to clear out the period or the phase of vision, of paralysis and incertitude about the way in which Europe was going to face its future. We have made it and this is very positive. I honestly think that all citizens can perfectly understand that, when we are talking about Twenty Seven countries that form part of the European Union, at present -which represents almost five hundred million citizens, whose countries have very different histories and also a very different pace of incorporation to the European Union, and whose economic development is also very different in each case-, establishing an agreement in spite of all those factors is doubtlessly a great positive achievement, not to say a success. P .- One of the chapters that Poland wanted to include was the one concerning morality. I would like to know where the Spanish Government was going to get in order to impede that Polish petition. Besides, as to your conversation with Mr. Tony Blair, which lasted half an hour, we were told part of that conversation, but I would like to know whether you analysed the current situation of terrorism in Spain and whether Mr. Blair gave any piece of advice with regards to it or what was his analysis of this issue. As to terrorism, I would like to know the new data of the Government about the implementation of an operative base of ETA in Portugal, after the police research that has been carried out during the last few days. President .- As to the first question, it is evident that such proposal did not have the support of the Spanish Government and I can confirm that it was also objected to by most of the Governments of the European Union. Thus, it remained as a declaration strictly on the Polish side, for obvious reasons. In the second place, the conversation with Mr. Tony Blair, logically enough, dealt with the development of the European Council, mainly, and with the most important matters that we had to go through; but in fact, we also talked about terrorism and about ETA's terrorism. Finally, as to the operative data, it is the Ministry of the Interior and, if applicable, the General Director of the Civil Guard, that can facilitate more information in this regard. P .- President, when you became President and arrived to the European Council for the first time, you soon abandoned the defence of the Nice voting system, turning to the double majority system. Yet, it now seems that one of the things that benefits us more is precisely the prolongation of the Nice voting system. Don't you think that you hurried up then or do you think that you are going to be recriminated for it? President .- This is a matter of political positioning and this is a philosophy of the European Union. For Spain, which, in both systems is represented as a 45-million citizen country should be at present, this is a system that works properly. We cannot enter Europe saying, on the one hand that there should not be too many issues approved of unanimously and, at the same time, on the other hand, say that we want as many blocking instruments as possible. This is an utter contradiction. Mine has been a coherent, balanced position. I want decisions concerning most issues to be made by majority and I don't want the logics of blockage to be activated, for that paralyses Europe, and I also want Spain to be represented as it should be. And that was so yesterday and that is so today. And, if I may, from my experience and for the fact that it is an undisputable truth, on most occasions influence does not depend on a difference of one vote, instead, it depends on coherence, on the constructive capacity and on the compromise with Europe. P.- President, as to the Chart of Fundamental Rights, I would like to know the position of Great Britain if it were allowed to carry out the \"\"opt out\"\" mentioned. As to terrorism, the newspaper \"\"Gara\"\" has published today certain information according to which the Government held a meeting with the terrorist group last March and, besides, it says that you were sent a letter whose tone was not really conciliating last February. Could you confirm such information? President .- As I have said, the Chart includes the legal value that we demanded, I think that this is the main advance with regards to the principles and rights. The position of Great Britain is already known, and I respect it although, logically enough, I do not share it for I think it would be highly convenient for it to include the European Union as a whole. In the second place, obviously enough, I am not acquainted with the speculations concerning such evident propaganda, and I am not going to comment on them or assess them, and even less in the case of such a particular newspaper. P .- President, the new Treaty will confer the European Parliament greater protagonism. Are you satisfied with the representation we have, with the one that has been agreed on? In yesterday's agreement with Poland you mentioned the Eurodeputies, was it because it is also in that same position? President.- We did not talk about Eurodeputies in the agreement with Poland. What we have is what we already had in the Treaty, including a specific reference in the conclusions so that this change in the composition of the European Parliament can be carried out before the elections in 2009, before the next elections to the European Parliament. Of course, if this is so, the composition will benefit Spain. P.- As you have said that you talked about the situation of the terrorist group with the British Prime Minister, I would like to know whether he gave you any new piece of advice, whether he encouraged you to keep going on, for that was what he had told you in the past: he had told you that you should keep up communication, that you should keep up some kind of dialogue. What was his piece of advice in this new situation? President .- During these three long years in the Government, I have spoken to the British Prime Minister on many occasions and we have talked about terrorism and ETA, very specially due to his experience in the peace process in Northern Ireland. Logically enough, he asked me yesterday about it and we commented on it and exchanged our points of view about the situation. Of course, everything he has told me on any occasion has been very useful to me and I am thankful for that. He has always been prone to collaborating. P.- President, it seems to me that in Brussels, the preparation of Brussels has propitiated a new romance, if I may put it so, between Mr. Sarkozy, President of the Republic of France, and the President of the Government of Spain, something that was unthinkable a few months ago. You supported Ms. Segolene Royal during one of your speeches and Mr. Sarkozy supported the policy of the People's Party, criticising quite hardly the process for the massive regularisation of immigrants carried out in Spain, as you will remember. Thus, what is that romance based upon? Is it based on love at the first sight, and I beg your pardon for using this expression? What do you think about the role of Mr. Sarkozy in this first European Council? Was it difficult to deal with a snake charmer who was better than Chirac? And, yet, Mr. Sarkozy seems to have charmed everyone, even taking his shoes off, isn't it right? President .- I am not acquainted with that last detail. I must confirm that I have a very good relationship with Mr. Nicolas Sarkozy, a very good relationship; but this has been the situation from the first day we held the first meeting and now, it has grown more intense, nicer and I think that this is going to be very positive from a political point of view. Besides, I think that we have a good personal understanding, for these things always contribute to it. As you know, we put forward a joint proposal, a proposal by France and Spain, through our Ministers of Foreign Affairs who have been working hard on this question, and it contemplated the main aspects that -in our opinion- had to be included in the reform of the Treaty, the ones that have been included in it, and we have been working in a coordinated way at all times. Yesterday, as you know, the final agreement with Poland was held in the room of the French delegation, and we all were present, Mr. Tony Blair, Mr. Juncker, the President of Poland, Mr. Nicolas Sarkozy and I. That means that there has been a joint, coordinated effort with Mr. Nicolas Sarkozy and, besides, I also think that we should get free from prejudices, sometimes. He has his own political ideas and his own ideology, so do I, but the rest is Europe. That is the grandeur of Europe. Thank you.\n## 5 Thank you very much, dear Rector, for the kindness and hospitality of the Complutense. Congratulations, Josefina, on your brilliant speech. We are here today, not only because this is the sunniest day of the year, but also because we are fully aware of the fact that climatic change is one of the main challenges for Humanity in this century. Being realistic, this is the greatest risk that life on Earth is facing at the moment. Climatic change is a proven fact, although we are still discussing its consequences and eventual calendar. We can not sit and wait for that date, which has no way back, and we should not resign to its effects. At least we know that it will determine the quality of life of our generation, of our sons' and of our grandsons'. It is an ineluctable responsibility that we have to face on our own and for the sake of the future. Some twenty years ago, or even less, only a few would dare to warn us against what was coming over. But nowadays, the International Community has assumed it. The Intergovernmental Panel of Experts on Climatic Change has put it clearly and sharply in its conclusions; and, even, during the last Summit of the G-8, those countries that had been reluctant for the last few years, as was the case of the United States, have taken the step we all expected and have announced that they are also going to commit themselves with this global task. As on many other occasions, Europe has led an awareness-raising process, a process of international solidarity that has been subscribed by the rest of the developed countries and, as on many other occasions, this awareness-raising process has been led, in the first place, by social organizations and researchers. Europe supported, unanimously, the Kyoto Protocol back in 1997 and now it is Europe that commits itself to making new moves in this process. The European Union will defend more ambitious objectives during the post-Kyoto negotiations and it is considering a reduction of 20 to 30 per cent in the emission of hothouse gases during the upcoming commitment period. Given its geographical situation and its social and economic characteristics, some of which have already been presented - with sufficient reasons for reflection, I believe - by Ms. Josefina Garcia Mendoza, Spain is an exposed country, highly vulnerable to climatic change. The most recent projections of its eventual effects on our country during the 21st century point to a progressive, important thermal increase and to a general decrease of precipitations, unequally distributed over regions and seasons. We can not just accept it passively, we should not remain still. That is why Spain has been involved in the genesis of Kyoto and it is now making great efforts in order to fulfil its compromises. We are going to throw ourselves into this strategy. The Exhibition and the Conference that are taking place today in the framework of the activities of the Year of Science are a good proof of the clear, firm commitment of the Spanish Government and of the Spanish society to promote renewable energies and fight climatic change. The commitment of this Government, I would like to remind you, started out the very first day it entered upon office. Among its first actuations one could point out the creation of the first National Plan to assign the rights for the emission of hothouse gases. Then, the next ones were the preparation of a Governmental strategy concerning mechanisms of flexibility regarding the Kyoto Protocol, with the participation in several initiatives concerning the Carbon Fund; the approval of the Plan on Renewable Energies and of the Action Plan of the Spanish Strategy on Energetic Efficiency; the approval of the Technical Building Code and the preparation of a National Plan for the adaptation to climatic change. Yet, even though we have carried out or we are still carrying out considerable efforts, these are not enough. We must be more ambitious. We have to set up greater aims in order to attain them in shorter periods. We will soon pass the Spanish Strategy on Climatic Change and Clean Energies during a Cabinet meeting that will be exclusively devoted to climatic change. During that meeting we shall approve of a series of specific, urgent measures, with a clear calendar and with available resources, in order to fulfil our commitment with the Kyoto Protocol. As part of the essential part of the Plan of Urgent Measures, we are elaborating a new Saving and Energy Efficiency Action Plan for the period 2008-201 The strategy defines eleven areas of intervention, from institutional cooperation to Research, Development and Technological Innovation, with special attention to the so-called disperse diffuse sectors: transport, residential, commercial, institutional, agricultural and service sectors. Thus, as to the transport sector, we can point out the elaboration of a basic rule on sustainable mobility and the promotion of railway transport for the transportation of goods. As to the residential sector, we can point out the energetic improvement of buildings and the spread of the energetic label to all the domestic facilities. Regarding the institutional sector, we have to point out the establishment of energetic efficiency requisites in the case of public lighting. There are nearly 170 specific measures in the General Strategy against climatic change and in favour of clean, renewable energies. The Strategy will also serve to orientate the capacity of Spain to assume additional compromises in the fight against climatic change after 201 The answer to climatic change is not just a governmental matter; the Government must lead it, and we accept it as it is, but it is a matter that depends on all the society. It concerns all the Administrations, the companies, the brilliant companies belonging to this sector in our country, the consumers and the civil society in general. It implies political leadership, a cultural change and social responsibility. The effort must be a collective, shared one. Each one, each company and each Administration must adapt its own dynamics to these new commitments and the achievements will also be shared. In 2006 we managed to revert a historical tendency and Spanish society reduced the demand for primary energy in 3 per cent, in spite of the high economic increase. Besides, this has allowed us to reduce the emissions of hothouse gases by, approximately, 4 per cent. And all this has been compatible with a strong, stable economic growth. The Spanish society, the companies and the citizens have proved that the fight against climatic change is compatible with economic growth. I dare say it is the best way towards economic growth that we have in front of us nowadays. The Spanish can feel proud of the work that is being carried out as far as renewable energies are concerned. We had a certain potential as a country and we know how to make the most out of it. Sun, water and wind are nothing but potential resources if we do not turn them into a source of useful energy. This is what is being achieved by means of research and innovation. Thanks to our research centres, as is the case of the researchers I have met today at the Complutense, and thanks to our companies, we have become a leading country - I would like to emphasize this - in most renewable technologies. Thus, for example, and since this Day is specially devoted to the sun, which, by the way, has behaved, we should say that Spain is the first country in the world where a solar thermal plant of high commercial temperature has come into operation. This is partly due to the work carried out in the research on this type of energy at the Solar Platform of Almeria. But our contribution to renewable energies is not just that: we are the third country in the world in the manufacture of aerogenerators and our market share last year was superior to 20 per cent; we are the leaders in biofuel production; in 2006, Spain was the second producer of bioethanol in the European Union and we were also the second producer of photovoltaic solar energy, as far as installed power is concerned, with an increase of 300 per cent as compared to year 200 The sources of renewable energy cover now an important part of the energetic demand. More of 20 per cent of the electric demand in 2006 was covered with this type of energy. Wind energy alone achieved 9 per cent of the total electric production in our country. Thus, Spain has taken huge steps in very little time. This is a fact we should congratulate ourselves on, but we must be fair and recognize each one's efforts. We are now in an appropriate moment in order to thank our companies for their work and to recognize their contribution to renewable energies and to sustainable development. Some of those companies are represented here. I must thank you for your work. You have been able to explore our technology and compete at a worldwide level, you are now present in the five continents, you have generated employment, more than 18000 work on renewable energies in Spain and you have promoted the image of Spain as a country with technological capacity and respectful with the environment, committed with sustainable development and aware of the challenges of the future. The representatives of non-governmental environmental organisations are also here. Thanks to their pioneer work and to their resolution and steadiness, we all have become aware of the importance of the defence of environmental values. Thanks to their tenacity, the protection of the environment is part of everyday life. From here, I encourage you to persevere on this determination and to keep presenting society and the Governments with objectives that seem unattainable today, yet will soon be demanded by society in general. I would like to finish by repeating that the fight against climatic change is an essential matter for the Government, an absolute priority, the great question of the future, for our economic model and for our growth model. We are doing what we are doing at the moment in order to progress today, but also with the aim of ensuring the future. The fight against climatic change must be the axis of any society-building project during the next years and during the next decades. And what's more, it must be assumed as an individual commitment. It must be more present in our conscience and form part of our daily customs. It is a great objective for any country, it stimulates innovation, it stimulates a healthy way of life, it stimulates respect for our heritage and it stimulates the passion to respect what we will leave for those who will come after us. From the Government, the fight against climatic change has characterised, to a great extent, the legislature; but the efforts we have made during these years must not come to an end once the legislature is over. The fight against climatic change is an essential part of our project. It must be an essential objective of Spanish society and conferences such as the one that is taking place today contribute to the spread of the importance of renewable energies as a source of future and as a fundamental element in order to ensure sustainable growth, in order to fight climatic change and in order to gain an insight into the Earth, into the landscape, into our resources and into what being able to develop welfare and to respect our environment represents. This will not be the last time for the Government to coordinate an initiative of this nature. The experience is extraordinarily positive. This has been one of the occasions on which we have been able to bring together the efforts made by different sectors of society. This matter deserves it and requires it. The fight against climatic change must consist of many voices working up one powerful, single message: defending what belongs to all of us so that it keeps belonging to us all. I want my voice to join yours, to join the voice of society, the voice of the non-governmental organisations, the voice of researchers and the voice of the companies in favour of a common commitment: the commitment of Spain to be the leader in the fight against climatic change and also the leader in favour of renewable energies. Thank you for your participation and keep working. Thank you very much.\n## 6 Nicolas, eighty? No, I can't believe that. I have seen you, I have seen you while you were coming up here and I have seen you before… I can't believe it. Do you know what happens? The thing is that in the case of those who give their lives for others, life is twice as worthy; it is twice as worthy. For me, Nicolas is forty. Apply this rule of thumb to yourselves, for that's a good one, because many of us here are trying to devote part of our lives to the destiny of others. I was thinking of that a while ago, while I was sitting there, next to Nicolas: if I apply this rule of thumb to Nicolas and if I apply it to myself: that makes twenty three. Not bad. Well, I don't know whether some others in the Spanish political arena would like to be applied this rule, but just do it. Nicolas, I wanted to be here, and I wanted to, mainly, because I wanted to enjoy it. It is true that such events, which are so emotive, are always somewhat nostalgic; nostalgia is quite habitual in a nation such as the Spanish nation, a nation whose history during the 19th century and part of the 20th has been difficult, with few satisfactions. And this makes us remember, quite nostalgically, the moments in which the doors of freedom opened up. I remember that with a deep happiness, with enthusiasm. I am deeply satisfied for having the chance of being the President of the Government of Spain in 200 21st century Spain does not need to admire the others, other countries, as used to happen during the 19th and during the 20th centuries. 21st century Spain can admire itself and 21st century Spain is admired by many countries in Europe and in the world. This has been thanks to people like Nicolas. Blanco, Urbieta, Terreros, a complete saga: the Redondo saga. Socialists, members of the General Workers' Union, defenders of freedom, committed, with strong convictions -it is not easy to convince them as you know--, they have proved, during this time, what is worth the effort in life, they have proved that it is worth it to commit oneself, to defend a position, to believe in one's country and in one's ideas. They have done this on many occasions in silence. I think I quite know Nicolas and I can say now that he is not really fond of tributes, yet, he is a loving man. This is the meaning, for me, of this tribute in this House of People; a House of People whose origin, whose principles embody the two most important values that the socialist movement, of the General Workers' Union (UGT) and the Socialist Party have given to this country; the two values that, in fact, transform, create, generate progress: that is culture and freedom. Once more, it is in the House of People, paying a tribute to someone who has been the General Secretary of the General Workers' Union for eighteen years - and let me praise all those in our group, in the General Workers' Union, in the Free Teaching Institution (Institucion Libre de Ensenanza) and in the Houses of People - that education has been turned into the main bastion of our evolution as a party and as a union. Thank you very much to you all. Nicolas has witnessed and he has also been the protagonist of the last three most brilliant decades of modern and contemporary history, and he has also been the protagonist of some of the most essential changes that our country needed and that it managed to achieve: namely, political and syndical freedom. He was there when the Constitution was written, for he was a constituent deputy as we said in the Parliament the day before yesterday; and he was there in the victory of 1982 and he left the Parliament when the General Workers' Union and the Spanish Socialist Workers Party made the wise decision of giving way to the necessary crisis in order to assume their positions with maturity. And, as usual in any maturity crisis, maturity crises and entering maturity itself is not usually calm or peaceful. That is why we also went through such moments… It is true that I saw it with a perspective and I remember it very well, of course, for I already had responsibilities… I was now commenting on the general strike of the 14th of December 1988 with Candido, for it was a situation that expressed very well the meaning of the Socialist Party and of the General Workers' Union in Spain. I remember that I was in my organisation, the Organisation of Leon, and the members were quite happy, preparing the placards that were going to be taken out during the demonstration for the general strike next day. They were members of the party, most of them workers of the railway company, by the way; for as you know there is a long tradition among those workers. We went through this crisis, which bordered schizophrenia, it is true, but it was necessary, absolutely necessary. Historically, as a result of the dictatorship, the General Workers' Union and the Spanish Socialist Workers' Party were the same. In a democratic society they could not be the same, even though they might share values and objectives. We started out together, then we fell apart and then, as it usually happens even with the laws of physics, we went back to our corresponding position, and that is where we are now. And that has also been thanks to our General Secretary, Candido, of whom you can be proud, Nicolas, because you left the General Workers' Union in very good hands. It is true that this week has been a week of memories with the thirty years of the first democratic elections. There has been a brilliant institutional celebration in order to remember those who are still standing, as strong as ever, those who were also present in those first elections. But today is a good day for me -as President of the Government of Spain- to state here, in a House of the People, before all citizens, that the transition, that freedom and democracy would not have been possible without the wonderful example of the workers and of the unions of Spain. I would like this to remain in the collective memory of the last thirty years. I would like to point out that we have reasons to admire ourselves as a country and to be admired by many countries in Europe and all over the world. An example of this is the way in which the social dialogue and the social consensus are structured, and an example of this are the emotive words of someone who has been the president of the representatives of the businessmen for many years, Mr. Jose Maria Cuevas. I know that he is really convinced about what he said here about Nicolas, about the General Workers' Union, about the Workers' Commissions and about the Spanish unions, which have made a great contribution to modernisation, to progress and to welfare in this country. Thank you, Jose Maria. I am glad that this is also the pervading atmosphere in the case of the understanding or of the unity of syndical action between the General Workers' Union and the Workers' Commissions, because the words that Jose Maria has uttered here today are invaluable for me. He has been generous with the General Workers' Union, the main competitor in the syndical arena. This takes Jose Maria Fidalgo even higher, which is quite a difficult task. Thus, let's call things by their names: we have done it very well, we are doing it very well and its results will benefit the Spanish citizens and Spain, the Spain that Nicolas Redondo loves so much. What called our attention in the video was the fact that the history of the Spanish Socialist Workers' Party and the General Workers' Union melted together with the meaning of modern and contemporary history in Spain. Let's consider this piece of data: the Spanish Socialist Workers' Party was founded in Madrid; the General Workers' Union in Catalonia and we all know that the Basque Country was decisive for the growth of both the General Workers' Union and the Socialist Party, as the Redondo's know very well. This is our sign of identity: the party that resembles Spain more, the party that has structured and still structures the meaning of a common project, the one with the deepest historical roots and the one with an even better future. Nicolas, your forty years place in front of us the perspective of what should be emphasised here today. From those well-written words, namely, history, memory and future, I chose the last one, the future, because we are going to witness a near future of full employment in Spain; we are going to witness a future with full equality among men and women as far as the activity rates are concerned, as far as rights are concerned, as far as employment is concerned and as far as the management of important companies in our country is concerned; we are going to witness a near future in which we will provide not only for education, health care and, of course, for a pension system that we will progressively approach to the model of the European average values, but also one in which we will provide for those who are alone, for dependent persons and for their families. A future in which we must guarantee a worthier Minimum Salary. We have walked a long way during this Legislature, and we will walk an even longer one during the next Legislature. The future of a country, think of thirty years before, where there were still many emigrants, a country that knows that it must keep being an example of organisation for those who come from abroad to live and work with us; persons who are going to work with their rights and with their duties if they are to stay here working, because in this country, regardless of the origin of the person, we are not going to allow illegal, fraudulent work, we are not going to allow the exploitation of a human being, regardless of the colour of his skin. A future with social agreements, with social dialogue, as we have been doing during the last three years with twenty social agreements. And a future that must focus on employment, on two main objectives: keeping on increasing at a swifter pace the transformation of temporary employment into indefinite employment, which is functioning very well thanks to the agreement that we have signed; and, of course, winning the battle of health and accidents at work, for which we want to approve of a long-scope strategy so as to reduce accidents at work in Spain in 25 per cent. Our unions use the word productivity, they attest modernity and they know that for our country to have more prosperity, welfare, social policies, equality and rights we have to produce more and better every day. This involves education, research and innovation. We can make it and we are going to make it. Nicolas, you may be proud, but the way a socialist does: intimately proud. Surely enough, words of praise, words of recognition and tributes come out relatively easy, they are even obligatory. I know that you are proud of yourself. I know that you have proved that living, committing with the destinies of the others and committing with deep values is worth it. You belong to a generation like the ones that have come next, a generation of Spanish citizens that have risen to the occasion. You have left us a country, Spain, for which it is worth it to fight; a country, Spain, that is going to keep on progressing every day; a country, Spain, that is admired, respected; a country, Spain, whose sole signs of identity before the world are democracy, justice, equality and solidarity, and, of course, peace. Nicolas, cheers for the General Workers' Union! cheers for workers! And cheers for the Redondo's! Thank you very much.\nspeeches$flesch.kincaid <- textstat_readability(speeches$text, measure = \"Flesch.Kincaid\")\n\n# returned as quanteda data.frame with document-level information;\n# need just the score:\nspeeches$flesch.kincaid <- speeches$flesch.kincaid$Flesch.Kincaid\n#get mean and standard deviation of Flesch-Kincaid, and N of speeches for each speaker\nsum_corpus <- speeches %>%\n group_by(speaker) %>%\n summarise(mean = mean(flesch.kincaid, na.rm=TRUE),\n SD=sd(flesch.kincaid, na.rm=TRUE),\n N=length(speaker))\n\n# calculate standard errors and confidence intervals\nsum_corpus$se <- sum_corpus$SD / sqrt(sum_corpus$N)\nsum_corpus$min <- sum_corpus$mean - 1.96*sum_corpus$se\nsum_corpus$max <- sum_corpus$mean + 1.96*sum_corpus$se\nsum_corpus## # A tibble: 4 × 7\n## speaker mean SD N se min max\n## \n## 1 D. Cameron 10.9 1.70 456 0.0794 10.7 11.0\n## 2 G. Brown 13.3 2.28 277 0.137 13.1 13.6\n## 3 J.L.R. Zapatero 15.5 2.83 354 0.150 15.2 15.8\n## 4 M. Rajoy 13.7 2.56 389 0.130 13.4 13.9\nggplot(sum_corpus, aes(x=speaker, y=mean)) +\n geom_bar(stat=\"identity\") + \n geom_errorbar(ymin=sum_corpus$min,ymax=sum_corpus$max, width=.2) +\n coord_flip() +\n xlab(\"\") +\n ylab(\"Mean Complexity\") + \n theme_minimal() + \n ylim(c(0,20))"},{"path":"exercise-3-comparison-and-complexity.html","id":"exercises-2","chapter":"19 Exercise 3: Comparison and complexity","heading":"19.7 Exercises","text":"Compute distance measures “euclidean” “manhattan” MP tweets , comparing tweets MPs tweets PM, Theresa May.Estimate least three complexity measures EU speeches . Consider results compare Flesch-Kincaid measure used article Schoonvelde et al. (2019).(Advanced—optional) Estimate similarity scores MP tweets PM tweets week contained data. Plot results.","code":""},{"path":"exercise-4-scaling-techniques.html","id":"exercise-4-scaling-techniques","chapter":"20 Exercise 4: Scaling techniques","heading":"20 Exercise 4: Scaling techniques","text":"","code":""},{"path":"exercise-4-scaling-techniques.html","id":"introduction-3","chapter":"20 Exercise 4: Scaling techniques","heading":"20.1 Introduction","text":"hands-exercise week focuses : 1) scaling texts ; 2) implementing scaling techniques using quanteda.tutorial, learn :Scale texts using “wordfish” algorithmScale texts gathered online sourcesReplicate analyses Kaneko, Asano, Miwa (2021)proceeding, ’ll load packages need tutorial.exercise ’ll using dataset used sentiment analysis exercise. data collected Twitter accounts top eight newspapers UK circulation. tweets include tweets news outlet main account.","code":"\nlibrary(dplyr)\nlibrary(quanteda) # includes functions to implement Lexicoder\nlibrary(quanteda.textmodels) # for estimating similarity and complexity measures\nlibrary(quanteda.textplots) #for visualizing text modelling results"},{"path":"exercise-4-scaling-techniques.html","id":"importing-data","chapter":"20 Exercise 4: Scaling techniques","heading":"20.2 Importing data","text":"can download dataset :’re working document computer (“locally”) can download tweets data following way:first take sample data speed runtime analyses.","code":"\ntweets <- readRDS(\"data/sentanalysis/newstweets.rds\")\ntweets <- readRDS(gzcon(url(\"https://github.com/cjbarrie/CTA-ED/blob/main/data/sentanalysis/newstweets.rds?raw=true\")))\ntweets <- tweets %>%\n sample_n(20000)"},{"path":"exercise-4-scaling-techniques.html","id":"construct-dfm-object","chapter":"20 Exercise 4: Scaling techniques","heading":"20.3 Construct dfm object","text":", previous exercise, create corpus object, specify document-level variables want group, generate document feature matrix.can look number documents (tweets) per newspaper Twitter account.document feature matrix looks like, word count eight newspapers.","code":"\n#make corpus object, specifying tweet as text field\ntweets_corpus <- corpus(tweets, text_field = \"text\")\n\n#add in username document-level information\ndocvars(tweets_corpus, \"newspaper\") <- tweets$user_username\n\ndfm_tweets <- dfm(tokens(tweets_corpus),\n remove_punct = TRUE, \n remove = stopwords(\"english\"))\n## number of tweets per newspaper\ntable(docvars(dfm_tweets, \"newspaper\"))## \n## DailyMailUK DailyMirror EveningStandard guardian MetroUK \n## 2052 5834 2182 2939 966 \n## Telegraph TheSun thetimes \n## 1519 3840 668\ndfm_tweets## Document-feature matrix of: 20,000 documents, 48,967 features (99.98% sparse) and 31 docvars.\n## features\n## docs rt @standardnews breaking coronavirus outbreak declared pandemic world\n## text1 1 1 1 1 1 1 1 1\n## text2 1 0 0 0 0 0 0 0\n## text3 0 0 0 0 0 0 0 0\n## text4 0 0 0 0 0 0 0 0\n## text5 0 0 0 0 0 0 0 0\n## text6 0 0 0 0 0 0 0 0\n## features\n## docs health organisation\n## text1 1 1\n## text2 0 0\n## text3 0 0\n## text4 0 0\n## text5 0 0\n## text6 0 0\n## [ reached max_ndoc ... 19,994 more documents, reached max_nfeat ... 48,957 more features ]"},{"path":"exercise-4-scaling-techniques.html","id":"estimate-wordfish-model","chapter":"20 Exercise 4: Scaling techniques","heading":"20.4 Estimate wordfish model","text":"data format, able group trim document feature matrix estimating wordfish model.results.can plot estimates \\(\\theta\\)s—.e., estimates latent newspaper position—.Interestingly, seem captured ideology tonal dimension. see tabloid newspapers scored similarly, grouped toward right hand side latent dimension; whereas broadsheet newspapers estimated theta left.Plotting “features,” .e., word-level betas shows words positioned along dimension, words help discriminate news outlets.can also look features.words seem belong tabloid-style reportage, include emojis relating film, sports reporting “cristiano” well colloquial terms like “saucy.”","code":"\n# compress the document-feature matrix at the newspaper level\ndfm_newstweets <- dfm_group(dfm_tweets, groups = newspaper)\n# remove words not used by two or more newspapers\ndfm_newstweets <- dfm_trim(dfm_newstweets, \n min_docfreq = 2, docfreq_type = \"count\")\n\n## size of the document-feature matrix\ndim(dfm_newstweets)## [1] 8 11111\n#### estimate the Wordfish model ####\nset.seed(123L)\ndfm_newstweets_results <- textmodel_wordfish(dfm_newstweets, \n sparse = TRUE)\nsummary(dfm_newstweets_results)## \n## Call:\n## textmodel_wordfish.dfm(x = dfm_newstweets, sparse = TRUE)\n## \n## Estimated Document Positions:\n## theta se\n## DailyMailUK 0.64904 0.012949\n## DailyMirror 1.18235 0.006726\n## EveningStandard -0.22616 0.016082\n## guardian -0.95428 0.010563\n## MetroUK -0.04625 0.022759\n## Telegraph -1.05344 0.010640\n## TheSun 1.45044 0.006048\n## thetimes -1.00168 0.014966\n## \n## Estimated Feature Scores:\n## rt breaking coronavirus outbreak declared pandemic world health\n## beta 0.537 0.191 0.06918 -0.2654 -0.06525 -0.2004 -0.317 -0.3277\n## psi 5.307 3.535 5.78715 3.1348 0.50705 3.1738 3.366 3.2041\n## organisation genuinely interested see one cos fair\n## beta -0.4118 -0.2873 -0.2545 0.0005141 -0.06312 -0.2788 -0.03078\n## psi 0.5487 -0.5403 -1.4502 2.7723965 3.85881 -1.4480 0.35480\n## german care system protect troubled children #covid19 anxiety shows\n## beta -0.7424 -0.3251 -1.105 -0.1106 -0.4731 0.01205 -0.6742 0.4218 0.4165\n## psi 1.1009 3.1042 1.259 1.8918 -0.0784 2.85004 2.9703 0.5917 2.8370\n## sign man behind app explains tips\n## beta -0.1215 0.5112 0.05499 0.271 0.6687 -0.2083\n## psi 1.9427 3.5777 2.43805 1.376 1.2749 1.5341\ntextplot_scale1d(dfm_newstweets_results)\ntextplot_scale1d(dfm_newstweets_results, margin = \"features\")\nfeatures <- dfm_newstweets_results[[\"features\"]]\n\nbetas <- dfm_newstweets_results[[\"beta\"]]\n\nfeat_betas <- as.data.frame(cbind(features, betas))\nfeat_betas$betas <- as.numeric(feat_betas$betas)\n\nfeat_betas %>%\n arrange(desc(betas)) %>%\n top_n(20) %>% \n kbl() %>%\n kable_styling(bootstrap_options = \"striped\")## Selecting by betas"},{"path":"exercise-4-scaling-techniques.html","id":"replicating-kaneko-et-al.","chapter":"20 Exercise 4: Scaling techniques","heading":"20.5 Replicating Kaneko et al.","text":"section adapts code replication data provided Kaneko, Asano, Miwa (2021) . can access data first study Kaneko, Asano, Miwa (2021) following way.’re working locally, can download dfm data :data form document-feature-matrix. can first manipulate way Kaneko, Asano, Miwa (2021) grouping level newspaper removing infrequent words.","code":"\nkaneko_dfm <- readRDS(\"data/wordscaling/study1_kaneko.rds\")\nkaneko_dfm <- readRDS(gzcon(url(\"https://github.com/cjbarrie/CTA-ED/blob/main/data/wordscaling/study1_kaneko.rds?raw=true\")))\ntable(docvars(kaneko_dfm, \"Newspaper\"))## \n## Asahi Chugoku Chunichi Hokkaido Kahoku Mainichi \n## 38 24 47 46 18 26 \n## Nikkei Nishinippon Sankei Yomiuri \n## 13 27 14 30\n## prepare the newspaper-level document-feature matrix\n# compress the document-feature matrix at the newspaper level\nkaneko_dfm_study1 <- dfm_group(kaneko_dfm, groups = Newspaper)\n# remove words not used by two or more newspapers\nkaneko_dfm_study1 <- dfm_trim(kaneko_dfm_study1, min_docfreq = 2, docfreq_type = \"count\")\n\n## size of the document-feature matrix\ndim(kaneko_dfm_study1)## [1] 10 4660"},{"path":"exercise-4-scaling-techniques.html","id":"exercises-3","chapter":"20 Exercise 4: Scaling techniques","heading":"20.6 Exercises","text":"Estimate wordfish model Kaneko, Asano, Miwa (2021) dataVisualize results","code":""},{"path":"exercise-5-unsupervised-learning-topic-models.html","id":"exercise-5-unsupervised-learning-topic-models","chapter":"21 Exercise 5: Unsupervised learning (topic models)","heading":"21 Exercise 5: Unsupervised learning (topic models)","text":"","code":""},{"path":"exercise-5-unsupervised-learning-topic-models.html","id":"introduction-4","chapter":"21 Exercise 5: Unsupervised learning (topic models)","heading":"21.1 Introduction","text":"hands-exercise week focuses : 1) estimating topic model ; 2) interpreting visualizing results.tutorial, learn :Generate document-term-matrices format appropriate topic modellingEstimate topic model using quanteda topicmodels packageVisualize resultsReverse engineer test model accuracyRun validation tests","code":""},{"path":"exercise-5-unsupervised-learning-topic-models.html","id":"setup-9","chapter":"21 Exercise 5: Unsupervised learning (topic models)","heading":"21.2 Setup","text":"proceeding, ’ll load packages need tutorial.’ll using data Alexis de Tocqueville’s “Democracy America.” download data , Volume 1 Volume 2, combine one data frame. , ’ll using gutenbergr package, allows user download text data 60,000 --copyright books. ID book appears url book selected search https://www.gutenberg.org/ebooks/.example adapted Text Mining R: Tidy Approach Julia Silge David Robinson., see Volume Tocqueville’s “Democracy America” stored “815”. separate search reveals Volume 2 stored “816”.can download dataset :’re working document computer (“locally”) can download data following way:read data, convert different data shape: document-term-matrix. also create new columns, call “booknumber” recordss whether term question Volume 1 Volume 2. convert tidy “DocumentTermMatrix” format can first use unnest_tokens() done past exercises, remove stop words, use cast_dtm() function convert “DocumentTermMatrix” object.see data now stored “DocumentTermMatrix.” format, matrix records term (equivalent column) document (equivalent row), number times term appears given document. Many terms appear document, meaning matrix stored “sparse,” meaning preponderance zeroes. , since looking two documents come single volume set, sparsity relatively low (27%). applications, sparsity lot higher, approaching 99% .Estimating topic model relatively simple. need specify many topics want search , can also set seed, needed reproduce results time (model generative probabilistic one, meaning different random iterations produce different results).can extract per-topic-per-word probabilities, called “β” model:now data stored one topic-per-term-per-row. betas listed represent probability given term belongs given topic. , , see term “democratic” likely belong topic 4. Strictly, probability represents probability term generated topic question.can plots top terms, terms beta, topic follows:actually evaluate topics? , topics seem pretty similar.","code":"\nlibrary(tidyverse) # loads dplyr, ggplot2, and others\nlibrary(stringr) # to handle text elements\nlibrary(tidytext) # includes set of functions useful for manipulating text\nlibrary(topicmodels) # to estimate topic models\nlibrary(gutenbergr) # to get text data\nlibrary(scales)\nlibrary(tm)\nlibrary(ggthemes) # to make your plots look nice\nlibrary(readr)\nlibrary(quanteda)\nlibrary(quanteda.textmodels)\n#devtools::install_github(\"matthewjdenny/preText\")\nlibrary(preText)\ntocq <- gutenberg_download(c(815, 816), \n meta_fields = \"author\")\ntocq <- readRDS(\"data/topicmodels/tocq.rds\")\ntocq <- readRDS(gzcon(url(\"https://github.com/cjbarrie/CTA-ED/blob/main/data/topicmodels/tocq.RDS?raw=true\")))\ntocq_words <- tocq %>%\n mutate(booknumber = ifelse(gutenberg_id==815, \"DiA1\", \"DiA2\")) %>%\n unnest_tokens(word, text) %>%\n filter(!is.na(word)) %>%\n count(booknumber, word, sort = TRUE) %>%\n ungroup() %>%\n anti_join(stop_words)## Joining with `by = join_by(word)`\ntocq_dtm <- tocq_words %>%\n cast_dtm(booknumber, word, n)\n\ntm::inspect(tocq_dtm)## <>\n## Non-/sparse entries: 17581/6603\n## Sparsity : 27%\n## Maximal term length: 18\n## Weighting : term frequency (tf)\n## Sample :\n## Terms\n## Docs country democratic government laws nations people power society time\n## DiA1 357 212 556 397 233 516 543 290 311\n## DiA2 167 561 162 133 313 360 263 241 309\n## Terms\n## Docs united\n## DiA1 554\n## DiA2 227\ntocq_lda <- LDA(tocq_dtm, k = 10, control = list(seed = 1234))\ntocq_topics <- tidy(tocq_lda, matrix = \"beta\")\n\nhead(tocq_topics, n = 10)## # A tibble: 10 × 3\n## topic term beta\n## \n## 1 1 democratic 0.00855\n## 2 2 democratic 0.0115 \n## 3 3 democratic 0.00444\n## 4 4 democratic 0.0193 \n## 5 5 democratic 0.00254\n## 6 6 democratic 0.00866\n## 7 7 democratic 0.00165\n## 8 8 democratic 0.0108 \n## 9 9 democratic 0.00276\n## 10 10 democratic 0.00334\ntocq_top_terms <- tocq_topics %>%\n group_by(topic) %>%\n top_n(10, beta) %>%\n ungroup() %>%\n arrange(topic, -beta)\n\ntocq_top_terms %>%\n mutate(term = reorder_within(term, beta, topic)) %>%\n ggplot(aes(beta, term, fill = factor(topic))) +\n geom_col(show.legend = FALSE) +\n facet_wrap(~ topic, scales = \"free\", ncol = 4) +\n scale_y_reordered() +\n theme_tufte(base_family = \"Helvetica\")"},{"path":"exercise-5-unsupervised-learning-topic-models.html","id":"evaluating-topic-model","chapter":"21 Exercise 5: Unsupervised learning (topic models)","heading":"21.3 Evaluating topic model","text":"Well, one way evaluate performance unspervised forms classification testing model outcome already known., two topics obvious ‘topics’ Volume 1 Volume 2 Tocqueville’s “Democracy America.” Volume 1 Tocqueville’s work deals obviously abstract constitutional ideas questions race; Volume 2 focuses esoteric aspects American society. Listen “Time” episode Melvyn Bragg discussing Democracy America .Given differences focus, might think generative model accurately assign topic (.e., Volume) accuracy.","code":""},{"path":"exercise-5-unsupervised-learning-topic-models.html","id":"plot-relative-word-frequencies","chapter":"21 Exercise 5: Unsupervised learning (topic models)","heading":"21.3.1 Plot relative word frequencies","text":"First let’s look see whether really words obviously distinguishing two Volumes.see seem marked distinguishing characteristics. plot , example, see abstract notions state systems appear greater frequency Volume 1 Volume 2 seems contain words specific America (e.g., “north” “south”) greater frequency. way read plot words positioned away diagonal line appear greater frequency one volume versus .","code":"\ntidy_tocq <- tocq %>%\n unnest_tokens(word, text) %>%\n anti_join(stop_words)## Joining with `by = join_by(word)`\n## Count most common words in both\ntidy_tocq %>%\n count(word, sort = TRUE)## # A tibble: 12,092 × 2\n## word n\n## \n## 1 people 876\n## 2 power 806\n## 3 united 781\n## 4 democratic 773\n## 5 government 718\n## 6 time 620\n## 7 nations 546\n## 8 society 531\n## 9 laws 530\n## 10 country 524\n## # ℹ 12,082 more rows\nbookfreq <- tidy_tocq %>%\n mutate(booknumber = ifelse(gutenberg_id==815, \"DiA1\", \"DiA2\")) %>%\n mutate(word = str_extract(word, \"[a-z']+\")) %>%\n count(booknumber, word) %>%\n group_by(booknumber) %>%\n mutate(proportion = n / sum(n)) %>% \n select(-n) %>% \n spread(booknumber, proportion)\n\nggplot(bookfreq, aes(x = DiA1, y = DiA2, color = abs(DiA1 - DiA2))) +\n geom_abline(color = \"gray40\", lty = 2) +\n geom_jitter(alpha = 0.1, size = 2.5, width = 0.3, height = 0.3) +\n geom_text(aes(label = word), check_overlap = TRUE, vjust = 1.5) +\n scale_x_log10(labels = percent_format()) +\n scale_y_log10(labels = percent_format()) +\n scale_color_gradient(limits = c(0, 0.001), low = \"darkslategray4\", high = \"gray75\") +\n theme_tufte(base_family = \"Helvetica\") +\n theme(legend.position=\"none\", \n strip.background = element_blank(), \n strip.text.x = element_blank()) +\n labs(x = \"Tocqueville DiA 2\", y = \"Tocqueville DiA 1\") +\n coord_equal()## Warning: Removed 6173 rows containing missing values (`geom_point()`).## Warning: Removed 6174 rows containing missing values (`geom_text()`)."},{"path":"exercise-5-unsupervised-learning-topic-models.html","id":"split-into-chapter-documents","chapter":"21 Exercise 5: Unsupervised learning (topic models)","heading":"21.3.2 Split into chapter documents","text":", first separate volumes chapters, repeat procedure . difference now instead two documents representing two full volumes Tocqueville’s work, now 132 documents, representing individual chapter. Notice now sparsity much increased: around 96%.re-estimate topic model new DocumentTermMatrix object, specifying k equal 2. enable us evaluate whether topic model able generatively assign volume accuracy., worth looking another output latent dirichlet allocation procedure. γ probability represents per-document-per-topic probability , words, probability given document (: chapter) belongs particular topic (, assuming topics represent volumes).gamma values therefore estimated proportion words within given chapter allocated given volume.","code":"\ntocq <- tocq %>%\n filter(!is.na(text))\n\n# Divide into documents, each representing one chapter\ntocq_chapter <- tocq %>%\n mutate(booknumber = ifelse(gutenberg_id==815, \"DiA1\", \"DiA2\")) %>%\n group_by(booknumber) %>%\n mutate(chapter = cumsum(str_detect(text, regex(\"^chapter \", ignore_case = TRUE)))) %>%\n ungroup() %>%\n filter(chapter > 0) %>%\n unite(document, booknumber, chapter)\n\n# Split into words\ntocq_chapter_word <- tocq_chapter %>%\n unnest_tokens(word, text)\n\n# Find document-word counts\ntocq_word_counts <- tocq_chapter_word %>%\n anti_join(stop_words) %>%\n count(document, word, sort = TRUE) %>%\n ungroup()## Joining with `by = join_by(word)`\ntocq_word_counts## # A tibble: 69,781 × 3\n## document word n\n## \n## 1 DiA2_76 united 88\n## 2 DiA2_60 honor 70\n## 3 DiA1_52 union 66\n## 4 DiA2_76 president 60\n## 5 DiA2_76 law 59\n## 6 DiA1_42 jury 57\n## 7 DiA2_76 time 50\n## 8 DiA1_11 township 49\n## 9 DiA1_21 federal 48\n## 10 DiA2_76 constitution 48\n## # ℹ 69,771 more rows\n# Cast into DTM format for LDA analysis\n\ntocq_chapters_dtm <- tocq_word_counts %>%\n cast_dtm(document, word, n)\n\ntm::inspect(tocq_chapters_dtm)## <>\n## Non-/sparse entries: 69781/1500755\n## Sparsity : 96%\n## Maximal term length: 18\n## Weighting : term frequency (tf)\n## Sample :\n## Terms\n## Docs country democratic government laws nations people power public time\n## DiA1_11 10 0 23 19 7 13 19 15 6\n## DiA1_13 13 5 34 9 12 17 37 15 6\n## DiA1_20 9 0 25 13 2 14 32 13 10\n## DiA1_21 4 0 20 29 6 12 20 5 5\n## DiA1_23 10 0 35 9 24 20 13 4 8\n## DiA1_31 7 12 10 13 4 30 18 31 6\n## DiA1_32 10 14 25 6 9 25 11 43 8\n## DiA1_47 12 2 5 3 3 6 8 0 3\n## DiA1_56 12 0 3 7 19 3 8 3 22\n## DiA2_76 11 10 24 39 12 31 27 27 50\n## Terms\n## Docs united\n## DiA1_11 13\n## DiA1_13 19\n## DiA1_20 21\n## DiA1_21 23\n## DiA1_23 15\n## DiA1_31 11\n## DiA1_32 14\n## DiA1_47 8\n## DiA1_56 25\n## DiA2_76 88\ntocq_chapters_lda <- LDA(tocq_chapters_dtm, k = 2, control = list(seed = 1234))\ntocq_chapters_gamma <- tidy(tocq_chapters_lda, matrix = \"gamma\")\ntocq_chapters_gamma## # A tibble: 264 × 3\n## document topic gamma\n## \n## 1 DiA2_76 1 0.551 \n## 2 DiA2_60 1 1.00 \n## 3 DiA1_52 1 0.0000464\n## 4 DiA1_42 1 0.0000746\n## 5 DiA1_11 1 0.0000382\n## 6 DiA1_21 1 0.0000437\n## 7 DiA1_20 1 0.0000425\n## 8 DiA1_28 1 0.249 \n## 9 DiA1_50 1 0.0000477\n## 10 DiA1_22 1 0.0000466\n## # ℹ 254 more rows"},{"path":"exercise-5-unsupervised-learning-topic-models.html","id":"examine-consensus","chapter":"21 Exercise 5: Unsupervised learning (topic models)","heading":"21.3.3 Examine consensus","text":"Now topic probabilities, can see well unsupervised learning distinguishing two volumes generatively just words contained chapter.bad! see model estimated accuracy 91% chapters Volume 2 79% chapters Volume 1","code":"\n# First separate the document name into title and chapter\n\ntocq_chapters_gamma <- tocq_chapters_gamma %>%\n separate(document, c(\"title\", \"chapter\"), sep = \"_\", convert = TRUE)\n\ntocq_chapter_classifications <- tocq_chapters_gamma %>%\n group_by(title, chapter) %>%\n top_n(1, gamma) %>%\n ungroup()\n\ntocq_book_topics <- tocq_chapter_classifications %>%\n count(title, topic) %>%\n group_by(title) %>%\n top_n(1, n) %>%\n ungroup() %>%\n transmute(consensus = title, topic)\n\ntocq_chapter_classifications %>%\n inner_join(tocq_book_topics, by = \"topic\") %>%\n filter(title != consensus)## # A tibble: 15 × 5\n## title chapter topic gamma consensus\n## \n## 1 DiA1 45 1 0.762 DiA2 \n## 2 DiA1 5 1 0.504 DiA2 \n## 3 DiA1 33 1 0.570 DiA2 \n## 4 DiA1 34 1 0.626 DiA2 \n## 5 DiA1 41 1 0.512 DiA2 \n## 6 DiA1 44 1 0.765 DiA2 \n## 7 DiA1 8 1 0.791 DiA2 \n## 8 DiA1 4 1 0.717 DiA2 \n## 9 DiA1 35 1 0.576 DiA2 \n## 10 DiA1 39 1 0.577 DiA2 \n## 11 DiA1 7 1 0.687 DiA2 \n## 12 DiA1 29 1 0.983 DiA2 \n## 13 DiA1 6 1 0.707 DiA2 \n## 14 DiA2 27 2 0.654 DiA1 \n## 15 DiA2 21 2 0.510 DiA1\n# Look document-word pairs were to see which words in each documents were assigned\n# to a given topic\n\nassignments <- augment(tocq_chapters_lda, data = tocq_chapters_dtm)\nassignments## # A tibble: 69,781 × 4\n## document term count .topic\n## \n## 1 DiA2_76 united 88 2\n## 2 DiA2_60 united 6 1\n## 3 DiA1_52 united 11 2\n## 4 DiA1_42 united 7 2\n## 5 DiA1_11 united 13 2\n## 6 DiA1_21 united 23 2\n## 7 DiA1_20 united 21 2\n## 8 DiA1_28 united 14 2\n## 9 DiA1_50 united 5 2\n## 10 DiA1_22 united 8 2\n## # ℹ 69,771 more rows\nassignments <- assignments %>%\n separate(document, c(\"title\", \"chapter\"), sep = \"_\", convert = TRUE) %>%\n inner_join(tocq_book_topics, by = c(\".topic\" = \"topic\"))\n\nassignments %>%\n count(title, consensus, wt = count) %>%\n group_by(title) %>%\n mutate(percent = n / sum(n)) %>%\n ggplot(aes(consensus, title, fill = percent)) +\n geom_tile() +\n scale_fill_gradient2(high = \"red\", label = percent_format()) +\n geom_text(aes(x = consensus, y = title, label = scales::percent(percent))) +\n theme_tufte(base_family = \"Helvetica\") +\n theme(axis.text.x = element_text(angle = 90, hjust = 1),\n panel.grid = element_blank()) +\n labs(x = \"Book words assigned to\",\n y = \"Book words came from\",\n fill = \"% of assignments\")"},{"path":"exercise-5-unsupervised-learning-topic-models.html","id":"validation","chapter":"21 Exercise 5: Unsupervised learning (topic models)","heading":"21.4 Validation","text":"articles Ying, Montgomery, Stewart (2021) Denny Spirling (2018) previous weeks, read potential validation techniques.section, ’ll using preText package mentioned Denny Spirling (2018) see impact different pre-processing choices text. , adapting tutorial Matthew Denny.First need reformat text quanteda corpus object.now ready preprocess different ways. , including n-grams preprocessing text 128 different ways. takes ten minutes run machine 8GB RAM.can get results pre-processing, comparing distance documents processed different ways.can plot accordingly.","code":"\n# load in corpus of Tocequeville text data.\ncorp <- corpus(tocq, text_field = \"text\")\n# use first 10 documents for example\ndocuments <- corp[sample(1:30000,1000)]\n# take a look at the document names\nprint(names(documents[1:10]))## [1] \"text26803\" \"text25102\" \"text28867\" \"text2986\" \"text1842\" \"text25718\"\n## [7] \"text3371\" \"text29925\" \"text29940\" \"text29710\"\npreprocessed_documents <- factorial_preprocessing(\n documents,\n use_ngrams = TRUE,\n infrequent_term_threshold = 0.2,\n verbose = FALSE)\npreText_results <- preText(\n preprocessed_documents,\n dataset_name = \"Tocqueville text\",\n distance_method = \"cosine\",\n num_comparisons = 20,\n verbose = FALSE)\npreText_score_plot(preText_results)"},{"path":"exercise-5-unsupervised-learning-topic-models.html","id":"exercises-4","chapter":"21 Exercise 5: Unsupervised learning (topic models)","heading":"21.5 Exercises","text":"Choose another book set books Project GutenbergRun topic model books, changing k topics, evaluating accuracy.Validate different pre-processing techniques using preText new book(s) choice.","code":""},{"path":"exercise-6-unsupervised-learning-word-embedding.html","id":"exercise-6-unsupervised-learning-word-embedding","chapter":"22 Exercise 6: Unsupervised learning (word embedding)","heading":"22 Exercise 6: Unsupervised learning (word embedding)","text":"","code":""},{"path":"exercise-6-unsupervised-learning-word-embedding.html","id":"introduction-5","chapter":"22 Exercise 6: Unsupervised learning (word embedding)","heading":"22.1 Introduction","text":"hands-exercise week focuses word embedding provides overview data structures, functions relevant , estimating word vectors word-embedding analyses.tutorial, learn :Generate word vectors (embeddings) via SVDTrain local word embedding model GloVeVisualize inspect resultsLoad examine pre-trained embeddingsNote: Adapts tutorials Chris Bail Julia Silge Emil Hvitfeldt Julia Silge .","code":""},{"path":"exercise-6-unsupervised-learning-word-embedding.html","id":"setup-10","chapter":"22 Exercise 6: Unsupervised learning (word embedding)","heading":"22.2 Setup","text":"begin reading data. data come sample 1m tweets elected UK MPs period 2017-2019. data contain just name MP-user, text tweet, MP’s party. just add ID variable called “postID.”’re working document computer (“locally”) can download tweets sample data following way:","code":"\nlibrary(tidyverse) # loads dplyr, ggplot2, and others\nlibrary(stringr) # to handle text elements\nlibrary(tidytext) # includes set of functions useful for manipulating text\nlibrary(ggthemes) # to make your plots look nice\nlibrary(text2vec) # for word embedding implementation\nlibrary(widyr) # for reshaping the text data\nlibrary(irlba) # for svd\nlibrary(umap) # for dimensionality reduction\ntwts_sample <- readRDS(\"data/wordembed/twts_corpus_sample.rds\")\n\n#create tweet id\ntwts_sample$postID <- row.names(twts_sample)\ntwts_sample <- readRDS(gzcon(url(\"https://github.com/cjbarrie/CTA-ED/blob/main/data/wordembed/twts_corpus_sample.rds?raw=true\")))"},{"path":"exercise-6-unsupervised-learning-word-embedding.html","id":"word-vectors-via-svd","chapter":"22 Exercise 6: Unsupervised learning (word embedding)","heading":"22.3 Word vectors via SVD","text":"’re going set generating set word vectors text data. Note many word embedding applications use pre-trained embeddings much larger corpus, generate local embeddings using neural net-based approaches., ’re instead going generate set embeddings word vectors making series calculations based frequencies words appear different contexts. use technique called “Singular Value Decomposition” (SVD). dimensionality reduction technique first axis resulting composition designed capture variance, second second-etc…achieve ?","code":""},{"path":"exercise-6-unsupervised-learning-word-embedding.html","id":"implementation","chapter":"22 Exercise 6: Unsupervised learning (word embedding)","heading":"22.4 Implementation","text":"first thing need get data right format calculate -called “skip-gram probabilties.” go code line line begin understand .’s going ?Well, ’re first unnesting tweet data previous exercises. importantly, , ’re unnesting individual tokens ngrams length 6 , words, postID n words k indexed , take words i1 …i6, take words i2 …i7. Try just running first two lines code see means practice., make unique ID particular ngram create postID, make unique skipgramID postID ngram. unnest words ngram associated skipgramID.can see resulting output .next?Well can now calculate set probabilities skipgrams. pairwise_count() function widyr package. Essentially, function saying: skipgramID count number times word appears another word feature (feature skipgramID). set diag TRUE also want count number times word appears near .probability calculating number times word appears another word denominated total number word pairings across whole corpus.see, example, words vote appear 4099 times together. Denominating total n word pairings (sum(skipgram_probs$n)), gives us probability p. Okay, now skipgram probabilities need get “unigram probabilities” order normalize skipgram probabilities applying singular value decomposition.“unigram probability”? Well, just technical way saying: count appearances given word corpus divide total number words corpus. can :Finally, ’s time normalize skipgram probabilities.take skipgram probabilities, filter word pairings appear twenty times less. rename words “item1” “item2,” merge unigram probabilities words.calculate joint probability skipgram probability divided unigram probability first word pairing divided unigram probability second word pairing. equivalent : P(x,y)/P(x)P(y).essence, interpretation value : “events (words) x y occur together often expect independent”?’ve recovered normalized probabilities, can look joint probabilities given item, .e., word. , look word “brexit” look words highest value “p_together.”Higher values greater 1 indicate words likely appear close ; low values less 1 indicate unlikely appear close . , words, gives indication association two words.Using normalized probabilities, calculate PMI “Pointwise Mutual Information” value, simply log joint probability calculated .Definition time: “PMI logarithm probability finding two words together, normalized probability finding words alone.”cast word pairs sparse matrix values correspond PMI two corresponding words.Notice setting vector size equal 256. just means vector length 256 given word., set numbers used represent word length limited 256. arbitrary can changed. Typically, size low hundreds chosen representing word vector.word vectors taken “u” column, left-singular vectors, SVD.","code":"\n#create context window with length 6\ntidy_skipgrams <- twts_sample %>%\n unnest_tokens(ngram, tweet, token = \"ngrams\", n = 6) %>%\n mutate(ngramID = row_number()) %>% \n tidyr::unite(skipgramID, postID, ngramID) %>%\n unnest_tokens(word, ngram)\n\nhead(tidy_skipgrams, n=20)## # A tibble: 20 × 4\n## username party_value skipgramID word \n## \n## 1 kirstysnp Scottish National Party 1_1 in \n## 2 kirstysnp Scottish National Party 1_1 amongst\n## 3 kirstysnp Scottish National Party 1_1 all \n## 4 kirstysnp Scottish National Party 1_1 the \n## 5 kirstysnp Scottish National Party 1_1 horror \n## 6 kirstysnp Scottish National Party 1_1 at \n## 7 kirstysnp Scottish National Party 1_2 amongst\n## 8 kirstysnp Scottish National Party 1_2 all \n## 9 kirstysnp Scottish National Party 1_2 the \n## 10 kirstysnp Scottish National Party 1_2 horror \n## 11 kirstysnp Scottish National Party 1_2 at \n## 12 kirstysnp Scottish National Party 1_2 the \n## 13 kirstysnp Scottish National Party 1_3 all \n## 14 kirstysnp Scottish National Party 1_3 the \n## 15 kirstysnp Scottish National Party 1_3 horror \n## 16 kirstysnp Scottish National Party 1_3 at \n## 17 kirstysnp Scottish National Party 1_3 the \n## 18 kirstysnp Scottish National Party 1_3 notion \n## 19 kirstysnp Scottish National Party 1_4 the \n## 20 kirstysnp Scottish National Party 1_4 horror\n#calculate probabilities\nskipgram_probs <- tidy_skipgrams %>%\n pairwise_count(word, skipgramID, diag = TRUE, sort = TRUE) %>% # diag = T means that we also count when the word appears twice within the window\n mutate(p = n / sum(n))\n\nhead(skipgram_probs[1000:1020,], n=20)## # A tibble: 20 × 4\n## item1 item2 n p\n## \n## 1 no to 4100 0.0000531\n## 2 vote for 4099 0.0000531\n## 3 for vote 4099 0.0000531\n## 4 see the 4078 0.0000528\n## 5 the see 4078 0.0000528\n## 6 having having 4076 0.0000528\n## 7 by of 4065 0.0000527\n## 8 of by 4065 0.0000527\n## 9 this with 4051 0.0000525\n## 10 with this 4051 0.0000525\n## 11 set set 4050 0.0000525\n## 12 right the 4045 0.0000524\n## 13 the right 4045 0.0000524\n## 14 what the 4044 0.0000524\n## 15 going to 4044 0.0000524\n## 16 the what 4044 0.0000524\n## 17 to going 4044 0.0000524\n## 18 evening evening 4035 0.0000523\n## 19 get the 4032 0.0000522\n## 20 the get 4032 0.0000522\n#calculate unigram probabilities (used to normalize skipgram probabilities later)\nunigram_probs <- twts_sample %>%\n unnest_tokens(word, tweet) %>%\n count(word, sort = TRUE) %>%\n mutate(p = n / sum(n))\n#normalize skipgram probabilities\nnormalized_prob <- skipgram_probs %>%\n filter(n > 20) %>% #filter out skipgrams with n <=20\n rename(word1 = item1, word2 = item2) %>%\n left_join(unigram_probs %>%\n select(word1 = word, p1 = p),\n by = \"word1\") %>%\n left_join(unigram_probs %>%\n select(word2 = word, p2 = p),\n by = \"word2\") %>%\n mutate(p_together = p / p1 / p2)\n\nnormalized_prob %>% \n filter(word1 == \"brexit\") %>%\n arrange(-p_together)## # A tibble: 1,016 × 7\n## word1 word2 n p p1 p2 p_together\n## \n## 1 brexit scotlandsplaceineurope 37 0.000000479 0.00278 0.00000186 92.6\n## 2 brexit preparedness 22 0.000000285 0.00278 0.00000149 68.8\n## 3 brexit dividend 176 0.00000228 0.00278 0.0000127 64.8\n## 4 brexit brexit 38517 0.000499 0.00278 0.00278 64.6\n## 5 brexit softer 50 0.000000648 0.00278 0.00000410 56.9\n## 6 brexit botched 129 0.00000167 0.00278 0.0000153 39.4\n## 7 brexit impasse 53 0.000000687 0.00278 0.00000820 30.1\n## 8 brexit smooth 30 0.000000389 0.00278 0.00000596 23.5\n## 9 brexit frustrate 28 0.000000363 0.00278 0.00000559 23.4\n## 10 brexit deadlock 120 0.00000155 0.00278 0.0000246 22.8\n## # ℹ 1,006 more rows\npmi_matrix <- normalized_prob %>%\n mutate(pmi = log10(p_together)) %>%\n cast_sparse(word1, word2, pmi)\n\n#remove missing data\npmi_matrix@x[is.na(pmi_matrix@x)] <- 0\n#run SVD\npmi_svd <- irlba(pmi_matrix, 256, maxit = 500)\n\nglimpse(pmi_matrix)## Formal class 'dgCMatrix' [package \"Matrix\"] with 6 slots\n## ..@ i : int [1:350700] 0 1 2 3 4 5 6 7 8 9 ...\n## ..@ p : int [1:21173] 0 7819 14360 20175 25467 29910 34368 39207 43376 46401 ...\n## ..@ Dim : int [1:2] 21172 21172\n## ..@ Dimnames:List of 2\n## .. ..$ : chr [1:21172] \"the\" \"to\" \"and\" \"of\" ...\n## .. ..$ : chr [1:21172] \"the\" \"to\" \"and\" \"of\" ...\n## ..@ x : num [1:350700] 0.65173 -0.01915 -0.00911 0.26937 -0.52456 ...\n## ..@ factors : list()\n#next we output the word vectors:\nword_vectors <- pmi_svd$u\nrownames(word_vectors) <- rownames(pmi_matrix)\n\ndim(word_vectors)## [1] 21172 256"},{"path":"exercise-6-unsupervised-learning-word-embedding.html","id":"exploration","chapter":"22 Exercise 6: Unsupervised learning (word embedding)","heading":"22.5 Exploration","text":"can define simple function take word vector, find similar words, nearest neighbours, given word:","code":"\nnearest_words <- function(word_vectors, word){\n selected_vector = word_vectors[word,]\n mult = as.data.frame(word_vectors %*% selected_vector) #dot product of selected word vector and all word vectors\n \n mult %>%\n rownames_to_column() %>%\n rename(word = rowname,\n similarity = V1) %>%\n anti_join(get_stopwords(language = \"en\")) %>%\n arrange(-similarity)\n\n}\n\nboris_synonyms <- nearest_words(word_vectors, \"boris\")## Joining with `by = join_by(word)`\nbrexit_synonyms <- nearest_words(word_vectors, \"brexit\")## Joining with `by = join_by(word)`\nhead(boris_synonyms, n=10)## word similarity\n## 1 johnson 0.10309556\n## 2 boris 0.09940448\n## 3 jeremy 0.04823204\n## 4 trust 0.04800155\n## 5 corbyn 0.04102031\n## 6 farage 0.03973588\n## 7 trump 0.03938184\n## 8 can.t 0.03533624\n## 9 says 0.03324624\n## 10 word 0.03267437\nhead(brexit_synonyms, n=10)## word similarity\n## 1 brexit 0.38737979\n## 2 deal 0.15083433\n## 3 botched 0.05003683\n## 4 tory 0.04377030\n## 5 unleash 0.04233445\n## 6 impact 0.04139872\n## 7 theresa 0.04017608\n## 8 approach 0.03970233\n## 9 handling 0.03901461\n## 10 orderly 0.03897535\n#then we can visualize\nbrexit_synonyms %>%\n mutate(selected = \"brexit\") %>%\n bind_rows(boris_synonyms %>%\n mutate(selected = \"boris\")) %>%\n group_by(selected) %>%\n top_n(15, similarity) %>%\n mutate(token = reorder(word, similarity)) %>%\n filter(token!=selected) %>%\n ggplot(aes(token, similarity, fill = selected)) +\n geom_col(show.legend = FALSE) +\n facet_wrap(~selected, scales = \"free\") +\n scale_fill_manual(values = c(\"#336B87\", \"#2A3132\")) +\n coord_flip() +\n theme_tufte(base_family = \"Helvetica\")"},{"path":"exercise-6-unsupervised-learning-word-embedding.html","id":"glove-embeddings","chapter":"22 Exercise 6: Unsupervised learning (word embedding)","heading":"22.6 GloVe Embeddings","text":"section adapts tutorials Pedro Rodriguez Dmitriy Selivanov Wouter van Gils .","code":""},{"path":"exercise-6-unsupervised-learning-word-embedding.html","id":"glove-algorithm","chapter":"22 Exercise 6: Unsupervised learning (word embedding)","heading":"22.7 GloVe algorithm","text":"section taken text2vec package page .GloVe algorithm pennington_glove_2014 consists following steps:Collect word co-occurence statistics form word co-ocurrence matrix \\(X\\). element \\(X_{ij}\\) matrix represents often word appears context word j. Usually scan corpus following manner: term look context terms within area defined window_size term window_size term. Also give less weight distant words, usually using formula: \\[decay = 1/offset\\]Collect word co-occurence statistics form word co-ocurrence matrix \\(X\\). element \\(X_{ij}\\) matrix represents often word appears context word j. Usually scan corpus following manner: term look context terms within area defined window_size term window_size term. Also give less weight distant words, usually using formula: \\[decay = 1/offset\\]Define soft constraints word pair: \\[w_i^Tw_j + b_i + b_j = log(X_{ij})\\] \\(w_i\\) - vector main word, \\(w_j\\) - vector context word, \\(b_i\\), \\(b_j\\) scalar biases main context words.Define soft constraints word pair: \\[w_i^Tw_j + b_i + b_j = log(X_{ij})\\] \\(w_i\\) - vector main word, \\(w_j\\) - vector context word, \\(b_i\\), \\(b_j\\) scalar biases main context words.Define cost function\n\\[J = \\sum_{=1}^V \\sum_{j=1}^V \\; f(X_{ij}) ( w_i^T w_j + b_i + b_j - \\log X_{ij})^2\\]\n\\(f\\) weighting function help us prevent learning extremely common word pairs. GloVe authors choose following function:Define cost function\n\\[J = \\sum_{=1}^V \\sum_{j=1}^V \\; f(X_{ij}) ( w_i^T w_j + b_i + b_j - \\log X_{ij})^2\\]\n\\(f\\) weighting function help us prevent learning extremely common word pairs. GloVe authors choose following function:\\[\nf(X_{ij}) =\n\\begin{cases}\n(\\frac{X_{ij}}{x_{max}})^\\alpha & \\text{} X_{ij} < XMAX \\\\\n1 & \\text{otherwise}\n\\end{cases}\n\\]go implementing algorithm R?Let’s first make sure loaded packages need:","code":"\nlibrary(text2vec) # for implementation of GloVe algorithm\nlibrary(stringr) # to handle text strings\nlibrary(umap) # for dimensionality reduction later on"},{"path":"exercise-6-unsupervised-learning-word-embedding.html","id":"implementation-1","chapter":"22 Exercise 6: Unsupervised learning (word embedding)","heading":"22.8 Implementation","text":"need set choice parameters GloVe model. first window size WINDOW_SIZE, , , arbitrary normally set around 6-8. means looking word context words 6 words around target word. image illustrates choice parameter word “cat” given sentence, increase context window size:ultimately understood matrix format :iterations parameter ITERS simply sets maximum number iterations allow model convergence. number iterations relatively high model likely converge 100 iterations.DIM parameter specifies length word vector want result (.e., just set limit 256 SVD approach ). Finally, COUNT_MIN specifying minimum count words want keep. words, word appears fewer ten times, discarded. , discarded word pairings appeared fewer twenty times.next “shuffle” text. just means randomly reordering character vector tweets.create list object, tokenizing text tweet within item list. , create vocabulary object needed implement GloVe algorithm. creating “itoken” object itoken() creating vocabulary create_vocabulary. remove words exceed specified threshold prune_vocabulary().Next vectorize vocabulary create term co-occurrence matrix. , similar created matrix PMIs word pairings corpus.set final model parameters, learning rate fit model. whole process take time. save time working tutorial, may also download resulting embedding Github repo linked little .Finally, get resulting word embedding save .rds file.save time working tutorial, may also download resulting embedding Github repo :","code":"\n# ================================ choice parameters\n# ================================\nWINDOW_SIZE <- 6\nDIM <- 300\nITERS <- 100\nCOUNT_MIN <- 10\n# shuffle text\nset.seed(42L)\ntext <- sample(twts_sample$tweet)\n# ================================ create vocab ================================\ntokens <- space_tokenizer(text)\nit <- itoken(tokens, progressbar = FALSE)\nvocab <- create_vocabulary(it)\nvocab_pruned <- prune_vocabulary(vocab, term_count_min = COUNT_MIN) # keep only words that meet count threshold\n# ================================ create term co-occurrence matrix\n# ================================\nvectorizer <- vocab_vectorizer(vocab_pruned)\ntcm <- create_tcm(it, vectorizer, skip_grams_window = WINDOW_SIZE, skip_grams_window_context = \"symmetric\", \n weights = rep(1, WINDOW_SIZE))\n# ================================ set model parameters\n# ================================\nglove <- GlobalVectors$new(rank = DIM, x_max = 100, learning_rate = 0.05)\n\n# ================================ fit model ================================\nword_vectors_main <- glove$fit_transform(tcm, n_iter = ITERS, convergence_tol = 0.001, \n n_threads = RcppParallel::defaultNumThreads())\n# ================================ get output ================================\nword_vectors_context <- glove$components\nglove_embedding <- word_vectors_main + t(word_vectors_context) # word vectors\n\n# ================================ save ================================\nsaveRDS(glove_embedding, file = \"local_glove.rds\")\nurl <- \"https://github.com/cjbarrie/CTA-ED/blob/main/data/wordembed/local_glove.rds?raw=true\"\nglove_embedding <- readRDS(url(url, method=\"libcurl\"))"},{"path":"exercise-6-unsupervised-learning-word-embedding.html","id":"visualization","chapter":"22 Exercise 6: Unsupervised learning (word embedding)","heading":"22.9 Visualization","text":"explore embeddings? Well, imagine embeddings look something dissimilar visualization another embedding . words, talking something doesn’t lend projection 2D space!…hope lost, space travellers. smart technique McInnes, Healy, Melville (2020) linked describes way reduce dimensionality embedding layers using called “Uniform Manifold Approximation Projection.” ? Well, happily, umap package pretty straightforward!helpful? Well, number reasons, particularly helpful visualizing embeddings two-dimensional space.can see, , embeddings seem make sense. zoomed first little outgrowth 2D mapping, seemed correspond numbers number words. looked words around “economy” see related terms like “growth” “jobs.”","code":"\n# GloVe dimension reduction\nglove_umap <- umap(glove_embedding, n_components = 2, metric = \"cosine\", n_neighbors = 25, min_dist = 0.1, spread=2)\n# Put results in a dataframe for ggplot\ndf_glove_umap <- as.data.frame(glove_umap[[\"layout\"]])\n\n# Add the labels of the words to the dataframe\ndf_glove_umap$word <- rownames(df_glove_umap)\ncolnames(df_glove_umap) <- c(\"UMAP1\", \"UMAP2\", \"word\")\n\n# Plot the UMAP dimensions\nggplot(df_glove_umap) +\n geom_point(aes(x = UMAP1, y = UMAP2), colour = 'blue', size = 0.05) +\n ggplot2::annotate(\"rect\", xmin = -3, xmax = -2, ymin = 5, ymax = 7,alpha = .2) +\n labs(title = \"GloVe word embedding in 2D using UMAP\")\n# Plot the shaded part of the GloVe word embedding with labels\nggplot(df_glove_umap[df_glove_umap$UMAP1 < -2.5 & df_glove_umap$UMAP1 > -3 & df_glove_umap$UMAP2 > 5 & df_glove_umap$UMAP2 < 6.5,]) +\n geom_point(aes(x = UMAP1, y = UMAP2), colour = 'blue', size = 2) +\n geom_text(aes(UMAP1, UMAP2, label = word), size = 2.5, vjust=-1, hjust=0) +\n labs(title = \"GloVe word embedding in 2D using UMAP - partial view\") +\n theme(plot.title = element_text(hjust = .5, size = 14))\n# Plot the word embedding of words that are related for the GloVe model\nword <- glove_embedding[\"economy\",, drop = FALSE]\ncos_sim = sim2(x = glove_embedding, y = word, method = \"cosine\", norm = \"l2\")\nselect <- data.frame(rownames(as.data.frame(head(sort(cos_sim[,1], decreasing = TRUE), 25))))\ncolnames(select) <- \"word\"\nselected_words <- df_glove_umap %>% \n inner_join(y=select, by= \"word\")\n\n#The ggplot visual for GloVe\nggplot(selected_words, aes(x = UMAP1, y = UMAP2)) + \n geom_point(show.legend = FALSE) + \n geom_text(aes(UMAP1, UMAP2, label = word), show.legend = FALSE, size = 2.5, vjust=-1.5, hjust=0) +\n labs(title = \"GloVe word embedding of words related to 'economy'\") +\n theme(plot.title = element_text(hjust = .5, size = 14))"},{"path":"exercise-6-unsupervised-learning-word-embedding.html","id":"exercises-5","chapter":"22 Exercise 6: Unsupervised learning (word embedding)","heading":"22.10 Exercises","text":"Inspect visualize nearest neighbour synonyms relevant words tweets corpusIdentify another region interest GloVe-trained model visualize","code":""},{"path":"exercise-7-sampling-text-information.html","id":"exercise-7-sampling-text-information","chapter":"23 Exercise 7: Sampling text information","heading":"23 Exercise 7: Sampling text information","text":"","code":""},{"path":"exercise-7-sampling-text-information.html","id":"introduction-6","chapter":"23 Exercise 7: Sampling text information","heading":"23.1 Introduction","text":"hands-exercise week focuses collect /sample text information.tutorial, learn :Access text information online corporaQuery text information using different APIsScrape text information programmaticallyTranscribe text information audioExtract text information images","code":""},{"path":"exercise-7-sampling-text-information.html","id":"online-corpora","chapter":"23 Exercise 7: Sampling text information","heading":"23.2 Online corpora","text":"","code":""},{"path":"exercise-7-sampling-text-information.html","id":"replication-datasets","chapter":"23 Exercise 7: Sampling text information","heading":"23.2.1 Replication datasets","text":"large numbers online corpora replication datasets available access freely online. first access example using dataverse package R, allows us download directly replication data repositories stored Harvard Dataverse.Let’s take example dataset might interested: UK parliamentary speech data fromWe first need set en environment variable .can search files want specifying DOI publication data question. can find series numbers letters come “https://doi.org/” shown .choose get UK data files, listed “UK_data.csv.” can download directly following way (take time file size >1GB).course, also download data manually, clicking buttons relevant Harvard Dataverse—sometimes useful build every step data collection code documentation, making analysis entirely programatically reproducible start finish.Note well don’t search specific datasets already know . can also use dataverse package search datasets dataverses. can simply following way.","code":"\nlibrary(dataverse)\nlibrary(dplyr)\nSys.setenv(\"DATAVERSE_SERVER\" = \"dataverse.harvard.edu\")\ndataset <- get_dataset(\"10.7910/DVN/QDTLYV\")\ndataset$files[c(\"filename\", \"contentType\")]## filename\n## 1 1-uk.do\n## 2 2-ireland.do\n## 3 3-word_clouds.py\n## 4 4-trends.R\n## 5 5-predictive_margins.R\n## 6 6-barplot_topics.R\n## 7 7-plot_media.R\n## 8 8-histogram.R\n## 9 commons_stats.tab\n## 10 emotive_cloud.tab\n## 11 emotive_ireland.tab\n## 12 emotive_uk.tab\n## 13 ireland_data.csv\n## 14 neutral_cloud.tab\n## 15 neutral_ireland.tab\n## 16 neutral_uk.tab\n## 17 README.docx\n## 18 uk_data.csv\n## contentType\n## 1 application/x-stata-syntax\n## 2 application/x-stata-syntax\n## 3 text/x-python\n## 4 type/x-r-syntax\n## 5 type/x-r-syntax\n## 6 type/x-r-syntax\n## 7 type/x-r-syntax\n## 8 type/x-r-syntax\n## 9 text/tab-separated-values\n## 10 text/tab-separated-values\n## 11 text/tab-separated-values\n## 12 text/tab-separated-values\n## 13 text/csv\n## 14 text/tab-separated-values\n## 15 text/tab-separated-values\n## 16 text/tab-separated-values\n## 17 application/vnd.openxmlformats-officedocument.wordprocessingml.document\n## 18 text/csv\ndata <- get_dataframe_by_name(\n \"uk_data.csv\",\n \"10.7910/DVN/QDTLYV\",\n .f = function(x) read.delim(x, sep = \",\"))\nsearch_results <- dataverse_search(\"corpus politics text\", type = \"dataset\", per_page = 10)## 10 of 37533 results retrieved\nsearch_results[,1:3]## name\n## 1 \"A Deeper Look at Interstate War Data: Interstate War Data Version 1.1\"\n## 2 \"Birth Legacies, State Making, and War.\"\n## 3 \"CBS Morning News\" Shopping Habits and Lifestyles Poll, January 1989\n## 4 \"Cuadro histórico del General Santa Anna. 2a. parte,\" 1857\n## 5 \"Don't Know\" Means \"Don't Know\": DK Responses and the Public's Level of Political Knowledge\n## 6 \"El déspota Santa-Anna ante los veteranos de la Independencia,\" 1844 Diciembre 09\n## 7 \"European mood\" bi-annual data, EU27 member states (1973-2014), Replication Data\n## 8 \"Government Partisanship and Electoral Accountability\" Political Research Quarterly 72(3): 727-743\n## 9 \"I Didn't Lie, I Misspoke\": Voters' Responses to Questionable Campaign Claims\n## 10 \"I Hope to Hell Nothing Goes Back to The Way It Was Before\": COVID-19, Marginalization, and Native Nations\n## type url\n## 1 dataset https://doi.org/10.7910/DVN/E2CEP5\n## 2 dataset https://doi.org/10.7910/DVN/EP7DXB\n## 3 dataset https://doi.org/10.3886/ICPSR09230.v1\n## 4 dataset https://doi.org/10.18738/T8/Z0JH2C\n## 5 dataset https://doi.org/10.7910/DVN/G9NOQO\n## 6 dataset https://doi.org/10.18738/T8/U71QSD\n## 7 dataset https://doi.org/10.7910/DVN/V42M9J\n## 8 dataset https://doi.org/10.7910/DVN/5OG9VV\n## 9 dataset https://doi.org/10.7910/DVN/GE3E8R\n## 10 dataset https://doi.org/10.7910/DVN/Y916NP"},{"path":"exercise-7-sampling-text-information.html","id":"curated-corpora","chapter":"23 Exercise 7: Sampling text information","heading":"23.2.2 Curated corpora","text":", course, many sources might go text information. list might interest :Large English-language corpora: https://www.corpusdata.org/Wikipedia data dumps: https://meta.wikimedia.org/wiki/Data_dumps\nEnglish version dumps \nEnglish version dumps hereScottish Corpus Texts & Speech: https://www.scottishcorpus.ac.uk/Corpus Scottish modern writing: https://www.scottishcorpus.ac.uk/cmsw/Manifesto Corpus: https://manifesto-project.wzb.eu/information/documents/corpusReddit Pushshift data: https://files.pushshift.io/reddit/Mediacloud: https://mediacloud.org/\nR package: https://github.com/joon-e/mediacloud\nR package: https://github.com/joon-e/mediacloudFeel free recommend sources add list, intended growing index relevant text corpora social science research!","code":""},{"path":"exercise-7-sampling-text-information.html","id":"using-apis","chapter":"23 Exercise 7: Sampling text information","heading":"23.3 Using APIs","text":"order use YouTube API, ’ll first need get authorization token. can obtained anybody, without academic profile (.e., unlike academictwitteR) previous worksheets.order get authorization credentials, can follow guide. need account Google Cloud console order . main three steps :create “Project” Google Cloud console;associate YouTube API Project;enable API keys APIOnce created Project (: called “tuberalt1” case) see landing screen like .can get credentials navigating menu left hand side selecting credentials:Now click name project (“tuberalt1”) taken page containing two pieces information: “client ID” “client secret”.client ID referred “app ID” tuber packaage client secret “app secret” mentioned tuber package.credentials, can log R environment yt_oauth function tuber package. function takes two arguments: “app ID” “app secret”. provided associated YouTube API Google Cloud console project.","code":""},{"path":"exercise-7-sampling-text-information.html","id":"getting-youtube-data","chapter":"23 Exercise 7: Sampling text information","heading":"23.4 Getting YouTube data","text":"paper (haroon2022?), authors analyze recommended videos particular used based watch history seed video. , won’t replicate first step look recommended videos appear based seed video.case, seed video video Jordan Peterson predicting death mainstream media. fairly “alternative” content actively taking stance mainstream media. mean YouTube learn recommend us away mainstream content?, first take unique identifying code string video. can find url video shown .can collect videos recommended basis video seed video. store data.frame object rel_vids.can look recommended videos basis seed video .seems YouTube recommends us back lot videos relating Jordan Peterson. mainstream outlets; others obscure sources.","code":"\nlibrary(tidyverse)\nlibrary(readxl)\ndevtools::install_github(\"soodoku/tuber\") # need to install development version is there is problem with CRAN versions of the package functions\nlibrary(tuber)\n\nyt_oauth(\"431484860847-1THISISNOTMYREALKEY7jlembpo3off4hhor.apps.googleusercontent.com\",\"2niTHISISMADEUPTOO-l9NPUS90fp\")\n\n#get related videos\nstartvid <- \"1Gp7xNnW5n8\"\nrel_vids <- get_related_videos(startvid, max_results = 50, safe_search = \"none\")"},{"path":"exercise-7-sampling-text-information.html","id":"questions","chapter":"23 Exercise 7: Sampling text information","heading":"23.5 Questions","text":"Make request YouTube API different seed video.Make request YouTube API different seed video.Collect one video ID channels included resulting dataCollect one video ID channels included resulting dataWrite loop collect recommended videos video IDsWrite loop collect recommended videos video IDs","code":""},{"path":"exercise-7-sampling-text-information.html","id":"other-apis-r-packages","chapter":"23 Exercise 7: Sampling text information","heading":"23.5.1 Other APIs (R packages)","text":"https://cran.r-project.org/web/packages/manifestoR/index.htmlhttps://cran.r-project.org/web/packages/academictwitteR/index.htmlhttps://cran.r-project.org/web/packages/vkR/vkR.pdf","code":""},{"path":"exercise-7-sampling-text-information.html","id":"scraping","chapter":"23 Exercise 7: Sampling text information","heading":"23.6 Scraping","text":"practice skill, use series webpages Internet Archive host material collected Arab Spring protests Egypt 2011. original website can seen .proceeding, ’ll load remaining packages need tutorial.can download final dataset produce :can also view formatted output scraping exercise, alongside images documents question, Google Sheets .’re working document computer (“locally”) can download Tahrir documents data following way:Let’s look end producing:going return Internet Archived webpages see can produce final formatted dataset. archived Tahrir Documents webpages can accessed .first want expect contents webpage stored.scroll bottom page, see listed number hyperlinks documents stored month:click documents stored March click top listed pamphlet entitled “Season Anger Sets Among Arab Peoples.” can access .store url inspect HTML contains follows:Well, isn’t particularly useful. Let’s now see can extract text contained inside.Well looks pretty terrifying now…need way quickly identifying relevant text can specify scraping. widely-used tool achieve “Selector Gadget” Chrome Extension. can add browser free .tool works allowing user point click elements webpage (“CSS selectors”). Unlike alternatives, “Inspect Element” browser tools, easily able see webpage item contained within CSS selectors (rather HTML tags alone), easier parse.can Tahrir documents :now know main text translated document contained “p” HTML tags. identify text HTML tags can run:, looks quite lot manageable…!happening ? Essentially, html_elements() function scanning page collecting HTML elements contained tags, collect using “p” CSS selector. just grabbing text contained part page html_text() function.gives us one way capturing text, wanted get elements document, example date tags attributed document? Well can thing . Let’s take example getting date:see date identified “.calendar” CSS selector enter html_elements() function :course, well good, also need way scale—can’t just keep repeating process every page find wouldn’t much quicker just copy pasting. can ? Well need first understand URL structure website question.scroll page see listed number documents. directs individual pamphlet distributed protests 2011 Egyptian Revolution.Click one see URL changes.see starting URL :click March 2011, first month documents, see url becomes:, August 2011 becomes:, January 2012 becomes:notice month, URL changes addition month year back slashes end URL. next section, go efficiently create set URLs loop retrieve information contained individual webpage.going want retrieve text documents archived month. , first task store webpages series strings. manually , example, pasting year month strings end URL month March, 2011 January, 2012:wouldn’t particularly efficient…Instead, can wrap loop.’s going ? Well, first specifying starting URL . iterating numbers 3 13. telling R take new URL , depending number loop , take base starting url— https://wayback.archive-.org/2358/20120130143023/http://www.tahrirdocuments.org/ — paste end string “2011/0”, number loop , “/”. , first “” loop—number 3—effectively calling equivalent :gives:, ifelse() commands simply telling R: (number loop ) less 10 paste0(url,\"2011/0\",,\"/\"); .e., less 10 paste “2011/0”, “” “/”. number 3 becomes:\"https://wayback.archive-.org/2358/20120130143023/http://www.tahrirdocuments.org/2011/03/\", number 4 becomes\"https://wayback.archive-.org/2358/20120130143023/http://www.tahrirdocuments.org/2011/04/\", however, >=10 & <=12 (greater equal 10 less equal 12) calling paste0(url,\"2011/\",,\"/\") need first “0” months.Finally, (else) greater 12 calling paste0(url,\"2012/01/\"). last call, notice, specify whether greater equal 12 wrapping everything ifelse() commands. ifelse() calls like , telling R x “meets condition” y, otherwise z. wrapping multiple ifelse() calls within , effectively telling R x “meets condition” y, x “meets condition” z, otherwise . , “otherwise ” part ifelse() calls saying: less 10, 10 12, paste “2012/01/” end URL.Got ? didn’t even get first reading… wrote . best way understand going run code look part .now list URLs month. next?Well go onto page particular month, let’s say March, see page multiple paginated tabs bottom. Let’s see happens URL click one :see starting point URL March, , :click page 2 becomes:page 3 becomes:can see pretty clearly navigate page, appears appended URL string “page/2/” “page/3/”. shouldn’t tricky add list URLs. want avoid manually click archive month figure many pagination tabs bottom page.Fortunately, don’t . Using “Selector Gadget” tool can automate process grabbing highest number appears pagination bar month’s pages. code achieves :’s going ? Well, first two lines, simply creating empty character string ’re going populate subsequent loop. Remember set eleven starting URLs months archived webpage.code beginning (seq_along(files) saying, similar , beginning url end url, following loop: first, read url url <- urls[] read html contains html <- read_html(url).line, getting pages character vector page numbers calling html_elements() function “.page” tag. gives series pages stored e.g. “1” “2” “3”.order able see many , need extract highest number appears string. , first need reformat “integer” object rather “character” object R can recognize numbers. call pageints <- .integer(pages). get maximum simply calling: npages <- max(pageints, na.rm = T).next part loop, taking new information stored “npages,” .e., number pagination tabs month, telling R: pages, define new url adding “page/” number pagination tab “j”, “/”. ’ve bound together, get list URLs look like :next?next step get URLs documents contained archive month. ? Well, can use “Selector Gadget” tool work . main landing pages month, see listed, , document list. documents, see title, links revolutionary leaflet question, two CSS selectors: “h2” “.post”.can pass tags html_elements() grab ’s contained inside. can grab ’s contained inside extracting “children” classes. essence, just means lower level tag: tags can tags within tags flow downwards like family tree (hence name, suppose).one “children” HTML tag link contained inside, can get calling html_children() followed specifying want specific attribute web link encloses html_attr(\"href\"). subsequent lines just remove extraneous information.complete loop, , retrieve URL page every leaflet contained website :gives us:see now collected 523 separate URLs every revolutionary leaflet contained pages. Now ’re great position able crawl page collect information need. final loop need go URL ’re interested collect relevant information document text, title, date, tags, URL image revolutionary literature .See can work part fitting together. NOTE: want run final loop machines take several hours complete.now… ’re pretty much …back started!","code":"\nlibrary(tidyverse) # loads dplyr, ggplot2, and others\nlibrary(ggthemes) # includes a set of themes to make your visualizations look nice!\nlibrary(readr) # more informative and easy way to import data\nlibrary(stringr) # to handle text elements\nlibrary(rvest) #for scraping\npamphdata <- read_csv(\"data/sampling/pamphlets_formatted_gsheets.csv\")## Rows: 523 Columns: 8\n## ── Column specification ─────────────────────────────────────────────────────────\n## Delimiter: \",\"\n## chr (6): title, text, tags, imageurl, imgID, image\n## dbl (1): year\n## date (1): date\n## \n## ℹ Use `spec()` to retrieve the full column specification for this data.\n## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.\npamphdata <- read_csv(\"https://github.com/cjbarrie/CTA-ED/blob/main/data/sampling/pamphlets_formatted_gsheets.csv\")\nhead(pamphdata)## # A tibble: 6 × 8\n## title date year text tags imageurl imgID image\n## \n## 1 The Season of Anger Sets in … 2011-03-30 2011 The … Soli… https:/… imgI… =Arr…\n## 2 The Most Important Workers’ … 2011-03-30 2011 [Voi… Soli… https:/… imgI… \n## 3 Yes it’s the Workers’ and Em… 2011-03-30 2011 [Voi… Soli… https:/… imgI… \n## 4 The Revolution is Still Ongo… 2011-03-30 2011 [Voi… Revo… https:/… imgI… \n## 5 Voice of the Revolution, #3 2011-03-30 2011 Febr… Revo… https:/… imgI… \n## 6 We Are Still Continuing Unti… 2011-03-29 2011 We A… Dema… https:/… imgI… \nurl <- \"https://wayback.archive-it.org/2358/20120130161341/http://www.tahrirdocuments.org/2011/03/voice-of-the-revolution-3-page-2/\"\n\nhtml <- read_html(url)\n\nhtml## {html_document}\n## \n## [1] \\n NewFile --> RScript"},{"path":"introduction-to-r.html","id":"a-simple-example","chapter":"Introduction to R","heading":"0.5 A simple example","text":"Script (top left) write commands R. can try first time writing small snipped code follows:tell R run command, highlight relevant row script click Run button (top right Script) - hold ctrl+enter Windows cmd+enter Mac - send command Console (bottom left), actual evaluation calculations taking place. shortcut keys become familiar quickly!Running command creates object named ‘x’, contains words message.can now see ‘x’ Environment (top right). view contained x, type Console (bottom left):","code":"\nx <- \"I can't wait to learn Computational Text Analysis\" #Note the quotation marks!\nprint(x)## [1] \"I can't wait to learn Computational Text Analysis\"\n# or alternatively you can just type:\n\nx## [1] \"I can't wait to learn Computational Text Analysis\""},{"path":"introduction-to-r.html","id":"loading-packages","chapter":"Introduction to R","heading":"0.6 Loading packages","text":"‘base’ version R powerful able everything , least ease. technical specialized forms analysis, need load new packages.need install -called ‘package’—program includes new tools (.e., functions) carry specific tasks. can think ‘extensions’ enhancing R’s capacities.take one example, might want something little exciting print excited course. Let’s make map instead.might sound technical. beauty packaged extensions R contain functions perform specialized types analysis ease.’ll first need install one packages, can :package installed, need load environment typing library(). Note , , don’t need wrap name package quotation marks. trick:now? Well, let’s see just easy visualize data using ggplot package comes bundled larger tidyverse package.wanted save ’d got making plots, want save scripts, maybe data used well, return later stage.","code":"\ninstall.packages(\"tidyverse\")\nlibrary(tidyverse)\nggplot(data = mpg) + \n geom_point(mapping = aes(x = displ, y = hwy))"},{"path":"introduction-to-r.html","id":"saving-your-objects-plots-and-scripts","chapter":"Introduction to R","heading":"0.7 Saving your objects, plots and scripts","text":"Saving scripts: save script RStudio (.e. top left panel), need click File –> Save (choose name script). script something like: myfilename.R.Saving scripts: save script RStudio (.e. top left panel), need click File –> Save (choose name script). script something like: myfilename.R.Saving plots: made plots like save, click Export (plotting pane) choose relevant file extension (e.g. .png, .pdf, etc.) size.Saving plots: made plots like save, click Export (plotting pane) choose relevant file extension (e.g. .png, .pdf, etc.) size.save individual objects (example x ) environment, run following command (choosing suitable filename):save individual objects (example x ) environment, run following command (choosing suitable filename):save objects (.e. everything top right panel) , run following command (choosing suitable filename):objects can re-loaded R next session running:many file formats might use save output. encounter course progresses.","code":"\nsave(x,file=\"myobject.RData\")\nload(file=\"myobject.RData\")\nsave.image(file=\"myfilname.RData\")\nload(file=\"myfilename.RData\")"},{"path":"introduction-to-r.html","id":"knowing-where-r-saves-your-documents","chapter":"Introduction to R","heading":"0.8 Knowing where R saves your documents","text":"home, open new script make sure check set working directory (.e. folder files create saved). check working directory use getwd() command (type Console write script Source Editor):set working directory, run following command, substituting file directory choice. Remember anything following `#’ symbol simply clarifying comment R process .","code":"\ngetwd()\n## Example for Mac \nsetwd(\"/Users/Documents/mydir/\") \n## Example for PC \nsetwd(\"c:/docs/mydir\") "},{"path":"introduction-to-r.html","id":"practicing-in-r","chapter":"Introduction to R","heading":"0.9 Practicing in R","text":"best way learn R use . workshops text analysis place become fully proficient R. , however, chance conduct hands-analysis applied examples fast-expanding field. best way learn . give shot!practice R programming language, look Wickham Grolemund (2017) , tidy text analysis, Silge Robinson (2017).free online book Hadley Wickham “R Data Science” available hereThe free online book Hadley Wickham “R Data Science” available hereThe free online book Julia Silge David Robinson “Text Mining R” available hereThe free online book Julia Silge David Robinson “Text Mining R” available hereFor practice R, may want consult set interactive tutorials, available package “learnr.” ’ve installed package, can go tutorials calling:practice R, may want consult set interactive tutorials, available package “learnr.” ’ve installed package, can go tutorials calling:","code":"\nlibrary(learnr)\n\navailable_tutorials() # this will tell you the names of the tutorials available\n\nrun_tutorial(name = \"ex-data-basics\", package = \"learnr\") #this will launch the interactive tutorial in a new Internet browser window"},{"path":"introduction-to-r.html","id":"one-final-note","chapter":"Introduction to R","heading":"0.10 One final note","text":"’ve dipped “R Data Science” book ’ll hear lot -called tidyverse R. essentially set packages use alternative, intuitive, way interacting data.main difference ’ll notice , instead separate lines function want run, wrapping functions inside functions, sets functions “piped” using “pipe” functions, look appearance: %>%.using “tidy” syntax weekly exercises computational text analysis workshops. anything unclear, can provide equivalents “base” R . lot useful text analysis packages now composed ‘tidy’ syntax.","code":""},{"path":"week-1-retrieving-and-analyzing-text.html","id":"week-1-retrieving-and-analyzing-text","chapter":"1 Week 1: Retrieving and analyzing text","heading":"1 Week 1: Retrieving and analyzing text","text":"first task conducting large-scale text analyses gathering curating text information . focus chapters Manning, Raghavan, Schtze (2007) listed . , ’ll find introduction different ways can reformat ‘query’ text data order begin asking questions . often referred computer science natural language processing contexts “information retrieval” foundation many search, including web search, processes.articles Tatman (2017) Pechenick, Danforth, Dodds (2015) focus seminar (Q&). articles get us thinking fundamentals text discovery sampling. reading articles think locating texts, sampling , biases might inhere sampling process, texts represent; .e., population phenomenon interest might provide inferences.Questions seminar:access text? need consider ?sample texts?biases need keep mind?Required reading:Tatman (2017)Tatman (2017)Pechenick, Danforth, Dodds (2015)Pechenick, Danforth, Dodds (2015)Manning, Raghavan, Schtze (2007) (chs.1 10): https://nlp.stanford.edu/IR-book/information-retrieval-book.htmlManning, Raghavan, Schtze (2007) (chs.1 10): https://nlp.stanford.edu/IR-book/information-retrieval-book.htmlKlaus Krippendorff (2004) (ch. 6)Klaus Krippendorff (2004) (ch. 6)reading:Olteanu et al. (2019)Biber (1993)Barberá Rivero (2015)Slides:Week 1 Slides","code":""},{"path":"week-2-tokenization-and-word-frequencies.html","id":"week-2-tokenization-and-word-frequencies","chapter":"2 Week 2: Tokenization and word frequencies","heading":"2 Week 2: Tokenization and word frequencies","text":"approaching large-scale quantiative analyses text, key task identify capture unit analysis. One commonly used approaches, across diverse analytical contexts, text tokenization. , splitting text word units: unigrams, bigrams, trigrams etc.chapters Manning, Raghavan, Schtze (2007), listed , provide technical introduction task “querying” text according different word-based queries. task studying hands-assignment week.seminar discussion, focusing widely-cited examples research applied social sciences employing token-based, word frequency, analyses large corpora. first, Michel et al. (2011) uses enormous Google books corpus measure cultural linguistic trends. second, Bollen et al. (2021a) uses corpus demonstrate specific change time—-called “cognitive distortion.” examples, attentive questions sampling covered previous weeks. question central back--forths short responses replies articles Michel et al. (2011) Bollen et al. (2021a).Questions:Tokenizing counting: capture?Corpus-based sampling: biases might threaten inference?write critique either Michel et al. (2011) Bollen et al. (2021a), focus ?Required reading:Michel et al. (2011)\nSchwartz (2011)\nMorse-Gagné (2011)\nAiden, Pickett, Michel (2011)\nSchwartz (2011)Morse-Gagné (2011)Aiden, Pickett, Michel (2011)Bollen et al. (2021a)\nSchmidt, Piantadosi, Mahowald (2021)\nBollen et al. (2021b)\nSchmidt, Piantadosi, Mahowald (2021)Bollen et al. (2021b)Manning, Raghavan, Schtze (2007) (ch. 2): https://nlp.stanford.edu/IR-book/information-retrieval-book.html]Klaus Krippendorff (2004) (ch. 5)reading:Rozado, Al-Gharbi, Halberstadt (2021)Alshaabi et al. (2021)Campos et al. (2015)Greenfield (2013)Slides:Week 2 Slides","code":""},{"path":"week-2-demo.html","id":"week-2-demo","chapter":"3 Week 2 Demo","heading":"3 Week 2 Demo","text":"","code":""},{"path":"week-2-demo.html","id":"setup","chapter":"3 Week 2 Demo","heading":"3.1 Setup","text":"section, ’ll quick overview ’re processing text data conducting analyses word frequency. ’ll using randomly simulated text.First load packages ’ll using:","code":"\nlibrary(stringi) #to generate random text\nlibrary(dplyr) #tidyverse package for wrangling data\nlibrary(tidytext) #package for 'tidy' manipulation of text data\nlibrary(ggplot2) #package for visualizing data\nlibrary(scales) #additional package for formatting plot axes\nlibrary(kableExtra) #package for displaying data in html format (relevant for formatting this worksheet mainly)"},{"path":"week-2-demo.html","id":"tokenizing","chapter":"3 Week 2 Demo","heading":"3.2 Tokenizing","text":"’ll first get random text see looks like ’re tokenizing text.can tokenize unnest_tokens() function tidytext.Now ’ll get larger data, simulating 5000 observations (rows) random Latin text strings.’ll add another column call “weeks.” unit analysis.Now ’ll simulate trend see increasing number words weeks go . Don’t worry much code little complex, share case interest.can see week goes , text.can trend week sees decreasing number words.Now let’s check top frequency words text.’re going check frequencies word “sed” ’re gonna normalize denominating total word frequencies week.First need get total word frequencies week.can join two dataframes together left_join() function ’re joining “week” column. can pipe joined data plot.","code":"\nlipsum_text <- data.frame(text = stri_rand_lipsum(1, start_lipsum = TRUE))\n\nhead(lipsum_text$text)## [1] \"Lorem ipsum dolor sit amet, mauris dolor posuere sed sit dapibus sapien egestas semper aptent. Luctus, eu, pretium enim, sociosqu rhoncus quis aliquam. In in in auctor natoque venenatis tincidunt. At scelerisque neque porta ut mi a, congue quis curae. Facilisis, adipiscing mauris. Dis non interdum cum commodo, tempor sapien donec in luctus. Nascetur ullamcorper, dui non semper, arcu sed. Sed non pellentesque rutrum tempor, curabitur in. Taciti gravida ut interdum iaculis. Arcu consectetur dictum et erat vestibulum luctus ridiculus! Luctus metus ad ex bibendum, eget at maximus nisl quisque ante posuere aptent. Cubilia tellus sed aliquam, suspendisse arcu et dapibus aenean. Ultricies primis sit nulla condimentum, sed, phasellus viverra nullam, primis.\"\ntokens <- lipsum_text %>%\n unnest_tokens(word, text)\n\nhead(tokens)## word\n## 1 lorem\n## 2 ipsum\n## 3 dolor\n## 4 sit\n## 5 amet\n## 6 mauris\n## Varying total words example\nlipsum_text <- data.frame(text = stri_rand_lipsum(5000, start_lipsum = TRUE))\n# make some weeks one to ten\nlipsum_text$week <- as.integer(rep(seq.int(1:10), 5000/10))\nfor(i in 1:nrow(lipsum_text)) {\n week <- lipsum_text[i, 2]\n morewords <-\n paste(rep(\"more lipsum words\", times = sample(1:100, 1) * week), collapse = \" \")\n lipsum_words <- lipsum_text[i, 1]\n new_lipsum_text <- paste0(morewords, lipsum_words, collapse = \" \")\n lipsum_text[i, 1] <- new_lipsum_text\n}\nlipsum_text %>%\n unnest_tokens(word, text) %>%\n group_by(week) %>%\n dplyr::count(word) %>%\n select(week, n) %>%\n distinct() %>%\n ggplot() +\n geom_bar(aes(week, n), stat = \"identity\") +\n labs(x = \"Week\", y = \"n words\") +\n scale_x_continuous(breaks= pretty_breaks())\n# simulate decreasing words trend\nlipsum_text <- data.frame(text = stri_rand_lipsum(5000, start_lipsum = TRUE))\n\n# make some weeks one to ten\nlipsum_text$week <- as.integer(rep(seq.int(1:10), 5000/10))\n\nfor(i in 1:nrow(lipsum_text)) {\n week <- lipsum_text[i,2]\n morewords <- paste(rep(\"more lipsum words\", times = sample(1:100, 1)* 1/week), collapse = \" \")\n lipsum_words <- lipsum_text[i,1]\n new_lipsum_text <- paste0(morewords, lipsum_words, collapse = \" \")\n lipsum_text[i,1] <- new_lipsum_text\n}\n\nlipsum_text %>%\n unnest_tokens(word, text) %>%\n group_by(week) %>%\n dplyr::count(word) %>%\n select(week, n) %>%\n distinct() %>%\n ggplot() +\n geom_bar(aes(week, n), stat = \"identity\") +\n labs(x = \"Week\", y = \"n words\") +\n scale_x_continuous(breaks= pretty_breaks())\nlipsum_text %>%\n unnest_tokens(word, text) %>%\n dplyr::count(word, sort = T) %>%\n top_n(5) %>%\n knitr::kable(format=\"html\")%>% \n kable_styling(\"striped\", full_width = F)## Selecting by n\nlipsum_totals <- lipsum_text %>%\n group_by(week) %>%\n unnest_tokens(word, text) %>%\n dplyr::count(word) %>%\n mutate(total = sum(n)) %>%\n distinct(week, total)\n# let's look for \"sed\"\nlipsum_sed <- lipsum_text %>%\n group_by(week) %>%\n unnest_tokens(word, text) %>%\n filter(word == \"sed\") %>%\n dplyr::count(word) %>%\n mutate(total_sed = sum(n)) %>%\n distinct(week, total_sed)\nlipsum_sed %>%\n left_join(lipsum_totals, by = \"week\") %>%\n mutate(sed_prop = total_sed/total) %>%\n ggplot() +\n geom_line(aes(week, sed_prop)) +\n labs(x = \"Week\", y = \"\n Proportion sed word\") +\n scale_x_continuous(breaks= pretty_breaks())"},{"path":"week-2-demo.html","id":"regexing","chapter":"3 Week 2 Demo","heading":"3.3 Regexing","text":"’ll notice worksheet word frequencies one point set parentheses str_detect() string “[-z]”. called character class use square brackets like [].character classes include, helpfully listed vignette stringr package. follows adapted materials regular expressions.[abc]: matches , b, c.[-z]: matches every character z\n(Unicode code point order).[^abc]: matches anything except , b, c.[\\^\\-]: matches ^ -.Several patterns match multiple characters. include:\\d: matches digit; opposite \\D, matches character \ndecimal digit.\\s: matches whitespace; opposite \\S^: matches start string$: matches end string^ $: exact string matchHold : plus signs etc. mean?+: 1 .*: 0 .?: 0 1.can tell output makes sense, ’re getting !","code":"\nstr_extract_all(\"1 + 2 = 3\", \"\\\\d+\")## [[1]]\n## [1] \"1\" \"2\" \"3\"\nstr_extract_all(\"1 + 2 = 3\", \"\\\\D+\")## [[1]]\n## [1] \" + \" \" = \"\n(text <- \"Some \\t badly\\n\\t\\tspaced \\f text\")## [1] \"Some \\t badly\\n\\t\\tspaced \\f text\"\nstr_replace_all(text, \"\\\\s+\", \" \")## [1] \"Some badly spaced text\"\nx <- c(\"apple\", \"banana\", \"pear\")\nstr_extract(x, \"^a\")## [1] \"a\" NA NA\nx <- c(\"apple\", \"banana\", \"pear\")\nstr_extract(x, \"^a$\")## [1] NA NA NA\nx <- c(\"apple\", \"banana\", \"pear\")\nstr_extract(x, \"^apple$\")## [1] \"apple\" NA NA\nstr_extract_all(\"1 + 2 = 3\", \"\\\\d+\")[[1]]## [1] \"1\" \"2\" \"3\"\nstr_extract_all(\"1 + 2 = 3\", \"\\\\D+\")[[1]]## [1] \" + \" \" = \"\nstr_extract_all(\"1 + 2 = 3\", \"\\\\d*\")[[1]]## [1] \"1\" \"\" \"\" \"\" \"2\" \"\" \"\" \"\" \"3\" \"\"\nstr_extract_all(\"1 + 2 = 3\", \"\\\\D*\")[[1]]## [1] \"\" \" + \" \"\" \" = \" \"\" \"\"\nstr_extract_all(\"1 + 2 = 3\", \"\\\\d?\")[[1]]## [1] \"1\" \"\" \"\" \"\" \"2\" \"\" \"\" \"\" \"3\" \"\"\nstr_extract_all(\"1 + 2 = 3\", \"\\\\D?\")[[1]]## [1] \"\" \" \" \"+\" \" \" \"\" \" \" \"=\" \" \" \"\" \"\""},{"path":"week-2-demo.html","id":"some-more-regex-resources","chapter":"3 Week 2 Demo","heading":"3.3.1 Some more regex resources:","text":"Regex crossword: https://regexcrossword.com/.Regexone: https://regexone.com/R4DS chapter 14","code":""},{"path":"week-3-dictionary-based-techniques.html","id":"week-3-dictionary-based-techniques","chapter":"4 Week 3: Dictionary-based techniques","heading":"4 Week 3: Dictionary-based techniques","text":"extension word frequency analyses, covered last week, -called “dictionary-based” techniques. basic form, analyses use index target terms classify corpus interest based presence absence. technical dimensions type analysis covered chapter section Klaus Krippendorff (2004), issues attending article - Loughran Mcdonald (2011). article Brooke (2021) provides outstanding illustration use text analysis techniques make inferences larger questions bias.also reading two examples application techniques Martins Baumard (2020) Young Soroka (2012). , discussing successful authors measuring phenomenon interest (“prosociality” “tone” respectively). Questions sampling representativeness relevant , naturally inform assessments work.Questions:general dictionaries possible; domain-specific?know dictionary accurate?enhance/supplement dictionary-based techniques?Required reading:Martins Baumard (2020)Voigt et al. (2017)Brooke (2021)reading:Tausczik Pennebaker (2010)Klaus Krippendorff (2004) (pp.283-289)Brier Hopp (2011)Bonikowski Gidron (2015)Barberá et al. (2021)Young Soroka (2012)Slides:Week 3 Slides","code":""},{"path":"week-3-demo.html","id":"week-3-demo","chapter":"5 Week 3 Demo","heading":"5 Week 3 Demo","text":"section, ’ll quick overview ’re processing text data conducting basic sentiment analyses.","code":""},{"path":"week-3-demo.html","id":"setup-1","chapter":"5 Week 3 Demo","heading":"5.1 Setup","text":"’ll first load packages need.","code":"\nlibrary(stringi)\nlibrary(dplyr)\nlibrary(tidytext)\nlibrary(ggplot2)\nlibrary(scales)"},{"path":"week-3-demo.html","id":"happy-words","chapter":"5 Week 3 Demo","heading":"5.2 Happy words","text":"discussed lectures, might find text class’s collective thoughts increase “happy” words time.simulated dataset text split weeks, students, words plus whether word word “happy” 0 means word “happy” 1 means .three datasets: one constant number “happy” words; one increasing number “happy” words; one decreasing number “happy” words. called: happyn, happyu, happyd respectively.can see trend “happy” words week student.First, dataset constant number happy words time.now simulated data increasing number happy words.finally decreasing number happy words.","code":"\nhead(happyn)## # A tibble: 6 × 4\n## # Groups: week, student [1]\n## week student word happy\n## \n## 1 1 9 lorem 0\n## 2 1 9 ipsum 0\n## 3 1 9 dolor 0\n## 4 1 9 sit 0\n## 5 1 9 amet 0\n## 6 1 9 nam 0\nhead(happyu)## # A tibble: 6 × 4\n## # Groups: week, student [1]\n## week student word happy\n## \n## 1 1 9 lorem 0\n## 2 1 9 ipsum 0\n## 3 1 9 dolor 0\n## 4 1 9 sit 0\n## 5 1 9 amet 0\n## 6 1 9 nam 0\nhead(happyd)## # A tibble: 6 × 4\n## # Groups: week, student [1]\n## week student word happy\n## \n## 1 1 9 lorem 0\n## 2 1 9 ipsum 0\n## 3 1 9 dolor 0\n## 4 1 9 sit 0\n## 5 1 9 amet 0\n## 6 1 9 nam 0## Warning: Returning more (or less) than 1 row per `summarise()` group was deprecated in\n## dplyr 1.1.0.\n## ℹ Please use `reframe()` instead.\n## ℹ When switching from `summarise()` to `reframe()`, remember that `reframe()`\n## always returns an ungrouped data frame and adjust accordingly.\n## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was\n## generated.## `summarise()` has grouped output by 'week', 'student'. You can override using\n## the `.groups` argument.\n## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'## Warning: Returning more (or less) than 1 row per `summarise()` group was deprecated in\n## dplyr 1.1.0.\n## ℹ Please use `reframe()` instead.\n## ℹ When switching from `summarise()` to `reframe()`, remember that `reframe()`\n## always returns an ungrouped data frame and adjust accordingly.\n## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was\n## generated.## `summarise()` has grouped output by 'week', 'student'. You can override using\n## the `.groups` argument.\n## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'## Warning: Returning more (or less) than 1 row per `summarise()` group was deprecated in\n## dplyr 1.1.0.\n## ℹ Please use `reframe()` instead.\n## ℹ When switching from `summarise()` to `reframe()`, remember that `reframe()`\n## always returns an ungrouped data frame and adjust accordingly.\n## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was\n## generated.## `summarise()` has grouped output by 'week', 'student'. You can override using\n## the `.groups` argument.\n## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'"},{"path":"week-3-demo.html","id":"normalizing-sentiment","chapter":"5 Week 3 Demo","heading":"5.3 Normalizing sentiment","text":"discussed lecture, also know just total number happy words increases, isn’t indication ’re getting happier class time.can begin make inference, need normalize total number words week., simulate data number happy words actually week (happyn dataset ).join data three datasets: happylipsumn, happylipsumu, happylipsumd. datasets random text, number happy words.first also number total words week. second two, however, differing number total words week: happylipsumu increasing number total words week; happylipsumd decreasing number total words week., see , ’re splitting week, student, word, whether “happy” word.plot number happy words divided number total words week student datasets, get .get normalized sentiment score–“happy” score–need create variable (column) dataframe sum happy words divided total number words dataframe.can following way.repeat datasets plot see following.plots look like ?Well, first, number total words week number happy words week. divided latter former, get proportion also stable time.second, however, increasing number total words week, number happy words time. means dividing ever larger number, giving ever smaller proportions. , trend decreasing time.third, decreasing number total words week, number happy words time. means dividing ever smaller number, giving ever larger proportions. , trend increasing time.","code":"\nhead(happylipsumn)## # A tibble: 6 × 4\n## # Groups: week, student [1]\n## week student word happy\n## \n## 1 1 9 lorem 0\n## 2 1 9 ipsum 0\n## 3 1 9 dolor 0\n## 4 1 9 sit 0\n## 5 1 9 amet 0\n## 6 1 9 semper 0\nhead(happylipsumu)## # A tibble: 6 × 4\n## # Groups: week, student [1]\n## week student word happy\n## \n## 1 1 9 lorem 0\n## 2 1 9 ipsum 0\n## 3 1 9 dolor 0\n## 4 1 9 sit 0\n## 5 1 9 amet 0\n## 6 1 9 commodo 0\nhead(happylipsumd)## # A tibble: 6 × 4\n## # Groups: week, student [1]\n## week student word happy\n## \n## 1 1 9 lorem 0\n## 2 1 9 ipsum 0\n## 3 1 9 dolor 0\n## 4 1 9 sit 0\n## 5 1 9 amet 0\n## 6 1 9 et 0\nhappylipsumn %>%\n group_by(week, student) %>%\n mutate(index_total = n()) %>%\n filter(happy==1) %>%\n summarise(sum_hap = sum(happy),\n index_total = index_total,\n prop_hap = sum_hap/index_total) %>%\n distinct()## Warning: Returning more (or less) than 1 row per `summarise()` group was deprecated in\n## dplyr 1.1.0.\n## ℹ Please use `reframe()` instead.\n## ℹ When switching from `summarise()` to `reframe()`, remember that `reframe()`\n## always returns an ungrouped data frame and adjust accordingly.\n## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was\n## generated.## `summarise()` has grouped output by 'week', 'student'. You can override using\n## the `.groups` argument.## # A tibble: 300 × 5\n## # Groups: week, student [300]\n## week student sum_hap index_total prop_hap\n## \n## 1 1 1 894 3548 0.252\n## 2 1 2 1164 5259 0.221\n## 3 1 3 1014 4531 0.224\n## 4 1 4 774 3654 0.212\n## 5 1 5 980 4212 0.233\n## 6 1 6 711 3579 0.199\n## 7 1 7 1254 5025 0.250\n## 8 1 8 1117 4846 0.230\n## 9 1 9 1079 4726 0.228\n## 10 1 10 1061 5111 0.208\n## # ℹ 290 more rows"},{"path":"week-4-natural-language-complexity-and-similarity.html","id":"week-4-natural-language-complexity-and-similarity","chapter":"6 Week 4: Natural language, complexity, and similarity","heading":"6 Week 4: Natural language, complexity, and similarity","text":"week delving deeply language used text. previous weeks, tried two main techniques rely, different ways, counting words. week, thinking sophisticated techniques identify measure language use, well compare texts . article Gomaa Fahmy (2013) provides overview different approaches. covering technical dimensions lecture.article Urman, Makhortykh, Ulloa (2021) investigates key question contemporary communications research—information exposed online—shows might compare web search results using similarity measures. Schoonvelde et al. (2019) article, hand, looks “complexity” texts, compares politicians different ideological stripes communicate.Questions:measure linguistic complexity/sophistication?biases might involved measuring sophistication?applications might similarity measures?Required reading:Urman, Makhortykh, Ulloa (2021)Schoonvelde et al. (2019)Gomaa Fahmy (2013)reading:Voigt et al. (2017)Peng Hengartner (2002)Lowe (2008)Bail (2012)Ziblatt, Hilbig, Bischof (2020)Benoit, Munger, Spirling (2019)Slides:Week 4 Slides","code":""},{"path":"week-4-demo.html","id":"week-4-demo","chapter":"7 Week 4 Demo","heading":"7 Week 4 Demo","text":"","code":""},{"path":"week-4-demo.html","id":"setup-2","chapter":"7 Week 4 Demo","heading":"7.1 Setup","text":"First, ’ll load packages ’ll using week’s brief demo.","code":"\nlibrary(quanteda)\nlibrary(quanteda.textstats)\nlibrary(quanteda.textplots)\nlibrary(tidytext)\nlibrary(stringdist)\nlibrary(corrplot)\nlibrary(janeaustenr)"},{"path":"week-4-demo.html","id":"character-based-similarity","chapter":"7 Week 4 Demo","heading":"7.2 Character-based similarity","text":"first measure text similarity level characters. can look last time (promise) example lecture see similarity compares.’ll make two sentences create two character objects . two thoughts imagined classes.know “longest common substring measure” , according stringdist package documentation, “longest string can obtained pairing characters b keeping order characters intact.”can easily get different distance/similarity measures comparing character objects b .","code":"\na <- \"We are all very happy to be at a lecture at 11AM\"\nb <- \"We are all even happier that we don’t have two lectures a week\"\n## longest common substring distance\nstringdist(a, b,\n method = \"lcs\")## [1] 36\n## levenshtein distance\nstringdist(a, b,\n method = \"lv\")## [1] 27\n## jaro distance\nstringdist(a, b,\n method = \"jw\", p =0)## [1] 0.2550103"},{"path":"week-4-demo.html","id":"term-based-similarity","chapter":"7 Week 4 Demo","heading":"7.3 Term-based similarity","text":"second example lecture, ’re taking opening line Pride Prejudice alongside versions famous opening line.can get text Jane Austen easily thanks janeaustenr package.’re going specify alternative versions sentence.Finally, ’re going convert document feature matrix. ’re quanteda package, package ’ll begin using coming weeks analyses ’re performing get gradually technical.see ?Well, ’s clear text2 text3 similar text1 —share words. also see text2 least contain words shared text1, original opening line Jane Austen’s Pride Prejudice., measure similarity distance texts?first way simply correlating two sets ones zeroes. can quanteda.textstats package like .’ll see get manipulated data tidy format (rows words columns 1s 0s).see expected text2 highly correlated text1 text3.\nEuclidean distances, can use quanteda .define function just see ’s going behind scenes.Manhattan distance, use quanteda .define function.cosine similarity, quanteda makes straightforward.make clear ’s going , write function.","code":"\n## similarity and distance example\n\ntext <- janeaustenr::prideprejudice\n\nsentences <- text[10:11]\n\nsentence1 <- paste(sentences[1], sentences[2], sep = \" \")\n\nsentence1## [1] \"It is a truth universally acknowledged, that a single man in possession of a good fortune, must be in want of a wife.\"\nsentence2 <- \"Everyone knows that a rich man without wife will want a wife\"\n\nsentence3 <- \"He's loaded so he wants to get married. Everyone knows that's what happens.\"\ndfmat <- dfm(tokens(c(sentence1,\n sentence2,\n sentence3)),\n remove_punct = TRUE, remove = stopwords(\"english\"))\n\ndfmat## Document-feature matrix of: 3 documents, 21 features (58.73% sparse) and 0 docvars.\n## features\n## docs truth universally acknowledged single man possession good fortune must\n## text1 1 1 1 1 1 1 1 1 1\n## text2 0 0 0 0 1 0 0 0 0\n## text3 0 0 0 0 0 0 0 0 0\n## features\n## docs want\n## text1 1\n## text2 1\n## text3 0\n## [ reached max_nfeat ... 11 more features ]\n## correlation\ntextstat_simil(dfmat, margin = \"documents\", method = \"correlation\")## textstat_simil object; method = \"correlation\"\n## text1 text2 text3\n## text1 1.000 -0.117 -0.742\n## text2 -0.117 1.000 -0.173\n## text3 -0.742 -0.173 1.000\ntest <- tidy(dfmat)\ntest <- test %>%\n cast_dfm(term, document, count)\ntest <- as.data.frame(test)\n\nres <- cor(test[,2:4])\nres## text1 text2 text3\n## text1 1.0000000 -0.1167748 -0.7416198\n## text2 -0.1167748 1.0000000 -0.1732051\n## text3 -0.7416198 -0.1732051 1.0000000\ncorrplot(res, type = \"upper\", order = \"hclust\", \n tl.col = \"black\", tl.srt = 45)\ntextstat_dist(dfmat, margin = \"documents\", method = \"euclidean\")## textstat_dist object; method = \"euclidean\"\n## text1 text2 text3\n## text1 0 3.74 4.24\n## text2 3.74 0 3.74\n## text3 4.24 3.74 0\n# function for Euclidean distance\neuclidean <- function(a,b) sqrt(sum((a - b)^2))\n# estimating the distance\neuclidean(test$text1, test$text2)## [1] 3.741657\neuclidean(test$text1, test$text3)## [1] 4.242641\neuclidean(test$text2, test$text3)## [1] 3.741657\ntextstat_dist(dfmat, margin = \"documents\", method = \"manhattan\")## textstat_dist object; method = \"manhattan\"\n## text1 text2 text3\n## text1 0 14 18\n## text2 14 0 12\n## text3 18 12 0\n## manhattan\nmanhattan <- function(a, b){\n dist <- abs(a - b)\n dist <- sum(dist)\n return(dist)\n}\n\nmanhattan(test$text1, test$text2)## [1] 14\nmanhattan(test$text1, test$text3)## [1] 18\nmanhattan(test$text2, test$text3)## [1] 12\ntextstat_simil(dfmat, margin = \"documents\", method = \"cosine\")## textstat_simil object; method = \"cosine\"\n## text1 text2 text3\n## text1 1.000 0.364 0\n## text2 0.364 1.000 0.228\n## text3 0 0.228 1.000\n## cosine\ncos.sim <- function(a, b) \n{\n return(sum(a*b)/sqrt(sum(a^2)*sum(b^2)) )\n} \n\ncos.sim(test$text1, test$text2)## [1] 0.3636364\ncos.sim(test$text1, test$text3)## [1] 0\ncos.sim(test$text2, test$text3)## [1] 0.2279212"},{"path":"week-4-demo.html","id":"complexity","chapter":"7 Week 4 Demo","heading":"7.4 Complexity","text":"Note: section borrows notation materials texstat_readability() function.also talked different document-level measures text characteristics. One “complexity” readability text. One frequently used Flesch’s Reading Ease Score (Flesch 1948).computed :{:}{Flesch’s Reading Ease Score (Flesch 1948).\n}can estimate readability score respective sentences . Flesch score 1948 default.see ? original Austen opening line marked lower readability colloquial alternatives.alternatives measures might use. can check clicking links function textstat_readability(). display .One McLaughlin (1969) “Simple Measure Gobbledygook, based recurrence words 3 syllables calculated :{:}{Simple Measure Gobbledygook (SMOG) (McLaughlin 1969). = Nwmin3sy = number words 3 syllables .\nmeasure regression equation D McLaughlin’s original paper.}can calculate three sentences ., , see original Austen sentence higher level complexity (gobbledygook!).","code":"\ntextstat_readability(sentence1)## document Flesch\n## 1 text1 62.10739\ntextstat_readability(sentence2)## document Flesch\n## 1 text1 88.905\ntextstat_readability(sentence3)## document Flesch\n## 1 text1 83.09904\ntextstat_readability(sentence1, measure = \"SMOG\")## document SMOG\n## 1 text1 13.02387\ntextstat_readability(sentence2, measure = \"SMOG\")## document SMOG\n## 1 text1 8.841846\ntextstat_readability(sentence3, measure = \"SMOG\")## document SMOG\n## 1 text1 7.168622"},{"path":"week-5-scaling-techniques.html","id":"week-5-scaling-techniques","chapter":"8 Week 5: Scaling techniques","heading":"8 Week 5: Scaling techniques","text":"begin thinking automated techniques analyzing texts. bunch additional considerations now need bring mind. considerations sparked significant debates… matter means settled.stake ? weeks come, studying various techniques ‘classify,’ ‘position’ ‘score’ texts based features. success techniques depends suitability question hand also higher-level questions meaning. short, ask : way can access underlying processes governing generation text? meaning governed set structural processes? can derive ‘objective’ measures contents given text?readings Justin Grimmer, Roberts, Stewart (2021), Denny Spirling (2018), Goldenstein Poschmann (2019b) (well response replies Nelson (2019) Goldenstein Poschmann (2019a)) required reading Flexible Learning Week.Justin Grimmer, Roberts, Stewart (2021)Justin Grimmer, Roberts, Stewart (2021)Justin Grimmer Stewart (2013a)Justin Grimmer Stewart (2013a)Denny Spirling (2018)Denny Spirling (2018)Goldenstein Poschmann (2019b)\nNelson (2019)\nGoldenstein Poschmann (2019a)\nGoldenstein Poschmann (2019b)Nelson (2019)Goldenstein Poschmann (2019a)substantive focus week set readings employ different types “scaling” “low-dimensional document embedding” techniques. article Lowe (2008) provides technical overview “wordfish” algorithm uses political science contexts. article Klüver (2009) also uses “wordfish” different way—measure “influence” interest groups. response article Bunea Ibenskas (2015) subsequent reply Klüver (2015) helps illuminate debates around questions. work Kim, Lelkes, McCrain (2022) gives insight ability text-scaling techniques capture key dimensions political communication bias.Questions:assumptions underlie scaling models text?; latent text decides?might scaling useful outside estimating ideological position/bias text?Required reading:Lowe (2008)Kim, Lelkes, McCrain (2022)Klüver (2009)\nBunea Ibenskas (2015)\nKlüver (2015)\nBunea Ibenskas (2015)Klüver (2015)reading:Benoit et al. (2016)Laver, Benoit, Garry (2003)Slapin Proksch (2008)Schwemmer Wieczorek (2020)Slides:Week 5 Slides","code":""},{"path":"week-5-demo.html","id":"week-5-demo","chapter":"9 Week 5 Demo","heading":"9 Week 5 Demo","text":"","code":""},{"path":"week-5-demo.html","id":"setup-3","chapter":"9 Week 5 Demo","heading":"9.1 Setup","text":"First, ’ll load packages ’ll using week’s brief demo.","code":"\ndevtools::install_github(\"conjugateprior/austin\")\nlibrary(austin)\nlibrary(quanteda)\nlibrary(quanteda.textstats)"},{"path":"week-5-demo.html","id":"wordscores","chapter":"9 Week 5 Demo","heading":"9.2 Wordscores","text":"can inspect function wordscores model Laver, Benoit, Garry (2003) following way:can take example data included austin package.reference documents documents marked “R” reference; .e., columns one five.matrix simply series words (: letters) reference texts word counts .can look wordscores words, calculated using reference dimensions reference documents.see thetas contained wordscores object, .e., reference dimensions reference documents pis, .e., estimated wordscores word.can now use score -called “virgin” texts follows.","code":"\nclassic.wordscores## function (wfm, scores) \n## {\n## if (!is.wfm(wfm)) \n## stop(\"Function not applicable to this object\")\n## if (length(scores) != length(docs(wfm))) \n## stop(\"There are not the same number of documents as scores\")\n## if (any(is.na(scores))) \n## stop(\"One of the reference document scores is NA\\nFit the model with known scores and use 'predict' to get virgin score estimates\")\n## thecall <- match.call()\n## C.all <- as.worddoc(wfm)\n## C <- C.all[rowSums(C.all) > 0, ]\n## F <- scale(C, center = FALSE, scale = colSums(C))\n## ws <- apply(F, 1, function(x) {\n## sum(scores * x)\n## })/rowSums(F)\n## pi <- matrix(ws, nrow = length(ws))\n## rownames(pi) <- rownames(C)\n## colnames(pi) <- c(\"Score\")\n## val <- list(pi = pi, theta = scores, data = wfm, call = thecall)\n## class(val) <- c(\"classic.wordscores\", \"wordscores\", class(val))\n## return(val)\n## }\n## \n## \ndata(lbg)\nref <- getdocs(lbg, 1:5)\nref## docs\n## words R1 R2 R3 R4 R5\n## A 2 0 0 0 0\n## B 3 0 0 0 0\n## C 10 0 0 0 0\n## D 22 0 0 0 0\n## E 45 0 0 0 0\n## F 78 2 0 0 0\n## G 115 3 0 0 0\n## H 146 10 0 0 0\n## I 158 22 0 0 0\n## J 146 45 0 0 0\n## K 115 78 2 0 0\n## L 78 115 3 0 0\n## M 45 146 10 0 0\n## N 22 158 22 0 0\n## O 10 146 45 0 0\n## P 3 115 78 2 0\n## Q 2 78 115 3 0\n## R 0 45 146 10 0\n## S 0 22 158 22 0\n## T 0 10 146 45 0\n## U 0 3 115 78 2\n## V 0 2 78 115 3\n## W 0 0 45 146 10\n## X 0 0 22 158 22\n## Y 0 0 10 146 45\n## Z 0 0 3 115 78\n## ZA 0 0 2 78 115\n## ZB 0 0 0 45 146\n## ZC 0 0 0 22 158\n## ZD 0 0 0 10 146\n## ZE 0 0 0 3 115\n## ZF 0 0 0 2 78\n## ZG 0 0 0 0 45\n## ZH 0 0 0 0 22\n## ZI 0 0 0 0 10\n## ZJ 0 0 0 0 3\n## ZK 0 0 0 0 2\nws <- classic.wordscores(ref, scores=seq(-1.5,1.5,by=0.75))\nws## $pi\n## Score\n## A -1.5000000\n## B -1.5000000\n## C -1.5000000\n## D -1.5000000\n## E -1.5000000\n## F -1.4812500\n## G -1.4809322\n## H -1.4519231\n## I -1.4083333\n## J -1.3232984\n## K -1.1846154\n## L -1.0369898\n## M -0.8805970\n## N -0.7500000\n## O -0.6194030\n## P -0.4507576\n## Q -0.2992424\n## R -0.1305970\n## S 0.0000000\n## T 0.1305970\n## U 0.2992424\n## V 0.4507576\n## W 0.6194030\n## X 0.7500000\n## Y 0.8805970\n## Z 1.0369898\n## ZA 1.1846154\n## ZB 1.3232984\n## ZC 1.4083333\n## ZD 1.4519231\n## ZE 1.4809322\n## ZF 1.4812500\n## ZG 1.5000000\n## ZH 1.5000000\n## ZI 1.5000000\n## ZJ 1.5000000\n## ZK 1.5000000\n## \n## $theta\n## [1] -1.50 -0.75 0.00 0.75 1.50\n## \n## $data\n## docs\n## words R1 R2 R3 R4 R5\n## A 2 0 0 0 0\n## B 3 0 0 0 0\n## C 10 0 0 0 0\n## D 22 0 0 0 0\n## E 45 0 0 0 0\n## F 78 2 0 0 0\n## G 115 3 0 0 0\n## H 146 10 0 0 0\n## I 158 22 0 0 0\n## J 146 45 0 0 0\n## K 115 78 2 0 0\n## L 78 115 3 0 0\n## M 45 146 10 0 0\n## N 22 158 22 0 0\n## O 10 146 45 0 0\n## P 3 115 78 2 0\n## Q 2 78 115 3 0\n## R 0 45 146 10 0\n## S 0 22 158 22 0\n## T 0 10 146 45 0\n## U 0 3 115 78 2\n## V 0 2 78 115 3\n## W 0 0 45 146 10\n## X 0 0 22 158 22\n## Y 0 0 10 146 45\n## Z 0 0 3 115 78\n## ZA 0 0 2 78 115\n## ZB 0 0 0 45 146\n## ZC 0 0 0 22 158\n## ZD 0 0 0 10 146\n## ZE 0 0 0 3 115\n## ZF 0 0 0 2 78\n## ZG 0 0 0 0 45\n## ZH 0 0 0 0 22\n## ZI 0 0 0 0 10\n## ZJ 0 0 0 0 3\n## ZK 0 0 0 0 2\n## \n## $call\n## classic.wordscores(wfm = ref, scores = seq(-1.5, 1.5, by = 0.75))\n## \n## attr(,\"class\")\n## [1] \"classic.wordscores\" \"wordscores\" \"list\"\n#get \"virgin\" documents\nvir <- getdocs(lbg, 'V1')\nvir## docs\n## words V1\n## A 0\n## B 0\n## C 0\n## D 0\n## E 0\n## F 0\n## G 0\n## H 2\n## I 3\n## J 10\n## K 22\n## L 45\n## M 78\n## N 115\n## O 146\n## P 158\n## Q 146\n## R 115\n## S 78\n## T 45\n## U 22\n## V 10\n## W 3\n## X 2\n## Y 0\n## Z 0\n## ZA 0\n## ZB 0\n## ZC 0\n## ZD 0\n## ZE 0\n## ZF 0\n## ZG 0\n## ZH 0\n## ZI 0\n## ZJ 0\n## ZK 0\n# predict textscores for the virgin documents\npredict(ws, newdata=vir)## 37 of 37 words (100%) are scorable\n## \n## Score Std. Err. Rescaled Lower Upper\n## V1 -0.448 0.0119 -0.448 -0.459 -0.437"},{"path":"week-5-demo.html","id":"wordfish","chapter":"9 Week 5 Demo","heading":"9.3 Wordfish","text":"wish, can inspect function wordscores model Slapin Proksch (2008) following way. much complex algorithm, printed , can inspect devices.can simulate data, formatted appropriately wordfiash estimation following way:can see document word-level FEs, well specified range thetas estimates.estimating document positions simply matter implementing algorithm.","code":"\nwordfish\ndd <- sim.wordfish()\n\ndd## $Y\n## docs\n## words D01 D02 D03 D04 D05 D06 D07 D08 D09 D10\n## W01 17 19 22 13 17 11 16 12 6 3\n## W02 18 21 18 16 12 19 11 7 10 4\n## W03 22 21 22 19 11 14 11 3 6 1\n## W04 22 19 18 15 16 13 18 6 2 8\n## W05 28 21 12 10 13 10 5 14 1 3\n## W06 5 7 12 13 15 8 12 13 23 19\n## W07 13 9 5 16 11 17 15 11 35 30\n## W08 8 7 7 10 9 15 18 23 21 23\n## W09 4 12 8 10 9 13 18 25 15 19\n## W10 5 3 7 11 19 16 13 18 17 18\n## W11 66 55 49 48 38 37 27 24 21 6\n## W12 53 56 47 39 49 28 22 15 12 14\n## W13 63 55 47 49 48 31 24 16 17 16\n## W14 57 64 48 51 27 36 24 27 11 12\n## W15 58 48 57 44 36 39 29 27 16 5\n## W16 17 13 24 28 24 32 41 56 67 61\n## W17 9 19 16 36 30 34 53 34 58 57\n## W18 11 19 34 27 42 38 48 58 49 66\n## W19 10 18 27 22 37 52 59 60 60 69\n## W20 14 14 20 23 37 37 36 51 53 66\n## \n## $theta\n## [1] -1.4863011 -1.1560120 -0.8257228 -0.4954337 -0.1651446 0.1651446 0.4954337\n## [8] 0.8257228 1.1560120 1.4863011\n## \n## $doclen\n## D01 D02 D03 D04 D05 D06 D07 D08 D09 D10 \n## 500 500 500 500 500 500 500 500 500 500 \n## \n## $psi\n## [1] 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1\n## \n## $beta\n## [1] 0 0 0 0 0 1 1 1 1 1 0 0 0 0 0 1 1 1 1 1\n## \n## attr(,\"class\")\n## [1] \"wordfish.simdata\" \"list\"\nwf <- wordfish(dd$Y)\nsummary(wf)## Call:\n## wordfish(wfm = dd$Y)\n## \n## Document Positions:\n## Estimate Std. Error Lower Upper\n## D01 -1.4243 0.10560 -1.6313 -1.21736\n## D02 -1.1483 0.09747 -1.3394 -0.95727\n## D03 -0.7701 0.08954 -0.9456 -0.59455\n## D04 -0.4878 0.08591 -0.6562 -0.31942\n## D05 -0.1977 0.08414 -0.3626 -0.03279\n## D06 0.0313 0.08411 -0.1336 0.19616\n## D07 0.4346 0.08704 0.2640 0.60517\n## D08 0.7163 0.09140 0.5372 0.89546\n## D09 1.2277 0.10447 1.0229 1.43243\n## D10 1.6166 0.11933 1.3827 1.85046"},{"path":"week-5-demo.html","id":"using-quanteda","chapter":"9 Week 5 Demo","heading":"9.4 Using quanteda","text":"can also use quanteda implement scaling techniques, demonstrated Exercise 4.","code":""},{"path":"week-6-unsupervised-learning-topic-models.html","id":"week-6-unsupervised-learning-topic-models","chapter":"10 Week 6: Unsupervised learning (topic models)","heading":"10 Week 6: Unsupervised learning (topic models)","text":"week builds upon past scaling techniques explored Week 5 instead turns another form unsupervised approach—topic modelling.substantive articles Nelson (2020) Alrababa’h Blaydes (2020) provide, turn, illuminating insights using topic models categorize thematic content text information.article Ying, Montgomery, Stewart (2021) provides valuable overview accompaniment earlier work Denny Spirling (2018) thinking validate findings test robustness inferences make models.Questions:assumptions underlie topic modelling approaches?Can develop structural models text?topic modelling discovery measurement strategy?validate model?Required reading:Nelson (2020)PARTHASARATHY, RAO, PALANISWAMY (2019)Ying, Montgomery, Stewart (2021)reading:Chang et al. (2009)Alrababa’h Blaydes (2020)J. Grimmer King (2011)Denny Spirling (2018)Smith et al. (2021)Boyd et al. (2018)Slides:Week 6 Slides","code":""},{"path":"week-6-demo.html","id":"week-6-demo","chapter":"11 Week 6 Demo","heading":"11 Week 6 Demo","text":"","code":""},{"path":"week-6-demo.html","id":"setup-4","chapter":"11 Week 6 Demo","heading":"11.1 Setup","text":"First, ’ll load packages ’ll using week’s brief demo.Estimating topic model requires us first data form document-term-matrix. another term referred previous weeks document-feature-matrix.can take example data topicmodels package. text news releases Associated Press. consists around 2,200 articles (documents) 10,000 terms (words).estimate topic model need specify document-term-matrix using, number (k) topics estimating. speed estimation, estimating 100 articles.can inspect contents topic follows.can use tidy() function tidytext gather relevant parameters ’ve estimated. get \\(\\beta\\) per-topic-per-word probabilities (.e., probability given term belongs given topic) can following.get \\(\\gamma\\) per-document-per-topic probabilities (.e., probability given document (: article) belongs particular topic) following.can easily plot \\(\\beta\\) estimates follows.shows us words associated topic, size associated \\(\\beta\\) coefficient.","code":"\nlibrary(topicmodels)\nlibrary(dplyr)\nlibrary(tidytext)\nlibrary(ggplot2)\nlibrary(ggthemes)\ndata(\"AssociatedPress\", \n package = \"topicmodels\")\nlda_output <- LDA(AssociatedPress[1:100,], k = 10)\nterms(lda_output, 10)## Topic 1 Topic 2 Topic 3 Topic 4 Topic 5 Topic 6 \n## [1,] \"bank\" \"wednesday\" \"fire\" \"bush\" \"administration\" \"noriega\" \n## [2,] \"new\" \"new\" \"barry\" \"i\" \"thats\" \"union\" \n## [3,] \"year\" \"central\" \"moore\" \"dukakis\" \"contact\" \"greyhound\"\n## [4,] \"soviet\" \"company\" \"church\" \"people\" \"farmer\" \"panama\" \n## [5,] \"last\" \"peres\" \"last\" \"year\" \"government\" \"president\"\n## [6,] \"million\" \"duracell\" \"mexico\" \"roberts\" \"grain\" \"officials\"\n## [7,] \"animals\" \"snow\" \"people\" \"campaign\" \"i\" \"national\" \n## [8,] \"florio\" \"warming\" \"died\" \"get\" \"new\" \"people\" \n## [9,] \"officials\" \"global\" \"friday\" \"two\" \"magellan\" \"plant\" \n## [10,] \"york\" \"offer\" \"pope\" \"years\" \"officials\" \"arco\" \n## Topic 7 Topic 8 Topic 9 Topic 10 \n## [1,] \"i\" \"percent\" \"state\" \"percent\" \n## [2,] \"new\" \"prices\" \"waste\" \"soviet\" \n## [3,] \"rating\" \"oil\" \"official\" \"economy\" \n## [4,] \"california\" \"year\" \"money\" \"committee\" \n## [5,] \"agents\" \"price\" \"people\" \"gorbachev\" \n## [6,] \"states\" \"gas\" \"announced\" \"union\" \n## [7,] \"mrs\" \"business\" \"company\" \"gorbachevs\"\n## [8,] \"police\" \"rate\" \"officials\" \"economic\" \n## [9,] \"percent\" \"report\" \"orr\" \"congress\" \n## [10,] \"three\" \"average\" \"senate\" \"war\"\nlda_beta <- tidy(lda_output, matrix = \"beta\")\n\nlda_beta %>%\n arrange(-beta)## # A tibble: 104,730 × 3\n## topic term beta\n## \n## 1 8 percent 0.0287\n## 2 10 percent 0.0197\n## 3 1 bank 0.0171\n## 4 8 prices 0.0170\n## 5 10 soviet 0.0160\n## 6 1 new 0.0159\n## 7 9 state 0.0158\n## 8 4 bush 0.0144\n## 9 7 i 0.0129\n## 10 8 oil 0.0118\n## # ℹ 104,720 more rows\nlda_gamma <- tidy(lda_output, matrix = \"gamma\")\n\nlda_gamma %>%\n arrange(-gamma)## # A tibble: 1,000 × 3\n## document topic gamma\n## \n## 1 76 5 1.00\n## 2 81 3 1.00\n## 3 6 6 1.00\n## 4 43 4 1.00\n## 5 31 8 1.00\n## 6 95 7 1.00\n## 7 77 4 1.00\n## 8 29 10 1.00\n## 9 80 5 1.00\n## 10 57 10 1.00\n## # ℹ 990 more rows\nlda_beta %>%\n group_by(topic) %>%\n top_n(10, beta) %>%\n ungroup() %>%\n arrange(topic, -beta) %>%\n mutate(term = reorder_within(term, beta, topic)) %>%\n ggplot(aes(beta, term, fill = factor(topic))) +\n geom_col(show.legend = FALSE) +\n facet_wrap(~ topic, scales = \"free\", ncol = 4) +\n scale_y_reordered() +\n theme_tufte(base_family = \"Helvetica\")"},{"path":"week-7-unsupervised-learning-word-embedding.html","id":"week-7-unsupervised-learning-word-embedding","chapter":"12 Week 7: Unsupervised learning (word embedding)","heading":"12 Week 7: Unsupervised learning (word embedding)","text":"week discussing second form “unsupervised” learning—word embeddings. previous weeks allowed us characterize complexity text, cluster text potential topical focus, word embeddings permit us expansive form measurement. essence, producing matrix representation entire corpus.reading Pedro L. Rodriguez Spirling (2022) provides effective overview technical dimensions technique. articles Garg et al. (2018) Kozlowski, Taddy, Evans (2019) two substantive articles use word embeddings provide insights prejudice bias manifested language time.Required reading:Garg et al. (2018)Kozlowski, Taddy, Evans (2019)Waller Anderson (2021)reading:P. Rodriguez Spirling (2021)Pedro L. Rodriguez Spirling (2022)Osnabrügge, Hobolt, Rodon (2021)Rheault Cochrane (2020)Jurafsky Martin (2021, ch.6): https://web.stanford.edu/~jurafsky/slp3/]Slides:Week 7 Slides","code":""},{"path":"week-7-demo.html","id":"week-7-demo","chapter":"13 Week 7 Demo","heading":"13 Week 7 Demo","text":"","code":""},{"path":"week-7-demo.html","id":"setup-5","chapter":"13 Week 7 Demo","heading":"13.1 Setup","text":"First, ’ll load packages ’ll using week’s brief demo. pre-loading already-estimated PMI matrix results singular value decomposition approach.work?Various approaches, including:\nSVD\n\nNeural network-based techniques like GloVe Word2Vec\n\nSVD\nSVDNeural network-based techniques like GloVe Word2Vec\nNeural network-based techniques like GloVe Word2VecIn approaches, :Defining context window (see figure )Looking probabilities word appearing near another wordsThe implementation technique using singular value decomposition approach requires following data structure:Word pair matrix PMI (Pairwise mutual information)PMI = log(P(x,y)/P(x)P(y))P(x,y) probability word x appearing within six-word window word yand P(x) probability word x appearing whole corpusand P(y) probability word y appearing whole corpusAnd resulting matrix object take following format:use “Singular Value Decomposition” (SVD) techique. another multidimensional scaling technique, first axis resulting coordinates captures variance, second second-etc…, simply need following.can collect vectors word inspect .","code":"\nlibrary(Matrix) #for handling matrices\nlibrary(tidyverse)\nlibrary(irlba) # for SVD\nlibrary(umap) # for dimensionality reduction\n\nload(\"data/wordembed/pmi_svd.RData\")\nload(\"data/wordembed/pmi_matrix.RData\")## 6 x 6 sparse Matrix of class \"dgCMatrix\"\n## the to and of https a\n## the 0.653259169 -0.01948121 -0.006446459 0.27136395 -0.5246159 -0.32557524\n## to -0.019481205 0.75498084 -0.065170433 -0.25694210 -0.5731182 -0.04595798\n## and -0.006446459 -0.06517043 1.027782342 -0.03974904 -0.4915159 -0.05862969\n## of 0.271363948 -0.25694210 -0.039749043 1.02111517 -0.5045067 0.09829389\n## https -0.524615878 -0.57311817 -0.491515918 -0.50450674 0.5451841 -0.57956404\n## a -0.325575239 -0.04595798 -0.058629689 0.09829389 -0.5795640 1.03048355## Formal class 'dgCMatrix' [package \"Matrix\"] with 6 slots\n## ..@ i : int [1:350700] 0 1 2 3 4 5 6 7 8 9 ...\n## ..@ p : int [1:21173] 0 7819 14360 20175 25467 29910 34368 39207 43376 46401 ...\n## ..@ Dim : int [1:2] 21172 21172\n## ..@ Dimnames:List of 2\n## .. ..$ : chr [1:21172] \"the\" \"to\" \"and\" \"of\" ...\n## .. ..$ : chr [1:21172] \"the\" \"to\" \"and\" \"of\" ...\n## ..@ x : num [1:350700] 0.65326 -0.01948 -0.00645 0.27136 -0.52462 ...\n## ..@ factors : list()\npmi_svd <- irlba(pmi_matrix, 256, maxit = 500)\nword_vectors <- pmi_svd$u\nrownames(word_vectors) <- rownames(pmi_matrix)\ndim(word_vectors)## [1] 21172 256\nhead(word_vectors[1:5, 1:5])## [,1] [,2] [,3] [,4] [,5]\n## the 0.007810973 0.07024009 0.06377615 0.03139044 -0.12362108\n## to 0.006889381 -0.03210269 0.10665925 0.03537632 0.10104552\n## and -0.050498380 0.09131495 0.19658197 -0.08136253 -0.01605705\n## of -0.015628371 0.16306386 0.13296127 -0.04087709 -0.23175976\n## https 0.301718525 0.07658843 -0.01720398 0.26219147 0.07930941"},{"path":"week-7-demo.html","id":"using-glove-or-word2vec","chapter":"13 Week 7 Demo","heading":"13.2 Using GloVe or word2vec","text":"neural network approach considerably involved, figure gives overview picture differing algorithmic approaches might use.","code":""},{"path":"week-8-sampling-text-information.html","id":"week-8-sampling-text-information","chapter":"14 Week 8: Sampling text information","heading":"14 Week 8: Sampling text information","text":"week ’ll thinking best sample text information, thinking different biases might inhere data-generating process, well representativeness generalizability text corpus construct.reading Barberá Rivero (2015) invesitgates representativeness Twitter data, give us pause thinking using digital trace data general barometer public opinion.reading Michalopoulos Xue (2021) takes entirely different tack, illustrates can think systematically text information broadly representative societies general.Required reading:Barberá Rivero (2015)Michalopoulos Xue (2021)Klaus Krippendorff (2004, chs. 5 6)reading:Martins Baumard (2020)Baumard et al. (2022)Slides:Week 8 Slides","code":""},{"path":"week-9-supervised-learning.html","id":"week-9-supervised-learning","chapter":"15 Week 9: Supervised learning","heading":"15 Week 9: Supervised learning","text":"Required reading:Hopkins King (2010)King, Pan, Roberts (2017)Siegel et al. (2021)Yu, Kaufmann, Diermeier (2008)Manning, Raghavan, Schtze (2007, chs. 13,14, 15): https://nlp.stanford.edu/IR-book/information-retrieval-book.html]reading:Denny Spirling (2018)King, Lam, Roberts (2017)","code":""},{"path":"week-10-validation.html","id":"week-10-validation","chapter":"16 Week 10: Validation","heading":"16 Week 10: Validation","text":"week ’ll thinking validate techniques ’ve used preceding weeks. Validation necessary important part text analysis technique.Often speak validation context machine labelling large text data. validation need ——restricted automated classification tasks. articles Ying, Montgomery, Stewart (2021) Pedro L. Rodriguez, Spirling, Stewart (2021) describe ways approach validation unsupervised contexts. Finally, article Peterson Spirling (2018) shows validation accuracy might provide measure substantive significance.Required reading:Ying, Montgomery, Stewart (2021)Pedro L. Rodriguez, Spirling, Stewart (2021)Peterson Spirling (2018)Manning, Raghavan, Schtze (2007, ch.2: https://nlp.stanford.edu/IR-book/information-retrieval-book.html)reading:K. Krippendorff (2004)Denny Spirling (2018)Justin Grimmer Stewart (2013b)Barberá et al. (2021)Schiller, Daxenberger, Gurevych (2021)Slides:Week 10 Slides","code":""},{"path":"exercise-1-word-frequency-analysis.html","id":"exercise-1-word-frequency-analysis","chapter":"17 Exercise 1: Word frequency analysis","heading":"17 Exercise 1: Word frequency analysis","text":"","code":""},{"path":"exercise-1-word-frequency-analysis.html","id":"introduction","chapter":"17 Exercise 1: Word frequency analysis","heading":"17.1 Introduction","text":"tutorial, learn summarise, aggregate, analyze text R:tokenize filter textHow clean preprocess textHow visualize results ggplotHow perform automated gender assignment name data (think possible biases methods may enclose)","code":""},{"path":"exercise-1-word-frequency-analysis.html","id":"setup-6","chapter":"17 Exercise 1: Word frequency analysis","heading":"17.2 Setup","text":"practice skills, use dataset already collected Edinburgh Fringe Festival website.can try : obtain data, must first obtain API key. Instructions available Edinburgh Fringe API page:","code":""},{"path":"exercise-1-word-frequency-analysis.html","id":"load-data-and-packages","chapter":"17 Exercise 1: Word frequency analysis","heading":"17.3 Load data and packages","text":"proceeding, ’ll load remaining packages need tutorial.tutorial, using data pre-cleaned provided .csv format. data come Edinburgh Book Festival API, provide data every event taken place Edinburgh Book Festival, runs every year month August, nine years: 2012-2020. many questions might ask data. tutorial, investigate contents event, speakers event, determine trends gender representation time.first task, , read data. can read_csv() function.read_csv() function takes .csv file loads working environment data frame object called “edbfdata.” can call object anything though. Try changing name object <- arrow. Note R allow names spaces , however. also good idea name object something beginning numbers, means call object within ` marks.’re working document computer (“locally”) can download Edinburgh Fringe data following way:","code":"\nlibrary(tidyverse) # loads dplyr, ggplot2, and others\nlibrary(tidytext) # includes set of functions useful for manipulating text\nlibrary(ggthemes) # includes a set of themes to make your visualizations look nice!\nlibrary(readr) # more informative and easy way to import data\nlibrary(babynames) #for gender predictions\nedbfdata <- read_csv(\"data/wordfreq/edbookfestall.csv\")## New names:\n## Rows: 5938 Columns: 12\n## ── Column specification\n## ───────────────────────────────────────────────────────── Delimiter: \",\" chr\n## (8): festival_id, title, sub_title, artist, description, genre, age_categ... dbl\n## (4): ...1, year, latitude, longitude\n## ℹ Use `spec()` to retrieve the full column specification for this data. ℹ\n## Specify the column types or set `show_col_types = FALSE` to quiet this message.\n## • `` -> `...1`\nedbfdata <- read_csv(\"https://raw.githubusercontent.com/cjbarrie/RDL-Ed/main/02-text-as-data/data/edbookfestall.csv\")"},{"path":"exercise-1-word-frequency-analysis.html","id":"inspect-and-filter-data","chapter":"17 Exercise 1: Word frequency analysis","heading":"17.4 Inspect and filter data","text":"next job cut dataset size, including columns need. first can inspect see existing column names , variable coded. can first call::can see description event included column named “description” year event “year.” now ’ll just keep two. Remember: ’re interested tutorial firstly representation gender feminism forms cultural production given platform Edinburgh International Book Festival. Given , first foremost interested reported content artist’s event.use pipe %>% functions tidyverse package quickly efficiently select columns want edbfdata data.frame object. pass data new data.frame object, call “evdes.”let’s take quick look many events time festival. , first calculate number individual events (row observations) year (column variable).can plot using ggplot!Perhaps unsurprisingly, context pandemic, number recorded bookings 2020 Festival drastically reduced.","code":"\ncolnames(edbfdata)## [1] \"...1\" \"festival_id\" \"title\" \"sub_title\" \"artist\" \n## [6] \"year\" \"description\" \"genre\" \"latitude\" \"longitude\" \n## [11] \"age_category\" \"ID\"\nglimpse(edbfdata)## Rows: 5,938\n## Columns: 12\n## $ ...1 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17,…\n## $ festival_id \"book\", \"book\", \"book\", \"book\", \"book\", \"book\", \"book\", \"b…\n## $ title \"Denise Mina\", \"Alex T Smith\", \"Challenging Expectations w…\n## $ sub_title \"HARD MEN AND CARDBOARD GANGSTERS\", NA, NA, \"WHAT CAUSED T…\n## $ artist \"Denise Mina\", \"Alex T Smith\", \"Peter Cocks\", \"Paul Mason\"…\n## $ year 2012, 2012, 2012, 2012, 2012, 2012, 2012, 2012, 2012, 2012…\n## $ description \"\\n\\tAs the grande dame of Scottish crime fiction, Deni…\n## $ genre \"Literature\", \"Children\", \"Children\", \"Literature\", \"Child…\n## $ latitude 55.9519, 55.9519, 55.9519, 55.9519, 55.9519, 55.9519, 55.9…\n## $ longitude -3.206913, -3.206913, -3.206913, -3.206913, -3.206913, -3.…\n## $ age_category NA, \"AGE 4 - 7\", \"AGE 11 - 14\", NA, \"AGE 10 - 14\", \"AGE 6 …\n## $ ID \"Denise Mina2012\", \"Alex T Smith2012\", \"Peter Cocks2012\", …\n# get simplified dataset with only event contents and year\nevdes <- edbfdata %>%\n select(description, year)\n\nhead(evdes)## # A tibble: 6 × 2\n## description year\n## \n## 1 \"\\n\\tAs the grande dame of Scottish crime fiction, Denise Mina places… 2012\n## 2 \"
\\n\\tWhen Alex T Smith was a little boy he wanted to be a chef, a rab… 2012\n## 3 \"
\\n\\tPeter Cocks is known for his fantasy series Triskellion written … 2012\n## 4 \"
\\n\\tTwo books by influential journalists are among the first to look… 2012\n## 5 \"
\\n\\tChris d’Lacey tells you all about The Fire Ascending, the … 2012\n## 6 \"
\\n\\tIt’s time for the honourable, feisty and courageous young … 2012\nevtsperyr <- evdes %>%\n mutate(obs=1) %>%\n group_by(year) %>%\n summarise(sum_events = sum(obs))\nggplot(evtsperyr) +\n geom_line(aes(year, sum_events)) +\n theme_tufte(base_family = \"Helvetica\") + \n scale_y_continuous(expand = c(0, 0), limits = c(0, NA))"},{"path":"exercise-1-word-frequency-analysis.html","id":"tidy-the-text","chapter":"17 Exercise 1: Word frequency analysis","heading":"17.5 Tidy the text","text":"Given data obtained API outputs data originally HTML format, text still contains HTML PHP encodings e.g. bold font paragraphs. ’ll need get rid , well punctuation analyzing data.set commands takes event descriptions, extracts individual words, counts number times appear years covered book festival data.","code":"\n#get year and word for every word and date pair in the dataset\ntidy_des <- evdes %>% \n mutate(desc = tolower(description)) %>%\n unnest_tokens(word, desc) %>%\n filter(str_detect(word, \"[a-z]\"))"},{"path":"exercise-1-word-frequency-analysis.html","id":"back-to-the-fringe","chapter":"17 Exercise 1: Word frequency analysis","heading":"17.6 Back to the Fringe","text":"see resulting dataset large (~446k rows). commands first taken events text, “mutated” set lower case character string. “unnest_tokens” function taken individual string create new column called “word” contains individual word contained event description texts.terminology also appropriate . tidy text format, often refer data structures consisting “documents” “terms.” “tokenizing” text “unnest_tokens” functions generating dataset one term per row., “documents” collection descriptions events year Edinburgh Book Festival. way sort text “documents” depends choice individual researcher.Instead year, might wanted sort text “genre.” , two genres: “Literature” “Children.” done , two “documents,” contained words included event descriptions genre.Alternatively, might interested contributions individual authors time. case, sorted text documents author. case, “document” represent words included event descriptions events given author (many multiple appearances time festival given year).can yet tidy , though. First ’ll remove stop words ’ll remove apostrophes:see number rows dataset reduces half ~223k rows. natural since large proportion string contain many -called “stop words”. can see stop words typing:lexicon (list words) included tidytext package produced Julia Silge David Robinson (see ). see contains 1000 words. remove informative interested substantive content text (rather , say, grammatical content).Now let’s look common words data:can see one common words “rsquo,” HTML encoding apostrophe. Clearly need clean data bit . common issue large-n text analysis key step want conduct reliably robust forms text analysis. ’ll another go using filter command, specifying keep words included string words rsquo, em, ndash, nbsp, lsquo.’s like ! words feature seem make sense now (actual words rather random HTML UTF-8 encodings).Let’s now collect words data.frame object, ’ll call edbf_term_counts:year, see “book” common word… perhaps surprises . evidence ’re properly pre-processing cleaning data. Cleaning text data important element preparing text analysis. often process trial error text data looks alike, may come e.g. webpages HTML encoding, unrecognized fonts unicode, potential cause issues! finding errors also chance get know data…","code":"\ntidy_des <- tidy_des %>%\n filter(!word %in% stop_words$word)\nstop_words## # A tibble: 1,149 × 2\n## word lexicon\n## \n## 1 a SMART \n## 2 a's SMART \n## 3 able SMART \n## 4 about SMART \n## 5 above SMART \n## 6 according SMART \n## 7 accordingly SMART \n## 8 across SMART \n## 9 actually SMART \n## 10 after SMART \n## # ℹ 1,139 more rows\ntidy_des %>%\n count(word, sort = TRUE)## # A tibble: 24,995 × 2\n## word n\n## \n## 1 rsquo 5638\n## 2 book 2088\n## 3 event 1356\n## 4 author 1332\n## 5 world 1240\n## 6 story 1159\n## 7 join 1095\n## 8 em 1064\n## 9 life 879\n## 10 strong 864\n## # ℹ 24,985 more rows\nremove_reg <- c(\"&\",\"<\",\">\",\"\", \"<\/p>\",\"&rsquo\", \"‘\", \"'\", \"\", \"<\/strong>\", \"rsquo\", \"em\", \"ndash\", \"nbsp\", \"lsquo\", \"strong\")\n \ntidy_des <- tidy_des %>%\n filter(!word %in% remove_reg)\ntidy_des %>%\n count(word, sort = TRUE)## # A tibble: 24,989 × 2\n## word n\n## \n## 1 book 2088\n## 2 event 1356\n## 3 author 1332\n## 4 world 1240\n## 5 story 1159\n## 6 join 1095\n## 7 life 879\n## 8 stories 860\n## 9 chaired 815\n## 10 books 767\n## # ℹ 24,979 more rows\nedbf_term_counts <- tidy_des %>% \n group_by(year) %>%\n count(word, sort = TRUE)\nhead(edbf_term_counts)## # A tibble: 6 × 3\n## # Groups: year [6]\n## year word n\n## \n## 1 2016 book 295\n## 2 2018 book 283\n## 3 2019 book 265\n## 4 2012 book 254\n## 5 2013 book 241\n## 6 2015 book 239"},{"path":"exercise-1-word-frequency-analysis.html","id":"analyze-keywords","chapter":"17 Exercise 1: Word frequency analysis","heading":"17.7 Analyze keywords","text":"Okay, now list words, number times appear, can tag words think might related issues gender inequality sexism. may decide list imprecise inexhaustive. , feel free change terms including grepl() function.","code":"\nedbf_term_counts$womword <- as.integer(grepl(\"women|feminist|feminism|gender|harassment|sexism|sexist\", \n x = edbf_term_counts$word))\nhead(edbf_term_counts)## # A tibble: 6 × 4\n## # Groups: year [6]\n## year word n womword\n## \n## 1 2016 book 295 0\n## 2 2018 book 283 0\n## 3 2019 book 265 0\n## 4 2012 book 254 0\n## 5 2013 book 241 0\n## 6 2015 book 239 0"},{"path":"exercise-1-word-frequency-analysis.html","id":"compute-aggregate-statistics","chapter":"17 Exercise 1: Word frequency analysis","heading":"17.8 Compute aggregate statistics","text":"Now tagged individual words relating gender inequality feminism, can sum number times words appear year denominate total number words event descriptions.intuition increase decrease percentage words relating issues capturing substantive change representation issues related sex gender.think measure? adequate measure representation issues cultural sphere?keywords used precise enough? , change?","code":"\n#get counts by year and word\nedbf_counts <- edbf_term_counts %>%\n group_by(year) %>%\n mutate(year_total = sum(n)) %>%\n filter(womword==1) %>%\n summarise(sum_wom = sum(n),\n year_total= min(year_total))\nhead(edbf_counts)## # A tibble: 6 × 3\n## year sum_wom year_total\n##
———. 2013b. “Text as Data: The Promise and Pitfalls of Automatic Content Analysis Methods for Political Texts.” Political Analysis 21 (3): 267–97. https://doi.org/10.1093/pan/mps028.
+
+Haroon, Muhammad, Anshuman Chhabra, Xin Liu, Prasant Mohapatra, Zubair Shafiq, and Magdalena Wojcieszak. 2022. “YouTube, the Great Radicalizer? Auditing and Mitigating Ideological Biases in YouTube Recommendations.” https://doi.org/10.48550/ARXIV.2203.10666.
+
Hopkins, Daniel J., and Gary King. 2010. “A Method of Automated Nonparametric Content Analysis for Social Science.” American Journal of Political Science 54 (1): 229–47. https://doi.org/10.1111/j.1540-5907.2009.00428.x.
diff --git a/docs/search.json b/docs/search.json
index 9a569f4..db31413 100644
--- a/docs/search.json
+++ b/docs/search.json
@@ -1 +1 @@
-[{"path":"index.html","id":"computational-text-analysis-pgsp11584","chapter":"“Computational Text Analysis” (PGSP11584)","heading":"“Computational Text Analysis” (PGSP11584)","text":" dedicated webpage course Computational Text Analysis” (PGSP11584) University Edinburgh, taught Christopher Barrie. Go Course Overview Introduction tabs course overview introduction R.using online book throughout course. week set essential recommended readings. essential readings must consulted full prior Lecture Seminar week. addition, find online Exercises examples written R. “live” book amended updated course .","code":""},{"path":"index.html","id":"structure","chapter":"“Computational Text Analysis” (PGSP11584)","heading":"0.1 Structure","text":"course structured alternating weeks substantive technical instruction.","code":""},{"path":"index.html","id":"acknowledgments","chapter":"“Computational Text Analysis” (PGSP11584)","heading":"Acknowledgments","text":"compiling course, benefited syllabus materials shared online Margaret Roberts, Alexandra Siegel, Arthur Spirling. Thanks also Justin Grimmer, Margaret Roberts, Brandon Stewart providing early view access forthcoming Text Data book.","code":""},{"path":"course-overview.html","id":"course-overview","chapter":"Course Overview","heading":"Course Overview","text":"recent years, use computational techniques quantitative analysis text exploded. volume quantity text data now access digital age enormous. led social scientists seek new means analyzing text data scale.see text records, form digital traces left social media platforms, archived works literature, parliamentary speeches, video transcripts, print news, can help us answer huge range important questions.","code":""},{"path":"course-overview.html","id":"learning-outcomes","chapter":"Course Overview","heading":"Learning outcomes","text":"course give students training use computational text analysis techniques. course prepare students dissertation work uses textual data provide hands-training use R programming language () Python.course provide venue seminar discussion examples using methods empirical social sciences well lectures technical /statistical dimensions application.","code":""},{"path":"course-overview.html","id":"course-structure","chapter":"Course Overview","heading":"Course structure","text":"using online book ten-week course “Computational Text Analysis” (PGSP11584). chapter contains readings week. book also includes worksheets example code conduct text analysis techniques discuss week.week (partial exception week 1), discussing, alternately, substantive technical dimensions published research empirical social sciences. readings week generally contain two “substantive” readings—, examples application text analysis techniques empirical data—one “technical” reading focuses mainly statistical computational aspects given technique.study first technical aspects analytical approaches , second, substantive dimensions applications. means , discussing readings, able discuss satisfactory given approach illuminating question topic hand.Lectures primarily focused technical dimensions given technique. seminar (Q&) follows give us opportunity study discuss questions social scientific interest, computational text analysis used answer .","code":""},{"path":"course-overview.html","id":"course-pre-preparation","chapter":"Course Overview","heading":"Course pre-preparation","text":"NOTE: lecture Week 2, students complete two introductory R exercises. students already done courses Semester 1 need .haven’t done pre-preparation tasks already, , first, consult worksheet , introduction setting understanding basics working R. Second, Ugur Ozdemir provided comprehensive introductory R course Research Training Centre University Edinburgh can follow instructions access .","code":""},{"path":"course-overview.html","id":"reference-sources","chapter":"Course Overview","heading":"Reference sources","text":"several reference texts use course:Wickham, Hadley Garrett Grolemund. R Data Science: https://r4ds..co.nz/Silge, Julia David Robinson. Text Mining R: https://www.tidytextmining.com/\nlearning tidytext, online tutorial used: https://juliasilge.shinyapps.io/learntidytext/\nlearning tidytext, online tutorial used: https://juliasilge.shinyapps.io/learntidytext/(later course) Hvitfelft, Emil Julia Silge. Supervised Machine Learning Text Analysis R: https://smltar.com/several weeks, also referring two textbooks, available online, information retrieval text processing. :Jurafsky, Dan James H. Martin. Speech Language Processing (3rd ed. draft): https://nlp.stanford.edu/IR-book/information-retrieval-book.htmlManning, Christopher D.,Prabhakar Raghavan, Hinrich Schütze. Introduction Information Retrieval: https://nlp.stanford.edu/IR-book/information-retrieval-book.html","code":""},{"path":"course-overview.html","id":"assessment","chapter":"Course Overview","heading":"Assessment","text":"","code":""},{"path":"course-overview.html","id":"fortnightly-worksheets","chapter":"Course Overview","heading":"Fortnightly worksheets","text":"fortnight, provide one worksheet walks implement different text analysis technique. end worksheets find set questions. buddy someone else class go together.called “pair programming” ’s reason . Firstly, coding can isolating difficult thing—’s good bring friend along ride! Secondly, ’s something don’t know, maybe buddy . saves time. Thirdly, buddy can check code write , vice versa. , means working together produce check something go along.subsequent week’s lecture, pick pair random answer one worksheet’s questions (.e., ~1/3 chance ’re going get picked week). ask walk us code. remember: ’s also fine struggled didn’t get end! encountered obstacle, can work together. matters try.remainder seminar worksheet weeks dedicated seminar discussion discuss readings together.","code":""},{"path":"course-overview.html","id":"fortnightly-flash-talks","chapter":"Course Overview","heading":"Fortnightly flash talks","text":"weeks going tasked coding assignment, ’re hook… selecting pair random (coding pair) talk one readings. pick different pair reading (.e., ~ 1/3 chance ).Don’t let cause great anguish: just want thirty seconds minutes lay least one—preferably two three—criticisms articles required reading week,, want think whether article really answered research question, whether data appropriate answering question, whether method appropriate answering question, whether results show author claims show.remainder seminar flash talk weeks dedicated group work go coding Worksheet together.","code":""},{"path":"course-overview.html","id":"final-assessment","chapter":"Course Overview","heading":"Final assessment","text":"Assessment takes form one summative assessment. 4000 word essay subject choosing (prior approval ). , required select range data sources provide. may also suggest data source.asked : ) formulate research question; b) use least one computational text analysis technique studied; c) conduct analysis data source provided; d) write initial findings; e) outline potential extensions analysis.provide code used reproducible (markdown) format assessed substantive content essay contribution (social science part) well demonstrated competency coding text analysis (computational part).","code":""},{"path":"introduction-to-r.html","id":"introduction-to-r","chapter":"Introduction to R","heading":"Introduction to R","text":"section designed ensure familiar R environment.","code":""},{"path":"introduction-to-r.html","id":"getting-started-with-r-at-home","chapter":"Introduction to R","heading":"0.2 Getting started with R at home","text":"Given ’re working home days, ’ll need download R RStudio onto devices. R name programming language ’ll using coding exercises; RStudio IDE (“Integrated Development Environment”), .e., piece software almost everyone uses working R.can download Windows Mac easily free. one first reasons use “open-source” programming language: ’s free everyone can contribute!Services University Edinburgh provided walkthrough needed get started. also break :Install R Mac : https://cran.r-project.org/bin/macosx/. Install R Windows : https://cran.r-project.org/bin/windows/base/.Install R Mac : https://cran.r-project.org/bin/macosx/. Install R Windows : https://cran.r-project.org/bin/windows/base/.Download RStudio Windows Mac : https://rstudio.com/products/rstudio/download/, choosing Free version: people use enough needs.Download RStudio Windows Mac : https://rstudio.com/products/rstudio/download/, choosing Free version: people use enough needs.programs free. Make sure load everything listed operating system R work properly!","code":""},{"path":"introduction-to-r.html","id":"some-basic-information","chapter":"Introduction to R","heading":"0.3 Some basic information","text":"script text file write commands (code) comments.script text file write commands (code) comments.put # character front line text line executed; useful add comments script!put # character front line text line executed; useful add comments script!R case sensitive, careful typing.R case sensitive, careful typing.send code script console, highlight relevant line code script click Run, select line hit ctrl+enter PCR cmd+enter MacTo send code script console, highlight relevant line code script click Run, select line hit ctrl+enter PCR cmd+enter MacAccess help files R functions preceding name function ? (e.g., ?table)Access help files R functions preceding name function ? (e.g., ?table)pressing key, can go back commands used beforeBy pressing key, can go back commands used beforePress tab key auto-complete variable names commandsPress tab key auto-complete variable names commands","code":""},{"path":"introduction-to-r.html","id":"getting-started-in-rstudio","chapter":"Introduction to R","heading":"0.4 Getting Started in RStudio","text":"Begin opening RStudio (located desktop). first task create new script (write commands). , click:screen now four panes:Script (top left)Script (top left)Console (bottom left)Console (bottom left)Environment/History (top right)Environment/History (top right)Files/Plots/Packages/Help/Viewer (bottom right)Files/Plots/Packages/Help/Viewer (bottom right)","code":"File --> NewFile --> RScript"},{"path":"introduction-to-r.html","id":"a-simple-example","chapter":"Introduction to R","heading":"0.5 A simple example","text":"Script (top left) write commands R. can try first time writing small snipped code follows:tell R run command, highlight relevant row script click Run button (top right Script) - hold ctrl+enter Windows cmd+enter Mac - send command Console (bottom left), actual evaluation calculations taking place. shortcut keys become familiar quickly!Running command creates object named ‘x’, contains words message.can now see ‘x’ Environment (top right). view contained x, type Console (bottom left):","code":"\nx <- \"I can't wait to learn Computational Text Analysis\" #Note the quotation marks!\nprint(x)## [1] \"I can't wait to learn Computational Text Analysis\"\n# or alternatively you can just type:\n\nx## [1] \"I can't wait to learn Computational Text Analysis\""},{"path":"introduction-to-r.html","id":"loading-packages","chapter":"Introduction to R","heading":"0.6 Loading packages","text":"‘base’ version R powerful able everything , least ease. technical specialized forms analysis, need load new packages.need install -called ‘package’—program includes new tools (.e., functions) carry specific tasks. can think ‘extensions’ enhancing R’s capacities.take one example, might want something little exciting print excited course. Let’s make map instead.might sound technical. beauty packaged extensions R contain functions perform specialized types analysis ease.’ll first need install one packages, can :package installed, need load environment typing library(). Note , , don’t need wrap name package quotation marks. trick:now? Well, let’s see just easy visualize data using ggplot package comes bundled larger tidyverse package.wanted save ’d got making plots, want save scripts, maybe data used well, return later stage.","code":"\ninstall.packages(\"tidyverse\")\nlibrary(tidyverse)\nggplot(data = mpg) + \n geom_point(mapping = aes(x = displ, y = hwy))"},{"path":"introduction-to-r.html","id":"saving-your-objects-plots-and-scripts","chapter":"Introduction to R","heading":"0.7 Saving your objects, plots and scripts","text":"Saving scripts: save script RStudio (.e. top left panel), need click File –> Save (choose name script). script something like: myfilename.R.Saving scripts: save script RStudio (.e. top left panel), need click File –> Save (choose name script). script something like: myfilename.R.Saving plots: made plots like save, click Export (plotting pane) choose relevant file extension (e.g. .png, .pdf, etc.) size.Saving plots: made plots like save, click Export (plotting pane) choose relevant file extension (e.g. .png, .pdf, etc.) size.save individual objects (example x ) environment, run following command (choosing suitable filename):save individual objects (example x ) environment, run following command (choosing suitable filename):save objects (.e. everything top right panel) , run following command (choosing suitable filename):objects can re-loaded R next session running:many file formats might use save output. encounter course progresses.","code":"\nsave(x,file=\"myobject.RData\")\nload(file=\"myobject.RData\")\nsave.image(file=\"myfilname.RData\")\nload(file=\"myfilename.RData\")"},{"path":"introduction-to-r.html","id":"knowing-where-r-saves-your-documents","chapter":"Introduction to R","heading":"0.8 Knowing where R saves your documents","text":"home, open new script make sure check set working directory (.e. folder files create saved). check working directory use getwd() command (type Console write script Source Editor):set working directory, run following command, substituting file directory choice. Remember anything following `#’ symbol simply clarifying comment R process .","code":"\ngetwd()\n## Example for Mac \nsetwd(\"/Users/Documents/mydir/\") \n## Example for PC \nsetwd(\"c:/docs/mydir\") "},{"path":"introduction-to-r.html","id":"practicing-in-r","chapter":"Introduction to R","heading":"0.9 Practicing in R","text":"best way learn R use . workshops text analysis place become fully proficient R. , however, chance conduct hands-analysis applied examples fast-expanding field. best way learn . give shot!practice R programming language, look Wickham Grolemund (2017) , tidy text analysis, Silge Robinson (2017).free online book Hadley Wickham “R Data Science” available hereThe free online book Hadley Wickham “R Data Science” available hereThe free online book Julia Silge David Robinson “Text Mining R” available hereThe free online book Julia Silge David Robinson “Text Mining R” available hereFor practice R, may want consult set interactive tutorials, available package “learnr.” ’ve installed package, can go tutorials calling:practice R, may want consult set interactive tutorials, available package “learnr.” ’ve installed package, can go tutorials calling:","code":"\nlibrary(learnr)\n\navailable_tutorials() # this will tell you the names of the tutorials available\n\nrun_tutorial(name = \"ex-data-basics\", package = \"learnr\") #this will launch the interactive tutorial in a new Internet browser window"},{"path":"introduction-to-r.html","id":"one-final-note","chapter":"Introduction to R","heading":"0.10 One final note","text":"’ve dipped “R Data Science” book ’ll hear lot -called tidyverse R. essentially set packages use alternative, intuitive, way interacting data.main difference ’ll notice , instead separate lines function want run, wrapping functions inside functions, sets functions “piped” using “pipe” functions, look appearance: %>%.using “tidy” syntax weekly exercises computational text analysis workshops. anything unclear, can provide equivalents “base” R . lot useful text analysis packages now composed ‘tidy’ syntax.","code":""},{"path":"week-1-retrieving-and-analyzing-text.html","id":"week-1-retrieving-and-analyzing-text","chapter":"1 Week 1: Retrieving and analyzing text","heading":"1 Week 1: Retrieving and analyzing text","text":"first task conducting large-scale text analyses gathering curating text information . focus chapters Manning, Raghavan, Schtze (2007) listed . , ’ll find introduction different ways can reformat ‘query’ text data order begin asking questions . often referred computer science natural language processing contexts “information retrieval” foundation many search, including web search, processes.articles Tatman (2017) Pechenick, Danforth, Dodds (2015) focus seminar (Q&). articles get us thinking fundamentals text discovery sampling. reading articles think locating texts, sampling , biases might inhere sampling process, texts represent; .e., population phenomenon interest might provide inferences.Questions seminar:access text? need consider ?sample texts?biases need keep mind?Required reading:Tatman (2017)Tatman (2017)Pechenick, Danforth, Dodds (2015)Pechenick, Danforth, Dodds (2015)Manning, Raghavan, Schtze (2007) (chs.1 10): https://nlp.stanford.edu/IR-book/information-retrieval-book.htmlManning, Raghavan, Schtze (2007) (chs.1 10): https://nlp.stanford.edu/IR-book/information-retrieval-book.htmlKlaus Krippendorff (2004) (ch. 6)Klaus Krippendorff (2004) (ch. 6)reading:Olteanu et al. (2019)Biber (1993)Barberá Rivero (2015)Slides:Week 1 Slides","code":""},{"path":"week-2-tokenization-and-word-frequencies.html","id":"week-2-tokenization-and-word-frequencies","chapter":"2 Week 2: Tokenization and word frequencies","heading":"2 Week 2: Tokenization and word frequencies","text":"approaching large-scale quantiative analyses text, key task identify capture unit analysis. One commonly used approaches, across diverse analytical contexts, text tokenization. , splitting text word units: unigrams, bigrams, trigrams etc.chapters Manning, Raghavan, Schtze (2007), listed , provide technical introduction task “querying” text according different word-based queries. task studying hands-assignment week.seminar discussion, focusing widely-cited examples research applied social sciences employing token-based, word frequency, analyses large corpora. first, Michel et al. (2011) uses enormous Google books corpus measure cultural linguistic trends. second, Bollen et al. (2021a) uses corpus demonstrate specific change time—-called “cognitive distortion.” examples, attentive questions sampling covered previous weeks. question central back--forths short responses replies articles Michel et al. (2011) Bollen et al. (2021a).Questions:Tokenizing counting: capture?Corpus-based sampling: biases might threaten inference?write critique either Michel et al. (2011) Bollen et al. (2021a), focus ?Required reading:Michel et al. (2011)\nSchwartz (2011)\nMorse-Gagné (2011)\nAiden, Pickett, Michel (2011)\nSchwartz (2011)Morse-Gagné (2011)Aiden, Pickett, Michel (2011)Bollen et al. (2021a)\nSchmidt, Piantadosi, Mahowald (2021)\nBollen et al. (2021b)\nSchmidt, Piantadosi, Mahowald (2021)Bollen et al. (2021b)Manning, Raghavan, Schtze (2007) (ch. 2): https://nlp.stanford.edu/IR-book/information-retrieval-book.html]Klaus Krippendorff (2004) (ch. 5)reading:Rozado, Al-Gharbi, Halberstadt (2021)Alshaabi et al. (2021)Campos et al. (2015)Greenfield (2013)Slides:Week 2 Slides","code":""},{"path":"week-2-demo.html","id":"week-2-demo","chapter":"3 Week 2 Demo","heading":"3 Week 2 Demo","text":"","code":""},{"path":"week-2-demo.html","id":"setup","chapter":"3 Week 2 Demo","heading":"3.1 Setup","text":"section, ’ll quick overview ’re processing text data conducting analyses word frequency. ’ll using randomly simulated text.First load packages ’ll using:","code":"\nlibrary(stringi) #to generate random text\nlibrary(dplyr) #tidyverse package for wrangling data\nlibrary(tidytext) #package for 'tidy' manipulation of text data\nlibrary(ggplot2) #package for visualizing data\nlibrary(scales) #additional package for formatting plot axes\nlibrary(kableExtra) #package for displaying data in html format (relevant for formatting this worksheet mainly)"},{"path":"week-2-demo.html","id":"tokenizing","chapter":"3 Week 2 Demo","heading":"3.2 Tokenizing","text":"’ll first get random text see looks like ’re tokenizing text.can tokenize unnest_tokens() function tidytext.Now ’ll get larger data, simulating 5000 observations (rows) random Latin text strings.’ll add another column call “weeks.” unit analysis.Now ’ll simulate trend see increasing number words weeks go . Don’t worry much code little complex, share case interest.can see week goes , text.can trend week sees decreasing number words.Now let’s check top frequency words text.’re going check frequencies word “sed” ’re gonna normalize denominating total word frequencies week.First need get total word frequencies week.can join two dataframes together left_join() function ’re joining “week” column. can pipe joined data plot.","code":"\nlipsum_text <- data.frame(text = stri_rand_lipsum(1, start_lipsum = TRUE))\n\nhead(lipsum_text$text)## [1] \"Lorem ipsum dolor sit amet, consectetur dictum ante id urna, quis convallis. Eros ut magnis mauris, eros auctor! Auctor ipsum eu himenaeos interdum. Dictum, litora urna sapien ut morbi, dui at ante at. Lorem vitae ac ut commodo. Id non ridiculus leo erat, tristique inceptos mauris faucibus consectetur erat et. Ex sed at accumsan. Molestie ultricies eu nisl congue duis volutpat ac. Lectus, est ornare sed vel dignissim ac parturient nisl vivamus.\"\ntokens <- lipsum_text %>%\n unnest_tokens(word, text)\n\nhead(tokens)## word\n## 1 lorem\n## 2 ipsum\n## 3 dolor\n## 4 sit\n## 5 amet\n## 6 consectetur\n## Varying total words example\nlipsum_text <- data.frame(text = stri_rand_lipsum(5000, start_lipsum = TRUE))\n# make some weeks one to ten\nlipsum_text$week <- as.integer(rep(seq.int(1:10), 5000/10))\nfor(i in 1:nrow(lipsum_text)) {\n week <- lipsum_text[i, 2]\n morewords <-\n paste(rep(\"more lipsum words\", times = sample(1:100, 1) * week), collapse = \" \")\n lipsum_words <- lipsum_text[i, 1]\n new_lipsum_text <- paste0(morewords, lipsum_words, collapse = \" \")\n lipsum_text[i, 1] <- new_lipsum_text\n}\nlipsum_text %>%\n unnest_tokens(word, text) %>%\n group_by(week) %>%\n dplyr::count(word) %>%\n select(week, n) %>%\n distinct() %>%\n ggplot() +\n geom_bar(aes(week, n), stat = \"identity\") +\n labs(x = \"Week\", y = \"n words\") +\n scale_x_continuous(breaks= pretty_breaks())\n# simulate decreasing words trend\nlipsum_text <- data.frame(text = stri_rand_lipsum(5000, start_lipsum = TRUE))\n\n# make some weeks one to ten\nlipsum_text$week <- as.integer(rep(seq.int(1:10), 5000/10))\n\nfor(i in 1:nrow(lipsum_text)) {\n week <- lipsum_text[i,2]\n morewords <- paste(rep(\"more lipsum words\", times = sample(1:100, 1)* 1/week), collapse = \" \")\n lipsum_words <- lipsum_text[i,1]\n new_lipsum_text <- paste0(morewords, lipsum_words, collapse = \" \")\n lipsum_text[i,1] <- new_lipsum_text\n}\n\nlipsum_text %>%\n unnest_tokens(word, text) %>%\n group_by(week) %>%\n dplyr::count(word) %>%\n select(week, n) %>%\n distinct() %>%\n ggplot() +\n geom_bar(aes(week, n), stat = \"identity\") +\n labs(x = \"Week\", y = \"n words\") +\n scale_x_continuous(breaks= pretty_breaks())\nlipsum_text %>%\n unnest_tokens(word, text) %>%\n dplyr::count(word, sort = T) %>%\n top_n(5) %>%\n knitr::kable(format=\"html\")%>% \n kable_styling(\"striped\", full_width = F)## Selecting by n\nlipsum_totals <- lipsum_text %>%\n group_by(week) %>%\n unnest_tokens(word, text) %>%\n dplyr::count(word) %>%\n mutate(total = sum(n)) %>%\n distinct(week, total)\n# let's look for \"sed\"\nlipsum_sed <- lipsum_text %>%\n group_by(week) %>%\n unnest_tokens(word, text) %>%\n filter(word == \"sed\") %>%\n dplyr::count(word) %>%\n mutate(total_sed = sum(n)) %>%\n distinct(week, total_sed)\nlipsum_sed %>%\n left_join(lipsum_totals, by = \"week\") %>%\n mutate(sed_prop = total_sed/total) %>%\n ggplot() +\n geom_line(aes(week, sed_prop)) +\n labs(x = \"Week\", y = \"\n Proportion sed word\") +\n scale_x_continuous(breaks= pretty_breaks())"},{"path":"week-2-demo.html","id":"regexing","chapter":"3 Week 2 Demo","heading":"3.3 Regexing","text":"’ll notice worksheet word frequencies one point set parentheses str_detect() string “[-z]”. called character class use square brackets like [].character classes include, helpfully listed vignette stringr package. follows adapted materials regular expressions.[abc]: matches , b, c.[-z]: matches every character z\n(Unicode code point order).[^abc]: matches anything except , b, c.[\\^\\-]: matches ^ -.Several patterns match multiple characters. include:\\d: matches digit; opposite \\D, matches character \ndecimal digit.\\s: matches whitespace; opposite \\S^: matches start string$: matches end string^ $: exact string matchHold : plus signs etc. mean?+: 1 .*: 0 .?: 0 1.can tell output makes sense, ’re getting !","code":"\nstr_extract_all(\"1 + 2 = 3\", \"\\\\d+\")## [[1]]\n## [1] \"1\" \"2\" \"3\"\nstr_extract_all(\"1 + 2 = 3\", \"\\\\D+\")## [[1]]\n## [1] \" + \" \" = \"\n(text <- \"Some \\t badly\\n\\t\\tspaced \\f text\")## [1] \"Some \\t badly\\n\\t\\tspaced \\f text\"\nstr_replace_all(text, \"\\\\s+\", \" \")## [1] \"Some badly spaced text\"\nx <- c(\"apple\", \"banana\", \"pear\")\nstr_extract(x, \"^a\")## [1] \"a\" NA NA\nx <- c(\"apple\", \"banana\", \"pear\")\nstr_extract(x, \"^a$\")## [1] NA NA NA\nx <- c(\"apple\", \"banana\", \"pear\")\nstr_extract(x, \"^apple$\")## [1] \"apple\" NA NA\nstr_extract_all(\"1 + 2 = 3\", \"\\\\d+\")[[1]]## [1] \"1\" \"2\" \"3\"\nstr_extract_all(\"1 + 2 = 3\", \"\\\\D+\")[[1]]## [1] \" + \" \" = \"\nstr_extract_all(\"1 + 2 = 3\", \"\\\\d*\")[[1]]## [1] \"1\" \"\" \"\" \"\" \"2\" \"\" \"\" \"\" \"3\" \"\"\nstr_extract_all(\"1 + 2 = 3\", \"\\\\D*\")[[1]]## [1] \"\" \" + \" \"\" \" = \" \"\" \"\"\nstr_extract_all(\"1 + 2 = 3\", \"\\\\d?\")[[1]]## [1] \"1\" \"\" \"\" \"\" \"2\" \"\" \"\" \"\" \"3\" \"\"\nstr_extract_all(\"1 + 2 = 3\", \"\\\\D?\")[[1]]## [1] \"\" \" \" \"+\" \" \" \"\" \" \" \"=\" \" \" \"\" \"\""},{"path":"week-2-demo.html","id":"some-more-regex-resources","chapter":"3 Week 2 Demo","heading":"3.3.1 Some more regex resources:","text":"Regex crossword: https://regexcrossword.com/.Regexone: https://regexone.com/R4DS chapter 14","code":""},{"path":"week-3-dictionary-based-techniques.html","id":"week-3-dictionary-based-techniques","chapter":"4 Week 3: Dictionary-based techniques","heading":"4 Week 3: Dictionary-based techniques","text":"extension word frequency analyses, covered last week, -called “dictionary-based” techniques. basic form, analyses use index target terms classify corpus interest based presence absence. technical dimensions type analysis covered chapter section Klaus Krippendorff (2004), issues attending article - Loughran Mcdonald (2011).also reading two examples application techniques Martins Baumard (2020) Young Soroka (2012). , discussing successful authors measuring phenomenon interest (“prosociality” “tone” respectively). Questions sampling representativeness relevant , naturally inform assessments work.Questions:general dictionaries possible; domain-specific?know dictionary accurate?enhance/supplement dictionary-based techniques?Required reading:Martins Baumard (2020)Voigt et al. (2017)reading:Tausczik Pennebaker (2010)Klaus Krippendorff (2004) (pp.283-289)Brier Hopp (2011)Bonikowski Gidron (2015)Barberá et al. (2021)Young Soroka (2012)Slides:Week 3 Slides","code":""},{"path":"week-3-demo.html","id":"week-3-demo","chapter":"5 Week 3 Demo","heading":"5 Week 3 Demo","text":"section, ’ll quick overview ’re processing text data conducting basic sentiment analyses.","code":""},{"path":"week-3-demo.html","id":"setup-1","chapter":"5 Week 3 Demo","heading":"5.1 Setup","text":"’ll first load packages need.","code":"\nlibrary(stringi)\nlibrary(dplyr)\nlibrary(tidytext)\nlibrary(ggplot2)\nlibrary(scales)"},{"path":"week-3-demo.html","id":"happy-words","chapter":"5 Week 3 Demo","heading":"5.2 Happy words","text":"discussed lectures, might find text class’s collective thoughts increase “happy” words time.simulated dataset text split weeks, students, words plus whether word word “happy” 0 means word “happy” 1 means .three datasets: one constant number “happy” words; one increasing number “happy” words; one decreasing number “happy” words. called: happyn, happyu, happyd respectively.can see trend “happy” words week student.First, dataset constant number happy words time.now simulated data increasing number happy words.finally decreasing number happy words.","code":"\nhead(happyn)## # A tibble: 6 × 4\n## # Groups: week, student [1]\n## week student word happy\n## \n## 1 1 23 lorem 0\n## 2 1 23 ipsum 0\n## 3 1 23 dolor 0\n## 4 1 23 sit 0\n## 5 1 23 amet 0\n## 6 1 23 amet 0\nhead(happyu)## # A tibble: 6 × 4\n## # Groups: week, student [1]\n## week student word happy\n## \n## 1 1 23 lorem 0\n## 2 1 23 ipsum 0\n## 3 1 23 dolor 0\n## 4 1 23 sit 0\n## 5 1 23 amet 0\n## 6 1 23 amet 0\nhead(happyd)## # A tibble: 6 × 4\n## # Groups: week, student [1]\n## week student word happy\n## \n## 1 1 23 lorem 0\n## 2 1 23 ipsum 0\n## 3 1 23 dolor 0\n## 4 1 23 sit 0\n## 5 1 23 amet 0\n## 6 1 23 amet 0## Warning: Returning more (or less) than 1 row per `summarise()` group was deprecated in\n## dplyr 1.1.0.\n## ℹ Please use `reframe()` instead.\n## ℹ When switching from `summarise()` to `reframe()`, remember that `reframe()`\n## always returns an ungrouped data frame and adjust accordingly.\n## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was\n## generated.## `summarise()` has grouped output by 'week', 'student'. You can override using\n## the `.groups` argument.\n## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'## Warning: Returning more (or less) than 1 row per `summarise()` group was deprecated in\n## dplyr 1.1.0.\n## ℹ Please use `reframe()` instead.\n## ℹ When switching from `summarise()` to `reframe()`, remember that `reframe()`\n## always returns an ungrouped data frame and adjust accordingly.\n## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was\n## generated.## `summarise()` has grouped output by 'week', 'student'. You can override using\n## the `.groups` argument.\n## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'## Warning: Returning more (or less) than 1 row per `summarise()` group was deprecated in\n## dplyr 1.1.0.\n## ℹ Please use `reframe()` instead.\n## ℹ When switching from `summarise()` to `reframe()`, remember that `reframe()`\n## always returns an ungrouped data frame and adjust accordingly.\n## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was\n## generated.## `summarise()` has grouped output by 'week', 'student'. You can override using\n## the `.groups` argument.\n## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'"},{"path":"week-3-demo.html","id":"normalizing-sentiment","chapter":"5 Week 3 Demo","heading":"5.3 Normalizing sentiment","text":"discussed lecture, also know just total number happy words increases, isn’t indication ’re getting happier class time.can begin make inference, need normalize total number words week., simulate data number happy words actually week (happyn dataset ).join data three datasets: happylipsumn, happylipsumu, happylipsumd. datasets random text, number happy words.first also number total words week. second two, however, differing number total words week: happylipsumu increasing number total words week; happylipsumd decreasing number total words week., see , ’re splitting week, student, word, whether “happy” word.plot number happy words divided number total words week student datasets, get .get normalized sentiment score–“happy” score–need create variable (column) dataframe sum happy words divided total number words dataframe.can following way.repeat datasets plot see following.plots look like ?Well, first, number total words week number happy words week. divided latter former, get proportion also stable time.second, however, increasing number total words week, number happy words time. means dividing ever larger number, giving ever smaller proportions. , trend decreasing time.third, decreasing number total words week, number happy words time. means dividing ever smaller number, giving ever larger proportions. , trend increasing time.","code":"\nhead(happylipsumn)## # A tibble: 6 × 4\n## # Groups: week, student [1]\n## week student word happy\n## \n## 1 1 23 lorem 0\n## 2 1 23 ipsum 0\n## 3 1 23 dolor 0\n## 4 1 23 sit 0\n## 5 1 23 amet 0\n## 6 1 23 taciti 0\nhead(happylipsumu)## # A tibble: 6 × 4\n## # Groups: week, student [1]\n## week student word happy\n## \n## 1 1 23 lorem 0\n## 2 1 23 ipsum 0\n## 3 1 23 dolor 0\n## 4 1 23 sit 0\n## 5 1 23 amet 0\n## 6 1 23 maecenas 0\nhead(happylipsumd)## # A tibble: 6 × 4\n## # Groups: week, student [1]\n## week student word happy\n## \n## 1 1 23 lorem 0\n## 2 1 23 ipsum 0\n## 3 1 23 dolor 0\n## 4 1 23 sit 0\n## 5 1 23 amet 0\n## 6 1 23 et 0\nhappylipsumn %>%\n group_by(week, student) %>%\n mutate(index_total = n()) %>%\n filter(happy==1) %>%\n summarise(sum_hap = sum(happy),\n index_total = index_total,\n prop_hap = sum_hap/index_total) %>%\n distinct()## Warning: Returning more (or less) than 1 row per `summarise()` group was deprecated in\n## dplyr 1.1.0.\n## ℹ Please use `reframe()` instead.\n## ℹ When switching from `summarise()` to `reframe()`, remember that `reframe()`\n## always returns an ungrouped data frame and adjust accordingly.\n## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was\n## generated.## `summarise()` has grouped output by 'week', 'student'. You can override using\n## the `.groups` argument.## # A tibble: 300 × 5\n## # Groups: week, student [300]\n## week student sum_hap index_total prop_hap\n## \n## 1 1 1 904 4471 0.202\n## 2 1 2 977 3970 0.246\n## 3 1 3 974 4452 0.219\n## 4 1 4 1188 5644 0.210\n## 5 1 5 962 4468 0.215\n## 6 1 6 686 2758 0.249\n## 7 1 7 1105 4493 0.246\n## 8 1 8 1182 5373 0.220\n## 9 1 9 733 3578 0.205\n## 10 1 10 1235 4537 0.272\n## # ℹ 290 more rows"},{"path":"week-4-natural-language-complexity-and-similarity.html","id":"week-4-natural-language-complexity-and-similarity","chapter":"6 Week 4: Natural language, complexity, and similarity","heading":"6 Week 4: Natural language, complexity, and similarity","text":"week delving deeply language used text. previous weeks, tried two main techniques rely, different ways, counting words. week, thinking sophisticated techniques identify measure language use, well compare texts . article Gomaa Fahmy (2013) provides overview different approaches. covering technical dimensions lecture.article Urman, Makhortykh, Ulloa (2021) investigates key question contemporary communications research—information exposed online—shows might compare web search results using similarity measures. Schoonvelde et al. (2019) article, hand, looks “complexity” texts, compares politicians different ideological stripes communicate.Questions:measure linguistic complexity/sophistication?biases might involved measuring sophistication?applications might similarity measures?Required reading:Urman, Makhortykh, Ulloa (2021)Schoonvelde et al. (2019)Gomaa Fahmy (2013)reading:Voigt et al. (2017)Peng Hengartner (2002)Lowe (2008)Bail (2012)Ziblatt, Hilbig, Bischof (2020)Benoit, Munger, Spirling (2019)Slides:Week 4 Slides","code":""},{"path":"week-4-demo.html","id":"week-4-demo","chapter":"7 Week 4 Demo","heading":"7 Week 4 Demo","text":"","code":""},{"path":"week-4-demo.html","id":"setup-2","chapter":"7 Week 4 Demo","heading":"7.1 Setup","text":"First, ’ll load packages ’ll using week’s brief demo.","code":"\nlibrary(quanteda)\nlibrary(quanteda.textstats)\nlibrary(quanteda.textplots)\nlibrary(tidytext)\nlibrary(stringdist)\nlibrary(corrplot)\nlibrary(janeaustenr)"},{"path":"week-4-demo.html","id":"character-based-similarity","chapter":"7 Week 4 Demo","heading":"7.2 Character-based similarity","text":"first measure text similarity level characters. can look last time (promise) example lecture see similarity compares.’ll make two sentences create two character objects . two thoughts imagined classes.know “longest common substring measure” , according stringdist package documentation, “longest string can obtained pairing characters b keeping order characters intact.”can easily get different distance/similarity measures comparing character objects b .","code":"\na <- \"We are all very happy to be at a lecture at 11AM\"\nb <- \"We are all even happier that we don’t have two lectures a week\"\n## longest common substring distance\nstringdist(a, b,\n method = \"lcs\")## [1] 36\n## levenshtein distance\nstringdist(a, b,\n method = \"lv\")## [1] 27\n## jaro distance\nstringdist(a, b,\n method = \"jw\", p =0)## [1] 0.2550103"},{"path":"week-4-demo.html","id":"term-based-similarity","chapter":"7 Week 4 Demo","heading":"7.3 Term-based similarity","text":"second example lecture, ’re taking opening line Pride Prejudice alongside versions famous opening line.can get text Jane Austen easily thanks janeaustenr package.’re going specify alternative versions sentence.Finally, ’re going convert document feature matrix. ’re quanteda package, package ’ll begin using coming weeks analyses ’re performing get gradually technical.see ?Well, ’s clear text2 text3 similar text1 —share words. also see text2 least contain words shared text1, original opening line Jane Austen’s Pride Prejudice., measure similarity distance texts?first way simply correlating two sets ones zeroes. can quanteda.textstats package like .’ll see get manipulated data tidy format (rows words columns 1s 0s).see expected text2 highly correlated text1 text3.\nEuclidean distances, can use quanteda .define function just see ’s going behind scenes.Manhattan distance, use quanteda .define function.cosine similarity, quanteda makes straightforward.make clear ’s going , write function.","code":"\n## similarity and distance example\n\ntext <- janeaustenr::prideprejudice\n\nsentences <- text[10:11]\n\nsentence1 <- paste(sentences[1], sentences[2], sep = \" \")\n\nsentence1## [1] \"It is a truth universally acknowledged, that a single man in possession of a good fortune, must be in want of a wife.\"\nsentence2 <- \"Everyone knows that a rich man without wife will want a wife\"\n\nsentence3 <- \"He's loaded so he wants to get married. Everyone knows that's what happens.\"\ndfmat <- dfm(tokens(c(sentence1,\n sentence2,\n sentence3)),\n remove_punct = TRUE, remove = stopwords(\"english\"))\n\ndfmat## Document-feature matrix of: 3 documents, 21 features (58.73% sparse) and 0 docvars.\n## features\n## docs truth universally acknowledged single man possession good fortune must\n## text1 1 1 1 1 1 1 1 1 1\n## text2 0 0 0 0 1 0 0 0 0\n## text3 0 0 0 0 0 0 0 0 0\n## features\n## docs want\n## text1 1\n## text2 1\n## text3 0\n## [ reached max_nfeat ... 11 more features ]\n## correlation\ntextstat_simil(dfmat, margin = \"documents\", method = \"correlation\")## textstat_simil object; method = \"correlation\"\n## text1 text2 text3\n## text1 1.000 -0.117 -0.742\n## text2 -0.117 1.000 -0.173\n## text3 -0.742 -0.173 1.000\ntest <- tidy(dfmat)\ntest <- test %>%\n cast_dfm(term, document, count)\ntest <- as.data.frame(test)\n\nres <- cor(test[,2:4])\nres## text1 text2 text3\n## text1 1.0000000 -0.1167748 -0.7416198\n## text2 -0.1167748 1.0000000 -0.1732051\n## text3 -0.7416198 -0.1732051 1.0000000\ncorrplot(res, type = \"upper\", order = \"hclust\", \n tl.col = \"black\", tl.srt = 45)\ntextstat_dist(dfmat, margin = \"documents\", method = \"euclidean\")## textstat_dist object; method = \"euclidean\"\n## text1 text2 text3\n## text1 0 3.74 4.24\n## text2 3.74 0 3.74\n## text3 4.24 3.74 0\n# function for Euclidean distance\neuclidean <- function(a,b) sqrt(sum((a - b)^2))\n# estimating the distance\neuclidean(test$text1, test$text2)## [1] 3.741657\neuclidean(test$text1, test$text3)## [1] 4.242641\neuclidean(test$text2, test$text3)## [1] 3.741657\ntextstat_dist(dfmat, margin = \"documents\", method = \"manhattan\")## textstat_dist object; method = \"manhattan\"\n## text1 text2 text3\n## text1 0 14 18\n## text2 14 0 12\n## text3 18 12 0\n## manhattan\nmanhattan <- function(a, b){\n dist <- abs(a - b)\n dist <- sum(dist)\n return(dist)\n}\n\nmanhattan(test$text1, test$text2)## [1] 14\nmanhattan(test$text1, test$text3)## [1] 18\nmanhattan(test$text2, test$text3)## [1] 12\ntextstat_simil(dfmat, margin = \"documents\", method = \"cosine\")## textstat_simil object; method = \"cosine\"\n## text1 text2 text3\n## text1 1.000 0.364 0\n## text2 0.364 1.000 0.228\n## text3 0 0.228 1.000\n## cosine\ncos.sim <- function(a, b) \n{\n return(sum(a*b)/sqrt(sum(a^2)*sum(b^2)) )\n} \n\ncos.sim(test$text1, test$text2)## [1] 0.3636364\ncos.sim(test$text1, test$text3)## [1] 0\ncos.sim(test$text2, test$text3)## [1] 0.2279212"},{"path":"week-4-demo.html","id":"complexity","chapter":"7 Week 4 Demo","heading":"7.4 Complexity","text":"Note: section borrows notation materials texstat_readability() function.also talked different document-level measures text characteristics. One “complexity” readability text. One frequently used Flesch’s Reading Ease Score (Flesch 1948).computed :{:}{Flesch’s Reading Ease Score (Flesch 1948).\n}can estimate readability score respective sentences . Flesch score 1948 default.see ? original Austen opening line marked lower readability colloquial alternatives.alternatives measures might use. can check clicking links function textstat_readability(). display .One McLaughlin (1969) “Simple Measure Gobbledygook, based recurrence words 3 syllables calculated :{:}{Simple Measure Gobbledygook (SMOG) (McLaughlin 1969). = Nwmin3sy = number words 3 syllables .\nmeasure regression equation D McLaughlin’s original paper.}can calculate three sentences ., , see original Austen sentence higher level complexity (gobbledygook!).","code":"\ntextstat_readability(sentence1)## document Flesch\n## 1 text1 62.10739\ntextstat_readability(sentence2)## document Flesch\n## 1 text1 88.905\ntextstat_readability(sentence3)## document Flesch\n## 1 text1 83.09904\ntextstat_readability(sentence1, measure = \"SMOG\")## document SMOG\n## 1 text1 13.02387\ntextstat_readability(sentence2, measure = \"SMOG\")## document SMOG\n## 1 text1 8.841846\ntextstat_readability(sentence3, measure = \"SMOG\")## document SMOG\n## 1 text1 7.168622"},{"path":"week-5-scaling-techniques.html","id":"week-5-scaling-techniques","chapter":"8 Week 5: Scaling techniques","heading":"8 Week 5: Scaling techniques","text":"begin thinking automated techniques analyzing texts. bunch additional considerations now need bring mind. considerations sparked significant debates… matter means settled.stake ? weeks come, studying various techniques ‘classify,’ ‘position’ ‘score’ texts based features. success techniques depends suitability question hand also higher-level questions meaning. short, ask : way can access underlying processes governing generation text? meaning governed set structural processes? can derive ‘objective’ measures contents given text?readings Justin Grimmer, Roberts, Stewart (2021), Denny Spirling (2018), Goldenstein Poschmann (2019b) (well response replies Nelson (2019) Goldenstein Poschmann (2019a)) required reading Flexible Learning Week.Justin Grimmer, Roberts, Stewart (2021)Justin Grimmer, Roberts, Stewart (2021)Justin Grimmer Stewart (2013a)Justin Grimmer Stewart (2013a)Denny Spirling (2018)Denny Spirling (2018)Goldenstein Poschmann (2019b)\nNelson (2019)\nGoldenstein Poschmann (2019a)\nGoldenstein Poschmann (2019b)Nelson (2019)Goldenstein Poschmann (2019a)substantive focus week set readings employ different types “scaling” “low-dimensional document embedding” techniques. article Lowe (2008) provides technical overview “wordfish” algorithm uses political science contexts. article Klüver (2009) also uses “wordfish” different way—measure “influence” interest groups. response article Bunea Ibenskas (2015) subsequent reply Klüver (2015) helps illuminate debates around questions. work Kim, Lelkes, McCrain (2022) gives insight ability text-scaling techniques capture key dimensions political communication bias.Questions:assumptions underlie scaling models text?; latent text decides?might scaling useful outside estimating ideological position/bias text?Required reading:Lowe (2008)Kim, Lelkes, McCrain (2022)Klüver (2009)\nBunea Ibenskas (2015)\nKlüver (2015)\nBunea Ibenskas (2015)Klüver (2015)reading:Benoit et al. (2016)Laver, Benoit, Garry (2003)Slapin Proksch (2008)Schwemmer Wieczorek (2020)Slides:Week 5 Slides","code":""},{"path":"week-5-demo.html","id":"week-5-demo","chapter":"9 Week 5 Demo","heading":"9 Week 5 Demo","text":"","code":""},{"path":"week-5-demo.html","id":"setup-3","chapter":"9 Week 5 Demo","heading":"9.1 Setup","text":"First, ’ll load packages ’ll using week’s brief demo.","code":"\ndevtools::install_github(\"conjugateprior/austin\")\nlibrary(austin)\nlibrary(quanteda)\nlibrary(quanteda.textstats)"},{"path":"week-5-demo.html","id":"wordscores","chapter":"9 Week 5 Demo","heading":"9.2 Wordscores","text":"can inspect function wordscores model Laver, Benoit, Garry (2003) following way:can take example data included austin package.reference documents documents marked “R” reference; .e., columns one five.matrix simply series words (: letters) reference texts word counts .can look wordscores words, calculated using reference dimensions reference documents.see thetas contained wordscores object, .e., reference dimensions reference documents pis, .e., estimated wordscores word.can now use score -called “virgin” texts follows.","code":"\nclassic.wordscores## function (wfm, scores) \n## {\n## if (!is.wfm(wfm)) \n## stop(\"Function not applicable to this object\")\n## if (length(scores) != length(docs(wfm))) \n## stop(\"There are not the same number of documents as scores\")\n## if (any(is.na(scores))) \n## stop(\"One of the reference document scores is NA\\nFit the model with known scores and use 'predict' to get virgin score estimates\")\n## thecall <- match.call()\n## C.all <- as.worddoc(wfm)\n## C <- C.all[rowSums(C.all) > 0, ]\n## F <- scale(C, center = FALSE, scale = colSums(C))\n## ws <- apply(F, 1, function(x) {\n## sum(scores * x)\n## })/rowSums(F)\n## pi <- matrix(ws, nrow = length(ws))\n## rownames(pi) <- rownames(C)\n## colnames(pi) <- c(\"Score\")\n## val <- list(pi = pi, theta = scores, data = wfm, call = thecall)\n## class(val) <- c(\"classic.wordscores\", \"wordscores\", class(val))\n## return(val)\n## }\n## \n## \ndata(lbg)\nref <- getdocs(lbg, 1:5)\nref## docs\n## words R1 R2 R3 R4 R5\n## A 2 0 0 0 0\n## B 3 0 0 0 0\n## C 10 0 0 0 0\n## D 22 0 0 0 0\n## E 45 0 0 0 0\n## F 78 2 0 0 0\n## G 115 3 0 0 0\n## H 146 10 0 0 0\n## I 158 22 0 0 0\n## J 146 45 0 0 0\n## K 115 78 2 0 0\n## L 78 115 3 0 0\n## M 45 146 10 0 0\n## N 22 158 22 0 0\n## O 10 146 45 0 0\n## P 3 115 78 2 0\n## Q 2 78 115 3 0\n## R 0 45 146 10 0\n## S 0 22 158 22 0\n## T 0 10 146 45 0\n## U 0 3 115 78 2\n## V 0 2 78 115 3\n## W 0 0 45 146 10\n## X 0 0 22 158 22\n## Y 0 0 10 146 45\n## Z 0 0 3 115 78\n## ZA 0 0 2 78 115\n## ZB 0 0 0 45 146\n## ZC 0 0 0 22 158\n## ZD 0 0 0 10 146\n## ZE 0 0 0 3 115\n## ZF 0 0 0 2 78\n## ZG 0 0 0 0 45\n## ZH 0 0 0 0 22\n## ZI 0 0 0 0 10\n## ZJ 0 0 0 0 3\n## ZK 0 0 0 0 2\nws <- classic.wordscores(ref, scores=seq(-1.5,1.5,by=0.75))\nws## $pi\n## Score\n## A -1.5000000\n## B -1.5000000\n## C -1.5000000\n## D -1.5000000\n## E -1.5000000\n## F -1.4812500\n## G -1.4809322\n## H -1.4519231\n## I -1.4083333\n## J -1.3232984\n## K -1.1846154\n## L -1.0369898\n## M -0.8805970\n## N -0.7500000\n## O -0.6194030\n## P -0.4507576\n## Q -0.2992424\n## R -0.1305970\n## S 0.0000000\n## T 0.1305970\n## U 0.2992424\n## V 0.4507576\n## W 0.6194030\n## X 0.7500000\n## Y 0.8805970\n## Z 1.0369898\n## ZA 1.1846154\n## ZB 1.3232984\n## ZC 1.4083333\n## ZD 1.4519231\n## ZE 1.4809322\n## ZF 1.4812500\n## ZG 1.5000000\n## ZH 1.5000000\n## ZI 1.5000000\n## ZJ 1.5000000\n## ZK 1.5000000\n## \n## $theta\n## [1] -1.50 -0.75 0.00 0.75 1.50\n## \n## $data\n## docs\n## words R1 R2 R3 R4 R5\n## A 2 0 0 0 0\n## B 3 0 0 0 0\n## C 10 0 0 0 0\n## D 22 0 0 0 0\n## E 45 0 0 0 0\n## F 78 2 0 0 0\n## G 115 3 0 0 0\n## H 146 10 0 0 0\n## I 158 22 0 0 0\n## J 146 45 0 0 0\n## K 115 78 2 0 0\n## L 78 115 3 0 0\n## M 45 146 10 0 0\n## N 22 158 22 0 0\n## O 10 146 45 0 0\n## P 3 115 78 2 0\n## Q 2 78 115 3 0\n## R 0 45 146 10 0\n## S 0 22 158 22 0\n## T 0 10 146 45 0\n## U 0 3 115 78 2\n## V 0 2 78 115 3\n## W 0 0 45 146 10\n## X 0 0 22 158 22\n## Y 0 0 10 146 45\n## Z 0 0 3 115 78\n## ZA 0 0 2 78 115\n## ZB 0 0 0 45 146\n## ZC 0 0 0 22 158\n## ZD 0 0 0 10 146\n## ZE 0 0 0 3 115\n## ZF 0 0 0 2 78\n## ZG 0 0 0 0 45\n## ZH 0 0 0 0 22\n## ZI 0 0 0 0 10\n## ZJ 0 0 0 0 3\n## ZK 0 0 0 0 2\n## \n## $call\n## classic.wordscores(wfm = ref, scores = seq(-1.5, 1.5, by = 0.75))\n## \n## attr(,\"class\")\n## [1] \"classic.wordscores\" \"wordscores\" \"list\"\n#get \"virgin\" documents\nvir <- getdocs(lbg, 'V1')\nvir## docs\n## words V1\n## A 0\n## B 0\n## C 0\n## D 0\n## E 0\n## F 0\n## G 0\n## H 2\n## I 3\n## J 10\n## K 22\n## L 45\n## M 78\n## N 115\n## O 146\n## P 158\n## Q 146\n## R 115\n## S 78\n## T 45\n## U 22\n## V 10\n## W 3\n## X 2\n## Y 0\n## Z 0\n## ZA 0\n## ZB 0\n## ZC 0\n## ZD 0\n## ZE 0\n## ZF 0\n## ZG 0\n## ZH 0\n## ZI 0\n## ZJ 0\n## ZK 0\n# predict textscores for the virgin documents\npredict(ws, newdata=vir)## 37 of 37 words (100%) are scorable\n## \n## Score Std. Err. Rescaled Lower Upper\n## V1 -0.448 0.0119 -0.448 -0.459 -0.437"},{"path":"week-5-demo.html","id":"wordfish","chapter":"9 Week 5 Demo","heading":"9.3 Wordfish","text":"wish, can inspect function wordscores model Slapin Proksch (2008) following way. much complex algorithm, printed , can inspect devices.can simulate data, formatted appropriately wordfiash estimation following way:can see document word-level FEs, well specified range thetas estimates.estimating document positions simply matter implementing algorithm.","code":"\nwordfish\ndd <- sim.wordfish()\n\ndd## $Y\n## docs\n## words D01 D02 D03 D04 D05 D06 D07 D08 D09 D10\n## W01 19 24 23 18 14 12 8 13 6 4\n## W02 25 11 22 22 12 11 6 10 4 4\n## W03 14 21 18 19 13 16 17 10 3 11\n## W04 34 23 25 11 19 16 10 6 13 7\n## W05 25 19 20 20 16 10 10 12 7 2\n## W06 4 5 12 7 13 20 19 19 23 31\n## W07 6 6 15 7 13 16 14 15 19 28\n## W08 5 4 12 14 15 13 18 19 19 20\n## W09 6 7 7 9 8 17 19 20 17 20\n## W10 6 6 9 6 13 13 13 19 17 27\n## W11 59 59 46 38 39 28 26 25 15 17\n## W12 58 52 53 58 36 38 26 19 26 19\n## W13 59 55 49 44 41 27 24 18 21 10\n## W14 59 59 45 45 32 30 31 15 17 12\n## W15 65 54 43 34 44 39 21 36 13 14\n## W16 12 13 22 36 31 34 49 40 55 53\n## W17 9 23 19 24 31 39 59 50 51 46\n## W18 7 21 10 29 36 34 52 58 57 58\n## W19 14 21 22 27 41 45 42 59 49 58\n## W20 14 17 28 32 33 42 36 37 68 59\n## \n## $theta\n## [1] -1.4863011 -1.1560120 -0.8257228 -0.4954337 -0.1651446 0.1651446 0.4954337\n## [8] 0.8257228 1.1560120 1.4863011\n## \n## $doclen\n## D01 D02 D03 D04 D05 D06 D07 D08 D09 D10 \n## 500 500 500 500 500 500 500 500 500 500 \n## \n## $psi\n## [1] 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1\n## \n## $beta\n## [1] 0 0 0 0 0 1 1 1 1 1 0 0 0 0 0 1 1 1 1 1\n## \n## attr(,\"class\")\n## [1] \"wordfish.simdata\" \"list\"\nwf <- wordfish(dd$Y)\nsummary(wf)## Call:\n## wordfish(wfm = dd$Y)\n## \n## Document Positions:\n## Estimate Std. Error Lower Upper\n## D01 -1.6378 0.12078 -1.87454 -1.40109\n## D02 -1.0988 0.10363 -1.30193 -0.89571\n## D03 -0.7959 0.09716 -0.98635 -0.60548\n## D04 -0.4694 0.09256 -0.65084 -0.28802\n## D05 -0.1188 0.09023 -0.29565 0.05807\n## D06 0.2096 0.09047 0.03232 0.38695\n## D07 0.6201 0.09404 0.43578 0.80442\n## D08 0.7459 0.09588 0.55795 0.93381\n## D09 1.1088 0.10322 0.90646 1.31108\n## D10 1.4366 0.11257 1.21598 1.65725"},{"path":"week-5-demo.html","id":"using-quanteda","chapter":"9 Week 5 Demo","heading":"9.4 Using quanteda","text":"can also use quanteda implement scaling techniques, demonstrated Exercise 4.","code":""},{"path":"week-6-unsupervised-learning-topic-models.html","id":"week-6-unsupervised-learning-topic-models","chapter":"10 Week 6: Unsupervised learning (topic models)","heading":"10 Week 6: Unsupervised learning (topic models)","text":"week builds upon past scaling techniques explored Week 5 instead turns another form unsupervised approach—topic modelling.substantive articles Nelson (2020) Alrababa’h Blaydes (2020) provide, turn, illuminating insights using topic models categorize thematic content text information.article Ying, Montgomery, Stewart (2021) provides valuable overview accompaniment earlier work Denny Spirling (2018) thinking validate findings test robustness inferences make models.Questions:assumptions underlie topic modelling approaches?Can develop structural models text?topic modelling discovery measurement strategy?validate model?Required reading:Nelson (2020)PARTHASARATHY, RAO, PALANISWAMY (2019)Ying, Montgomery, Stewart (2021)reading:Chang et al. (2009)Alrababa’h Blaydes (2020)J. Grimmer King (2011)Denny Spirling (2018)Smith et al. (2021)Boyd et al. (2018)Slides:Week 6 Slides","code":""},{"path":"week-6-demo.html","id":"week-6-demo","chapter":"11 Week 6 Demo","heading":"11 Week 6 Demo","text":"","code":""},{"path":"week-6-demo.html","id":"setup-4","chapter":"11 Week 6 Demo","heading":"11.1 Setup","text":"First, ’ll load packages ’ll using week’s brief demo.Estimating topic model requires us first data form document-term-matrix. another term referred previous weeks document-feature-matrix.can take example data topicmodels package. text news releases Associated Press. consists around 2,200 articles (documents) 10,000 terms (words).estimate topic model need specify document-term-matrix using, number (k) topics estimating. speed estimation, estimating 100 articles.can inspect contents topic follows.can use tidy() function tidytext gather relevant parameters ’ve estimated. get \\(\\beta\\) per-topic-per-word probabilities (.e., probability given term belongs given topic) can following.get \\(\\gamma\\) per-document-per-topic probabilities (.e., probability given document (: article) belongs particular topic) following.can easily plot \\(\\beta\\) estimates follows.shows us words associated topic, size associated \\(\\beta\\) coefficient.","code":"\nlibrary(topicmodels)\nlibrary(dplyr)\nlibrary(tidytext)\nlibrary(ggplot2)\nlibrary(ggthemes)\ndata(\"AssociatedPress\", \n package = \"topicmodels\")\nlda_output <- LDA(AssociatedPress[1:100,], k = 10)\nterms(lda_output, 10)## Topic 1 Topic 2 Topic 3 Topic 4 Topic 5 \n## [1,] \"soviet\" \"government\" \"i\" \"dukakis\" \"new\" \n## [2,] \"roberts\" \"congress\" \"administration\" \"bush\" \"immigration\"\n## [3,] \"years\" \"jewish\" \"people\" \"rating\" \"central\" \n## [4,] \"gorbachev\" \"million\" \"bush\" \"new\" \"year\" \n## [5,] \"million\" \"soviet\" \"president\" \"president\" \"company\" \n## [6,] \"year\" \"jews\" \"noriega\" \"i\" \"greyhound\" \n## [7,] \"officers\" \"new\" \"thats\" \"day\" \"snow\" \n## [8,] \"gas\" \"people\" \"american\" \"told\" \"southern\" \n## [9,] \"polish\" \"church\" \"peres\" \"blackowned\" \"union\" \n## [10,] \"study\" \"east\" \"official\" \"rose\" \"contact\" \n## Topic 6 Topic 7 Topic 8 Topic 9 Topic 10 \n## [1,] \"fire\" \"percent\" \"i\" \"percent\" \"police\" \n## [2,] \"barry\" \"state\" \"people\" \"new\" \"two\" \n## [3,] \"warming\" \"year\" \"new\" \"bank\" \"mrs\" \n## [4,] \"global\" \"man\" \"duracell\" \"prices\" \"i\" \n## [5,] \"moore\" \"years\" \"soviet\" \"york\" \"last\" \n## [6,] \"summit\" \"last\" \"waste\" \"year\" \"school\" \n## [7,] \"mundy\" \"north\" \"agents\" \"california\" \"get\" \n## [8,] \"saudi\" \"government\" \"children\" \"economy\" \"liberace\"\n## [9,] \"asked\" \"national\" \"like\" \"oil\" \"shot\" \n## [10,] \"monday\" \"black\" \"company\" \"report\" \"man\"\nlda_beta <- tidy(lda_output, matrix = \"beta\")\n\nlda_beta %>%\n arrange(-beta)## # A tibble: 104,730 × 3\n## topic term beta\n## \n## 1 9 percent 0.0207\n## 2 7 percent 0.0184\n## 3 1 soviet 0.0143\n## 4 9 new 0.0143\n## 5 4 dukakis 0.0131\n## 6 7 state 0.0125\n## 7 10 police 0.0124\n## 8 8 i 0.0124\n## 9 4 bush 0.0120\n## 10 6 fire 0.0120\n## # ℹ 104,720 more rows\nlda_gamma <- tidy(lda_output, matrix = \"gamma\")\n\nlda_gamma %>%\n arrange(-gamma)## # A tibble: 1,000 × 3\n## document topic gamma\n## \n## 1 76 3 1.00\n## 2 81 2 1.00\n## 3 6 3 1.00\n## 4 43 1 1.00\n## 5 31 9 1.00\n## 6 95 9 1.00\n## 7 77 4 1.00\n## 8 29 1 1.00\n## 9 80 9 1.00\n## 10 57 2 1.00\n## # ℹ 990 more rows\nlda_beta %>%\n group_by(topic) %>%\n top_n(10, beta) %>%\n ungroup() %>%\n arrange(topic, -beta) %>%\n mutate(term = reorder_within(term, beta, topic)) %>%\n ggplot(aes(beta, term, fill = factor(topic))) +\n geom_col(show.legend = FALSE) +\n facet_wrap(~ topic, scales = \"free\", ncol = 4) +\n scale_y_reordered() +\n theme_tufte(base_family = \"Helvetica\")"},{"path":"week-7-unsupervised-learning-word-embedding.html","id":"week-7-unsupervised-learning-word-embedding","chapter":"12 Week 7: Unsupervised learning (word embedding)","heading":"12 Week 7: Unsupervised learning (word embedding)","text":"week discussing second form “unsupervised” learning—word embeddings. previous weeks allowed us characterize complexity text, cluster text potential topical focus, word embeddings permit us expansive form measurement. essence, producing matrix representation entire corpus.reading Pedro L. Rodriguez Spirling (2022) provides effective overview technical dimensions technique. articles Garg et al. (2018) Kozlowski, Taddy, Evans (2019) two substantive articles use word embeddings provide insights prejudice bias manifested language time.Required reading:Garg et al. (2018)Kozlowski, Taddy, Evans (2019)Waller Anderson (2021)reading:P. Rodriguez Spirling (2021)Pedro L. Rodriguez Spirling (2022)Osnabrügge, Hobolt, Rodon (2021)Rheault Cochrane (2020)Jurafsky Martin (2021, ch.6): https://web.stanford.edu/~jurafsky/slp3/]Slides:Week 7 Slides","code":""},{"path":"week-7-demo.html","id":"week-7-demo","chapter":"13 Week 7 Demo","heading":"13 Week 7 Demo","text":"","code":""},{"path":"week-7-demo.html","id":"setup-5","chapter":"13 Week 7 Demo","heading":"13.1 Setup","text":"First, ’ll load packages ’ll using week’s brief demo. pre-loading already-estimated PMI matrix results singular value decomposition approach.work?Various approaches, including:\nSVD\n\nNeural network-based techniques like GloVe Word2Vec\n\nSVD\nSVDNeural network-based techniques like GloVe Word2Vec\nNeural network-based techniques like GloVe Word2VecIn approaches, :Defining context window (see figure )Looking probabilities word appearing near another wordsThe implementation technique using singular value decomposition approach requires following data structure:Word pair matrix PMI (Pairwise mutual information)PMI = log(P(x,y)/P(x)P(y))P(x,y) probability word x appearing within six-word window word yand P(x) probability word x appearing whole corpusand P(y) probability word y appearing whole corpusAnd resulting matrix object take following format:use “Singular Value Decomposition” (SVD) techique. another multidimensional scaling technique, first axis resulting coordinates captures variance, second second-etc…, simply need following.can collect vectors word inspect .","code":"\nlibrary(Matrix) #for handling matrices\nlibrary(tidyverse)\nlibrary(irlba) # for SVD\nlibrary(umap) # for dimensionality reduction\n\nload(\"data/wordembed/pmi_svd.RData\")\nload(\"data/wordembed/pmi_matrix.RData\")## 6 x 6 sparse Matrix of class \"dgCMatrix\"\n## the to and of https a\n## the 0.653259169 -0.01948121 -0.006446459 0.27136395 -0.5246159 -0.32557524\n## to -0.019481205 0.75498084 -0.065170433 -0.25694210 -0.5731182 -0.04595798\n## and -0.006446459 -0.06517043 1.027782342 -0.03974904 -0.4915159 -0.05862969\n## of 0.271363948 -0.25694210 -0.039749043 1.02111517 -0.5045067 0.09829389\n## https -0.524615878 -0.57311817 -0.491515918 -0.50450674 0.5451841 -0.57956404\n## a -0.325575239 -0.04595798 -0.058629689 0.09829389 -0.5795640 1.03048355## Formal class 'dgCMatrix' [package \"Matrix\"] with 6 slots\n## ..@ i : int [1:350700] 0 1 2 3 4 5 6 7 8 9 ...\n## ..@ p : int [1:21173] 0 7819 14360 20175 25467 29910 34368 39207 43376 46401 ...\n## ..@ Dim : int [1:2] 21172 21172\n## ..@ Dimnames:List of 2\n## .. ..$ : chr [1:21172] \"the\" \"to\" \"and\" \"of\" ...\n## .. ..$ : chr [1:21172] \"the\" \"to\" \"and\" \"of\" ...\n## ..@ x : num [1:350700] 0.65326 -0.01948 -0.00645 0.27136 -0.52462 ...\n## ..@ factors : list()\npmi_svd <- irlba(pmi_matrix, 256, maxit = 500)\nword_vectors <- pmi_svd$u\nrownames(word_vectors) <- rownames(pmi_matrix)\ndim(word_vectors)## [1] 21172 256\nhead(word_vectors[1:5, 1:5])## [,1] [,2] [,3] [,4] [,5]\n## the 0.007810973 0.07024009 0.06377615 0.03139044 -0.12362108\n## to 0.006889381 -0.03210269 0.10665925 0.03537632 0.10104552\n## and -0.050498380 0.09131495 0.19658197 -0.08136253 -0.01605705\n## of -0.015628371 0.16306386 0.13296127 -0.04087709 -0.23175976\n## https 0.301718525 0.07658843 -0.01720398 0.26219147 0.07930941"},{"path":"week-7-demo.html","id":"using-glove-or-word2vec","chapter":"13 Week 7 Demo","heading":"13.2 Using GloVe or word2vec","text":"neural network approach considerably involved, figure gives overview picture differing algorithmic approaches might use.","code":""},{"path":"week-8-sampling-text-information.html","id":"week-8-sampling-text-information","chapter":"14 Week 8: Sampling text information","heading":"14 Week 8: Sampling text information","text":"week ’ll thinking best sample text information, thinking different biases might inhere data-generating process, well representativeness generalizability text corpus construct.reading Barberá Rivero (2015) invesitgates representativeness Twitter data, give us pause thinking using digital trace data general barometer public opinion.reading Michalopoulos Xue (2021) takes entirely different tack, illustrates can think systematically text information broadly representative societies general.Required reading:Barberá Rivero (2015)Michalopoulos Xue (2021)Klaus Krippendorff (2004, chs. 5 6)reading:Martins Baumard (2020)Baumard et al. (2022)Slides:Week 8 Slides","code":""},{"path":"week-9-supervised-learning.html","id":"week-9-supervised-learning","chapter":"15 Week 9: Supervised learning","heading":"15 Week 9: Supervised learning","text":"Required reading:Hopkins King (2010)King, Pan, Roberts (2017)Siegel et al. (2021)Yu, Kaufmann, Diermeier (2008)Manning, Raghavan, Schtze (2007, chs. 13,14, 15): https://nlp.stanford.edu/IR-book/information-retrieval-book.html]reading:Denny Spirling (2018)King, Lam, Roberts (2017)","code":""},{"path":"week-10-validation.html","id":"week-10-validation","chapter":"16 Week 10: Validation","heading":"16 Week 10: Validation","text":"week ’ll thinking validate techniques ’ve used preceding weeks. Validation necessary important part text analysis technique.Often speak validation context machine labelling large text data. validation need ——restricted automated classification tasks. articles Ying, Montgomery, Stewart (2021) Pedro L. Rodriguez, Spirling, Stewart (2021) describe ways approach validation unsupervised contexts. Finally, article Peterson Spirling (2018) shows validation accuracy might provide measure substantive significance.Required reading:Ying, Montgomery, Stewart (2021)Pedro L. Rodriguez, Spirling, Stewart (2021)Peterson Spirling (2018)Manning, Raghavan, Schtze (2007, ch.2: https://nlp.stanford.edu/IR-book/information-retrieval-book.html)reading:K. Krippendorff (2004)Denny Spirling (2018)Justin Grimmer Stewart (2013b)Barberá et al. (2021)Schiller, Daxenberger, Gurevych (2021)Slides:Week 10 Slides","code":""},{"path":"exercise-1-word-frequency-analysis.html","id":"exercise-1-word-frequency-analysis","chapter":"17 Exercise 1: Word frequency analysis","heading":"17 Exercise 1: Word frequency analysis","text":"","code":""},{"path":"exercise-1-word-frequency-analysis.html","id":"introduction","chapter":"17 Exercise 1: Word frequency analysis","heading":"17.1 Introduction","text":"tutorial, learn summarise, aggregate, analyze text R:tokenize filter textHow clean preprocess textHow visualize results ggplotHow perform automated gender assignment name data (think possible biases methods may enclose)","code":""},{"path":"exercise-1-word-frequency-analysis.html","id":"setup-6","chapter":"17 Exercise 1: Word frequency analysis","heading":"17.2 Setup","text":"practice skills, use dataset already collected Edinburgh Fringe Festival website.can try : obtain data, must first obtain API key. Instructions available Edinburgh Fringe API page:","code":""},{"path":"exercise-1-word-frequency-analysis.html","id":"load-data-and-packages","chapter":"17 Exercise 1: Word frequency analysis","heading":"17.3 Load data and packages","text":"proceeding, ’ll load remaining packages need tutorial.tutorial, using data pre-cleaned provided .csv format. data come Edinburgh Book Festival API, provide data every event taken place Edinburgh Book Festival, runs every year month August, nine years: 2012-2020. many questions might ask data. tutorial, investigate contents event, speakers event, determine trends gender representation time.first task, , read data. can read_csv() function.read_csv() function takes .csv file loads working environment data frame object called “edbfdata.” can call object anything though. Try changing name object <- arrow. Note R allow names spaces , however. also good idea name object something beginning numbers, means call object within ` marks.’re working document computer (“locally”) can download Edinburgh Fringe data following way:","code":"\nlibrary(tidyverse) # loads dplyr, ggplot2, and others\nlibrary(tidytext) # includes set of functions useful for manipulating text\nlibrary(ggthemes) # includes a set of themes to make your visualizations look nice!\nlibrary(readr) # more informative and easy way to import data\nlibrary(babynames) #for gender predictions\nedbfdata <- read_csv(\"data/wordfreq/edbookfestall.csv\")## New names:\n## Rows: 5938 Columns: 12\n## ── Column specification\n## ───────────────────────────────────────────────────────── Delimiter: \",\" chr\n## (8): festival_id, title, sub_title, artist, description, genre, age_categ... dbl\n## (4): ...1, year, latitude, longitude\n## ℹ Use `spec()` to retrieve the full column specification for this data. ℹ\n## Specify the column types or set `show_col_types = FALSE` to quiet this message.\n## • `` -> `...1`\nedbfdata <- read_csv(\"https://raw.githubusercontent.com/cjbarrie/RDL-Ed/main/02-text-as-data/data/edbookfestall.csv\")"},{"path":"exercise-1-word-frequency-analysis.html","id":"inspect-and-filter-data","chapter":"17 Exercise 1: Word frequency analysis","heading":"17.4 Inspect and filter data","text":"next job cut dataset size, including columns need. first can inspect see existing column names , variable coded. can first call::can see description event included column named “description” year event “year.” now ’ll just keep two. Remember: ’re interested tutorial firstly representation gender feminism forms cultural production given platform Edinburgh International Book Festival. Given , first foremost interested reported content artist’s event.use pipe %>% functions tidyverse package quickly efficiently select columns want edbfdata data.frame object. pass data new data.frame object, call “evdes.”let’s take quick look many events time festival. , first calculate number individual events (row observations) year (column variable).can plot using ggplot!Perhaps unsurprisingly, context pandemic, number recorded bookings 2020 Festival drastically reduced.","code":"\ncolnames(edbfdata)## [1] \"...1\" \"festival_id\" \"title\" \"sub_title\" \"artist\" \n## [6] \"year\" \"description\" \"genre\" \"latitude\" \"longitude\" \n## [11] \"age_category\" \"ID\"\nglimpse(edbfdata)## Rows: 5,938\n## Columns: 12\n## $ ...1 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17,…\n## $ festival_id \"book\", \"book\", \"book\", \"book\", \"book\", \"book\", \"book\", \"b…\n## $ title \"Denise Mina\", \"Alex T Smith\", \"Challenging Expectations w…\n## $ sub_title \"HARD MEN AND CARDBOARD GANGSTERS\", NA, NA, \"WHAT CAUSED T…\n## $ artist \"Denise Mina\", \"Alex T Smith\", \"Peter Cocks\", \"Paul Mason\"…\n## $ year 2012, 2012, 2012, 2012, 2012, 2012, 2012, 2012, 2012, 2012…\n## $ description \"\\n\\tAs the grande dame of Scottish crime fiction, Deni…\n## $ genre \"Literature\", \"Children\", \"Children\", \"Literature\", \"Child…\n## $ latitude 55.9519, 55.9519, 55.9519, 55.9519, 55.9519, 55.9519, 55.9…\n## $ longitude -3.206913, -3.206913, -3.206913, -3.206913, -3.206913, -3.…\n## $ age_category NA, \"AGE 4 - 7\", \"AGE 11 - 14\", NA, \"AGE 10 - 14\", \"AGE 6 …\n## $ ID \"Denise Mina2012\", \"Alex T Smith2012\", \"Peter Cocks2012\", …\n# get simplified dataset with only event contents and year\nevdes <- edbfdata %>%\n select(description, year)\n\nhead(evdes)## # A tibble: 6 × 2\n## description year\n## \n## 1 \"\\n\\tAs the grande dame of Scottish crime fiction, Denise Mina places… 2012\n## 2 \"
\\n\\tWhen Alex T Smith was a little boy he wanted to be a chef, a rab… 2012\n## 3 \"
\\n\\tPeter Cocks is known for his fantasy series Triskellion written … 2012\n## 4 \"
\\n\\tTwo books by influential journalists are among the first to look… 2012\n## 5 \"
\\n\\tChris d’Lacey tells you all about The Fire Ascending, the … 2012\n## 6 \"
\\n\\tIt’s time for the honourable, feisty and courageous young … 2012\nevtsperyr <- evdes %>%\n mutate(obs=1) %>%\n group_by(year) %>%\n summarise(sum_events = sum(obs))\nggplot(evtsperyr) +\n geom_line(aes(year, sum_events)) +\n theme_tufte(base_family = \"Helvetica\") + \n scale_y_continuous(expand = c(0, 0), limits = c(0, NA))"},{"path":"exercise-1-word-frequency-analysis.html","id":"tidy-the-text","chapter":"17 Exercise 1: Word frequency analysis","heading":"17.5 Tidy the text","text":"Given data obtained API outputs data originally HTML format, text still contains HTML PHP encodings e.g. bold font paragraphs. ’ll need get rid , well punctuation analyzing data.set commands takes event descriptions, extracts individual words, counts number times appear years covered book festival data.","code":"\n#get year and word for every word and date pair in the dataset\ntidy_des <- evdes %>% \n mutate(desc = tolower(description)) %>%\n unnest_tokens(word, desc) %>%\n filter(str_detect(word, \"[a-z]\"))"},{"path":"exercise-1-word-frequency-analysis.html","id":"back-to-the-fringe","chapter":"17 Exercise 1: Word frequency analysis","heading":"17.6 Back to the Fringe","text":"see resulting dataset large (~446k rows). commands first taken events text, “mutated” set lower case character string. “unnest_tokens” function taken individual string create new column called “word” contains individual word contained event description texts.terminology also appropriate . tidy text format, often refer data structures consisting “documents” “terms.” “tokenizing” text “unnest_tokens” functions generating dataset one term per row., “documents” collection descriptions events year Edinburgh Book Festival. way sort text “documents” depends choice individual researcher.Instead year, might wanted sort text “genre.” , two genres: “Literature” “Children.” done , two “documents,” contained words included event descriptions genre.Alternatively, might interested contributions individual authors time. case, sorted text documents author. case, “document” represent words included event descriptions events given author (many multiple appearances time festival given year).can yet tidy , though. First ’ll remove stop words ’ll remove apostrophes:see number rows dataset reduces half ~223k rows. natural since large proportion string contain many -called “stop words”. can see stop words typing:lexicon (list words) included tidytext package produced Julia Silge David Robinson (see ). see contains 1000 words. remove informative interested substantive content text (rather , say, grammatical content).Now let’s look common words data:can see one common words “rsquo,” HTML encoding apostrophe. Clearly need clean data bit . common issue large-n text analysis key step want conduct reliably robust forms text analysis. ’ll another go using filter command, specifying keep words included string words rsquo, em, ndash, nbsp, lsquo.’s like ! words feature seem make sense now (actual words rather random HTML UTF-8 encodings).Let’s now collect words data.frame object, ’ll call edbf_term_counts:year, see “book” common word… perhaps surprises . evidence ’re properly pre-processing cleaning data. Cleaning text data important element preparing text analysis. often process trial error text data looks alike, may come e.g. webpages HTML encoding, unrecognized fonts unicode, potential cause issues! finding errors also chance get know data…","code":"\ntidy_des <- tidy_des %>%\n filter(!word %in% stop_words$word)\nstop_words## # A tibble: 1,149 × 2\n## word lexicon\n## \n## 1 a SMART \n## 2 a's SMART \n## 3 able SMART \n## 4 about SMART \n## 5 above SMART \n## 6 according SMART \n## 7 accordingly SMART \n## 8 across SMART \n## 9 actually SMART \n## 10 after SMART \n## # ℹ 1,139 more rows\ntidy_des %>%\n count(word, sort = TRUE)## # A tibble: 24,995 × 2\n## word n\n## \n## 1 rsquo 5638\n## 2 book 2088\n## 3 event 1356\n## 4 author 1332\n## 5 world 1240\n## 6 story 1159\n## 7 join 1095\n## 8 em 1064\n## 9 life 879\n## 10 strong 864\n## # ℹ 24,985 more rows\nremove_reg <- c(\"&\",\"<\",\">\",\"\", \"<\/p>\",\"&rsquo\", \"‘\", \"'\", \"\", \"<\/strong>\", \"rsquo\", \"em\", \"ndash\", \"nbsp\", \"lsquo\", \"strong\")\n \ntidy_des <- tidy_des %>%\n filter(!word %in% remove_reg)\ntidy_des %>%\n count(word, sort = TRUE)## # A tibble: 24,989 × 2\n## word n\n## \n## 1 book 2088\n## 2 event 1356\n## 3 author 1332\n## 4 world 1240\n## 5 story 1159\n## 6 join 1095\n## 7 life 879\n## 8 stories 860\n## 9 chaired 815\n## 10 books 767\n## # ℹ 24,979 more rows\nedbf_term_counts <- tidy_des %>% \n group_by(year) %>%\n count(word, sort = TRUE)\nhead(edbf_term_counts)## # A tibble: 6 × 3\n## # Groups: year [6]\n## year word n\n## \n## 1 2016 book 295\n## 2 2018 book 283\n## 3 2019 book 265\n## 4 2012 book 254\n## 5 2013 book 241\n## 6 2015 book 239"},{"path":"exercise-1-word-frequency-analysis.html","id":"analyze-keywords","chapter":"17 Exercise 1: Word frequency analysis","heading":"17.7 Analyze keywords","text":"Okay, now list words, number times appear, can tag words think might related issues gender inequality sexism. may decide list imprecise inexhaustive. , feel free change terms including grepl() function.","code":"\nedbf_term_counts$womword <- as.integer(grepl(\"women|feminist|feminism|gender|harassment|sexism|sexist\", \n x = edbf_term_counts$word))\nhead(edbf_term_counts)## # A tibble: 6 × 4\n## # Groups: year [6]\n## year word n womword\n## \n## 1 2016 book 295 0\n## 2 2018 book 283 0\n## 3 2019 book 265 0\n## 4 2012 book 254 0\n## 5 2013 book 241 0\n## 6 2015 book 239 0"},{"path":"exercise-1-word-frequency-analysis.html","id":"compute-aggregate-statistics","chapter":"17 Exercise 1: Word frequency analysis","heading":"17.8 Compute aggregate statistics","text":"Now tagged individual words relating gender inequality feminism, can sum number times words appear year denominate total number words event descriptions.intuition increase decrease percentage words relating issues capturing substantive change representation issues related sex gender.think measure? adequate measure representation issues cultural sphere?keywords used precise enough? , change?","code":"\n#get counts by year and word\nedbf_counts <- edbf_term_counts %>%\n group_by(year) %>%\n mutate(year_total = sum(n)) %>%\n filter(womword==1) %>%\n summarise(sum_wom = sum(n),\n year_total= min(year_total))\nhead(edbf_counts)## # A tibble: 6 × 3\n## year sum_wom year_total\n## \n## 1 2012 22 23146\n## 2 2013 40 23277\n## 3 2014 30 25366\n## 4 2015 24 22158\n## 5 2016 34 24356\n## 6 2017 55 27602"},{"path":"exercise-1-word-frequency-analysis.html","id":"plot-time-trends","chapter":"17 Exercise 1: Word frequency analysis","heading":"17.9 Plot time trends","text":"see? Let’s take count words relating gender dataset, denominate total number words data per year.can add visual guides draw attention apparent changes data. , might wish signal year #MeToo movement 2017.label highlighting year 2017 including text label along vertical line.","code":"\nggplot(edbf_counts, aes(year, sum_wom / year_total, group=1)) +\n geom_line() +\n xlab(\"Year\") +\n ylab(\"% gender-related words\") +\n scale_y_continuous(labels = scales::percent_format(),\n expand = c(0, 0), limits = c(0, NA)) +\n theme_tufte(base_family = \"Helvetica\") \nggplot(edbf_counts, aes(year, sum_wom / year_total, group=1)) +\n geom_line() +\n geom_vline(xintercept = 2017, col=\"red\") +\n xlab(\"Year\") +\n ylab(\"% gender-related words\") +\n scale_y_continuous(labels = scales::percent_format(),\n expand = c(0, 0), limits = c(0, NA)) +\n theme_tufte(base_family = \"Helvetica\")\nggplot(edbf_counts, aes(year, sum_wom / year_total, group=1)) +\n geom_line() +\n geom_vline(xintercept = 2017, col=\"red\") +\n geom_text(aes(x=2017.1, label=\"#metoo year\", y=.0015), \n colour=\"black\", angle=90, text=element_text(size=8)) +\n xlab(\"Year\") +\n ylab(\"% gender-related words\") +\n scale_y_continuous(labels = scales::percent_format(),\n expand = c(0, 0), limits = c(0, NA)) +\n theme_tufte(base_family = \"Helvetica\")"},{"path":"exercise-1-word-frequency-analysis.html","id":"bonus-gender-prediction","chapter":"17 Exercise 1: Word frequency analysis","heading":"17.10 Bonus: gender prediction","text":"might decide measure inadequate expansive answer question hand. Another way measuring representation cultural production measure gender authors spoke events.course, take quite time individually code approximately 6000 events included dataset.exist alternative techniques imputing gender based name individual.first create new data.frame object, selecting just columns artist name year. generate new column containing just artist’s (author’s) first name:set packages called gender genderdata used make process predicting gender based given individual’s name pretty straightforward. technique worked reference U.S. Social Security Administration baby name data.Given common gender associated given name changes time, function also allows us specify range years cohort question whose gender inferring. Given don’t know wide cohort artists , specify broad range 1920-2000.Unfortunately, package longer works newer versions R; fortunately, recreated using original “babynames” data, comes bundled babynames package.don’t necessarily follow step done —include information sake completeness.babynames package. contains, year, number children born given name, well sex. information, can calculate total number individuals given name born sex given year.Given also total number babies born total cross records, can denominate (divide) sums name total number births sex year. can take proportion representing probability given individual Edinburgh Fringe dataset male female.information babynames package can found .first load babynames package R environment data.frame object. data.frame “babynames” contained babynames package can just call object store .dataset contains names years period 1800–2019. variable “n” represents number babies born given name sex year, “prop” represents, according package materials accessible , “n divided total number applicants year, means proportions people gender name born year.”calculate total number babies female male sex born year. merge get combined dataset male female baby names year. merge information back original babynames data.frame object.can calculate, babies born 1920, number babies born name sex. information, can get proportion babies given name particular sex. example, 92% babies born name “Mary” female, give us .92 probability individual name “Mary” female.every name dataset, excluding names proportion equal .5; .e., names adjudicate whether less likely male female.proportions names, can merge back names artists Edinburgh Fringe Book Festival. can easily plot proportion artists Festival male versus female year Festival.can conclude form graph?Note merged proportions th “babynames” data Edinburgh Fringe data lost observations. names Edinburgh Fringe data match “babynames” data. Let’s look names match:notice anything names? tell us potential biases using sources US baby names data foundation gender prediction? alternative ways might go task?","code":"\n# get columns for artist name and year, omitting NAs\ngendes <- edbfdata %>%\n select(artist, year) %>%\n na.omit()\n\n# generate new column with just the artist's (author's) first name\ngendes$name <- sub(\" .*\", \"\", gendes$artist)\ngenpred <- gender(gendes$name,\n years = c(1920, 2000))\nbabynames <- babynames\nhead(babynames)## # A tibble: 6 × 5\n## year sex name n prop\n## \n## 1 1880 F Mary 7065 0.0724\n## 2 1880 F Anna 2604 0.0267\n## 3 1880 F Emma 2003 0.0205\n## 4 1880 F Elizabeth 1939 0.0199\n## 5 1880 F Minnie 1746 0.0179\n## 6 1880 F Margaret 1578 0.0162\ntotals_female <- babynames %>%\n filter(sex==\"F\") %>%\n group_by(year) %>%\n summarise(total_female = sum(n))\n\ntotals_male <- babynames %>%\n filter(sex==\"M\") %>%\n group_by(year) %>%\n summarise(total_male = sum(n))\n\ntotals <- merge(totals_female, totals_male)\n\ntotsm <- merge(babynames, totals, by = \"year\")\nhead(totsm)## year sex name n prop total_female total_male\n## 1 1880 F Mary 7065 0.07238359 90993 110491\n## 2 1880 F Anna 2604 0.02667896 90993 110491\n## 3 1880 F Emma 2003 0.02052149 90993 110491\n## 4 1880 F Elizabeth 1939 0.01986579 90993 110491\n## 5 1880 F Minnie 1746 0.01788843 90993 110491\n## 6 1880 F Margaret 1578 0.01616720 90993 110491\ntotprops <- totsm %>%\n filter(year >= 1920) %>%\n group_by(name, year) %>%\n mutate(sumname = sum(n),\n prop = ifelse(sumname==n, 1,\n n/sumname)) %>%\n filter(prop!=.5) %>%\n group_by(name) %>%\n slice(which.max(prop)) %>%\n summarise(prop = max(prop),\n totaln = sum(n),\n name = max(name),\n sex = unique(sex))\n\nhead(totprops)## # A tibble: 6 × 4\n## name prop totaln sex \n## \n## 1 Aaban 1 5 M \n## 2 Aabha 1 7 F \n## 3 Aabid 1 5 M \n## 4 Aabir 1 5 M \n## 5 Aabriella 1 5 F \n## 6 Aada 1 5 F\nednameprops <- merge(totprops, gendes, by = \"name\")\n\nggplot(ednameprops, aes(x=year, fill = factor(sex))) +\n geom_bar(position = \"fill\") +\n xlab(\"Year\") +\n ylab(\"% women authors\") +\n labs(fill=\"\") +\n scale_y_continuous(labels = scales::percent_format(accuracy = 1)) +\n theme_tufte(base_family = \"Helvetica\") +\n geom_abline(slope=0, intercept=0.5, col = \"black\",lty=2)\nnames1 <- ednameprops$name\nnames2 <- gendes$name\ndiffs <- setdiff(names2, names1)\ndiffs## [1] \"L\" \"Kapka\" \"Menzies\" \"Ros\" \n## [5] \"G\" \"Pankaj\" \"Uzodinma\" \"Rodge\" \n## [9] \"A\" \"Zoë\" \"László\" \"Sadakat\" \n## [13] \"Michèle\" \"Maajid\" \"Yrsa\" \"Ahdaf\" \n## [17] \"Noo\" \"Dilip\" \"Sjón\" \"François\" \n## [21] \"J\" \"K\" \"Aonghas\" \"S\" \n## [25] \"Bashabi\" \"Kjartan\" \"Romesh\" \"T\" \n## [29] \"Chibundu\" \"Yiyun\" \"Fiammetta\" \"W\" \n## [33] \"Sindiwe\" \"Cat\" \"Jez\" \"Fi\" \n## [37] \"Sunder\" \"Saci\" \"C.J\" \"Halik\" \n## [41] \"Niccolò\" \"Sifiso\" \"C.S.\" \"DBC\" \n## [45] \"Phyllida\" \"R\" \"Struan\" \"C.J.\" \n## [49] \"SF\" \"Nadifa\" \"Jérome\" \"D\" \n## [53] \"Xiaolu\" \"Ramita\" \"John-Paul\" \"Ha-Joon\" \n## [57] \"Niq\" \"Andrés\" \"Sasenarine\" \"Frane\" \n## [61] \"Alev\" \"Gruff\" \"Line\" \"Zakes\" \n## [65] \"Pip\" \"Witi\" \"Halsted\" \"Ziauddin\" \n## [69] \"J.\" \"Åsne\" \"Alecos\" \".\" \n## [73] \"Julián\" \"Sunjeev\" \"A.C.S\" \"Etgar\" \n## [77] \"Hyeonseo\" \"Jaume\" \"A.\" \"Jesús\" \n## [81] \"Jón\" \"Helle\" \"M\" \"Jussi\" \n## [85] \"Aarathi\" \"Shappi\" \"Macastory\" \"Odafe\" \n## [89] \"Chimwemwe\" \"Hrefna\" \"Bidisha\" \"Packie\" \n## [93] \"Tahmima\" \"Sara-Jane\" \"Tahar\" \"Lemn\" \n## [97] \"Neu!\" \"Jürgen\" \"Barroux\" \"Jan-Philipp\" \n## [101] \"Non\" \"Metaphrog\" \"Wilko\" \"Álvaro\" \n## [105] \"Stef\" \"Erlend\" \"Grinagog\" \"Norma-Ann\" \n## [109] \"Fuchsia\" \"Giddy\" \"Joudie\" \"Sav\" \n## [113] \"Liu\" \"Jayne-Anne\" \"Wioletta\" \"Sinéad\" \n## [117] \"Katherena\" \"Siân\" \"Dervla\" \"Teju\" \n## [121] \"Iosi\" \"Daša\" \"Cosey\" \"Bettany\" \n## [125] \"Thordis\" \"Uršuľa\" \"Limmy\" \"Meik\" \n## [129] \"Zindzi\" \"Dougie\" \"Ngugi\" \"Inua\" \n## [133] \"Ottessa\" \"Bjørn\" \"Novuyo\" \"Rhidian\" \n## [137] \"Sibéal\" \"Hsiao-Hung\" \"Audur\" \"Sadek\" \n## [141] \"Özlem\" \"Zaffar\" \"Jean-Pierre\" \"Lalage\" \n## [145] \"Yaba\" \"H\" \"DJ\" \"Sigitas\" \n## [149] \"Clémentine\" \"Celeste-Marie\" \"Marawa\" \"Ghillie\" \n## [153] \"Ahdam\" \"Suketu\" \"Goenawan\" \"Niviaq\" \n## [157] \"Steinunn\" \"Shoo\" \"Ibram\" \"Venki\" \n## [161] \"DeRay\" \"Diarmaid\" \"Serhii\" \"Harkaitz\" \n## [165] \"Adélaïde\" \"Agustín\" \"Jérôme\" \"Siobhán\" \n## [169] \"Nesrine\" \"Jokha\" \"Gulnar\" \"Uxue\" \n## [173] \"Taqralik\" \"Tayi\" \"E\" \"Dapo\" \n## [177] \"Dunja\" \"Maaza\" \"Wayétu\" \"Shokoofeh\""},{"path":"exercise-1-word-frequency-analysis.html","id":"exercises","chapter":"17 Exercise 1: Word frequency analysis","heading":"17.11 Exercises","text":"Filter books genre (selecting e.g., “Literature” “Children”) plot frequency women-related words time.Choose another set terms filter (e.g., race-related words) plot frequency time.","code":""},{"path":"exercise-1-word-frequency-analysis.html","id":"references","chapter":"17 Exercise 1: Word frequency analysis","heading":"17.12 References","text":"","code":""},{"path":"exercise-2-dictionary-based-methods.html","id":"exercise-2-dictionary-based-methods","chapter":"18 Exercise 2: Dictionary-based methods","heading":"18 Exercise 2: Dictionary-based methods","text":"","code":""},{"path":"exercise-2-dictionary-based-methods.html","id":"introduction-1","chapter":"18 Exercise 2: Dictionary-based methods","heading":"18.1 Introduction","text":"tutorial, learn :Use dictionary-based techniques analyze textUse common sentiment dictionariesCreate “dictionary”Use Lexicoder sentiment dictionary Young Soroka (2012)","code":""},{"path":"exercise-2-dictionary-based-methods.html","id":"setup-7","chapter":"18 Exercise 2: Dictionary-based methods","heading":"18.2 Setup","text":"hands-exercise week uses dictionary-based methods filtering scoring words. Dictionary-based methods use pre-generated lexicons, list words associated scores variables measuring valence particular word. sense, exercise unlike analysis Edinburgh Book Festival event descriptions. , filtering descriptions based presence absence word related women gender. can understand approach particularly simple type “dictionary-based” method. , “dictionary” “lexicon” contained just words related gender.","code":""},{"path":"exercise-2-dictionary-based-methods.html","id":"load-data-and-packages-1","chapter":"18 Exercise 2: Dictionary-based methods","heading":"18.3 Load data and packages","text":"proceeding, ’ll load remaining packages need tutorial.exercise ’ll using another new dataset. data collected Twitter accounts top eight newspapers UK circulation. can see names newspapers code :details access Twitter data academictwitteR, check details package .can download final dataset :’re working document computer (“locally”) can download tweets data following way:","code":"\nlibrary(academictwitteR) # for fetching Twitter data\nlibrary(tidyverse) # loads dplyr, ggplot2, and others\nlibrary(readr) # more informative and easy way to import data\nlibrary(stringr) # to handle text elements\nlibrary(tidytext) # includes set of functions useful for manipulating text\nlibrary(quanteda) # includes functions to implement Lexicoder\nlibrary(textdata)\nnewspapers = c(\"TheSun\", \"DailyMailUK\", \"MetroUK\", \"DailyMirror\", \n \"EveningStandard\", \"thetimes\", \"Telegraph\", \"guardian\")\n\ntweets <-\n get_all_tweets(\n users = newspapers,\n start_tweets = \"2020-01-01T00:00:00Z\",\n end_tweets = \"2020-05-01T00:00:00Z\",\n data_path = \"data/sentanalysis/\",\n n = Inf,\n )\n\ntweets <- \n bind_tweets(data_path = \"data/sentanalysis/\", output_format = \"tidy\")\n\nsaveRDS(tweets, \"data/sentanalysis/newstweets.rds\")\ntweets <- readRDS(\"data/sentanalysis/newstweets.rds\")\ntweets <- readRDS(gzcon(url(\"https://github.com/cjbarrie/CTA-ED/blob/main/data/sentanalysis/newstweets.rds?raw=true\")))"},{"path":"exercise-2-dictionary-based-methods.html","id":"inspect-and-filter-data-1","chapter":"18 Exercise 2: Dictionary-based methods","heading":"18.4 Inspect and filter data","text":"Let’s look data:row tweets produced one news outlets detailed five month period, January–May 2020. Note also tweets particular date. can therefore use look time changes.won’t need variables let’s just keep interest us:manipulate data tidy format , unnesting token (: words) tweet text.’ll tidy , previous example, removing stop words:","code":"\nhead(tweets)## # A tibble: 6 × 31\n## tweet_id user_username text lang author_id source possibly_sensitive\n## \n## 1 1212334402266521… DailyMirror \"Sec… en 16887175 Tweet… FALSE \n## 2 1212334169457676… DailyMirror \"RT … en 16887175 Tweet… FALSE \n## 3 1212333195879993… thetimes \"A c… en 6107422 Echob… FALSE \n## 4 1212333194864988… TheSun \"Way… en 34655603 Echob… FALSE \n## 5 1212332920507191… DailyMailUK \"Stu… en 111556423 Socia… FALSE \n## 6 1212332640570875… TheSun \"Dad… en 34655603 Twitt… FALSE \n## # ℹ 24 more variables: conversation_id , created_at , user_url ,\n## # user_location , user_protected , user_verified ,\n## # user_name , user_profile_image_url , user_description ,\n## # user_created_at , user_pinned_tweet_id , retweet_count ,\n## # like_count , quote_count , user_tweet_count ,\n## # user_list_count , user_followers_count ,\n## # user_following_count , sourcetweet_type , sourcetweet_id , …\ncolnames(tweets)## [1] \"tweet_id\" \"user_username\" \"text\" \n## [4] \"lang\" \"author_id\" \"source\" \n## [7] \"possibly_sensitive\" \"conversation_id\" \"created_at\" \n## [10] \"user_url\" \"user_location\" \"user_protected\" \n## [13] \"user_verified\" \"user_name\" \"user_profile_image_url\"\n## [16] \"user_description\" \"user_created_at\" \"user_pinned_tweet_id\" \n## [19] \"retweet_count\" \"like_count\" \"quote_count\" \n## [22] \"user_tweet_count\" \"user_list_count\" \"user_followers_count\" \n## [25] \"user_following_count\" \"sourcetweet_type\" \"sourcetweet_id\" \n## [28] \"sourcetweet_text\" \"sourcetweet_lang\" \"sourcetweet_author_id\" \n## [31] \"in_reply_to_user_id\"\ntweets <- tweets %>%\n select(user_username, text, created_at, user_name,\n retweet_count, like_count, quote_count) %>%\n rename(username = user_username,\n newspaper = user_name,\n tweet = text)\ntidy_tweets <- tweets %>% \n mutate(desc = tolower(tweet)) %>%\n unnest_tokens(word, desc) %>%\n filter(str_detect(word, \"[a-z]\"))\ntidy_tweets <- tidy_tweets %>%\n filter(!word %in% stop_words$word)"},{"path":"exercise-2-dictionary-based-methods.html","id":"get-sentiment-dictionaries","chapter":"18 Exercise 2: Dictionary-based methods","heading":"18.5 Get sentiment dictionaries","text":"Several sentiment dictionaries come bundled tidytext package. :AFINN Finn Årup Nielsen,bing Bing Liu collaborators, andnrc Saif Mohammad Peter TurneyWe can look see relevant dictionaries stored.see . First, AFINN lexicon gives words score -5 +5, negative scores indicate negative sentiment positive scores indicate positive sentiment. nrc lexicon opts binary classification: positive, negative, anger, anticipation, disgust, fear, joy, sadness, surprise, trust, word given score 1/0 sentiments. words, nrc lexicon, words appear multiple times enclose one emotion (see, e.g., “abandon” ). bing lexicon minimal, classifying words simply binary “positive” “negative” categories.Let’s see might filter texts selecting dictionary, subset dictionary, using inner_join() filter tweet data. might, example, interested fear words. Maybe, might hypothesize, uptick fear toward beginning coronavirus outbreak. First, let’s look words tweet data nrc lexicon codes fear-related words.total 1,174 words fear valence tweet data according nrc classification. Several seem reasonable (e.g., “death,” “pandemic”); others seems less (e.g., “mum,” “fight”).","code":"\nget_sentiments(\"afinn\")## # A tibble: 2,477 × 2\n## word value\n## \n## 1 abandon -2\n## 2 abandoned -2\n## 3 abandons -2\n## 4 abducted -2\n## 5 abduction -2\n## 6 abductions -2\n## 7 abhor -3\n## 8 abhorred -3\n## 9 abhorrent -3\n## 10 abhors -3\n## # ℹ 2,467 more rows\nget_sentiments(\"bing\")## # A tibble: 6,786 × 2\n## word sentiment\n## \n## 1 2-faces negative \n## 2 abnormal negative \n## 3 abolish negative \n## 4 abominable negative \n## 5 abominably negative \n## 6 abominate negative \n## 7 abomination negative \n## 8 abort negative \n## 9 aborted negative \n## 10 aborts negative \n## # ℹ 6,776 more rows\nget_sentiments(\"nrc\")## # A tibble: 13,875 × 2\n## word sentiment\n## \n## 1 abacus trust \n## 2 abandon fear \n## 3 abandon negative \n## 4 abandon sadness \n## 5 abandoned anger \n## 6 abandoned fear \n## 7 abandoned negative \n## 8 abandoned sadness \n## 9 abandonment anger \n## 10 abandonment fear \n## # ℹ 13,865 more rows\nnrc_fear <- get_sentiments(\"nrc\") %>% \n filter(sentiment == \"fear\")\n\ntidy_tweets %>%\n inner_join(nrc_fear) %>%\n count(word, sort = TRUE)## Joining with `by = join_by(word)`## # A tibble: 1,173 × 2\n## word n\n## \n## 1 mum 4509\n## 2 death 4073\n## 3 police 3275\n## 4 hospital 2240\n## 5 government 2179\n## 6 pandemic 1877\n## 7 fight 1309\n## 8 die 1199\n## 9 attack 1099\n## 10 murder 1064\n## # ℹ 1,163 more rows"},{"path":"exercise-2-dictionary-based-methods.html","id":"sentiment-trends-over-time","chapter":"18 Exercise 2: Dictionary-based methods","heading":"18.6 Sentiment trends over time","text":"see time trends? First let’s make sure data properly arranged ascending order date. ’ll add column, ’ll call “order,” use become clear sentiment analysis.Remember structure tweet data one token (word) per document (tweet) format. order look sentiment trends time, ’ll need decide many words estimate sentiment., first add sentiment dictionary inner_join(). use count() function, specifying want count dates, words indexed order (.e., row number) every 1000 rows (.e., every 1000 words).means one date many tweets totalling >1000 words, multiple observations given date; one two tweets might just one row associated sentiment score date.calculate sentiment scores sentiment types (positive, negative, anger, anticipation, disgust, fear, joy, sadness, surprise, trust) use spread() function convert separate columns (rather rows). Finally calculate net sentiment score subtracting score negative sentiment positive sentiment.different sentiment dictionaries look compared ? can plot sentiment scores time sentiment dictionaries like :see look pretty similar… interestingly seems overall sentiment positivity increases pandemic breaks.","code":"\n#gen data variable, order and format date\ntidy_tweets$date <- as.Date(tidy_tweets$created_at)\n\ntidy_tweets <- tidy_tweets %>%\n arrange(date)\n\ntidy_tweets$order <- 1:nrow(tidy_tweets)\n#get tweet sentiment by date\ntweets_nrc_sentiment <- tidy_tweets %>%\n inner_join(get_sentiments(\"nrc\")) %>%\n count(date, index = order %/% 1000, sentiment) %>%\n spread(sentiment, n, fill = 0) %>%\n mutate(sentiment = positive - negative)## Joining with `by = join_by(word)`## Warning in inner_join(., get_sentiments(\"nrc\")): Detected an unexpected many-to-many relationship between `x` and `y`.\n## ℹ Row 2 of `x` matches multiple rows in `y`.\n## ℹ Row 7712 of `y` matches multiple rows in `x`.\n## ℹ If a many-to-many relationship is expected, set `relationship =\n## \"many-to-many\"` to silence this warning.\ntweets_nrc_sentiment %>%\n ggplot(aes(date, sentiment)) +\n geom_point(alpha=0.5) +\n geom_smooth(method= loess, alpha=0.25)## `geom_smooth()` using formula = 'y ~ x'\ntidy_tweets %>%\n inner_join(get_sentiments(\"bing\")) %>%\n count(date, index = order %/% 1000, sentiment) %>%\n spread(sentiment, n, fill = 0) %>%\n mutate(sentiment = positive - negative) %>%\n ggplot(aes(date, sentiment)) +\n geom_point(alpha=0.5) +\n geom_smooth(method= loess, alpha=0.25) +\n ylab(\"bing sentiment\")## Joining with `by = join_by(word)`## Warning in inner_join(., get_sentiments(\"bing\")): Detected an unexpected many-to-many relationship between `x` and `y`.\n## ℹ Row 54114 of `x` matches multiple rows in `y`.\n## ℹ Row 3848 of `y` matches multiple rows in `x`.\n## ℹ If a many-to-many relationship is expected, set `relationship =\n## \"many-to-many\"` to silence this warning.## `geom_smooth()` using formula = 'y ~ x'\ntidy_tweets %>%\n inner_join(get_sentiments(\"nrc\")) %>%\n count(date, index = order %/% 1000, sentiment) %>%\n spread(sentiment, n, fill = 0) %>%\n mutate(sentiment = positive - negative) %>%\n ggplot(aes(date, sentiment)) +\n geom_point(alpha=0.5) +\n geom_smooth(method= loess, alpha=0.25) +\n ylab(\"nrc sentiment\")## Joining with `by = join_by(word)`## Warning in inner_join(., get_sentiments(\"nrc\")): Detected an unexpected many-to-many relationship between `x` and `y`.\n## ℹ Row 2 of `x` matches multiple rows in `y`.\n## ℹ Row 7712 of `y` matches multiple rows in `x`.\n## ℹ If a many-to-many relationship is expected, set `relationship =\n## \"many-to-many\"` to silence this warning.## `geom_smooth()` using formula = 'y ~ x'\ntidy_tweets %>%\n inner_join(get_sentiments(\"afinn\")) %>%\n group_by(date, index = order %/% 1000) %>% \n summarise(sentiment = sum(value)) %>% \n ggplot(aes(date, sentiment)) +\n geom_point(alpha=0.5) +\n geom_smooth(method= loess, alpha=0.25) +\n ylab(\"afinn sentiment\")## Joining with `by = join_by(word)`\n## `summarise()` has grouped output by 'date'. You can override using the `.groups`\n## argument.\n## `geom_smooth()` using formula = 'y ~ x'"},{"path":"exercise-2-dictionary-based-methods.html","id":"domain-specific-lexicons","chapter":"18 Exercise 2: Dictionary-based methods","heading":"18.7 Domain-specific lexicons","text":"course, list- dictionary-based methods need focus sentiment, even one common uses. essence, ’ll seen sentiment analysis techniques rely given lexicon score words appropriately. nothing stopping us making dictionaries, whether measure sentiment . data , might interested, example, prevalence mortality-related words news. , might choose make dictionary terms. look like?minimal example choose, example, words like “death” synonyms score 1. combine dictionary, ’ve called “mordict” .use technique bind data look incidence words time. Combining sequence scripts following:simply counts number mortality words time. might misleading , example, longer tweets certain points time; .e., length quantity text time-constant.matter? Well, just mortality words later just tweets earlier . just counting words, taking account denominator.alternative, preferable, method simply take character string relevant words. sum total number words across tweets time. filter tweet words whether mortality word , according dictionary words constructed. words, summing number times appear date., join data frame total words date. Note using full_join() want include dates appear “totals” data frame appear filter mortality words; .e., days mortality words equal 0. go plotting .","code":"\nword <- c('death', 'illness', 'hospital', 'life', 'health',\n 'fatality', 'morbidity', 'deadly', 'dead', 'victim')\nvalue <- c(1, 1, 1, 1, 1, 1, 1, 1, 1, 1)\nmordict <- data.frame(word, value)\nmordict## word value\n## 1 death 1\n## 2 illness 1\n## 3 hospital 1\n## 4 life 1\n## 5 health 1\n## 6 fatality 1\n## 7 morbidity 1\n## 8 deadly 1\n## 9 dead 1\n## 10 victim 1\ntidy_tweets %>%\n inner_join(mordict) %>%\n group_by(date, index = order %/% 1000) %>% \n summarise(morwords = sum(value)) %>% \n ggplot(aes(date, morwords)) +\n geom_bar(stat= \"identity\") +\n ylab(\"mortality words\")## Joining with `by = join_by(word)`\n## `summarise()` has grouped output by 'date'. You can override using the `.groups`\n## argument.\nmordict <- c('death', 'illness', 'hospital', 'life', 'health',\n 'fatality', 'morbidity', 'deadly', 'dead', 'victim')\n\n#get total tweets per day (no missing dates so no date completion required)\ntotals <- tidy_tweets %>%\n mutate(obs=1) %>%\n group_by(date) %>%\n summarise(sum_words = sum(obs))\n\n#plot\ntidy_tweets %>%\n mutate(obs=1) %>%\n filter(grepl(paste0(mordict, collapse = \"|\"),word, ignore.case = T)) %>%\n group_by(date) %>%\n summarise(sum_mwords = sum(obs)) %>%\n full_join(totals, word, by=\"date\") %>%\n mutate(sum_mwords= ifelse(is.na(sum_mwords), 0, sum_mwords),\n pctmwords = sum_mwords/sum_words) %>%\n ggplot(aes(date, pctmwords)) +\n geom_point(alpha=0.5) +\n geom_smooth(method= loess, alpha=0.25) +\n xlab(\"Date\") + ylab(\"% mortality words\")## `geom_smooth()` using formula = 'y ~ x'"},{"path":"exercise-2-dictionary-based-methods.html","id":"using-lexicoder","chapter":"18 Exercise 2: Dictionary-based methods","heading":"18.8 Using Lexicoder","text":"approaches use general dictionary-based techniques designed domain-specific text news text. Lexicoder Sentiment Dictionary, Young Soroka (2012) designed specifically examining affective content news text. follows, see implement analysis using dictionary.conduct analysis using quanteda package. see can tokenize text similar way using functions included quanteda package.quanteda package first need create “corpus” object, declaring tweets corpus object. , make sure date column correctly stored create corpus object corpus() function. Note specifying text_field “tweet” text data interest , including information date tweet published. information specified docvars argument. ’ll see tthen corpus consists text -called “docvars,” just variables (columns) original dataset. , included date column.tokenize text using tokens() function quanteda, removing punctuation along way:take data_dictionary_LSD2015 comes bundled quanteda select positive negative categories, excluding words deemed “neutral.” , ready “look ” dictionary tokens corpus scored tokens_lookup() function.creates long list texts (tweets) annotated series ‘positive’ ‘negative’ annotations depending valence words text. creators quanteda recommend generate document feature matric . Grouping date, get dfm object, quite convoluted list object can plot using base graphics functions plotting matrices.Alternatively, can recreate tidy format follows:plot accordingly:","code":"\ntweets$date <- as.Date(tweets$created_at)\n\ntweet_corpus <- corpus(tweets, text_field = \"tweet\", docvars = \"date\")## Warning: docvars argument is not used.\ntoks_news <- tokens(tweet_corpus, remove_punct = TRUE)\n# select only the \"negative\" and \"positive\" categories\ndata_dictionary_LSD2015_pos_neg <- data_dictionary_LSD2015[1:2]\n\ntoks_news_lsd <- tokens_lookup(toks_news, dictionary = data_dictionary_LSD2015_pos_neg)\n# create a document document-feature matrix and group it by date\ndfmat_news_lsd <- dfm(toks_news_lsd) %>% \n dfm_group(groups = date)\n\n# plot positive and negative valence over time\nmatplot(dfmat_news_lsd$date, dfmat_news_lsd, type = \"l\", lty = 1, col = 1:2,\n ylab = \"Frequency\", xlab = \"\")\ngrid()\nlegend(\"topleft\", col = 1:2, legend = colnames(dfmat_news_lsd), lty = 1, bg = \"white\")\n# plot overall sentiment (positive - negative) over time\n\nplot(dfmat_news_lsd$date, dfmat_news_lsd[,\"positive\"] - dfmat_news_lsd[,\"negative\"], \n type = \"l\", ylab = \"Sentiment\", xlab = \"\")\ngrid()\nabline(h = 0, lty = 2)\nnegative <- dfmat_news_lsd@x[1:121]\npositive <- dfmat_news_lsd@x[122:242]\ndate <- dfmat_news_lsd@Dimnames$docs\n\n\ntidy_sent <- as.data.frame(cbind(negative, positive, date))\n\ntidy_sent$negative <- as.numeric(tidy_sent$negative)\ntidy_sent$positive <- as.numeric(tidy_sent$positive)\ntidy_sent$sentiment <- tidy_sent$positive - tidy_sent$negative\ntidy_sent$date <- as.Date(tidy_sent$date)\ntidy_sent %>%\n ggplot() +\n geom_line(aes(date, sentiment))"},{"path":"exercise-2-dictionary-based-methods.html","id":"exercises-1","chapter":"18 Exercise 2: Dictionary-based methods","heading":"18.9 Exercises","text":"Take subset tweets data “user_name” names describe name newspaper source Twitter account. see different sentiment dynamics look different newspaper sources?Build (minimal) dictionary-based filter technique plot resultApply Lexicoder Sentiment Dictionary news tweets, break analysis newspaper","code":""},{"path":"exercise-2-dictionary-based-methods.html","id":"references-1","chapter":"18 Exercise 2: Dictionary-based methods","heading":"18.10 References","text":"","code":""},{"path":"exercise-3-comparison-and-complexity.html","id":"exercise-3-comparison-and-complexity","chapter":"19 Exercise 3: Comparison and complexity","heading":"19 Exercise 3: Comparison and complexity","text":"","code":""},{"path":"exercise-3-comparison-and-complexity.html","id":"introduction-2","chapter":"19 Exercise 3: Comparison and complexity","heading":"19.1 Introduction","text":"hands-exercise week focuses : 1) comparing texts; 2) measuring document-level characteristics text—, complexity.tutorial, learn :Compare texts using character-based measures similarity distanceCompare texts using term-based measures similarity distanceCalculate complexity textsReplicate analyses Schoonvelde et al. (2019)","code":""},{"path":"exercise-3-comparison-and-complexity.html","id":"setup-8","chapter":"19 Exercise 3: Comparison and complexity","heading":"19.2 Setup","text":"proceeding, ’ll load remaining packages need tutorial.example ’ll using data 2017-2018 Theresa May Cabinet UK. data tweets members cabinet.can load data follows.’re working document computer (“locally”) can download tweets data following way:see data contain three variables: “username,” username MP question; “tweet,” text given tweet, “date” days yyyy-mm-dd format.24 MPs whose tweets ’re examining.","code":"\nlibrary(readr) # more informative and easy way to import data\nlibrary(quanteda) # includes functions to implement Lexicoder\nlibrary(quanteda.textstats) # for estimating similarity and complexity measures\nlibrary(stringdist) # for basic character-based distance measures\nlibrary(dplyr) #for wrangling data\nlibrary(tibble) #for wrangling data\nlibrary(ggplot2) #for visualization\ntweets <- readRDS(\"data/comparison-complexity/cabinet_tweets.rds\")\ntweets <- readRDS(gzcon(url(\"https://github.com/cjbarrie/CTA-ED/blob/main/data/comparison-complexity/cabinet_tweets.rds?raw=true\")))\nhead(tweets)## # A tibble: 6 × 3\n## username tweet date \n## \n## 1 aluncairns \"A good luck message to Chris Coleman’s squad @FAWales a… 2017-10-09\n## 2 aluncairns \".@AlunCairns “The close relationship between industry a… 2017-10-09\n## 3 aluncairns \"@BarclaysCorp & @SPTS_Tech \\\"voice of Welsh Manufac… 2017-10-09\n## 4 aluncairns \"Today we announced plans to ban the sale of ivory in th… 2017-10-06\n## 5 aluncairns \"Unbeaten Wales overcome Georgia to boost their @FIFAWor… 2017-10-06\n## 6 aluncairns \".@GutoAberconwy marks 25 years of engine production @to… 2017-10-06\nunique(tweets$username)## [1] \"aluncairns\" \"amberrudduk\" \"andrealeadsom\" \"borisjohnson\" \n## [5] \"brandonlewis\" \"damiangreen\" \"damianhinds\" \"daviddavismp\" \n## [9] \"davidgauke\" \"davidmundelldct\" \"dlidington\" \"gavinwilliamson\"\n## [13] \"gregclarkmp\" \"jbrokenshire\" \"jeremy_hunt\" \"juliansmithuk\" \n## [17] \"justinegreening\" \"liamfox\" \"michaelgove\" \"pennymordaunt\" \n## [21] \"philiphammonduk\" \"sajidjavid\" \"theresa_may\" \"trussliz\"\nlength(unique(tweets$username))## [1] 24"},{"path":"exercise-3-comparison-and-complexity.html","id":"generate-document-feature-matrix","chapter":"19 Exercise 3: Comparison and complexity","heading":"19.3 Generate document feature matrix","text":"order use quanteda package accompanying quanteda.textstats package, need reformat data quanteda “corpus” object. just need specify text ’re interested well associated document-level variables ’re interested.can follows.now ready reformat data document feature matrix.Note need tokenized corpus object first. can wrapping tokens function inside dfm() function .object? Well documents tweets. matrix sparse (.e., mostly zeroes) matrix 1s 0s whether given word appears document (tweet) question.vertical elements (columns) vector made words used tweets combined. , helps imagine every tweet positioned side side understand ’s going .","code":"\n#make corpus object, specifying tweet as text field\ntweets_corpus <- corpus(tweets, text_field = \"tweet\")\n\n#add in username document-level information\ndocvars(tweets_corpus, \"username\") <- tweets$username\n\ntweets_corpus## Corpus consisting of 10,321 documents and 2 docvars.\n## text1 :\n## \"A good luck message to Chris Coleman’s squad @FAWales ahead ...\"\n## \n## text2 :\n## \".@AlunCairns “The close relationship between industry and go...\"\n## \n## text3 :\n## \"@BarclaysCorp & @SPTS_Tech \"voice of Welsh Manufacturing...\"\n## \n## text4 :\n## \"Today we announced plans to ban the sale of ivory in the UK....\"\n## \n## text5 :\n## \"Unbeaten Wales overcome Georgia to boost their @FIFAWorldCup...\"\n## \n## text6 :\n## \".@GutoAberconwy marks 25 years of engine production @toyotaf...\"\n## \n## [ reached max_ndoc ... 10,315 more documents ]\ndfmat <- dfm(tokens(tweets_corpus),\n remove_punct = TRUE, \n remove = stopwords(\"english\"))## Warning: '...' should not be used for tokens() arguments; use 'tokens()' first.## Warning: 'remove' is deprecated; use dfm_remove() instead\ndfmat## Document-feature matrix of: 10,321 documents, 26,956 features (99.95% sparse) and 2 docvars.\n## features\n## docs good luck message chris coleman’s squad @fawales ahead tonight’s crucial\n## text1 1 1 1 1 1 1 1 1 1 1\n## text2 0 0 0 0 0 0 0 0 0 0\n## text3 0 0 0 0 0 0 0 0 0 0\n## text4 0 0 0 0 0 0 0 0 0 0\n## text5 0 0 0 0 0 0 0 0 0 0\n## text6 0 0 0 0 0 0 0 0 0 0\n## [ reached max_ndoc ... 10,315 more documents, reached max_nfeat ... 26,946 more features ]"},{"path":"exercise-3-comparison-and-complexity.html","id":"compare-between-mps","chapter":"19 Exercise 3: Comparison and complexity","heading":"19.4 Compare between MPs","text":"data format, ready compare text produced members Theresa May’s Cabinet.’s example correlations combined tweets 5 MPs .Note ’re using dfm_group() function, allows take document feature matrix make calculations grouping one document-level variables specified .many different measures similarity, however, might think using., combine four different measures similarity, see compare across MPs. Note ’re looking similarity MP’s tweets Prime Minister, Theresa May.","code":"\ncorrmat <- dfmat %>%\n dfm_group(groups = username) %>%\n textstat_simil(margin = \"documents\", method = \"correlation\")\n\ncorrmat[1:5,1:5]## 5 x 5 Matrix of class \"dspMatrix\"\n## aluncairns amberrudduk andrealeadsom borisjohnson brandonlewis\n## aluncairns 1.0000000 0.3610579 0.4717627 0.4137785 0.4815319\n## amberrudduk 0.3610579 1.0000000 0.4746674 0.4657415 0.5866139\n## andrealeadsom 0.4717627 0.4746674 1.0000000 0.5605795 0.6905958\n## borisjohnson 0.4137785 0.4657415 0.5605795 1.0000000 0.6685258\n## brandonlewis 0.4815319 0.5866139 0.6905958 0.6685258 1.0000000"},{"path":"exercise-3-comparison-and-complexity.html","id":"compare-between-measures","chapter":"19 Exercise 3: Comparison and complexity","heading":"19.5 Compare between measures","text":"Let’s see looks like one measures—cosine similarity.first get similarities text MP tweets MPs.remember ’re interested compare Theresa May saying.need take cosine similarities retain similarity measures corresponding text Theresa May’s tweets.first convert textstat_simil() output matrix.can see 23rd row matrix contains similarity measures Theresa May tweets.take row, removing similarity Theresa May (always = 1), convert datframe object.rename cosine similarity column appropriate name convert row names column variable cells containing information MP cosine similarity measure refers.like data tidy format, can plot like .Combining steps single loop, can see different similarity measures interest compare.","code":"\n#estimate similarity, grouping by username\n\ncos_sim <- dfmat %>%\n dfm_group(groups = username) %>%\n textstat_simil(margin = \"documents\", method = \"cosine\") #specify method here as character object\ncosmat <- as.matrix(cos_sim) #convert to a matrix\n#generate data frame keeping only the row for Theresa May\ncosmatdf <- as.data.frame(cosmat[23, c(1:22, 24)])\n#rename column\ncolnames(cosmatdf) <- \"corr_may\"\n \n#create column variable from rownames\ncosmatdf <- tibble::rownames_to_column(cosmatdf, \"username\")\nggplot(cosmatdf) +\n geom_point(aes(x=reorder(username, -corr_may), y= corr_may)) + \n coord_flip() +\n xlab(\"MP username\") +\n ylab(\"Cosine similarity score\") + \n theme_minimal()\n#specify different similarity measures to explore\nmethods <- c(\"correlation\", \"cosine\", \"dice\", \"edice\")\n\n#create empty dataframe\ntestdf_all <- data.frame()\n\n#gen for loop across methods types\nfor (i in seq_along(methods)) {\n \n #pass method to character string object\n sim_method <- methods[[i]]\n \n #estimate similarity, grouping by username\n test <- dfmat %>%\n dfm_group(groups = username) %>%\n textstat_simil(margin = \"documents\", method = sim_method) #specify method here as character object created above\n \n testm <- as.matrix(test) #convert to a matrix\n \n #generate data frame keeping only the row for Theresa May\n testdf <- as.data.frame(testm[23, c(1:22, 24)])\n \n #rename column\n colnames(testdf) <- \"corr_may\"\n \n #create column variable from rownames\n testdf <- tibble::rownames_to_column(testdf, \"username\")\n \n #record method in new column variable\n testdf$method <- sim_method\n\n #bind all together\n testdf_all <- rbind(testdf_all, testdf) \n \n}\n\n#create variable (for viz only) that is mean of similarity scores for each MP\ntestdf_all <- testdf_all %>%\n group_by(username) %>%\n mutate(mean_sim = mean(corr_may))\n\nggplot(testdf_all) +\n geom_point( aes(x=reorder(username, -mean_sim), y= corr_may, color = method)) + \n coord_flip() +\n xlab(\"MP username\") +\n ylab(\"Similarity score\") + \n theme_minimal()"},{"path":"exercise-3-comparison-and-complexity.html","id":"complexity-1","chapter":"19 Exercise 3: Comparison and complexity","heading":"19.6 Complexity","text":"now move document-level measures text characteristics. focus paper Schoonvelde et al. (2019).using subset data, taken EU speeches given four politicians. provided authors https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/S4IZ8K.can load data follows.’re working document computer (“locally”) can download tweets data following way:can take look data contains .data contain speeches four different politicians, positioned different points liberal-conservative scale.can calculate Flesch-Kincaid readability/complexity score quanteda.textstats package like .want information aggregated politicians: Gordon Brown, Jose Zapatero”, David Cameron, Mariano Rajoy. recorded data column called “speaker.”gives us data tidy format looks like .can plot—see results look like Figure 1 published article Schoonvelde et al. (2019).","code":"\nspeeches <- readRDS(\"data/comparison-complexity/speeches.rds\")\nspeeches <- readRDS(gzcon(url(\"https://github.com/cjbarrie/CTA-ED/blob/main/data/comparison-complexity/speeches.rds?raw=true\")))\nhead(speeches)## speaker\n## 1 J.L.R. Zapatero\n## 2 J.L.R. Zapatero\n## 3 J.L.R. Zapatero\n## 4 J.L.R. Zapatero\n## 5 J.L.R. Zapatero\n## 6 J.L.R. Zapatero\n## text\n## 1 Dear friends, good morning to you all, both men and women. Thank you very much, Enriqueta, for you work, for your attitude and for your e-mails, for your messages of affection -some of them, so beautiful- and for making the Federation of Progressit Women possible, creating it and providing it with dignity. Thanks to all those women who form part of this Federation for their work, for their spirit and for their temperament. I haven't heard anything resembling a cry or an insult in all the voices that have spoken out from this tribune, because we are talking about equality and equality is the deepest expression of the dignity of men, of the rights of men, of citizens; it is not compatible with looking down on anyone, with shouting or insulting. Equality and dignity are a form of respect. Thanks for working with respect. Thanks to the Federation of Progressist Women. I have been awarded this prize - and I am going to use the stereotype - and apart from picking it up with affection, which I do, I pick up interpreting that it is a prize for most of the Spanish society, which has made it possible for us to do what we can do from the Government; or better, to do what must be done from the Government. The Spanish society is not afraid of equality. The Spanish society defends and believes in equality, in the equality between men and women, of course; in the equality that still has a long way to go. Thus, for me, this prize represents a firm, committed souvenir of the way we still have to go, rather than the recognition to the task of a Government, so that not a single woman is dominated, discriminated, mistreated, forgotten in this society. And there are still too many. I would like to tell you that the equality and the dignity of men and women is the great motor, the great horizon of my political project. I can assure that there are persons, women and, fortunately enough, more men each time, who are going to keep moving on in favour of equality and dignity, and I can assure that I am going to lead those men and women in this country who want more equality and more rights and freedoms for women and for all citizens. You can trust my commitment. This prize has a great value for it was awarded by the associational movement, which I always try to praise and value and to which I try to show my gratitude; an associational movement that has taught and is still teaching us a lesson, that has spotted the problem, that has called the attention of long-forgotten problems and that has lent its voice to those who had been deprived of it, to those who had been silenced. Thanks to the associational movement, to all the organisations that step by step, with their constant effort, give our society a horizon with more rights, with more freedoms and with more equality. I have said it more than once and I must repeat it again here: I am a convinced feminist and I am proud of it. And I would like to tell you something that is very important: I think I have passed this spirit on to the Vice President, to practically the whole Government and to practically the whole Socialist Party. I can tell you that your twenty years of work in favour of the equality of rights and opportunities have been worth it. You may feel proud of yourselves. You have broken conventions and stereotypes that, as Pedro said when he referred to his career from his childhood and to his mother, when one looks back and reflects upon the meaning of the treatment, the consideration and the dominance exerted upon women for centuries in our society, one is tempted to state - and I think that it is a fair thing to do - that we cannot feel proud of History and we cannot feel proud of the history of civilisation, because in most societies it is blotted by the fact that it has dominated, forgotten, marginalized and discriminated women. The 20th century was basically the first time in History when the dawn of equality brought hope to the 21st century, not only to the Spanish society, but to all societies, and there are many societies characterised by painful, intolerable and unbearable marginalization and discrimination and this must become a turning point in History. I would also like to congratulate all those who have received the prize. Congratulations to Mr. Arsenio Escolar, director of \"\"20 Minutos\"\". Life always forces us to chose, to decide among different dilemmas and you chose to support the dignity of women when you had to choose among publicity benefits and the dignity of women, taking out those short ads referring to prostitution. This honours you and this honours us. Thank you very much, Arsenio. Ms. Paz Hernandez Felgueroso, Ms. Maria Jose Ramos and Ms. Begona Fernandez, with their Casa Malva, a welcoming, nice name, have given an exemplary public answer to a social blot and its consequences, namely, gender violence. Congratulations on this deserved prize and I wish you will keep on going, with courage, working in that direction, so that there may be each time more centres, more homes like the Casa Malva, in order to make us think about who need investments and public resources and also think how can a society be dignified, a common task, such as the task that leaders carry out in a democracy. A Casa Malva to restore the dignity of mistreated women. It has been mentioned here that we have observed the compromise of making a first law on gender violence in this Legislature. This was the first one elaborated and taken to the Parliament by the Government; a law that is the beginning; a law without which we would not have had any hope and a law that is not going to do away with the blot of criminal male chauvinism on its own. But this Law must be added our will and yours; this Law must be added measures, resources and a calling to the general awareness, which I would like to repeat today, from all the public operators, from all the public Administrations of Justice, of the State Security Forces and Bodies, and from all the social, support network, so that they may know and remember that the Government is asking them, as their first duty as public servers, to pay attention and be near those women who might suffer or have already suffered gender violence. You know that our Government has placed equality at the first position among the values that any country should defend; an equality that can only be efficient and real if it is based upon rights and rights are included in Laws. Thus, the Law on equality has opened up a space at work, harmonising work and family life, in the public arena, and also in the big companies, in the area of economic power, a determinant, decisive space for equality. I am very sorry that this Law has not been supported by all the political parties and I am even sorrier about the fact that it has been appealed before the Constitutional Court. It is not possible to appeal equality between men and women, no amendment is possible; no amendment or petition. Pedro, I start with you now. You know I admire you and I can tell you, because people sometimes say things about you, as President of the Government of Spain, that this country is proud of you. What else could a country offer to the rest of the world? Its culture and the different artistic creations. It is there that talent, creativity, the seed of freedom and the values of equality lie. It is there. It is there that the origin lies and having a well-known, internationally praised director is one of the best occasions to feel a patriot, to feel Spanish in the broad sense of the word.Thank you, Pedro. The whole work of Pedro, especially as far as his female characters are concerned, is a move in favour of equality, because his female characters break the conventions and the stereotypes, and they are a great proof of the hard, everyday life of working women, who devote all their efforts to a family that does not thank them for it or recognise their work. He shows us the profiles of that extraordinary strength that Pedro was mentioning a while ago, with great realism; strength, mainly in sweetness, mainly in love and mainly in courage, because only courage makes it possible to attain equality, that type of equality that is present in all your films, defending women in your films. Yet, women are the best characters in your films, in my opinion, Pedro, the most solid ones, and sometimes the most tormented and the best defined ones. You have been able to outline them in your films and several generations of Spanish women have seen themselves reflected in your films. We saw this when we were having a look at some of the photographs. I would also like to praise, through this prize, in a special way, in a very special way, all those generations of Spanish women who have had to live without being able to speak up, without being able to study, to ask; having to remain silent, to obey; having to assume that they were different and inferior. All that generation of women who left many things behind in their lives, because they were not allowed to have a life; all those women deserve my deepest praise, those Spanish women who have not been able to live in freedom. I would like to conclude with two ideas: one has to do with democracy and politics, and the other one with emotions. I will start with politics and democracy. When one arrives enters power -I am not going to comment on those matters pointed out by Pedro Almodovar, for as you know I am not a specialist in that- one contemplates, knows the social reality even better than before being in power, of course. The conclusion I would like to express, the one I have always expressed and I will always keep expressing from my experience as President of the Government. I am convinced that in those countries where there is equality between men and women there is more freedom, more life, more creativity, more respect and more democracy; that wherever there is more power - and let's analyse and see through the eyes of a social reality - wherever there is more power in the hardest sense of the word, and also in the less democratic sense of the word, there are less women. Therefore, for a progressist, for someone who believes in changes, in reform, in transformation and in equality, changing a society implies having more women in those places in which they were not in the past and it also implies that men will have to assume that they do it just like us and, in most cases, better than us. The second thing I would like to say, just to conclude, has to do with emotions, with the field of emotions. This celebration is for me one of the most dearly ones that I have attended so far as President of the Government. Nothing excites me more than contemplating a country with a clear horizon in favour of freedom and equality. Nothing excites me more than being able to contribute with my grain of sand or with many grains of sand so that every day in Spain women may have more freedom, dignity and equality. Today is a good day to say this. I feel very proud of being able to lead the values that you represent. Thank you very much.\n## 2 Honourable President, Honourable Deputies, I want to start with a moving mention to the six \"\"Blue Helmets\"\", soldiers of the Spanish army who died tragically last Sunday, 24 th of June, in Southern Lebanon, and transmit to their families our deepest condolences for their irreparable loss. The Minister of Defence will attend, at his own request, the corresponding Commission of these Chambers in order to give a detailed explanation on the research that is currently being carried out concerning the circumstances and consequences of the attack. I know that I word the feelings of Your Honours by expressing in this very moment the recognition and support for the valuable, heroic task of our contingent in Southern Lebanon. And I also know that I word the feelings of the Spanish citizens by stating that Manuel David Portas, Jonathan Galea, Jefferson Vargas, Yeison Alejandro Castaño, Yhon Edisson Posada and Juan Carlos Villoria, either born in Spain or in Colombia, will always be one of us and will always be with us. They lived together, patrolled together and gave their lives together for the same cause. Their families will always feel the encouraging support of the Spanish society, the support of the institutions and the proximity of many citizens that share their immense pain today. Their cause was the cause of peace and solidarity. The cause for which our Army and our Civil Guard are there, in Lebanon, with a triple backing: legal, politic and moral. They are there following the explicit demand contained in Resolution 1071 of the United Nations Security Council; they are there with the support of all the Parliamentary Groups of these Chambers, at the suggestion of the Government, as expressed last September 2006; they are there with the moral urge of contributing to maintaining the cease-fire in a very dangerous zone, helping the locals dismantle the mines, and backing the reconstruction tasks; but they are there, mainly, on a peace operation of the United Nations, contributing with their efforts and sacrifice and even giving their lives in order to help to establish stability in an area where most of the events that place world peace at risk are elucidated. Thus, apart from us, the Heads of State and Government and the heads of international organisms, and in particular, the General Secretary of the United Nations, Mr. Ban-Ki-Moon have also paid homage to them. We have paid a very high price, but our commitment with peace in the Middle East will not be altered, nor shall we stop offering our support to the United Nations as the main factor in order to reach peace. We will also persist in our determination so that those who are guilty for this deadly blast assume it and pay for their felony and, of course, so that they never attain their aims. Honourable President, Honourable Deputies, I will now go on to analyse de results of the European Council held in Brussels last 21 st and 22 nd of June, this year, which makes me feel great satisfaction. We had gone through a two-year blocking, which on many occasions revealed itself as a form of paralysis, and we risked to continue in this situation, damaging the very consistency of the European project. We could have got lost in the inextricable labyrinth of the particular demands of twenty seven States and we could have been defeated by the temptation to postpone the progression to a further attempt. Nothing of this would have been useful for the European Union. Yet, Honourable Deputies, we have made it. We have an agreement that will reactivate the process of European Integration. It is a huge step forward for Europe and a good move for Spain. We, the Heads of State and Government, have agreed upon an order to carry out an Intergovernmental Conference that will adopt a new Treaty for the Reform of the European Union. This order has an extraordinary political meaning, because it develops all the relevant aspects of the future Reform Treaty. Thus, the commitment we have reached represents in fact the future Reform Treaty. Therefore, this commitment represents an agreement de facto and an initial agreement that affects both the form and the content of the new Treaty. It has been, Honourable Deputies, a long, complex negotiation. It was not easy to reach an agreement after such an important, varied political breach among the member States. In some cases, such as in our case, the Governments had received a clear order from the people and from the parliament in favour of the text of the Constitutional Treaty. In some other cases, such as in the case of France and Holland, the citizens had positioned themselves against it. Two years and deep reflection, comprehension and political willpower have been necessary in order to overcome this situation. The Government pointed out from the very first moment that its main aim in the negotiation was to take Europe out from the state of stagnation into which it had fallen; of course, preserving the essential contents and the balance of the Constitutional Treaty, at all times. We believed that Europe needed a solution as soon as possible and that this European Council was the chance to get it. And we have made proposals; we have been active, available and we have worked for it. We offered full support to the German Presidency and we backed its efforts through direct contacts with the member States that posed greater difficulties. We explained clearly and in due time the main points of our position and we pointed out the limits that would never be given up. In that framework we proved ourselves flexible enough to understand and add the coherent proposals to an adequate solution agreed upon through consensus. Thus, we increased confidence in the relationship with our partners. All this has been essential for these days' negotiation in Spain to contribute directly to fix the terms of the agreement. Honourable Deputies, the success of the European Council is our own success. We all had risked a lot in this negotiation. It is a success for Europe and for us as European citizens. It is a success for Spain and for the interests of Spain. All the contents of the Constitutional Treaty that we considered essential are included in the new Treaty. This means exactly that the most efficient and democratic Europe the Spanish voted for at the referendum will soon become a reality as soon as the new text of the Treaty comes into force. It is true that in order to achieve this agreement we have had to make concessions too. Spain would have preferred to have come further, with a single Treaty simplifying the European legislation, keeping the term \"\"Constitution\"\" and the reference to the symbols of the Union. Those seemed to us positive contributions, but we also knew that those were not the substantial aspects of the Treaty. It was not those aspects that placed the future of Europe at risk. Thus, it was decided that if this terminology posed difficulties to the other States concerning the agreement, we could eventually accept its modification. The final result is an excellent one. If the previous Treaty was said not to be a proper Constitution, the new one will doubtlessly have to be recognized as much more than a Treaty, from a political point of view. It is a project with a foundational character, a Treaty for the new Europe. The new Treaty establishes in a clear way the binding juridical value of the Charter of Fundamental Rights and Duties. This recognition is essential in order to bring into force our shared value system. Besides, the Treaty introduces a substantial advance for the efficient functioning of the European Union. The subjects that may be voted by qualified majority will increase from 36 to 85, which sets a significant limit to the principle of unanimity that slows down or blocks decision-making in Europe so many times. Once it comes into force, the qualified majority will be the regime applicable to other delicate questions for Spain such as immigration, energy and cooperation in the areas of justice and internal affairs. These areas have a great potential inside the European Union and they require a more agile system in order to be developed. Our citizens, the Spanish citizens, will be the first to experience the benefits of these measures. As the Honourable Deputies already know, the definition of the qualified majority voting system has been one of the most discussed questions at this Council. We finally have reached an agreement which consists in keeping the existing system until the 1 st of November 2014, with a further transition period until the 31 st of March 2017, during which the blocking minorities may constitute themselves either upon the basis of the system in force, the one known as \"\"Nice system\"\", or upon the basis of the double majority system, following the decision of the interested States. Both in the case of the existing system and in the case of the double majority system, Spain has an adequate representation, according to its population; but Spain wishes to have a superior influence, as compared to its own number of votes or to what its inhabitants represent, since it knows, by its own experience, that real power in the Union does not depend on more or less votes, but on the capacity of the member States to generate confidence, attract involvement, make alliances and defend its national positions from a European perspective. Against the option of the blocking minorities, the Treaty proposes reinforced cooperation and establishes that these may be promoted by nine States minimum. This is also especially relevant for a country such as Spain, which wants to be at the avant-garde of the integration process in almost all the fields of action of the Union. And there is something more: before the end of October this year, a proposal on the new composition of the European Parliament will have to be put forward, and with regards to this proposal, it guarantees an increase in the number of seats corresponding to Spain during the elections to the Parliament that will be celebrated in 200 As far as the institutional field is concerned, with the creation of the new figures, namely, the President of the Council of the European Union and High Representative for the Common Foreign and Security Policy, Europe will progressively reinforce its efficiency, its visibility and its importance as an authentic European Government. Thanks to these figures it will be easier to identify the personality of the Union and speak on its behalf with a single voice in the international area. It is a very important step in the process of political integration in the European Union, which will give institutional coherence to the functioning of the Council and to the direction and development of the Common Foreign and Security Policy, besides, thanks to the Treaty it will also have an external European service so that it can enter into force. Moreover, the Treaty is a great advance as far as the creation of a Space for Freedom, Security and Justice is concerned, these depend completely on the qualified majority after the introduction in this category of the areas of police and criminal cooperation. These are very good news for Spain and promoting such policies at a European level is a reward to our efforts, and it is also a very significant change for our citizens, as it reinforces the protection of their interests and of their security. With this new framework for actuation, the European policy on immigration promoted by the Spanish Government will be more efficient from the perspective of the European Union. Besides, as far as another question of strategic importance for Spanish interests is concerned, the Treaty makes specific reference to the promotion of energetic interconnections among the member States, which, as the Honourable Deputies know, is an essential landmark for the security and development of our energetic policy. The Union recognizes that the principle of energetic solidarity can not be understood in Europe without the development of interconnections. This means, doubtlessly, a great support for the achievement of such interconnections, which are vital for our energetic system. Spain has also been able to keep, in the new text, the improvements established by the Constitutional Treaty with regards to a question that is rather delicate in the case of our country, that concerning the Statute of Ultra-peripheral Regions. At the same time, the Treaty reinforces the role of the national Parliaments, by increasing their capacity to intervene in the European legislative process whenever a simple majority of the votes attributed to those national Parliaments deems that the project put forward does not respect the principle of subsidiarity. Honourable Deputies, I believe that we can be really satisfied with these results. We have not left out any substantial point of the Constitutional Treaty and we have obtained some positive changes for Spain. As far as Spain, this Council has been a reinforcement of our position in Europe. We have worked in cooperation and in harmony with the German Presidency, and I would like to congratulate them once more, from here, on the success achieved thanks to this agreement; the political determination of the German Presidency has doubtlessly been essential for the command that the European Council has given to the Intergovernmental Conference. We have kept a close contact with France, which is the country with which we presented a common proposal a few hours before the meeting of the European Council. And I can tell you that the coordination of our positions and the common mediation have been very useful for the German Presidency. Similarly, we have been working with Italy, Belgium and Luxembourg in order to defend those parts of the Constitutional Treaty that we considered essential. Spain has acted in favour of stability and agreement. It has generated confidence during the whole negotiation and with this attitude we have been able to impulse the defence of the contents and the ambition of a new Treaty. Portugal , the country that will occupy the European Presidency during the next semester, will have our full support during the Intergovernmental Conference. I am convinced that we will have a new Treaty this same year and that its ratification process will take place without any further difficulties. Honourable Deputies, Even though this negotiation about the new Treaty has been the centre of attention of the debates of the Council, during this Council other conclusions about other matters have also been approved of. As you will see, these matters, which I will briefly refer to next, are also of importance for Spain. The Council went on dealing with European immigration policies, following the proposals of Spain. Thus, it stated the need to develop further on the actions in Africa and in the Mediterranean region, signing new Mobility Agreements with the Countries of origin and with the Countries of passage; it congratulated itself on the achievement of agreements for the creation of quick intervention teams and a network of coastal patrols, and it decided to keep reinforcing the capacity of the European Exterior Frontiers Agency. Besides, the Council reaffirmed the importance of the fact that a good management of legal immigration may contribute to dissuade illegal migration flows, and it developed some aspects of the application of this European policy on immigration in the Eastern and South-eastern frontiers of the Union. As far as economic, social and environmental policies are concerned, the Council paid attention to the progress made and to the projects that are currently being carried out with regards to matters such as joint technological initiatives or the European Institute of Technology; it repeated the importance of moving forward towards a European, efficient and sustainable transport, and it also encouraged the work on coordination of the social security systems and on the application of the Action Plan against AIDS. Finally, the conclusions of the Council also focus on the European neighbourhood policy, the strategy of the European Union for a new association with Central Asia and the dialogue process with the so-called emerging economies. Similarly, the European Council celebrated the fact that Cyprus and Malt are in condition to adopt the euro by next 1 st of January 200 Honourable Deputies, These have been the main contents of the European Council that has given us back the image of the Europe we want, the Europe in which we believe and for which we have been working so far: a Europe full of ambition and built upon consensus. Spain and Europe have come out of this process even stronger. We were the first to ratify, by referendum, the Constitutional Treaty. In so doing, we reinforced it so that it could survive in essence against the difficulties. We have now contributed in a decisive way to lay the foundations for the agreement and we have proved our solidarity throughout the negotiation. Spain is perceived at a European level as a member State that transmits stability and confidence, and assumes its responsibilities when Europe requires it. This is how we are perceived, this is how we are needed and this is how we are recognized. It is for this reason that we should feel reasonably satisfied and proud of our contribution and, also, and mainly, because Europe has achieved an agreement that will be applied and it will thus bring about a more democratic, efficient functioning of the Union, which is, no doubt, what most of the Spanish and what most of the European citizens want. Thank you very much.\n## 3 Honourable President, Honourable Deputies, I want to start with a moving mention to the six \"\"Blue Helmets\"\", soldiers of the Spanish army who died tragically last Sunday, 24th of June, in Southern Lebanon, and transmit to their families our deepest condolences for their irreparable loss. The Minister of Defence will attend, at his own request, the corresponding Commission of these Chambers in order to give a detailed explanation on the research that is currently being carried out concerning the circumstances and consequences of the attack. I know that I word the feelings of Your Honours by expressing in this very moment the recognition and support for the valuable, heroic task of our contingent in Southern Lebanon. And I also know that I word the feelings of the Spanish citizens by stating that Manuel David Portas, Jonathan Galea, Jefferson Vargas, Yeison Alejandro Castaño, Yhon Edisson Posada and Juan Carlos Villoria, either born in Spain or in Colombia, will always be one of us and will always be with us. They lived together, patrolled together and gave their lives together for the same cause. Their families will always feel the encouraging support of the Spanish society, the support of the institutions and the proximity of many citizens that share their immense pain today. Their cause was the cause of peace and solidarity. The cause for which our Army and our Civil Guard are there, in Lebanon, with a triple backing: legal, politic and moral. They are there following the explicit demand contained in Resolution 1071 of the United Nations Security Council; they are there with the support of all the Parliamentary Groups of these Chambers, at the suggestion of the Government, as expressed last September 2006; they are there with the moral urge of contributing to maintaining the cease-fire in a very dangerous zone, helping the locals dismantle the mines, and backing the reconstruction tasks; but they are there, mainly, on a peace operation of the United Nations, contributing with their efforts and sacrifice and even giving their lives in order to help to establish stability in an area where most of the events that place world peace at risk are elucidated. Thus, apart from us, the Heads of State and Government and the heads of international organisms, and in particular, the General Secretary of the United Nations, Mr. Ban-Ki-Moon have also paid homage to them. We have paid a very high price, but our commitment with peace in the Middle East will not be altered, nor shall we stop offering our support to the United Nations as the main factor in order to reach peace. We will also persist in our determination so that those who are guilty for this deadly blast assume it and pay for their felony and, of course, so that they never attain their aims. Honourable President, Honourable Deputies, I will now go on to analyse de results of the European Council held in Brussels last 21st and 22nd of June, this year, which makes me feel great satisfaction. We had gone through a two-year blocking, which on many occasions revealed itself as a form of paralysis, and we risked to continue in this situation, damaging the very consistency of the European project. We could have got lost in the inextricable labyrinth of the particular demands of twenty seven States and we could have been defeated by the temptation to postpone the progression to a further attempt. Nothing of this would have been useful for the European Union. Yet, Honourable Deputies, we have made it. We have an agreement that will reactivate the process of European Integration. It is a huge step forward for Europe and a good move for Spain. We, the Heads of State and Government, have agreed upon an order to carry out an Intergovernmental Conference that will adopt a new Treaty for the Reform of the European Union. This order has an extraordinary political meaning, because it develops all the relevant aspects of the future Reform Treaty. Thus, the commitment we have reached represents in fact the future Reform Treaty. Therefore, this commitment represents an agreement de facto and an initial agreement that affects both the form and the content of the new Treaty. It has been, Honourable Deputies, a long, complex negotiation. It was not easy to reach an agreement after such an important, varied political breach among the member States. In some cases, such as in our case, the Governments had received a clear order from the people and from the parliament in favour of the text of the Constitutional Treaty. In some other cases, such as in the case of France and Holland, the citizens had positioned themselves against it. Two years and deep reflection, comprehension and political willpower have been necessary in order to overcome this situation. The Government pointed out from the very first moment that its main aim in the negotiation was to take Europe out from the state of stagnation into which it had fallen; of course, preserving the essential contents and the balance of the Constitutional Treaty, at all times. We believed that Europe needed a solution as soon as possible and that this European Council was the chance to get it. And we have made proposals; we have been active, available and we have worked for it. We offered full support to the German Presidency and we backed its efforts through direct contacts with the member States that posed greater difficulties. We explained clearly and in due time the main points of our position and we pointed out the limits that would never be given up. In that framework we proved ourselves flexible enough to understand and add the coherent proposals to an adequate solution agreed upon through consensus. Thus, we increased confidence in the relationship with our partners. All this has been essential for these days' negotiation in Spain to contribute directly to fix the terms of the agreement. Honourable Deputies, the success of the European Council is our own success. We all had risked a lot in this negotiation. It is a success for Europe and for us as European citizens. It is a success for Spain and for the interests of Spain. All the contents of the Constitutional Treaty that we considered essential are included in the new Treaty. This means exactly that the most efficient and democratic Europe the Spanish voted for at the referendum will soon become a reality as soon as the new text of the Treaty comes into force. It is true that in order to achieve this agreement we have had to make concessions too. Spain would have preferred to have come further, with a single Treaty simplifying the European legislation, keeping the term \"\"Constitution\"\" and the reference to the symbols of the Union. Those seemed to us positive contributions, but we also knew that those were not the substantial aspects of the Treaty. It was not those aspects that placed the future of Europe at risk. Thus, it was decided that if this terminology posed difficulties to the other States concerning the agreement, we could eventually accept its modification. The final result is an excellent one. If the previous Treaty was said not to be a proper Constitution, the new one will doubtlessly have to be recognized as much more than a Treaty, from a political point of view. It is a project with a foundational character, a Treaty for the new Europe. The new Treaty establishes in a clear way the binding juridical value of the Charter of Fundamental Rights and Duties. This recognition is essential in order to bring into force our shared value system. Besides, the Treaty introduces a substantial advance for the efficient functioning of the European Union. The subjects that may be voted by qualified majority will increase from 36 to 85, which sets a significant limit to the principle of unanimity that slows down or blocks decision-making in Europe so many times. Once it comes into force, the qualified majority will be the regime applicable to other delicate questions for Spain such as immigration, energy and cooperation in the areas of justice and internal affairs. These areas have a great potential inside the European Union and they require a more agile system in order to be developed. Our citizens, the Spanish citizens, will be the first to experience the benefits of these measures. As the Honourable Deputies already know, the definition of the qualified majority voting system has been one of the most discussed questions at this Council. We finally have reached an agreement which consists in keeping the existing system until the 1st of November 2014, with a further transition period until the 31st of March 2017, during which the blocking minorities may constitute themselves either upon the basis of the system in force, the one known as \"\"Nice system\"\", or upon the basis of the double majority system, following the decision of the interested States. Both in the case of the existing system and in the case of the double majority system, Spain has an adequate representation, according to its population; but Spain wishes to have a superior influence, as compared to its own number of votes or to what its inhabitants represent, since it knows, by its own experience, that real power in the Union does not depend on more or less votes, but on the capacity of the member States to generate confidence, attract involvement, make alliances and defend its national positions from a European perspective. Against the option of the blocking minorities, the Treaty proposes reinforced cooperation and establishes that these may be promoted by nine States minimum. This is also especially relevant for a country such as Spain, which wants to be at the avant-garde of the integration process in almost all the fields of action of the Union. And there is something more: before the end of October this year, a proposal on the new composition of the European Parliament will have to be put forward, and with regards to this proposal, it guarantees an increase in the number of seats corresponding to Spain during the elections to the Parliament that will be celebrated in 200 As far as the institutional field is concerned, with the creation of the new figures, namely, the President of the Council of the European Union and High Representative for the Common Foreign and Security Policy, Europe will progressively reinforce its efficiency, its visibility and its importance as an authentic European Government. Thanks to these figures it will be easier to identify the personality of the Union and speak on its behalf with a single voice in the international area. It is a very important step in the process of political integration in the European Union, which will give institutional coherence to the functioning of the Council and to the direction and development of the Common Foreign and Security Policy, besides, thanks to the Treaty it will also have an external European service so that it can enter into force. Moreover, the Treaty is a great advance as far as the creation of a Space for Freedom, Security and Justice is concerned, these depend completely on the qualified majority after the introduction in this category of the areas of police and criminal cooperation. These are very good news for Spain and promoting such policies at a European level is a reward to our efforts, and it is also a very significant change for our citizens, as it reinforces the protection of their interests and of their security. With this new framework for actuation, the European policy on immigration promoted by the Spanish Government will be more efficient from the perspective of the European Union. Besides, as far as another question of strategic importance for Spanish interests is concerned, the Treaty makes specific reference to the promotion of energetic interconnections among the member States, which, as the Honourable Deputies know, is an essential landmark for the security and development of our energetic policy. The Union recognizes that the principle of energetic solidarity can not be understood in Europe without the development of interconnections. This means, doubtlessly, a great support for the achievement of such interconnections, which are vital for our energetic system. Spain has also been able to keep, in the new text, the improvements established by the Constitutional Treaty with regards to a question that is rather delicate in the case of our country, that concerning the Statute of Ultra-peripheral Regions. At the same time, the Treaty reinforces the role of the national Parliaments, by increasing their capacity to intervene in the European legislative process whenever a simple majority of the votes attributed to those national Parliaments deems that the project put forward does not respect the principle of subsidiarity. Honourable Deputies, I believe that we can be really satisfied with these results. We have not left out any substantial point of the Constitutional Treaty and we have obtained some positive changes for Spain. As far as Spain, this Council has been a reinforcement of our position in Europe. We have worked in cooperation and in harmony with the German Presidency, and I would like to congratulate them once more, from here, on the success achieved thanks to this agreement; the political determination of the German Presidency has doubtlessly been essential for the command that the European Council has given to the Intergovernmental Conference. We have kept a close contact with France, which is the country with which we presented a common proposal a few hours before the meeting of the European Council. And I can tell you that the coordination of our positions and the common mediation have been very useful for the German Presidency. Similarly, we have been working with Italy, Belgium and Luxembourg in order to defend those parts of the Constitutional Treaty that we considered essential. Spain has acted in favour of stability and agreement. It has generated confidence during the whole negotiation and with this attitude we have been able to impulse the defence of the contents and the ambition of a new Treaty. Portugal, the country that will occupy the European Presidency during the next semester, will have our full support during the Intergovernmental Conference. I am convinced that we will have a new Treaty this same year and that its ratification process will take place without any further difficulties. Honourable Deputies, Even though this negotiation about the new Treaty has been the centre of attention of the debates of the Council, during this Council other conclusions about other matters have also been approved of. As you will see, these matters, which I will briefly refer to next, are also of importance for Spain. The Council went on dealing with European immigration policies, following the proposals of Spain. Thus, it stated the need to develop further on the actions in Africa and in the Mediterranean region, signing new Mobility Agreements with the Countries of origin and with the Countries of passage; it congratulated itself on the achievement of agreements for the creation of quick intervention teams and a network of coastal patrols, and it decided to keep reinforcing the capacity of the European Exterior Frontiers Agency. Besides, the Council reaffirmed the importance of the fact that a good management of legal immigration may contribute to dissuade illegal migration flows, and it developed some aspects of the application of this European policy on immigration in the Eastern and South-eastern frontiers of the Union. As far as economic, social and environmental policies are concerned, the Council paid attention to the progress made and to the projects that are currently being carried out with regards to matters such as joint technological initiatives or the European Institute of Technology; it repeated the importance of moving forward towards a European, efficient and sustainable transport, and it also encouraged the work on coordination of the social security systems and on the application of the Action Plan against AIDS. Finally, the conclusions of the Council also focus on the European neighbourhood policy, the strategy of the European Union for a new association with Central Asia and the dialogue process with the so-called emerging economies. Similarly, the European Council celebrated the fact that Cyprus and Malt are in condition to adopt the euro by next 1st of January 200 Honourable Deputies, These have been the main contents of the European Council that has given us back the image of the Europe we want, the Europe in which we believe and for which we have been working so far: a Europe full of ambition and built upon consensus. Spain and Europe have come out of this process even stronger. We were the first to ratify, by referendum, the Constitutional Treaty. In so doing, we reinforced it so that it could survive in essence against the difficulties. We have now contributed in a decisive way to lay the foundations for the agreement and we have proved our solidarity throughout the negotiation. Spain is perceived at a European level as a member State that transmits stability and confidence, and assumes its responsibilities when Europe requires it. This is how we are perceived, this is how we are needed and this is how we are recognized. It is for this reason that we should feel reasonably satisfied and proud of our contribution and, also, and mainly, because Europe has achieved an agreement that will be applied and it will thus bring about a more democratic, efficient functioning of the Union, which is, no doubt, what most of the Spanish and what most of the European citizens want. Thank you very much.\n## 4 President .- Good morning. Thank you for attending this press conference. I hope you all have had time to rest. In the first place, I would like to say that today is a good day for Europe and I am satisfied as we have attained a very important agreement in order to modify in a substantial way the operation of the European Union, in order to make it more efficient so as to provide an answer to the social problems and to the problems of the European citizens. As you know, achieving this agreement was a difficult challenge after the process that we had gone through as a consequence of the referendums in France and in the Netherlands, and we have made it. We all have been willing and we all have committed in order to get the European Union going again, in order to complete a new stage, in order to make it move in the right direction in this new stage, so that it may achieve an each time more perfect, efficient and useful political union. Thus, I would like to express the satisfaction of the Government of Spain, of a Europeist country, a country that has firmly decided to support the European Union, the strengthening of the European Union and its construction, in order -as usual- to establish a compromise, in this case a compromise among twenty seven countries according to the political circumstances that we already know, which I have just mentioned. Everyone has made concessions so that everyone could win a lot. As you all know, the European Council has issued a mandate for the Intergovernmental Conference to reform the basic treaties of the European Union; a mandate whose most important aspects for the operation of the European Union, from the Spanish perspective, are the following: in the first place, the consecration of the rights, of the principles of the Chart of Fundamental Rights with legal value; in the second place, and this might be the most operative achievement, the subjects that will be decided on by qualified majority have passed from 36 to 87, thus, we will reduce the unanimity system and, accordingly, the right of veto, which will facilitate the decision making process with regards to important issues for the whole Union and for Spain, such as, for instance, immigration, energy or justice and interior. I was saying that the European Union would function more democratically with the reform of the Treaty because one of the main principles of our democracies, namely, the weight of the majority, has spread in the heart of Europe, for it could not keep on working according to the principles of Europe when it had six, nine or twelve members. Many of these issues, as I said, are very important for Spain. As you know, the definition of the concept of qualified majority has been one of the most discussed issues, among others, during this European Council. We finally have reached an agreement whereby the voting system in force will be kept, the one known as \"\"Nice\"\" system, until 2014, followed by another period, until the 31st of March 2017, during which the minorities will be allowed to use the double majority system or the \"\"Nice\"\" system, according to the decision of the States concerned if it is so requested by any other State. Besides, one of the main achievements of the reform, in my opinion, is the new definition of the EU Foreign and Security Policy. Europe is going to have a single, stronger voice in order to carry out its activities in the world, thanks to the High Representative of the Union, who will be Vice President of the Commission and, besides, will be provided with its own service, with a foreign service, in order to carry out his tasks. One of the main objectives of all this reform process has been to endow the European Union with a stronger voice, more efficient and unified as far of foreign policy and security are concerned. Besides, we will have a common, integral policy on immigration upon the bases that we have already established during the last few months, with the active participation of the Government of Spain, as you know. This has been an important step forward in the area of justice, freedom and security for the construction of the European space. Besides, with regards to a matter of strategic interest for Spain, at Spain's own request, an agreement has been adopted in order to introduce a specific reference in the energy policy for the promotion of energetic interconnections among the member States. Similarly, I would like to emphasise a specific matter, for the improvements established in the Constitutional Treaty concerning the Statute of Ultraperipheral Regions are maintained, for, as you know, this is a very interesting matter for Spain. The European Council has issued a limited mandate, very defined and specific, for the Intergovernmental Conference. You know that the objective of the Portuguese Presidency, as expressed yesterday, is to complete the process as soon as possible so that before the end of the year we may have approved of the reform of the Treaty; this new Treaty that will enable the European Union to operate more democratically, efficiently and according to our present times. The German Presidency has played a fundamental role in order to establish this Agreement. It has been supported by Spain in order to achieve the common backing and in order to approach the most distant positions throughout the last few weeks and, of course, the European Council itself has been supported by Spain. Similarly, we have been working intensely with the President of the Republic of France, Mr. Nicolas Sarkozy; with the Prime Minister of Italy, Mr. Romano Prodi, and, of course, we have also collaborated with the Prime Minister of Great Britain, Mr. Tony Blair, with regards to many important aspects; and, by the way, the latter received yesterday, quite logically, an affectionate applause for this was the last time for him to take part in a European Council. I would also like to express my gratitude for the fact that we have been able to set up a relationship and to work with Mr. Tony Blair at the European Council during this period. To sum up, Europe has provided an answer to a difficult situation as the Constitutional Treaty had not been approved of. We have finally included the most important aspects for the practical operation of the European Union in the reform of the Treaty. Thus, the progressively more united, political, efficient Europe that we want, the one with a more powerful voice, with a single voice in the world, will come true once the Treaty is ratified and enters into force. P.- I would like to know whether the Spanish Government has already decided on what it is going to propose to the Spanish citizens with regards to this new Treaty, for Spain ratified the referendum on the European Union back in February 2005: whether it is going to propose a new referendum, whether it is going to be approved of by the Parliament… How is that process going to be? President .- The ratification process is going to take place at the Parliament. Spanish citizens already gave their opinion about a text that, of course, included an important change in the operation of the Union and most of that text is going to form part of the Treaties of the European Union. Thus, the ratification is going to take place at the Parliament. P .- We all agree that this agreement was necessary because, among other things, we could not provide an answer to a two-year long crisis with an even more serious crisis if we had failed. But you, in particular, who have defended so firmly the Constitutional Treaty that has now been forgotten, don't you have a sour feeling? There are evident achievements, but Spain has had to renounce to many of its demands, during the negotiation and, mainly, during the last moments of this negotiation, Europe has had to give in to Poland. Don't you have a sour feeling, of some type, thinking of what could have been and finally has not been achieved, in spite of what has been achieved? President .- Most certainly not, and even less at dawn, for it was at dawn that we finished the meeting yesterday… I have a very positive feeling, for things, as you know very well, were very difficult, months ago, one year ago, in order to find a solution for all, which is how we work in Europe. Of course, from the point of view of the practical operation, which was the essential reason of the Constitutional Treaty and the Convention, from the practical point of view, the most important thing is the reform of the Treaties, which we passed yesterday. The fact that many issues will be approved of by qualified majority, the abandonment of unanimity, which blocks out, which hinders common policies, which does not allow the integration of a European action in important areas such as the ones I have just mentioned (immigration, energy and justice or Interior) was our essential objective. The fact that there is a voting formula and other instruments that are each time more democratic, such as the intense role of national Parliaments and the legislative initiative for citizens, all this implies an important change from the democratic point of view and from the point of view of the capacities of the European Union. Whatever will change with this Treaty is going to change for better. Could we have implemented more changes? Yes, but, once more, Europe has remained loyal to its tradition. As the founders foresaw, Europe is moving forward step by step. Yesterday's step was an important one. All the steps towards the construction of the European Union have yielded very positive results for the European Union and for the countries that form part of it. In fact, I think that we will all agree that this political project is admired all over the world and this is a political project whose current destiny we would not have believed only fifteen years ago, if we had been described -once century ago- a Europe formed by Twenty Seven members, after all the difficulties of co-existence among flags and nations, quite surely. Its destiny is democracy, which is essential, unity, peaceful coexistence and prosperity. That is Europe and, to a great extent, thanks to the European Union. The more we have a European Union, and the better the European Union may work, the better, for I am firmly convinced that the horizon of its Twenty Seven members will involve more security, welfare and prosperity. This is a step, an important step in a difficult situation. P .- Nonetheless, the characteristics of the negotiation in the last moment, that is to say, of what we saw yesterday, the very fact that the Polish have been left outside the Intergovernmental Conference so far, doesn't it increase or augment, to a certain extent, the sensation or the impression that this procedure and this Constitution lack legitimacy? My question is whether that sensation of lack of legitimacy might give raise to the demands or petitions for referendums and, thus, it might hinder the whole process once more. President .- Of course that is not the perspective. I think that, quite obviously, the legitimacy issues from the European Law itself and from the way the Treaties are reformed, from the way it has usually been done in most of the occasions in which a Treaty has been approved of, and also from the great political consensus, which is what gives more legitimacy. I am convinced that what European citizens wanted yesterday was an agreement and to clear out the period or the phase of vision, of paralysis and incertitude about the way in which Europe was going to face its future. We have made it and this is very positive. I honestly think that all citizens can perfectly understand that, when we are talking about Twenty Seven countries that form part of the European Union, at present -which represents almost five hundred million citizens, whose countries have very different histories and also a very different pace of incorporation to the European Union, and whose economic development is also very different in each case-, establishing an agreement in spite of all those factors is doubtlessly a great positive achievement, not to say a success. P .- One of the chapters that Poland wanted to include was the one concerning morality. I would like to know where the Spanish Government was going to get in order to impede that Polish petition. Besides, as to your conversation with Mr. Tony Blair, which lasted half an hour, we were told part of that conversation, but I would like to know whether you analysed the current situation of terrorism in Spain and whether Mr. Blair gave any piece of advice with regards to it or what was his analysis of this issue. As to terrorism, I would like to know the new data of the Government about the implementation of an operative base of ETA in Portugal, after the police research that has been carried out during the last few days. President .- As to the first question, it is evident that such proposal did not have the support of the Spanish Government and I can confirm that it was also objected to by most of the Governments of the European Union. Thus, it remained as a declaration strictly on the Polish side, for obvious reasons. In the second place, the conversation with Mr. Tony Blair, logically enough, dealt with the development of the European Council, mainly, and with the most important matters that we had to go through; but in fact, we also talked about terrorism and about ETA's terrorism. Finally, as to the operative data, it is the Ministry of the Interior and, if applicable, the General Director of the Civil Guard, that can facilitate more information in this regard. P .- President, when you became President and arrived to the European Council for the first time, you soon abandoned the defence of the Nice voting system, turning to the double majority system. Yet, it now seems that one of the things that benefits us more is precisely the prolongation of the Nice voting system. Don't you think that you hurried up then or do you think that you are going to be recriminated for it? President .- This is a matter of political positioning and this is a philosophy of the European Union. For Spain, which, in both systems is represented as a 45-million citizen country should be at present, this is a system that works properly. We cannot enter Europe saying, on the one hand that there should not be too many issues approved of unanimously and, at the same time, on the other hand, say that we want as many blocking instruments as possible. This is an utter contradiction. Mine has been a coherent, balanced position. I want decisions concerning most issues to be made by majority and I don't want the logics of blockage to be activated, for that paralyses Europe, and I also want Spain to be represented as it should be. And that was so yesterday and that is so today. And, if I may, from my experience and for the fact that it is an undisputable truth, on most occasions influence does not depend on a difference of one vote, instead, it depends on coherence, on the constructive capacity and on the compromise with Europe. P.- President, as to the Chart of Fundamental Rights, I would like to know the position of Great Britain if it were allowed to carry out the \"\"opt out\"\" mentioned. As to terrorism, the newspaper \"\"Gara\"\" has published today certain information according to which the Government held a meeting with the terrorist group last March and, besides, it says that you were sent a letter whose tone was not really conciliating last February. Could you confirm such information? President .- As I have said, the Chart includes the legal value that we demanded, I think that this is the main advance with regards to the principles and rights. The position of Great Britain is already known, and I respect it although, logically enough, I do not share it for I think it would be highly convenient for it to include the European Union as a whole. In the second place, obviously enough, I am not acquainted with the speculations concerning such evident propaganda, and I am not going to comment on them or assess them, and even less in the case of such a particular newspaper. P .- President, the new Treaty will confer the European Parliament greater protagonism. Are you satisfied with the representation we have, with the one that has been agreed on? In yesterday's agreement with Poland you mentioned the Eurodeputies, was it because it is also in that same position? President.- We did not talk about Eurodeputies in the agreement with Poland. What we have is what we already had in the Treaty, including a specific reference in the conclusions so that this change in the composition of the European Parliament can be carried out before the elections in 2009, before the next elections to the European Parliament. Of course, if this is so, the composition will benefit Spain. P.- As you have said that you talked about the situation of the terrorist group with the British Prime Minister, I would like to know whether he gave you any new piece of advice, whether he encouraged you to keep going on, for that was what he had told you in the past: he had told you that you should keep up communication, that you should keep up some kind of dialogue. What was his piece of advice in this new situation? President .- During these three long years in the Government, I have spoken to the British Prime Minister on many occasions and we have talked about terrorism and ETA, very specially due to his experience in the peace process in Northern Ireland. Logically enough, he asked me yesterday about it and we commented on it and exchanged our points of view about the situation. Of course, everything he has told me on any occasion has been very useful to me and I am thankful for that. He has always been prone to collaborating. P.- President, it seems to me that in Brussels, the preparation of Brussels has propitiated a new romance, if I may put it so, between Mr. Sarkozy, President of the Republic of France, and the President of the Government of Spain, something that was unthinkable a few months ago. You supported Ms. Segolene Royal during one of your speeches and Mr. Sarkozy supported the policy of the People's Party, criticising quite hardly the process for the massive regularisation of immigrants carried out in Spain, as you will remember. Thus, what is that romance based upon? Is it based on love at the first sight, and I beg your pardon for using this expression? What do you think about the role of Mr. Sarkozy in this first European Council? Was it difficult to deal with a snake charmer who was better than Chirac? And, yet, Mr. Sarkozy seems to have charmed everyone, even taking his shoes off, isn't it right? President .- I am not acquainted with that last detail. I must confirm that I have a very good relationship with Mr. Nicolas Sarkozy, a very good relationship; but this has been the situation from the first day we held the first meeting and now, it has grown more intense, nicer and I think that this is going to be very positive from a political point of view. Besides, I think that we have a good personal understanding, for these things always contribute to it. As you know, we put forward a joint proposal, a proposal by France and Spain, through our Ministers of Foreign Affairs who have been working hard on this question, and it contemplated the main aspects that -in our opinion- had to be included in the reform of the Treaty, the ones that have been included in it, and we have been working in a coordinated way at all times. Yesterday, as you know, the final agreement with Poland was held in the room of the French delegation, and we all were present, Mr. Tony Blair, Mr. Juncker, the President of Poland, Mr. Nicolas Sarkozy and I. That means that there has been a joint, coordinated effort with Mr. Nicolas Sarkozy and, besides, I also think that we should get free from prejudices, sometimes. He has his own political ideas and his own ideology, so do I, but the rest is Europe. That is the grandeur of Europe. Thank you.\n## 5 Thank you very much, dear Rector, for the kindness and hospitality of the Complutense. Congratulations, Josefina, on your brilliant speech. We are here today, not only because this is the sunniest day of the year, but also because we are fully aware of the fact that climatic change is one of the main challenges for Humanity in this century. Being realistic, this is the greatest risk that life on Earth is facing at the moment. Climatic change is a proven fact, although we are still discussing its consequences and eventual calendar. We can not sit and wait for that date, which has no way back, and we should not resign to its effects. At least we know that it will determine the quality of life of our generation, of our sons' and of our grandsons'. It is an ineluctable responsibility that we have to face on our own and for the sake of the future. Some twenty years ago, or even less, only a few would dare to warn us against what was coming over. But nowadays, the International Community has assumed it. The Intergovernmental Panel of Experts on Climatic Change has put it clearly and sharply in its conclusions; and, even, during the last Summit of the G-8, those countries that had been reluctant for the last few years, as was the case of the United States, have taken the step we all expected and have announced that they are also going to commit themselves with this global task. As on many other occasions, Europe has led an awareness-raising process, a process of international solidarity that has been subscribed by the rest of the developed countries and, as on many other occasions, this awareness-raising process has been led, in the first place, by social organizations and researchers. Europe supported, unanimously, the Kyoto Protocol back in 1997 and now it is Europe that commits itself to making new moves in this process. The European Union will defend more ambitious objectives during the post-Kyoto negotiations and it is considering a reduction of 20 to 30 per cent in the emission of hothouse gases during the upcoming commitment period. Given its geographical situation and its social and economic characteristics, some of which have already been presented - with sufficient reasons for reflection, I believe - by Ms. Josefina Garcia Mendoza, Spain is an exposed country, highly vulnerable to climatic change. The most recent projections of its eventual effects on our country during the 21st century point to a progressive, important thermal increase and to a general decrease of precipitations, unequally distributed over regions and seasons. We can not just accept it passively, we should not remain still. That is why Spain has been involved in the genesis of Kyoto and it is now making great efforts in order to fulfil its compromises. We are going to throw ourselves into this strategy. The Exhibition and the Conference that are taking place today in the framework of the activities of the Year of Science are a good proof of the clear, firm commitment of the Spanish Government and of the Spanish society to promote renewable energies and fight climatic change. The commitment of this Government, I would like to remind you, started out the very first day it entered upon office. Among its first actuations one could point out the creation of the first National Plan to assign the rights for the emission of hothouse gases. Then, the next ones were the preparation of a Governmental strategy concerning mechanisms of flexibility regarding the Kyoto Protocol, with the participation in several initiatives concerning the Carbon Fund; the approval of the Plan on Renewable Energies and of the Action Plan of the Spanish Strategy on Energetic Efficiency; the approval of the Technical Building Code and the preparation of a National Plan for the adaptation to climatic change. Yet, even though we have carried out or we are still carrying out considerable efforts, these are not enough. We must be more ambitious. We have to set up greater aims in order to attain them in shorter periods. We will soon pass the Spanish Strategy on Climatic Change and Clean Energies during a Cabinet meeting that will be exclusively devoted to climatic change. During that meeting we shall approve of a series of specific, urgent measures, with a clear calendar and with available resources, in order to fulfil our commitment with the Kyoto Protocol. As part of the essential part of the Plan of Urgent Measures, we are elaborating a new Saving and Energy Efficiency Action Plan for the period 2008-201 The strategy defines eleven areas of intervention, from institutional cooperation to Research, Development and Technological Innovation, with special attention to the so-called disperse diffuse sectors: transport, residential, commercial, institutional, agricultural and service sectors. Thus, as to the transport sector, we can point out the elaboration of a basic rule on sustainable mobility and the promotion of railway transport for the transportation of goods. As to the residential sector, we can point out the energetic improvement of buildings and the spread of the energetic label to all the domestic facilities. Regarding the institutional sector, we have to point out the establishment of energetic efficiency requisites in the case of public lighting. There are nearly 170 specific measures in the General Strategy against climatic change and in favour of clean, renewable energies. The Strategy will also serve to orientate the capacity of Spain to assume additional compromises in the fight against climatic change after 201 The answer to climatic change is not just a governmental matter; the Government must lead it, and we accept it as it is, but it is a matter that depends on all the society. It concerns all the Administrations, the companies, the brilliant companies belonging to this sector in our country, the consumers and the civil society in general. It implies political leadership, a cultural change and social responsibility. The effort must be a collective, shared one. Each one, each company and each Administration must adapt its own dynamics to these new commitments and the achievements will also be shared. In 2006 we managed to revert a historical tendency and Spanish society reduced the demand for primary energy in 3 per cent, in spite of the high economic increase. Besides, this has allowed us to reduce the emissions of hothouse gases by, approximately, 4 per cent. And all this has been compatible with a strong, stable economic growth. The Spanish society, the companies and the citizens have proved that the fight against climatic change is compatible with economic growth. I dare say it is the best way towards economic growth that we have in front of us nowadays. The Spanish can feel proud of the work that is being carried out as far as renewable energies are concerned. We had a certain potential as a country and we know how to make the most out of it. Sun, water and wind are nothing but potential resources if we do not turn them into a source of useful energy. This is what is being achieved by means of research and innovation. Thanks to our research centres, as is the case of the researchers I have met today at the Complutense, and thanks to our companies, we have become a leading country - I would like to emphasize this - in most renewable technologies. Thus, for example, and since this Day is specially devoted to the sun, which, by the way, has behaved, we should say that Spain is the first country in the world where a solar thermal plant of high commercial temperature has come into operation. This is partly due to the work carried out in the research on this type of energy at the Solar Platform of Almeria. But our contribution to renewable energies is not just that: we are the third country in the world in the manufacture of aerogenerators and our market share last year was superior to 20 per cent; we are the leaders in biofuel production; in 2006, Spain was the second producer of bioethanol in the European Union and we were also the second producer of photovoltaic solar energy, as far as installed power is concerned, with an increase of 300 per cent as compared to year 200 The sources of renewable energy cover now an important part of the energetic demand. More of 20 per cent of the electric demand in 2006 was covered with this type of energy. Wind energy alone achieved 9 per cent of the total electric production in our country. Thus, Spain has taken huge steps in very little time. This is a fact we should congratulate ourselves on, but we must be fair and recognize each one's efforts. We are now in an appropriate moment in order to thank our companies for their work and to recognize their contribution to renewable energies and to sustainable development. Some of those companies are represented here. I must thank you for your work. You have been able to explore our technology and compete at a worldwide level, you are now present in the five continents, you have generated employment, more than 18000 work on renewable energies in Spain and you have promoted the image of Spain as a country with technological capacity and respectful with the environment, committed with sustainable development and aware of the challenges of the future. The representatives of non-governmental environmental organisations are also here. Thanks to their pioneer work and to their resolution and steadiness, we all have become aware of the importance of the defence of environmental values. Thanks to their tenacity, the protection of the environment is part of everyday life. From here, I encourage you to persevere on this determination and to keep presenting society and the Governments with objectives that seem unattainable today, yet will soon be demanded by society in general. I would like to finish by repeating that the fight against climatic change is an essential matter for the Government, an absolute priority, the great question of the future, for our economic model and for our growth model. We are doing what we are doing at the moment in order to progress today, but also with the aim of ensuring the future. The fight against climatic change must be the axis of any society-building project during the next years and during the next decades. And what's more, it must be assumed as an individual commitment. It must be more present in our conscience and form part of our daily customs. It is a great objective for any country, it stimulates innovation, it stimulates a healthy way of life, it stimulates respect for our heritage and it stimulates the passion to respect what we will leave for those who will come after us. From the Government, the fight against climatic change has characterised, to a great extent, the legislature; but the efforts we have made during these years must not come to an end once the legislature is over. The fight against climatic change is an essential part of our project. It must be an essential objective of Spanish society and conferences such as the one that is taking place today contribute to the spread of the importance of renewable energies as a source of future and as a fundamental element in order to ensure sustainable growth, in order to fight climatic change and in order to gain an insight into the Earth, into the landscape, into our resources and into what being able to develop welfare and to respect our environment represents. This will not be the last time for the Government to coordinate an initiative of this nature. The experience is extraordinarily positive. This has been one of the occasions on which we have been able to bring together the efforts made by different sectors of society. This matter deserves it and requires it. The fight against climatic change must consist of many voices working up one powerful, single message: defending what belongs to all of us so that it keeps belonging to us all. I want my voice to join yours, to join the voice of society, the voice of the non-governmental organisations, the voice of researchers and the voice of the companies in favour of a common commitment: the commitment of Spain to be the leader in the fight against climatic change and also the leader in favour of renewable energies. Thank you for your participation and keep working. Thank you very much.\n## 6 Nicolas, eighty? No, I can't believe that. I have seen you, I have seen you while you were coming up here and I have seen you before… I can't believe it. Do you know what happens? The thing is that in the case of those who give their lives for others, life is twice as worthy; it is twice as worthy. For me, Nicolas is forty. Apply this rule of thumb to yourselves, for that's a good one, because many of us here are trying to devote part of our lives to the destiny of others. I was thinking of that a while ago, while I was sitting there, next to Nicolas: if I apply this rule of thumb to Nicolas and if I apply it to myself: that makes twenty three. Not bad. Well, I don't know whether some others in the Spanish political arena would like to be applied this rule, but just do it. Nicolas, I wanted to be here, and I wanted to, mainly, because I wanted to enjoy it. It is true that such events, which are so emotive, are always somewhat nostalgic; nostalgia is quite habitual in a nation such as the Spanish nation, a nation whose history during the 19th century and part of the 20th has been difficult, with few satisfactions. And this makes us remember, quite nostalgically, the moments in which the doors of freedom opened up. I remember that with a deep happiness, with enthusiasm. I am deeply satisfied for having the chance of being the President of the Government of Spain in 200 21st century Spain does not need to admire the others, other countries, as used to happen during the 19th and during the 20th centuries. 21st century Spain can admire itself and 21st century Spain is admired by many countries in Europe and in the world. This has been thanks to people like Nicolas. Blanco, Urbieta, Terreros, a complete saga: the Redondo saga. Socialists, members of the General Workers' Union, defenders of freedom, committed, with strong convictions -it is not easy to convince them as you know--, they have proved, during this time, what is worth the effort in life, they have proved that it is worth it to commit oneself, to defend a position, to believe in one's country and in one's ideas. They have done this on many occasions in silence. I think I quite know Nicolas and I can say now that he is not really fond of tributes, yet, he is a loving man. This is the meaning, for me, of this tribute in this House of People; a House of People whose origin, whose principles embody the two most important values that the socialist movement, of the General Workers' Union (UGT) and the Socialist Party have given to this country; the two values that, in fact, transform, create, generate progress: that is culture and freedom. Once more, it is in the House of People, paying a tribute to someone who has been the General Secretary of the General Workers' Union for eighteen years - and let me praise all those in our group, in the General Workers' Union, in the Free Teaching Institution (Institucion Libre de Ensenanza) and in the Houses of People - that education has been turned into the main bastion of our evolution as a party and as a union. Thank you very much to you all. Nicolas has witnessed and he has also been the protagonist of the last three most brilliant decades of modern and contemporary history, and he has also been the protagonist of some of the most essential changes that our country needed and that it managed to achieve: namely, political and syndical freedom. He was there when the Constitution was written, for he was a constituent deputy as we said in the Parliament the day before yesterday; and he was there in the victory of 1982 and he left the Parliament when the General Workers' Union and the Spanish Socialist Workers Party made the wise decision of giving way to the necessary crisis in order to assume their positions with maturity. And, as usual in any maturity crisis, maturity crises and entering maturity itself is not usually calm or peaceful. That is why we also went through such moments… It is true that I saw it with a perspective and I remember it very well, of course, for I already had responsibilities… I was now commenting on the general strike of the 14th of December 1988 with Candido, for it was a situation that expressed very well the meaning of the Socialist Party and of the General Workers' Union in Spain. I remember that I was in my organisation, the Organisation of Leon, and the members were quite happy, preparing the placards that were going to be taken out during the demonstration for the general strike next day. They were members of the party, most of them workers of the railway company, by the way; for as you know there is a long tradition among those workers. We went through this crisis, which bordered schizophrenia, it is true, but it was necessary, absolutely necessary. Historically, as a result of the dictatorship, the General Workers' Union and the Spanish Socialist Workers' Party were the same. In a democratic society they could not be the same, even though they might share values and objectives. We started out together, then we fell apart and then, as it usually happens even with the laws of physics, we went back to our corresponding position, and that is where we are now. And that has also been thanks to our General Secretary, Candido, of whom you can be proud, Nicolas, because you left the General Workers' Union in very good hands. It is true that this week has been a week of memories with the thirty years of the first democratic elections. There has been a brilliant institutional celebration in order to remember those who are still standing, as strong as ever, those who were also present in those first elections. But today is a good day for me -as President of the Government of Spain- to state here, in a House of the People, before all citizens, that the transition, that freedom and democracy would not have been possible without the wonderful example of the workers and of the unions of Spain. I would like this to remain in the collective memory of the last thirty years. I would like to point out that we have reasons to admire ourselves as a country and to be admired by many countries in Europe and all over the world. An example of this is the way in which the social dialogue and the social consensus are structured, and an example of this are the emotive words of someone who has been the president of the representatives of the businessmen for many years, Mr. Jose Maria Cuevas. I know that he is really convinced about what he said here about Nicolas, about the General Workers' Union, about the Workers' Commissions and about the Spanish unions, which have made a great contribution to modernisation, to progress and to welfare in this country. Thank you, Jose Maria. I am glad that this is also the pervading atmosphere in the case of the understanding or of the unity of syndical action between the General Workers' Union and the Workers' Commissions, because the words that Jose Maria has uttered here today are invaluable for me. He has been generous with the General Workers' Union, the main competitor in the syndical arena. This takes Jose Maria Fidalgo even higher, which is quite a difficult task. Thus, let's call things by their names: we have done it very well, we are doing it very well and its results will benefit the Spanish citizens and Spain, the Spain that Nicolas Redondo loves so much. What called our attention in the video was the fact that the history of the Spanish Socialist Workers' Party and the General Workers' Union melted together with the meaning of modern and contemporary history in Spain. Let's consider this piece of data: the Spanish Socialist Workers' Party was founded in Madrid; the General Workers' Union in Catalonia and we all know that the Basque Country was decisive for the growth of both the General Workers' Union and the Socialist Party, as the Redondo's know very well. This is our sign of identity: the party that resembles Spain more, the party that has structured and still structures the meaning of a common project, the one with the deepest historical roots and the one with an even better future. Nicolas, your forty years place in front of us the perspective of what should be emphasised here today. From those well-written words, namely, history, memory and future, I chose the last one, the future, because we are going to witness a near future of full employment in Spain; we are going to witness a future with full equality among men and women as far as the activity rates are concerned, as far as rights are concerned, as far as employment is concerned and as far as the management of important companies in our country is concerned; we are going to witness a near future in which we will provide not only for education, health care and, of course, for a pension system that we will progressively approach to the model of the European average values, but also one in which we will provide for those who are alone, for dependent persons and for their families. A future in which we must guarantee a worthier Minimum Salary. We have walked a long way during this Legislature, and we will walk an even longer one during the next Legislature. The future of a country, think of thirty years before, where there were still many emigrants, a country that knows that it must keep being an example of organisation for those who come from abroad to live and work with us; persons who are going to work with their rights and with their duties if they are to stay here working, because in this country, regardless of the origin of the person, we are not going to allow illegal, fraudulent work, we are not going to allow the exploitation of a human being, regardless of the colour of his skin. A future with social agreements, with social dialogue, as we have been doing during the last three years with twenty social agreements. And a future that must focus on employment, on two main objectives: keeping on increasing at a swifter pace the transformation of temporary employment into indefinite employment, which is functioning very well thanks to the agreement that we have signed; and, of course, winning the battle of health and accidents at work, for which we want to approve of a long-scope strategy so as to reduce accidents at work in Spain in 25 per cent. Our unions use the word productivity, they attest modernity and they know that for our country to have more prosperity, welfare, social policies, equality and rights we have to produce more and better every day. This involves education, research and innovation. We can make it and we are going to make it. Nicolas, you may be proud, but the way a socialist does: intimately proud. Surely enough, words of praise, words of recognition and tributes come out relatively easy, they are even obligatory. I know that you are proud of yourself. I know that you have proved that living, committing with the destinies of the others and committing with deep values is worth it. You belong to a generation like the ones that have come next, a generation of Spanish citizens that have risen to the occasion. You have left us a country, Spain, for which it is worth it to fight; a country, Spain, that is going to keep on progressing every day; a country, Spain, that is admired, respected; a country, Spain, whose sole signs of identity before the world are democracy, justice, equality and solidarity, and, of course, peace. Nicolas, cheers for the General Workers' Union! cheers for workers! And cheers for the Redondo's! Thank you very much.\nspeeches$flesch.kincaid <- textstat_readability(speeches$text, measure = \"Flesch.Kincaid\")\n\n# returned as quanteda data.frame with document-level information;\n# need just the score:\nspeeches$flesch.kincaid <- speeches$flesch.kincaid$Flesch.Kincaid\n#get mean and standard deviation of Flesch-Kincaid, and N of speeches for each speaker\nsum_corpus <- speeches %>%\n group_by(speaker) %>%\n summarise(mean = mean(flesch.kincaid, na.rm=TRUE),\n SD=sd(flesch.kincaid, na.rm=TRUE),\n N=length(speaker))\n\n# calculate standard errors and confidence intervals\nsum_corpus$se <- sum_corpus$SD / sqrt(sum_corpus$N)\nsum_corpus$min <- sum_corpus$mean - 1.96*sum_corpus$se\nsum_corpus$max <- sum_corpus$mean + 1.96*sum_corpus$se\nsum_corpus## # A tibble: 4 × 7\n## speaker mean SD N se min max\n## \n## 1 D. Cameron 10.9 1.70 456 0.0794 10.7 11.0\n## 2 G. Brown 13.3 2.28 277 0.137 13.1 13.6\n## 3 J.L.R. Zapatero 15.5 2.83 354 0.150 15.2 15.8\n## 4 M. Rajoy 13.7 2.56 389 0.130 13.4 13.9\nggplot(sum_corpus, aes(x=speaker, y=mean)) +\n geom_bar(stat=\"identity\") + \n geom_errorbar(ymin=sum_corpus$min,ymax=sum_corpus$max, width=.2) +\n coord_flip() +\n xlab(\"\") +\n ylab(\"Mean Complexity\") + \n theme_minimal() + \n ylim(c(0,20))"},{"path":"exercise-3-comparison-and-complexity.html","id":"exercises-2","chapter":"19 Exercise 3: Comparison and complexity","heading":"19.7 Exercises","text":"Compute distance measures “euclidean” “manhattan” MP tweets , comparing tweets MPs tweets PM, Theresa May.Estimate least three complexity measures EU speeches . Consider results compare Flesch-Kincaid measure used article Schoonvelde et al. (2019).(Advanced—optional) Estimate similarity scores MP tweets PM tweets week contained data. Plot results.","code":""},{"path":"exercise-4-scaling-techniques.html","id":"exercise-4-scaling-techniques","chapter":"20 Exercise 4: Scaling techniques","heading":"20 Exercise 4: Scaling techniques","text":"","code":""},{"path":"exercise-4-scaling-techniques.html","id":"introduction-3","chapter":"20 Exercise 4: Scaling techniques","heading":"20.1 Introduction","text":"hands-exercise week focuses : 1) scaling texts ; 2) implementing scaling techniques using quanteda.tutorial, learn :Scale texts using “wordfish” algorithmScale texts gathered online sourcesReplicate analyses Kaneko, Asano, Miwa (2021)proceeding, ’ll load packages need tutorial.exercise ’ll using dataset used sentiment analysis exercise. data collected Twitter accounts top eight newspapers UK circulation. tweets include tweets news outlet main account.","code":"\nlibrary(dplyr)\nlibrary(quanteda) # includes functions to implement Lexicoder\nlibrary(quanteda.textmodels) # for estimating similarity and complexity measures\nlibrary(quanteda.textplots) #for visualizing text modelling results"},{"path":"exercise-4-scaling-techniques.html","id":"importing-data","chapter":"20 Exercise 4: Scaling techniques","heading":"20.2 Importing data","text":"can download dataset :’re working document computer (“locally”) can download tweets data following way:first take sample data speed runtime analyses.","code":"\ntweets <- readRDS(\"data/sentanalysis/newstweets.rds\")\ntweets <- readRDS(gzcon(url(\"https://github.com/cjbarrie/CTA-ED/blob/main/data/sentanalysis/newstweets.rds?raw=true\")))\ntweets <- tweets %>%\n sample_n(20000)"},{"path":"exercise-4-scaling-techniques.html","id":"construct-dfm-object","chapter":"20 Exercise 4: Scaling techniques","heading":"20.3 Construct dfm object","text":", previous exercise, create corpus object, specify document-level variables want group, generate document feature matrix.can look number documents (tweets) per newspaper Twitter account.document feature matrix looks like, word count eight newspapers.","code":"\n#make corpus object, specifying tweet as text field\ntweets_corpus <- corpus(tweets, text_field = \"text\")\n\n#add in username document-level information\ndocvars(tweets_corpus, \"newspaper\") <- tweets$user_username\n\ndfm_tweets <- dfm(tokens(tweets_corpus),\n remove_punct = TRUE, \n remove = stopwords(\"english\"))\n## number of tweets per newspaper\ntable(docvars(dfm_tweets, \"newspaper\"))## \n## DailyMailUK DailyMirror EveningStandard guardian MetroUK \n## 2052 5834 2182 2939 966 \n## Telegraph TheSun thetimes \n## 1519 3840 668\ndfm_tweets## Document-feature matrix of: 20,000 documents, 48,967 features (99.98% sparse) and 31 docvars.\n## features\n## docs rt @standardnews breaking coronavirus outbreak declared pandemic world\n## text1 1 1 1 1 1 1 1 1\n## text2 1 0 0 0 0 0 0 0\n## text3 0 0 0 0 0 0 0 0\n## text4 0 0 0 0 0 0 0 0\n## text5 0 0 0 0 0 0 0 0\n## text6 0 0 0 0 0 0 0 0\n## features\n## docs health organisation\n## text1 1 1\n## text2 0 0\n## text3 0 0\n## text4 0 0\n## text5 0 0\n## text6 0 0\n## [ reached max_ndoc ... 19,994 more documents, reached max_nfeat ... 48,957 more features ]"},{"path":"exercise-4-scaling-techniques.html","id":"estimate-wordfish-model","chapter":"20 Exercise 4: Scaling techniques","heading":"20.4 Estimate wordfish model","text":"data format, able group trim document feature matrix estimating wordfish model.results.can plot estimates \\(\\theta\\)s—.e., estimates latent newspaper position—.Interestingly, seem captured ideology tonal dimension. see tabloid newspapers scored similarly, grouped toward right hand side latent dimension; whereas broadsheet newspapers estimated theta left.Plotting “features,” .e., word-level betas shows words positioned along dimension, words help discriminate news outlets.can also look features.words seem belong tabloid-style reportage, include emojis relating film, sports reporting “cristiano” well colloquial terms like “saucy.”","code":"\n# compress the document-feature matrix at the newspaper level\ndfm_newstweets <- dfm_group(dfm_tweets, groups = newspaper)\n# remove words not used by two or more newspapers\ndfm_newstweets <- dfm_trim(dfm_newstweets, \n min_docfreq = 2, docfreq_type = \"count\")\n\n## size of the document-feature matrix\ndim(dfm_newstweets)## [1] 8 11111\n#### estimate the Wordfish model ####\nset.seed(123L)\ndfm_newstweets_results <- textmodel_wordfish(dfm_newstweets, \n sparse = TRUE)\nsummary(dfm_newstweets_results)## \n## Call:\n## textmodel_wordfish.dfm(x = dfm_newstweets, sparse = TRUE)\n## \n## Estimated Document Positions:\n## theta se\n## DailyMailUK 0.64904 0.012949\n## DailyMirror 1.18235 0.006726\n## EveningStandard -0.22616 0.016082\n## guardian -0.95428 0.010563\n## MetroUK -0.04625 0.022759\n## Telegraph -1.05344 0.010640\n## TheSun 1.45044 0.006048\n## thetimes -1.00168 0.014966\n## \n## Estimated Feature Scores:\n## rt breaking coronavirus outbreak declared pandemic world health\n## beta 0.537 0.191 0.06918 -0.2654 -0.06525 -0.2004 -0.317 -0.3277\n## psi 5.307 3.535 5.78715 3.1348 0.50705 3.1738 3.366 3.2041\n## organisation genuinely interested see one cos fair\n## beta -0.4118 -0.2873 -0.2545 0.0005141 -0.06312 -0.2788 -0.03078\n## psi 0.5487 -0.5403 -1.4502 2.7723965 3.85881 -1.4480 0.35480\n## german care system protect troubled children #covid19 anxiety shows\n## beta -0.7424 -0.3251 -1.105 -0.1106 -0.4731 0.01205 -0.6742 0.4218 0.4165\n## psi 1.1009 3.1042 1.259 1.8918 -0.0784 2.85004 2.9703 0.5917 2.8370\n## sign man behind app explains tips\n## beta -0.1215 0.5112 0.05499 0.271 0.6687 -0.2083\n## psi 1.9427 3.5777 2.43805 1.376 1.2749 1.5341\ntextplot_scale1d(dfm_newstweets_results)\ntextplot_scale1d(dfm_newstweets_results, margin = \"features\")\nfeatures <- dfm_newstweets_results[[\"features\"]]\n\nbetas <- dfm_newstweets_results[[\"beta\"]]\n\nfeat_betas <- as.data.frame(cbind(features, betas))\nfeat_betas$betas <- as.numeric(feat_betas$betas)\n\nfeat_betas %>%\n arrange(desc(betas)) %>%\n top_n(20) %>% \n kbl() %>%\n kable_styling(bootstrap_options = \"striped\")## Selecting by betas"},{"path":"exercise-4-scaling-techniques.html","id":"replicating-kaneko-et-al.","chapter":"20 Exercise 4: Scaling techniques","heading":"20.5 Replicating Kaneko et al.","text":"section adapts code replication data provided Kaneko, Asano, Miwa (2021) . can access data first study Kaneko, Asano, Miwa (2021) following way.’re working locally, can download dfm data :data form document-feature-matrix. can first manipulate way Kaneko, Asano, Miwa (2021) grouping level newspaper removing infrequent words.","code":"\nkaneko_dfm <- readRDS(\"data/wordscaling/study1_kaneko.rds\")\nkaneko_dfm <- readRDS(gzcon(url(\"https://github.com/cjbarrie/CTA-ED/blob/main/data/wordscaling/study1_kaneko.rds?raw=true\")))\ntable(docvars(kaneko_dfm, \"Newspaper\"))## \n## Asahi Chugoku Chunichi Hokkaido Kahoku Mainichi \n## 38 24 47 46 18 26 \n## Nikkei Nishinippon Sankei Yomiuri \n## 13 27 14 30\n## prepare the newspaper-level document-feature matrix\n# compress the document-feature matrix at the newspaper level\nkaneko_dfm_study1 <- dfm_group(kaneko_dfm, groups = Newspaper)\n# remove words not used by two or more newspapers\nkaneko_dfm_study1 <- dfm_trim(kaneko_dfm_study1, min_docfreq = 2, docfreq_type = \"count\")\n\n## size of the document-feature matrix\ndim(kaneko_dfm_study1)## [1] 10 4660"},{"path":"exercise-4-scaling-techniques.html","id":"exercises-3","chapter":"20 Exercise 4: Scaling techniques","heading":"20.6 Exercises","text":"Estimate wordfish model Kaneko, Asano, Miwa (2021) dataVisualize results","code":""},{"path":"exercise-5-unsupervised-learning-topic-models.html","id":"exercise-5-unsupervised-learning-topic-models","chapter":"21 Exercise 5: Unsupervised learning (topic models)","heading":"21 Exercise 5: Unsupervised learning (topic models)","text":"","code":""},{"path":"exercise-5-unsupervised-learning-topic-models.html","id":"introduction-4","chapter":"21 Exercise 5: Unsupervised learning (topic models)","heading":"21.1 Introduction","text":"hands-exercise week focuses : 1) estimating topic model ; 2) interpreting visualizing results.tutorial, learn :Generate document-term-matrices format appropriate topic modellingEstimate topic model using quanteda topicmodels packageVisualize resultsReverse engineer test model accuracyRun validation tests","code":""},{"path":"exercise-5-unsupervised-learning-topic-models.html","id":"setup-9","chapter":"21 Exercise 5: Unsupervised learning (topic models)","heading":"21.2 Setup","text":"proceeding, ’ll load packages need tutorial.’ll using data Alexis de Tocqueville’s “Democracy America.” download data , Volume 1 Volume 2, combine one data frame. , ’ll using gutenbergr package, allows user download text data 60,000 --copyright books. ID book appears url book selected search https://www.gutenberg.org/ebooks/.example adapted Text Mining R: Tidy Approach Julia Silge David Robinson., see Volume Tocqueville’s “Democracy America” stored “815”. separate search reveals Volume 2 stored “816”.can download dataset :’re working document computer (“locally”) can download data following way:read data, convert different data shape: document-term-matrix. also create new columns, call “booknumber” recordss whether term question Volume 1 Volume 2. convert tidy “DocumentTermMatrix” format can first use unnest_tokens() done past exercises, remove stop words, use cast_dtm() function convert “DocumentTermMatrix” object.see data now stored “DocumentTermMatrix.” format, matrix records term (equivalent column) document (equivalent row), number times term appears given document. Many terms appear document, meaning matrix stored “sparse,” meaning preponderance zeroes. , since looking two documents come single volume set, sparsity relatively low (27%). applications, sparsity lot higher, approaching 99% .Estimating topic model relatively simple. need specify many topics want search , can also set seed, needed reproduce results time (model generative probabilistic one, meaning different random iterations produce different results).can extract per-topic-per-word probabilities, called “β” model:now data stored one topic-per-term-per-row. betas listed represent probability given term belongs given topic. , , see term “democratic” likely belong topic 4. Strictly, probability represents probability term generated topic question.can plots top terms, terms beta, topic follows:actually evaluate topics? , topics seem pretty similar.","code":"\nlibrary(tidyverse) # loads dplyr, ggplot2, and others\nlibrary(stringr) # to handle text elements\nlibrary(tidytext) # includes set of functions useful for manipulating text\nlibrary(topicmodels) # to estimate topic models\nlibrary(gutenbergr) # to get text data\nlibrary(scales)\nlibrary(tm)\nlibrary(ggthemes) # to make your plots look nice\nlibrary(readr)\nlibrary(quanteda)\nlibrary(quanteda.textmodels)\n#devtools::install_github(\"matthewjdenny/preText\")\nlibrary(preText)\ntocq <- gutenberg_download(c(815, 816), \n meta_fields = \"author\")\ntocq <- readRDS(\"data/topicmodels/tocq.rds\")\ntocq <- readRDS(gzcon(url(\"https://github.com/cjbarrie/CTA-ED/blob/main/data/topicmodels/tocq.RDS?raw=true\")))\ntocq_words <- tocq %>%\n mutate(booknumber = ifelse(gutenberg_id==815, \"DiA1\", \"DiA2\")) %>%\n unnest_tokens(word, text) %>%\n filter(!is.na(word)) %>%\n count(booknumber, word, sort = TRUE) %>%\n ungroup() %>%\n anti_join(stop_words)## Joining with `by = join_by(word)`\ntocq_dtm <- tocq_words %>%\n cast_dtm(booknumber, word, n)\n\ntm::inspect(tocq_dtm)## <>\n## Non-/sparse entries: 17581/6603\n## Sparsity : 27%\n## Maximal term length: 18\n## Weighting : term frequency (tf)\n## Sample :\n## Terms\n## Docs country democratic government laws nations people power society time\n## DiA1 357 212 556 397 233 516 543 290 311\n## DiA2 167 561 162 133 313 360 263 241 309\n## Terms\n## Docs united\n## DiA1 554\n## DiA2 227\ntocq_lda <- LDA(tocq_dtm, k = 10, control = list(seed = 1234))\ntocq_topics <- tidy(tocq_lda, matrix = \"beta\")\n\nhead(tocq_topics, n = 10)## # A tibble: 10 × 3\n## topic term beta\n## \n## 1 1 democratic 0.00855\n## 2 2 democratic 0.0115 \n## 3 3 democratic 0.00444\n## 4 4 democratic 0.0193 \n## 5 5 democratic 0.00254\n## 6 6 democratic 0.00866\n## 7 7 democratic 0.00165\n## 8 8 democratic 0.0108 \n## 9 9 democratic 0.00276\n## 10 10 democratic 0.00334\ntocq_top_terms <- tocq_topics %>%\n group_by(topic) %>%\n top_n(10, beta) %>%\n ungroup() %>%\n arrange(topic, -beta)\n\ntocq_top_terms %>%\n mutate(term = reorder_within(term, beta, topic)) %>%\n ggplot(aes(beta, term, fill = factor(topic))) +\n geom_col(show.legend = FALSE) +\n facet_wrap(~ topic, scales = \"free\", ncol = 4) +\n scale_y_reordered() +\n theme_tufte(base_family = \"Helvetica\")"},{"path":"exercise-5-unsupervised-learning-topic-models.html","id":"evaluating-topic-model","chapter":"21 Exercise 5: Unsupervised learning (topic models)","heading":"21.3 Evaluating topic model","text":"Well, one way evaluate performance unspervised forms classification testing model outcome already known., two topics obvious ‘topics’ Volume 1 Volume 2 Tocqueville’s “Democracy America.” Volume 1 Tocqueville’s work deals obviously abstract constitutional ideas questions race; Volume 2 focuses esoteric aspects American society. Listen “Time” episode Melvyn Bragg discussing Democracy America .Given differences focus, might think generative model accurately assign topic (.e., Volume) accuracy.","code":""},{"path":"exercise-5-unsupervised-learning-topic-models.html","id":"plot-relative-word-frequencies","chapter":"21 Exercise 5: Unsupervised learning (topic models)","heading":"21.3.1 Plot relative word frequencies","text":"First let’s look see whether really words obviously distinguishing two Volumes.see seem marked distinguishing characteristics. plot , example, see abstract notions state systems appear greater frequency Volume 1 Volume 2 seems contain words specific America (e.g., “north” “south”) greater frequency. way read plot words positioned away diagonal line appear greater frequency one volume versus .","code":"\ntidy_tocq <- tocq %>%\n unnest_tokens(word, text) %>%\n anti_join(stop_words)## Joining with `by = join_by(word)`\n## Count most common words in both\ntidy_tocq %>%\n count(word, sort = TRUE)## # A tibble: 12,092 × 2\n## word n\n## \n## 1 people 876\n## 2 power 806\n## 3 united 781\n## 4 democratic 773\n## 5 government 718\n## 6 time 620\n## 7 nations 546\n## 8 society 531\n## 9 laws 530\n## 10 country 524\n## # ℹ 12,082 more rows\nbookfreq <- tidy_tocq %>%\n mutate(booknumber = ifelse(gutenberg_id==815, \"DiA1\", \"DiA2\")) %>%\n mutate(word = str_extract(word, \"[a-z']+\")) %>%\n count(booknumber, word) %>%\n group_by(booknumber) %>%\n mutate(proportion = n / sum(n)) %>% \n select(-n) %>% \n spread(booknumber, proportion)\n\nggplot(bookfreq, aes(x = DiA1, y = DiA2, color = abs(DiA1 - DiA2))) +\n geom_abline(color = \"gray40\", lty = 2) +\n geom_jitter(alpha = 0.1, size = 2.5, width = 0.3, height = 0.3) +\n geom_text(aes(label = word), check_overlap = TRUE, vjust = 1.5) +\n scale_x_log10(labels = percent_format()) +\n scale_y_log10(labels = percent_format()) +\n scale_color_gradient(limits = c(0, 0.001), low = \"darkslategray4\", high = \"gray75\") +\n theme_tufte(base_family = \"Helvetica\") +\n theme(legend.position=\"none\", \n strip.background = element_blank(), \n strip.text.x = element_blank()) +\n labs(x = \"Tocqueville DiA 2\", y = \"Tocqueville DiA 1\") +\n coord_equal()## Warning: Removed 6173 rows containing missing values (`geom_point()`).## Warning: Removed 6174 rows containing missing values (`geom_text()`)."},{"path":"exercise-5-unsupervised-learning-topic-models.html","id":"split-into-chapter-documents","chapter":"21 Exercise 5: Unsupervised learning (topic models)","heading":"21.3.2 Split into chapter documents","text":", first separate volumes chapters, repeat procedure . difference now instead two documents representing two full volumes Tocqueville’s work, now 132 documents, representing individual chapter. Notice now sparsity much increased: around 96%.re-estimate topic model new DocumentTermMatrix object, specifying k equal 2. enable us evaluate whether topic model able generatively assign volume accuracy., worth looking another output latent dirichlet allocation procedure. γ probability represents per-document-per-topic probability , words, probability given document (: chapter) belongs particular topic (, assuming topics represent volumes).gamma values therefore estimated proportion words within given chapter allocated given volume.","code":"\ntocq <- tocq %>%\n filter(!is.na(text))\n\n# Divide into documents, each representing one chapter\ntocq_chapter <- tocq %>%\n mutate(booknumber = ifelse(gutenberg_id==815, \"DiA1\", \"DiA2\")) %>%\n group_by(booknumber) %>%\n mutate(chapter = cumsum(str_detect(text, regex(\"^chapter \", ignore_case = TRUE)))) %>%\n ungroup() %>%\n filter(chapter > 0) %>%\n unite(document, booknumber, chapter)\n\n# Split into words\ntocq_chapter_word <- tocq_chapter %>%\n unnest_tokens(word, text)\n\n# Find document-word counts\ntocq_word_counts <- tocq_chapter_word %>%\n anti_join(stop_words) %>%\n count(document, word, sort = TRUE) %>%\n ungroup()## Joining with `by = join_by(word)`\ntocq_word_counts## # A tibble: 69,781 × 3\n## document word n\n## \n## 1 DiA2_76 united 88\n## 2 DiA2_60 honor 70\n## 3 DiA1_52 union 66\n## 4 DiA2_76 president 60\n## 5 DiA2_76 law 59\n## 6 DiA1_42 jury 57\n## 7 DiA2_76 time 50\n## 8 DiA1_11 township 49\n## 9 DiA1_21 federal 48\n## 10 DiA2_76 constitution 48\n## # ℹ 69,771 more rows\n# Cast into DTM format for LDA analysis\n\ntocq_chapters_dtm <- tocq_word_counts %>%\n cast_dtm(document, word, n)\n\ntm::inspect(tocq_chapters_dtm)## <>\n## Non-/sparse entries: 69781/1500755\n## Sparsity : 96%\n## Maximal term length: 18\n## Weighting : term frequency (tf)\n## Sample :\n## Terms\n## Docs country democratic government laws nations people power public time\n## DiA1_11 10 0 23 19 7 13 19 15 6\n## DiA1_13 13 5 34 9 12 17 37 15 6\n## DiA1_20 9 0 25 13 2 14 32 13 10\n## DiA1_21 4 0 20 29 6 12 20 5 5\n## DiA1_23 10 0 35 9 24 20 13 4 8\n## DiA1_31 7 12 10 13 4 30 18 31 6\n## DiA1_32 10 14 25 6 9 25 11 43 8\n## DiA1_47 12 2 5 3 3 6 8 0 3\n## DiA1_56 12 0 3 7 19 3 8 3 22\n## DiA2_76 11 10 24 39 12 31 27 27 50\n## Terms\n## Docs united\n## DiA1_11 13\n## DiA1_13 19\n## DiA1_20 21\n## DiA1_21 23\n## DiA1_23 15\n## DiA1_31 11\n## DiA1_32 14\n## DiA1_47 8\n## DiA1_56 25\n## DiA2_76 88\ntocq_chapters_lda <- LDA(tocq_chapters_dtm, k = 2, control = list(seed = 1234))\ntocq_chapters_gamma <- tidy(tocq_chapters_lda, matrix = \"gamma\")\ntocq_chapters_gamma## # A tibble: 264 × 3\n## document topic gamma\n## \n## 1 DiA2_76 1 0.551 \n## 2 DiA2_60 1 1.00 \n## 3 DiA1_52 1 0.0000464\n## 4 DiA1_42 1 0.0000746\n## 5 DiA1_11 1 0.0000382\n## 6 DiA1_21 1 0.0000437\n## 7 DiA1_20 1 0.0000425\n## 8 DiA1_28 1 0.249 \n## 9 DiA1_50 1 0.0000477\n## 10 DiA1_22 1 0.0000466\n## # ℹ 254 more rows"},{"path":"exercise-5-unsupervised-learning-topic-models.html","id":"examine-consensus","chapter":"21 Exercise 5: Unsupervised learning (topic models)","heading":"21.3.3 Examine consensus","text":"Now topic probabilities, can see well unsupervised learning distinguishing two volumes generatively just words contained chapter.bad! see model estimated accuracy 91% chapters Volume 2 79% chapters Volume 1","code":"\n# First separate the document name into title and chapter\n\ntocq_chapters_gamma <- tocq_chapters_gamma %>%\n separate(document, c(\"title\", \"chapter\"), sep = \"_\", convert = TRUE)\n\ntocq_chapter_classifications <- tocq_chapters_gamma %>%\n group_by(title, chapter) %>%\n top_n(1, gamma) %>%\n ungroup()\n\ntocq_book_topics <- tocq_chapter_classifications %>%\n count(title, topic) %>%\n group_by(title) %>%\n top_n(1, n) %>%\n ungroup() %>%\n transmute(consensus = title, topic)\n\ntocq_chapter_classifications %>%\n inner_join(tocq_book_topics, by = \"topic\") %>%\n filter(title != consensus)## # A tibble: 15 × 5\n## title chapter topic gamma consensus\n## \n## 1 DiA1 45 1 0.762 DiA2 \n## 2 DiA1 5 1 0.504 DiA2 \n## 3 DiA1 33 1 0.570 DiA2 \n## 4 DiA1 34 1 0.626 DiA2 \n## 5 DiA1 41 1 0.512 DiA2 \n## 6 DiA1 44 1 0.765 DiA2 \n## 7 DiA1 8 1 0.791 DiA2 \n## 8 DiA1 4 1 0.717 DiA2 \n## 9 DiA1 35 1 0.576 DiA2 \n## 10 DiA1 39 1 0.577 DiA2 \n## 11 DiA1 7 1 0.687 DiA2 \n## 12 DiA1 29 1 0.983 DiA2 \n## 13 DiA1 6 1 0.707 DiA2 \n## 14 DiA2 27 2 0.654 DiA1 \n## 15 DiA2 21 2 0.510 DiA1\n# Look document-word pairs were to see which words in each documents were assigned\n# to a given topic\n\nassignments <- augment(tocq_chapters_lda, data = tocq_chapters_dtm)\nassignments## # A tibble: 69,781 × 4\n## document term count .topic\n## \n## 1 DiA2_76 united 88 2\n## 2 DiA2_60 united 6 1\n## 3 DiA1_52 united 11 2\n## 4 DiA1_42 united 7 2\n## 5 DiA1_11 united 13 2\n## 6 DiA1_21 united 23 2\n## 7 DiA1_20 united 21 2\n## 8 DiA1_28 united 14 2\n## 9 DiA1_50 united 5 2\n## 10 DiA1_22 united 8 2\n## # ℹ 69,771 more rows\nassignments <- assignments %>%\n separate(document, c(\"title\", \"chapter\"), sep = \"_\", convert = TRUE) %>%\n inner_join(tocq_book_topics, by = c(\".topic\" = \"topic\"))\n\nassignments %>%\n count(title, consensus, wt = count) %>%\n group_by(title) %>%\n mutate(percent = n / sum(n)) %>%\n ggplot(aes(consensus, title, fill = percent)) +\n geom_tile() +\n scale_fill_gradient2(high = \"red\", label = percent_format()) +\n geom_text(aes(x = consensus, y = title, label = scales::percent(percent))) +\n theme_tufte(base_family = \"Helvetica\") +\n theme(axis.text.x = element_text(angle = 90, hjust = 1),\n panel.grid = element_blank()) +\n labs(x = \"Book words assigned to\",\n y = \"Book words came from\",\n fill = \"% of assignments\")"},{"path":"exercise-5-unsupervised-learning-topic-models.html","id":"validation","chapter":"21 Exercise 5: Unsupervised learning (topic models)","heading":"21.4 Validation","text":"articles Ying, Montgomery, Stewart (2021) Denny Spirling (2018) previous weeks, read potential validation techniques.section, ’ll using preText package mentioned Denny Spirling (2018) see impact different pre-processing choices text. , adapting tutorial Matthew Denny.First need reformat text quanteda corpus object.now ready preprocess different ways. , including n-grams preprocessing text 128 different ways. takes ten minutes run machine 8GB RAM.can get results pre-processing, comparing distance documents processed different ways.can plot accordingly.","code":"\n# load in corpus of Tocequeville text data.\ncorp <- corpus(tocq, text_field = \"text\")\n# use first 10 documents for example\ndocuments <- corp[sample(1:30000,1000)]\n# take a look at the document names\nprint(names(documents[1:10]))## [1] \"text26803\" \"text25102\" \"text28867\" \"text2986\" \"text1842\" \"text25718\"\n## [7] \"text3371\" \"text29925\" \"text29940\" \"text29710\"\npreprocessed_documents <- factorial_preprocessing(\n documents,\n use_ngrams = TRUE,\n infrequent_term_threshold = 0.2,\n verbose = FALSE)\npreText_results <- preText(\n preprocessed_documents,\n dataset_name = \"Tocqueville text\",\n distance_method = \"cosine\",\n num_comparisons = 20,\n verbose = FALSE)\npreText_score_plot(preText_results)"},{"path":"exercise-5-unsupervised-learning-topic-models.html","id":"exercises-4","chapter":"21 Exercise 5: Unsupervised learning (topic models)","heading":"21.5 Exercises","text":"Choose another book set books Project GutenbergRun topic model books, changing k topics, evaluating accuracy.Validate different pre-processing techniques using preText new book(s) choice.","code":""},{"path":"exercise-6-unsupervised-learning-word-embedding.html","id":"exercise-6-unsupervised-learning-word-embedding","chapter":"22 Exercise 6: Unsupervised learning (word embedding)","heading":"22 Exercise 6: Unsupervised learning (word embedding)","text":"","code":""},{"path":"exercise-6-unsupervised-learning-word-embedding.html","id":"introduction-5","chapter":"22 Exercise 6: Unsupervised learning (word embedding)","heading":"22.1 Introduction","text":"hands-exercise week focuses word embedding provides overview data structures, functions relevant , estimating word vectors word-embedding analyses.tutorial, learn :Generate word vectors (embeddings) via SVDTrain local word embedding model GloVeVisualize inspect resultsLoad examine pre-trained embeddingsNote: Adapts tutorials Chris Bail Julia Silge Emil Hvitfeldt Julia Silge .","code":""},{"path":"exercise-6-unsupervised-learning-word-embedding.html","id":"setup-10","chapter":"22 Exercise 6: Unsupervised learning (word embedding)","heading":"22.2 Setup","text":"begin reading data. data come sample 1m tweets elected UK MPs period 2017-2019. data contain just name MP-user, text tweet, MP’s party. just add ID variable called “postID.”’re working document computer (“locally”) can download tweets sample data following way:","code":"\nlibrary(tidyverse) # loads dplyr, ggplot2, and others\nlibrary(stringr) # to handle text elements\nlibrary(tidytext) # includes set of functions useful for manipulating text\nlibrary(ggthemes) # to make your plots look nice\nlibrary(text2vec) # for word embedding implementation\nlibrary(widyr) # for reshaping the text data\nlibrary(irlba) # for svd\nlibrary(umap) # for dimensionality reduction\ntwts_sample <- readRDS(\"data/wordembed/twts_corpus_sample.rds\")\n\n#create tweet id\ntwts_sample$postID <- row.names(twts_sample)\ntwts_sample <- readRDS(gzcon(url(\"https://github.com/cjbarrie/CTA-ED/blob/main/data/wordembed/twts_corpus_sample.rds?raw=true\")))"},{"path":"exercise-6-unsupervised-learning-word-embedding.html","id":"word-vectors-via-svd","chapter":"22 Exercise 6: Unsupervised learning (word embedding)","heading":"22.3 Word vectors via SVD","text":"’re going set generating set word vectors text data. Note many word embedding applications use pre-trained embeddings much larger corpus, generate local embeddings using neural net-based approaches., ’re instead going generate set embeddings word vectors making series calculations based frequencies words appear different contexts. use technique called “Singular Value Decomposition” (SVD). dimensionality reduction technique first axis resulting composition designed capture variance, second second-etc…achieve ?","code":""},{"path":"exercise-6-unsupervised-learning-word-embedding.html","id":"implementation","chapter":"22 Exercise 6: Unsupervised learning (word embedding)","heading":"22.4 Implementation","text":"first thing need get data right format calculate -called “skip-gram probabilties.” go code line line begin understand .’s going ?Well, ’re first unnesting tweet data previous exercises. importantly, , ’re unnesting individual tokens ngrams length 6 , words, postID n words k indexed , take words i1 …i6, take words i2 …i7. Try just running first two lines code see means practice., make unique ID particular ngram create postID, make unique skipgramID postID ngram. unnest words ngram associated skipgramID.can see resulting output .next?Well can now calculate set probabilities skipgrams. pairwise_count() function widyr package. Essentially, function saying: skipgramID count number times word appears another word feature (feature skipgramID). set diag TRUE also want count number times word appears near .probability calculating number times word appears another word denominated total number word pairings across whole corpus.see, example, words vote appear 4099 times together. Denominating total n word pairings (sum(skipgram_probs$n)), gives us probability p. Okay, now skipgram probabilities need get “unigram probabilities” order normalize skipgram probabilities applying singular value decomposition.“unigram probability”? Well, just technical way saying: count appearances given word corpus divide total number words corpus. can :Finally, ’s time normalize skipgram probabilities.take skipgram probabilities, filter word pairings appear twenty times less. rename words “item1” “item2,” merge unigram probabilities words.calculate joint probability skipgram probability divided unigram probability first word pairing divided unigram probability second word pairing. equivalent : P(x,y)/P(x)P(y).essence, interpretation value : “events (words) x y occur together often expect independent”?’ve recovered normalized probabilities, can look joint probabilities given item, .e., word. , look word “brexit” look words highest value “p_together.”Higher values greater 1 indicate words likely appear close ; low values less 1 indicate unlikely appear close . , words, gives indication association two words.Using normalized probabilities, calculate PMI “Pointwise Mutual Information” value, simply log joint probability calculated .Definition time: “PMI logarithm probability finding two words together, normalized probability finding words alone.”cast word pairs sparse matrix values correspond PMI two corresponding words.Notice setting vector size equal 256. just means vector length 256 given word., set numbers used represent word length limited 256. arbitrary can changed. Typically, size low hundreds chosen representing word vector.word vectors taken “u” column, left-singular vectors, SVD.","code":"\n#create context window with length 6\ntidy_skipgrams <- twts_sample %>%\n unnest_tokens(ngram, tweet, token = \"ngrams\", n = 6) %>%\n mutate(ngramID = row_number()) %>% \n tidyr::unite(skipgramID, postID, ngramID) %>%\n unnest_tokens(word, ngram)\n\nhead(tidy_skipgrams, n=20)## # A tibble: 20 × 4\n## username party_value skipgramID word \n## \n## 1 kirstysnp Scottish National Party 1_1 in \n## 2 kirstysnp Scottish National Party 1_1 amongst\n## 3 kirstysnp Scottish National Party 1_1 all \n## 4 kirstysnp Scottish National Party 1_1 the \n## 5 kirstysnp Scottish National Party 1_1 horror \n## 6 kirstysnp Scottish National Party 1_1 at \n## 7 kirstysnp Scottish National Party 1_2 amongst\n## 8 kirstysnp Scottish National Party 1_2 all \n## 9 kirstysnp Scottish National Party 1_2 the \n## 10 kirstysnp Scottish National Party 1_2 horror \n## 11 kirstysnp Scottish National Party 1_2 at \n## 12 kirstysnp Scottish National Party 1_2 the \n## 13 kirstysnp Scottish National Party 1_3 all \n## 14 kirstysnp Scottish National Party 1_3 the \n## 15 kirstysnp Scottish National Party 1_3 horror \n## 16 kirstysnp Scottish National Party 1_3 at \n## 17 kirstysnp Scottish National Party 1_3 the \n## 18 kirstysnp Scottish National Party 1_3 notion \n## 19 kirstysnp Scottish National Party 1_4 the \n## 20 kirstysnp Scottish National Party 1_4 horror\n#calculate probabilities\nskipgram_probs <- tidy_skipgrams %>%\n pairwise_count(word, skipgramID, diag = TRUE, sort = TRUE) %>% # diag = T means that we also count when the word appears twice within the window\n mutate(p = n / sum(n))\n\nhead(skipgram_probs[1000:1020,], n=20)## # A tibble: 20 × 4\n## item1 item2 n p\n## \n## 1 no to 4100 0.0000531\n## 2 vote for 4099 0.0000531\n## 3 for vote 4099 0.0000531\n## 4 see the 4078 0.0000528\n## 5 the see 4078 0.0000528\n## 6 having having 4076 0.0000528\n## 7 by of 4065 0.0000527\n## 8 of by 4065 0.0000527\n## 9 this with 4051 0.0000525\n## 10 with this 4051 0.0000525\n## 11 set set 4050 0.0000525\n## 12 right the 4045 0.0000524\n## 13 the right 4045 0.0000524\n## 14 what the 4044 0.0000524\n## 15 going to 4044 0.0000524\n## 16 the what 4044 0.0000524\n## 17 to going 4044 0.0000524\n## 18 evening evening 4035 0.0000523\n## 19 get the 4032 0.0000522\n## 20 the get 4032 0.0000522\n#calculate unigram probabilities (used to normalize skipgram probabilities later)\nunigram_probs <- twts_sample %>%\n unnest_tokens(word, tweet) %>%\n count(word, sort = TRUE) %>%\n mutate(p = n / sum(n))\n#normalize skipgram probabilities\nnormalized_prob <- skipgram_probs %>%\n filter(n > 20) %>% #filter out skipgrams with n <=20\n rename(word1 = item1, word2 = item2) %>%\n left_join(unigram_probs %>%\n select(word1 = word, p1 = p),\n by = \"word1\") %>%\n left_join(unigram_probs %>%\n select(word2 = word, p2 = p),\n by = \"word2\") %>%\n mutate(p_together = p / p1 / p2)\n\nnormalized_prob %>% \n filter(word1 == \"brexit\") %>%\n arrange(-p_together)## # A tibble: 1,016 × 7\n## word1 word2 n p p1 p2 p_together\n## \n## 1 brexit scotlandsplaceineurope 37 0.000000479 0.00278 0.00000186 92.6\n## 2 brexit preparedness 22 0.000000285 0.00278 0.00000149 68.8\n## 3 brexit dividend 176 0.00000228 0.00278 0.0000127 64.8\n## 4 brexit brexit 38517 0.000499 0.00278 0.00278 64.6\n## 5 brexit softer 50 0.000000648 0.00278 0.00000410 56.9\n## 6 brexit botched 129 0.00000167 0.00278 0.0000153 39.4\n## 7 brexit impasse 53 0.000000687 0.00278 0.00000820 30.1\n## 8 brexit smooth 30 0.000000389 0.00278 0.00000596 23.5\n## 9 brexit frustrate 28 0.000000363 0.00278 0.00000559 23.4\n## 10 brexit deadlock 120 0.00000155 0.00278 0.0000246 22.8\n## # ℹ 1,006 more rows\npmi_matrix <- normalized_prob %>%\n mutate(pmi = log10(p_together)) %>%\n cast_sparse(word1, word2, pmi)\n\n#remove missing data\npmi_matrix@x[is.na(pmi_matrix@x)] <- 0\n#run SVD\npmi_svd <- irlba(pmi_matrix, 256, maxit = 500)\n\nglimpse(pmi_matrix)## Formal class 'dgCMatrix' [package \"Matrix\"] with 6 slots\n## ..@ i : int [1:350700] 0 1 2 3 4 5 6 7 8 9 ...\n## ..@ p : int [1:21173] 0 7819 14360 20175 25467 29910 34368 39207 43376 46401 ...\n## ..@ Dim : int [1:2] 21172 21172\n## ..@ Dimnames:List of 2\n## .. ..$ : chr [1:21172] \"the\" \"to\" \"and\" \"of\" ...\n## .. ..$ : chr [1:21172] \"the\" \"to\" \"and\" \"of\" ...\n## ..@ x : num [1:350700] 0.65173 -0.01915 -0.00911 0.26937 -0.52456 ...\n## ..@ factors : list()\n#next we output the word vectors:\nword_vectors <- pmi_svd$u\nrownames(word_vectors) <- rownames(pmi_matrix)\n\ndim(word_vectors)## [1] 21172 256"},{"path":"exercise-6-unsupervised-learning-word-embedding.html","id":"exploration","chapter":"22 Exercise 6: Unsupervised learning (word embedding)","heading":"22.5 Exploration","text":"can define simple function take word vector, find similar words, nearest neighbours, given word:","code":"\nnearest_words <- function(word_vectors, word){\n selected_vector = word_vectors[word,]\n mult = as.data.frame(word_vectors %*% selected_vector) #dot product of selected word vector and all word vectors\n \n mult %>%\n rownames_to_column() %>%\n rename(word = rowname,\n similarity = V1) %>%\n anti_join(get_stopwords(language = \"en\")) %>%\n arrange(-similarity)\n\n}\n\nboris_synonyms <- nearest_words(word_vectors, \"boris\")## Joining with `by = join_by(word)`\nbrexit_synonyms <- nearest_words(word_vectors, \"brexit\")## Joining with `by = join_by(word)`\nhead(boris_synonyms, n=10)## word similarity\n## 1 johnson 0.10309556\n## 2 boris 0.09940448\n## 3 jeremy 0.04823204\n## 4 trust 0.04800155\n## 5 corbyn 0.04102031\n## 6 farage 0.03973588\n## 7 trump 0.03938184\n## 8 can.t 0.03533624\n## 9 says 0.03324624\n## 10 word 0.03267437\nhead(brexit_synonyms, n=10)## word similarity\n## 1 brexit 0.38737979\n## 2 deal 0.15083433\n## 3 botched 0.05003683\n## 4 tory 0.04377030\n## 5 unleash 0.04233445\n## 6 impact 0.04139872\n## 7 theresa 0.04017608\n## 8 approach 0.03970233\n## 9 handling 0.03901461\n## 10 orderly 0.03897535\n#then we can visualize\nbrexit_synonyms %>%\n mutate(selected = \"brexit\") %>%\n bind_rows(boris_synonyms %>%\n mutate(selected = \"boris\")) %>%\n group_by(selected) %>%\n top_n(15, similarity) %>%\n mutate(token = reorder(word, similarity)) %>%\n filter(token!=selected) %>%\n ggplot(aes(token, similarity, fill = selected)) +\n geom_col(show.legend = FALSE) +\n facet_wrap(~selected, scales = \"free\") +\n scale_fill_manual(values = c(\"#336B87\", \"#2A3132\")) +\n coord_flip() +\n theme_tufte(base_family = \"Helvetica\")"},{"path":"exercise-6-unsupervised-learning-word-embedding.html","id":"glove-embeddings","chapter":"22 Exercise 6: Unsupervised learning (word embedding)","heading":"22.6 GloVe Embeddings","text":"section adapts tutorials Pedro Rodriguez Dmitriy Selivanov Wouter van Gils .","code":""},{"path":"exercise-6-unsupervised-learning-word-embedding.html","id":"glove-algorithm","chapter":"22 Exercise 6: Unsupervised learning (word embedding)","heading":"22.7 GloVe algorithm","text":"section taken text2vec package page .GloVe algorithm pennington_glove_2014 consists following steps:Collect word co-occurence statistics form word co-ocurrence matrix \\(X\\). element \\(X_{ij}\\) matrix represents often word appears context word j. Usually scan corpus following manner: term look context terms within area defined window_size term window_size term. Also give less weight distant words, usually using formula: \\[decay = 1/offset\\]Collect word co-occurence statistics form word co-ocurrence matrix \\(X\\). element \\(X_{ij}\\) matrix represents often word appears context word j. Usually scan corpus following manner: term look context terms within area defined window_size term window_size term. Also give less weight distant words, usually using formula: \\[decay = 1/offset\\]Define soft constraints word pair: \\[w_i^Tw_j + b_i + b_j = log(X_{ij})\\] \\(w_i\\) - vector main word, \\(w_j\\) - vector context word, \\(b_i\\), \\(b_j\\) scalar biases main context words.Define soft constraints word pair: \\[w_i^Tw_j + b_i + b_j = log(X_{ij})\\] \\(w_i\\) - vector main word, \\(w_j\\) - vector context word, \\(b_i\\), \\(b_j\\) scalar biases main context words.Define cost function\n\\[J = \\sum_{=1}^V \\sum_{j=1}^V \\; f(X_{ij}) ( w_i^T w_j + b_i + b_j - \\log X_{ij})^2\\]\n\\(f\\) weighting function help us prevent learning extremely common word pairs. GloVe authors choose following function:Define cost function\n\\[J = \\sum_{=1}^V \\sum_{j=1}^V \\; f(X_{ij}) ( w_i^T w_j + b_i + b_j - \\log X_{ij})^2\\]\n\\(f\\) weighting function help us prevent learning extremely common word pairs. GloVe authors choose following function:\\[\nf(X_{ij}) =\n\\begin{cases}\n(\\frac{X_{ij}}{x_{max}})^\\alpha & \\text{} X_{ij} < XMAX \\\\\n1 & \\text{otherwise}\n\\end{cases}\n\\]go implementing algorithm R?Let’s first make sure loaded packages need:","code":"\nlibrary(text2vec) # for implementation of GloVe algorithm\nlibrary(stringr) # to handle text strings\nlibrary(umap) # for dimensionality reduction later on"},{"path":"exercise-6-unsupervised-learning-word-embedding.html","id":"implementation-1","chapter":"22 Exercise 6: Unsupervised learning (word embedding)","heading":"22.8 Implementation","text":"need set choice parameters GloVe model. first window size WINDOW_SIZE, , , arbitrary normally set around 6-8. means looking word context words 6 words around target word. image illustrates choice parameter word “cat” given sentence, increase context window size:ultimately understood matrix format :iterations parameter ITERS simply sets maximum number iterations allow model convergence. number iterations relatively high model likely converge 100 iterations.DIM parameter specifies length word vector want result (.e., just set limit 256 SVD approach ). Finally, COUNT_MIN specifying minimum count words want keep. words, word appears fewer ten times, discarded. , discarded word pairings appeared fewer twenty times.next “shuffle” text. just means randomly reordering character vector tweets.create list object, tokenizing text tweet within item list. , create vocabulary object needed implement GloVe algorithm. creating “itoken” object itoken() creating vocabulary create_vocabulary. remove words exceed specified threshold prune_vocabulary().Next vectorize vocabulary create term co-occurrence matrix. , similar created matrix PMIs word pairings corpus.set final model parameters, learning rate fit model. whole process take time. save time working tutorial, may also download resulting embedding Github repo linked little .Finally, get resulting word embedding save .rds file.save time working tutorial, may also download resulting embedding Github repo :","code":"\n# ================================ choice parameters\n# ================================\nWINDOW_SIZE <- 6\nDIM <- 300\nITERS <- 100\nCOUNT_MIN <- 10\n# shuffle text\nset.seed(42L)\ntext <- sample(twts_sample$tweet)\n# ================================ create vocab ================================\ntokens <- space_tokenizer(text)\nit <- itoken(tokens, progressbar = FALSE)\nvocab <- create_vocabulary(it)\nvocab_pruned <- prune_vocabulary(vocab, term_count_min = COUNT_MIN) # keep only words that meet count threshold\n# ================================ create term co-occurrence matrix\n# ================================\nvectorizer <- vocab_vectorizer(vocab_pruned)\ntcm <- create_tcm(it, vectorizer, skip_grams_window = WINDOW_SIZE, skip_grams_window_context = \"symmetric\", \n weights = rep(1, WINDOW_SIZE))\n# ================================ set model parameters\n# ================================\nglove <- GlobalVectors$new(rank = DIM, x_max = 100, learning_rate = 0.05)\n\n# ================================ fit model ================================\nword_vectors_main <- glove$fit_transform(tcm, n_iter = ITERS, convergence_tol = 0.001, \n n_threads = RcppParallel::defaultNumThreads())\n# ================================ get output ================================\nword_vectors_context <- glove$components\nglove_embedding <- word_vectors_main + t(word_vectors_context) # word vectors\n\n# ================================ save ================================\nsaveRDS(glove_embedding, file = \"local_glove.rds\")\nurl <- \"https://github.com/cjbarrie/CTA-ED/blob/main/data/wordembed/local_glove.rds?raw=true\"\nglove_embedding <- readRDS(url(url, method=\"libcurl\"))"},{"path":"exercise-6-unsupervised-learning-word-embedding.html","id":"visualization","chapter":"22 Exercise 6: Unsupervised learning (word embedding)","heading":"22.9 Visualization","text":"explore embeddings? Well, imagine embeddings look something dissimilar visualization another embedding . words, talking something doesn’t lend projection 2D space!…hope lost, space travellers. smart technique McInnes, Healy, Melville (2020) linked describes way reduce dimensionality embedding layers using called “Uniform Manifold Approximation Projection.” ? Well, happily, umap package pretty straightforward!helpful? Well, number reasons, particularly helpful visualizing embeddings two-dimensional space.can see, , embeddings seem make sense. zoomed first little outgrowth 2D mapping, seemed correspond numbers number words. looked words around “economy” see related terms like “growth” “jobs.”","code":"\n# GloVe dimension reduction\nglove_umap <- umap(glove_embedding, n_components = 2, metric = \"cosine\", n_neighbors = 25, min_dist = 0.1, spread=2)\n# Put results in a dataframe for ggplot\ndf_glove_umap <- as.data.frame(glove_umap[[\"layout\"]])\n\n# Add the labels of the words to the dataframe\ndf_glove_umap$word <- rownames(df_glove_umap)\ncolnames(df_glove_umap) <- c(\"UMAP1\", \"UMAP2\", \"word\")\n\n# Plot the UMAP dimensions\nggplot(df_glove_umap) +\n geom_point(aes(x = UMAP1, y = UMAP2), colour = 'blue', size = 0.05) +\n ggplot2::annotate(\"rect\", xmin = -3, xmax = -2, ymin = 5, ymax = 7,alpha = .2) +\n labs(title = \"GloVe word embedding in 2D using UMAP\")\n# Plot the shaded part of the GloVe word embedding with labels\nggplot(df_glove_umap[df_glove_umap$UMAP1 < -2.5 & df_glove_umap$UMAP1 > -3 & df_glove_umap$UMAP2 > 5 & df_glove_umap$UMAP2 < 6.5,]) +\n geom_point(aes(x = UMAP1, y = UMAP2), colour = 'blue', size = 2) +\n geom_text(aes(UMAP1, UMAP2, label = word), size = 2.5, vjust=-1, hjust=0) +\n labs(title = \"GloVe word embedding in 2D using UMAP - partial view\") +\n theme(plot.title = element_text(hjust = .5, size = 14))\n# Plot the word embedding of words that are related for the GloVe model\nword <- glove_embedding[\"economy\",, drop = FALSE]\ncos_sim = sim2(x = glove_embedding, y = word, method = \"cosine\", norm = \"l2\")\nselect <- data.frame(rownames(as.data.frame(head(sort(cos_sim[,1], decreasing = TRUE), 25))))\ncolnames(select) <- \"word\"\nselected_words <- df_glove_umap %>% \n inner_join(y=select, by= \"word\")\n\n#The ggplot visual for GloVe\nggplot(selected_words, aes(x = UMAP1, y = UMAP2)) + \n geom_point(show.legend = FALSE) + \n geom_text(aes(UMAP1, UMAP2, label = word), show.legend = FALSE, size = 2.5, vjust=-1.5, hjust=0) +\n labs(title = \"GloVe word embedding of words related to 'economy'\") +\n theme(plot.title = element_text(hjust = .5, size = 14))"},{"path":"exercise-6-unsupervised-learning-word-embedding.html","id":"exercises-5","chapter":"22 Exercise 6: Unsupervised learning (word embedding)","heading":"22.10 Exercises","text":"Inspect visualize nearest neighbour synonyms relevant words tweets corpusIdentify another region interest GloVe-trained model visualize","code":""},{"path":"exercise-7-sampling-text-information.html","id":"exercise-7-sampling-text-information","chapter":"23 Exercise 7: Sampling text information","heading":"23 Exercise 7: Sampling text information","text":"","code":""},{"path":"exercise-7-sampling-text-information.html","id":"introduction-6","chapter":"23 Exercise 7: Sampling text information","heading":"23.1 Introduction","text":"hands-exercise week focuses collect /sample text information.tutorial, learn :Access text information online corporaQuery text information using different APIsScrape text information programmaticallyTranscribe text information audioExtract text information images","code":""},{"path":"exercise-7-sampling-text-information.html","id":"online-corpora","chapter":"23 Exercise 7: Sampling text information","heading":"23.2 Online corpora","text":"","code":""},{"path":"exercise-7-sampling-text-information.html","id":"replication-datasets","chapter":"23 Exercise 7: Sampling text information","heading":"23.2.1 Replication datasets","text":"large numbers online corpora replication datasets available access freely online. first access example using dataverse package R, allows us download directly replication data repositories stored Harvard Dataverse.Let’s take example dataset might interested: UK parliamentary speech data fromWe first need set en environment variable .can search files want specifying DOI publication data question. can find series numbers letters come “https://doi.org/” shown .choose get UK data files, listed “UK_data.csv.” can download directly following way (take time file size >1GB).course, also download data manually, clicking buttons relevant Harvard Dataverse—sometimes useful build every step data collection code documentation, making analysis entirely programatically reproducible start finish.Note well don’t search specific datasets already know . can also use dataverse package search datasets dataverses. can simply following way.","code":"\nlibrary(dataverse)\nlibrary(dplyr)\nSys.setenv(\"DATAVERSE_SERVER\" = \"dataverse.harvard.edu\")\ndataset <- get_dataset(\"10.7910/DVN/QDTLYV\")\ndataset$files[c(\"filename\", \"contentType\")]## filename\n## 1 1-uk.do\n## 2 2-ireland.do\n## 3 3-word_clouds.py\n## 4 4-trends.R\n## 5 5-predictive_margins.R\n## 6 6-barplot_topics.R\n## 7 7-plot_media.R\n## 8 8-histogram.R\n## 9 commons_stats.tab\n## 10 emotive_cloud.tab\n## 11 emotive_ireland.tab\n## 12 emotive_uk.tab\n## 13 ireland_data.csv\n## 14 neutral_cloud.tab\n## 15 neutral_ireland.tab\n## 16 neutral_uk.tab\n## 17 README.docx\n## 18 uk_data.csv\n## contentType\n## 1 application/x-stata-syntax\n## 2 application/x-stata-syntax\n## 3 text/x-python\n## 4 type/x-r-syntax\n## 5 type/x-r-syntax\n## 6 type/x-r-syntax\n## 7 type/x-r-syntax\n## 8 type/x-r-syntax\n## 9 text/tab-separated-values\n## 10 text/tab-separated-values\n## 11 text/tab-separated-values\n## 12 text/tab-separated-values\n## 13 text/csv\n## 14 text/tab-separated-values\n## 15 text/tab-separated-values\n## 16 text/tab-separated-values\n## 17 application/vnd.openxmlformats-officedocument.wordprocessingml.document\n## 18 text/csv\ndata <- get_dataframe_by_name(\n \"uk_data.csv\",\n \"10.7910/DVN/QDTLYV\",\n .f = function(x) read.delim(x, sep = \",\"))\nsearch_results <- dataverse_search(\"corpus politics text\", type = \"dataset\", per_page = 10)## 10 of 37533 results retrieved\nsearch_results[,1:3]## name\n## 1 \"A Deeper Look at Interstate War Data: Interstate War Data Version 1.1\"\n## 2 \"Birth Legacies, State Making, and War.\"\n## 3 \"CBS Morning News\" Shopping Habits and Lifestyles Poll, January 1989\n## 4 \"Cuadro histórico del General Santa Anna. 2a. parte,\" 1857\n## 5 \"Don't Know\" Means \"Don't Know\": DK Responses and the Public's Level of Political Knowledge\n## 6 \"El déspota Santa-Anna ante los veteranos de la Independencia,\" 1844 Diciembre 09\n## 7 \"European mood\" bi-annual data, EU27 member states (1973-2014), Replication Data\n## 8 \"Government Partisanship and Electoral Accountability\" Political Research Quarterly 72(3): 727-743\n## 9 \"I Didn't Lie, I Misspoke\": Voters' Responses to Questionable Campaign Claims\n## 10 \"I Hope to Hell Nothing Goes Back to The Way It Was Before\": COVID-19, Marginalization, and Native Nations\n## type url\n## 1 dataset https://doi.org/10.7910/DVN/E2CEP5\n## 2 dataset https://doi.org/10.7910/DVN/EP7DXB\n## 3 dataset https://doi.org/10.3886/ICPSR09230.v1\n## 4 dataset https://doi.org/10.18738/T8/Z0JH2C\n## 5 dataset https://doi.org/10.7910/DVN/G9NOQO\n## 6 dataset https://doi.org/10.18738/T8/U71QSD\n## 7 dataset https://doi.org/10.7910/DVN/V42M9J\n## 8 dataset https://doi.org/10.7910/DVN/5OG9VV\n## 9 dataset https://doi.org/10.7910/DVN/GE3E8R\n## 10 dataset https://doi.org/10.7910/DVN/Y916NP"},{"path":"exercise-7-sampling-text-information.html","id":"curated-corpora","chapter":"23 Exercise 7: Sampling text information","heading":"23.2.2 Curated corpora","text":", course, many sources might go text information. list might interest :Large English-language corpora: https://www.corpusdata.org/Wikipedia data dumps: https://meta.wikimedia.org/wiki/Data_dumps\nEnglish version dumps \nEnglish version dumps hereScottish Corpus Texts & Speech: https://www.scottishcorpus.ac.uk/Corpus Scottish modern writing: https://www.scottishcorpus.ac.uk/cmsw/Manifesto Corpus: https://manifesto-project.wzb.eu/information/documents/corpusReddit Pushshift data: https://files.pushshift.io/reddit/Mediacloud: https://mediacloud.org/\nR package: https://github.com/joon-e/mediacloud\nR package: https://github.com/joon-e/mediacloudFeel free recommend sources add list, intended growing index relevant text corpora social science research!","code":""},{"path":"exercise-7-sampling-text-information.html","id":"using-apis","chapter":"23 Exercise 7: Sampling text information","heading":"23.3 Using APIs","text":"order use YouTube API, ’ll first need get authorization token. can obtained anybody, without academic profile (.e., unlike academictwitteR) previous worksheets.order get authorization credentials, can follow guide. need account Google Cloud console order . main three steps :create “Project” Google Cloud console;associate YouTube API Project;enable API keys APIOnce created Project (: called “tuberalt1” case) see landing screen like .can get credentials navigating menu left hand side selecting credentials:Now click name project (“tuberalt1”) taken page containing two pieces information: “client ID” “client secret”.client ID referred “app ID” tuber packaage client secret “app secret” mentioned tuber package.credentials, can log R environment yt_oauth function tuber package. function takes two arguments: “app ID” “app secret”. provided associated YouTube API Google Cloud console project.","code":""},{"path":"exercise-7-sampling-text-information.html","id":"getting-youtube-data","chapter":"23 Exercise 7: Sampling text information","heading":"23.4 Getting YouTube data","text":"paper (haroon2022?), authors analyze recommended videos particular used based watch history seed video. , won’t replicate first step look recommended videos appear based seed video.case, seed video video Jordan Peterson predicting death mainstream media. fairly “alternative” content actively taking stance mainstream media. mean YouTube learn recommend us away mainstream content?, first take unique identifying code string video. can find url video shown .can collect videos recommended basis video seed video. store data.frame object rel_vids.can look recommended videos basis seed video .seems YouTube recommends us back lot videos relating Jordan Peterson. mainstream outlets; others obscure sources.","code":"\nlibrary(tidyverse)\nlibrary(readxl)\ndevtools::install_github(\"soodoku/tuber\") # need to install development version is there is problem with CRAN versions of the package functions\nlibrary(tuber)\n\nyt_oauth(\"431484860847-1THISISNOTMYREALKEY7jlembpo3off4hhor.apps.googleusercontent.com\",\"2niTHISISMADEUPTOO-l9NPUS90fp\")\n\n#get related videos\nstartvid <- \"1Gp7xNnW5n8\"\nrel_vids <- get_related_videos(startvid, max_results = 50, safe_search = \"none\")"},{"path":"exercise-7-sampling-text-information.html","id":"questions","chapter":"23 Exercise 7: Sampling text information","heading":"23.5 Questions","text":"Make request YouTube API different seed video.Make request YouTube API different seed video.Collect one video ID channels included resulting dataCollect one video ID channels included resulting dataWrite loop collect recommended videos video IDsWrite loop collect recommended videos video IDs","code":""},{"path":"exercise-7-sampling-text-information.html","id":"other-apis-r-packages","chapter":"23 Exercise 7: Sampling text information","heading":"23.5.1 Other APIs (R packages)","text":"https://cran.r-project.org/web/packages/manifestoR/index.htmlhttps://cran.r-project.org/web/packages/academictwitteR/index.htmlhttps://cran.r-project.org/web/packages/vkR/vkR.pdf","code":""},{"path":"exercise-7-sampling-text-information.html","id":"scraping","chapter":"23 Exercise 7: Sampling text information","heading":"23.6 Scraping","text":"practice skill, use series webpages Internet Archive host material collected Arab Spring protests Egypt 2011. original website can seen .proceeding, ’ll load remaining packages need tutorial.can download final dataset produce :can also view formatted output scraping exercise, alongside images documents question, Google Sheets .’re working document computer (“locally”) can download Tahrir documents data following way:Let’s look end producing:going return Internet Archived webpages see can produce final formatted dataset. archived Tahrir Documents webpages can accessed .first want expect contents webpage stored.scroll bottom page, see listed number hyperlinks documents stored month:click documents stored March click top listed pamphlet entitled “Season Anger Sets Among Arab Peoples.” can access .store url inspect HTML contains follows:Well, isn’t particularly useful. Let’s now see can extract text contained inside.Well looks pretty terrifying now…need way quickly identifying relevant text can specify scraping. widely-used tool achieve “Selector Gadget” Chrome Extension. can add browser free .tool works allowing user point click elements webpage (“CSS selectors”). Unlike alternatives, “Inspect Element” browser tools, easily able see webpage item contained within CSS selectors (rather HTML tags alone), easier parse.can Tahrir documents :now know main text translated document contained “p” HTML tags. identify text HTML tags can run:, looks quite lot manageable…!happening ? Essentially, html_elements() function scanning page collecting HTML elements contained tags, collect using “p” CSS selector. just grabbing text contained part page html_text() function.gives us one way capturing text, wanted get elements document, example date tags attributed document? Well can thing . Let’s take example getting date:see date identified “.calendar” CSS selector enter html_elements() function :course, well good, also need way scale—can’t just keep repeating process every page find wouldn’t much quicker just copy pasting. can ? Well need first understand URL structure website question.scroll page see listed number documents. directs individual pamphlet distributed protests 2011 Egyptian Revolution.Click one see URL changes.see starting URL :click March 2011, first month documents, see url becomes:, August 2011 becomes:, January 2012 becomes:notice month, URL changes addition month year back slashes end URL. next section, go efficiently create set URLs loop retrieve information contained individual webpage.going want retrieve text documents archived month. , first task store webpages series strings. manually , example, pasting year month strings end URL month March, 2011 January, 2012:wouldn’t particularly efficient…Instead, can wrap loop.’s going ? Well, first specifying starting URL . iterating numbers 3 13. telling R take new URL , depending number loop , take base starting url— https://wayback.archive-.org/2358/20120130143023/http://www.tahrirdocuments.org/ — paste end string “2011/0”, number loop , “/”. , first “” loop—number 3—effectively calling equivalent :gives:, ifelse() commands simply telling R: (number loop ) less 10 paste0(url,\"2011/0\",,\"/\"); .e., less 10 paste “2011/0”, “” “/”. number 3 becomes:\"https://wayback.archive-.org/2358/20120130143023/http://www.tahrirdocuments.org/2011/03/\", number 4 becomes\"https://wayback.archive-.org/2358/20120130143023/http://www.tahrirdocuments.org/2011/04/\", however, >=10 & <=12 (greater equal 10 less equal 12) calling paste0(url,\"2011/\",,\"/\") need first “0” months.Finally, (else) greater 12 calling paste0(url,\"2012/01/\"). last call, notice, specify whether greater equal 12 wrapping everything ifelse() commands. ifelse() calls like , telling R x “meets condition” y, otherwise z. wrapping multiple ifelse() calls within , effectively telling R x “meets condition” y, x “meets condition” z, otherwise . , “otherwise ” part ifelse() calls saying: less 10, 10 12, paste “2012/01/” end URL.Got ? didn’t even get first reading… wrote . best way understand going run code look part .now list URLs month. next?Well go onto page particular month, let’s say March, see page multiple paginated tabs bottom. Let’s see happens URL click one :see starting point URL March, , :click page 2 becomes:page 3 becomes:can see pretty clearly navigate page, appears appended URL string “page/2/” “page/3/”. shouldn’t tricky add list URLs. want avoid manually click archive month figure many pagination tabs bottom page.Fortunately, don’t . Using “Selector Gadget” tool can automate process grabbing highest number appears pagination bar month’s pages. code achieves :’s going ? Well, first two lines, simply creating empty character string ’re going populate subsequent loop. Remember set eleven starting URLs months archived webpage.code beginning (seq_along(files) saying, similar , beginning url end url, following loop: first, read url url <- urls[] read html contains html <- read_html(url).line, getting pages character vector page numbers calling html_elements() function “.page” tag. gives series pages stored e.g. “1” “2” “3”.order able see many , need extract highest number appears string. , first need reformat “integer” object rather “character” object R can recognize numbers. call pageints <- .integer(pages). get maximum simply calling: npages <- max(pageints, na.rm = T).next part loop, taking new information stored “npages,” .e., number pagination tabs month, telling R: pages, define new url adding “page/” number pagination tab “j”, “/”. ’ve bound together, get list URLs look like :next?next step get URLs documents contained archive month. ? Well, can use “Selector Gadget” tool work . main landing pages month, see listed, , document list. documents, see title, links revolutionary leaflet question, two CSS selectors: “h2” “.post”.can pass tags html_elements() grab ’s contained inside. can grab ’s contained inside extracting “children” classes. essence, just means lower level tag: tags can tags within tags flow downwards like family tree (hence name, suppose).one “children” HTML tag link contained inside, can get calling html_children() followed specifying want specific attribute web link encloses html_attr(\"href\"). subsequent lines just remove extraneous information.complete loop, , retrieve URL page every leaflet contained website :gives us:see now collected 523 separate URLs every revolutionary leaflet contained pages. Now ’re great position able crawl page collect information need. final loop need go URL ’re interested collect relevant information document text, title, date, tags, URL image revolutionary literature .See can work part fitting together. NOTE: want run final loop machines take several hours complete.now… ’re pretty much …back started!","code":"\nlibrary(tidyverse) # loads dplyr, ggplot2, and others\nlibrary(ggthemes) # includes a set of themes to make your visualizations look nice!\nlibrary(readr) # more informative and easy way to import data\nlibrary(stringr) # to handle text elements\nlibrary(rvest) #for scraping\npamphdata <- read_csv(\"data/sampling/pamphlets_formatted_gsheets.csv\")## Rows: 523 Columns: 8\n## ── Column specification ─────────────────────────────────────────────────────────\n## Delimiter: \",\"\n## chr (6): title, text, tags, imageurl, imgID, image\n## dbl (1): year\n## date (1): date\n## \n## ℹ Use `spec()` to retrieve the full column specification for this data.\n## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.\npamphdata <- read_csv(\"https://github.com/cjbarrie/CTA-ED/blob/main/data/sampling/pamphlets_formatted_gsheets.csv\")\nhead(pamphdata)## # A tibble: 6 × 8\n## title date year text tags imageurl imgID image\n## \n## 1 The Season of Anger Sets in … 2011-03-30 2011 The … Soli… https:/… imgI… =Arr…\n## 2 The Most Important Workers’ … 2011-03-30 2011 [Voi… Soli… https:/… imgI… \n## 3 Yes it’s the Workers’ and Em… 2011-03-30 2011 [Voi… Soli… https:/… imgI… \n## 4 The Revolution is Still Ongo… 2011-03-30 2011 [Voi… Revo… https:/… imgI… \n## 5 Voice of the Revolution, #3 2011-03-30 2011 Febr… Revo… https:/… imgI… \n## 6 We Are Still Continuing Unti… 2011-03-29 2011 We A… Dema… https:/… imgI… \nurl <- \"https://wayback.archive-it.org/2358/20120130161341/http://www.tahrirdocuments.org/2011/03/voice-of-the-revolution-3-page-2/\"\n\nhtml <- read_html(url)\n\nhtml## {html_document}\n## \n## [1] \\n NewFile --> RScript"},{"path":"introduction-to-r.html","id":"a-simple-example","chapter":"Introduction to R","heading":"0.5 A simple example","text":"Script (top left) write commands R. can try first time writing small snipped code follows:tell R run command, highlight relevant row script click Run button (top right Script) - hold ctrl+enter Windows cmd+enter Mac - send command Console (bottom left), actual evaluation calculations taking place. shortcut keys become familiar quickly!Running command creates object named ‘x’, contains words message.can now see ‘x’ Environment (top right). view contained x, type Console (bottom left):","code":"\nx <- \"I can't wait to learn Computational Text Analysis\" #Note the quotation marks!\nprint(x)## [1] \"I can't wait to learn Computational Text Analysis\"\n# or alternatively you can just type:\n\nx## [1] \"I can't wait to learn Computational Text Analysis\""},{"path":"introduction-to-r.html","id":"loading-packages","chapter":"Introduction to R","heading":"0.6 Loading packages","text":"‘base’ version R powerful able everything , least ease. technical specialized forms analysis, need load new packages.need install -called ‘package’—program includes new tools (.e., functions) carry specific tasks. can think ‘extensions’ enhancing R’s capacities.take one example, might want something little exciting print excited course. Let’s make map instead.might sound technical. beauty packaged extensions R contain functions perform specialized types analysis ease.’ll first need install one packages, can :package installed, need load environment typing library(). Note , , don’t need wrap name package quotation marks. trick:now? Well, let’s see just easy visualize data using ggplot package comes bundled larger tidyverse package.wanted save ’d got making plots, want save scripts, maybe data used well, return later stage.","code":"\ninstall.packages(\"tidyverse\")\nlibrary(tidyverse)\nggplot(data = mpg) + \n geom_point(mapping = aes(x = displ, y = hwy))"},{"path":"introduction-to-r.html","id":"saving-your-objects-plots-and-scripts","chapter":"Introduction to R","heading":"0.7 Saving your objects, plots and scripts","text":"Saving scripts: save script RStudio (.e. top left panel), need click File –> Save (choose name script). script something like: myfilename.R.Saving scripts: save script RStudio (.e. top left panel), need click File –> Save (choose name script). script something like: myfilename.R.Saving plots: made plots like save, click Export (plotting pane) choose relevant file extension (e.g. .png, .pdf, etc.) size.Saving plots: made plots like save, click Export (plotting pane) choose relevant file extension (e.g. .png, .pdf, etc.) size.save individual objects (example x ) environment, run following command (choosing suitable filename):save individual objects (example x ) environment, run following command (choosing suitable filename):save objects (.e. everything top right panel) , run following command (choosing suitable filename):objects can re-loaded R next session running:many file formats might use save output. encounter course progresses.","code":"\nsave(x,file=\"myobject.RData\")\nload(file=\"myobject.RData\")\nsave.image(file=\"myfilname.RData\")\nload(file=\"myfilename.RData\")"},{"path":"introduction-to-r.html","id":"knowing-where-r-saves-your-documents","chapter":"Introduction to R","heading":"0.8 Knowing where R saves your documents","text":"home, open new script make sure check set working directory (.e. folder files create saved). check working directory use getwd() command (type Console write script Source Editor):set working directory, run following command, substituting file directory choice. Remember anything following `#’ symbol simply clarifying comment R process .","code":"\ngetwd()\n## Example for Mac \nsetwd(\"/Users/Documents/mydir/\") \n## Example for PC \nsetwd(\"c:/docs/mydir\") "},{"path":"introduction-to-r.html","id":"practicing-in-r","chapter":"Introduction to R","heading":"0.9 Practicing in R","text":"best way learn R use . workshops text analysis place become fully proficient R. , however, chance conduct hands-analysis applied examples fast-expanding field. best way learn . give shot!practice R programming language, look Wickham Grolemund (2017) , tidy text analysis, Silge Robinson (2017).free online book Hadley Wickham “R Data Science” available hereThe free online book Hadley Wickham “R Data Science” available hereThe free online book Julia Silge David Robinson “Text Mining R” available hereThe free online book Julia Silge David Robinson “Text Mining R” available hereFor practice R, may want consult set interactive tutorials, available package “learnr.” ’ve installed package, can go tutorials calling:practice R, may want consult set interactive tutorials, available package “learnr.” ’ve installed package, can go tutorials calling:","code":"\nlibrary(learnr)\n\navailable_tutorials() # this will tell you the names of the tutorials available\n\nrun_tutorial(name = \"ex-data-basics\", package = \"learnr\") #this will launch the interactive tutorial in a new Internet browser window"},{"path":"introduction-to-r.html","id":"one-final-note","chapter":"Introduction to R","heading":"0.10 One final note","text":"’ve dipped “R Data Science” book ’ll hear lot -called tidyverse R. essentially set packages use alternative, intuitive, way interacting data.main difference ’ll notice , instead separate lines function want run, wrapping functions inside functions, sets functions “piped” using “pipe” functions, look appearance: %>%.using “tidy” syntax weekly exercises computational text analysis workshops. anything unclear, can provide equivalents “base” R . lot useful text analysis packages now composed ‘tidy’ syntax.","code":""},{"path":"week-1-retrieving-and-analyzing-text.html","id":"week-1-retrieving-and-analyzing-text","chapter":"1 Week 1: Retrieving and analyzing text","heading":"1 Week 1: Retrieving and analyzing text","text":"first task conducting large-scale text analyses gathering curating text information . focus chapters Manning, Raghavan, Schtze (2007) listed . , ’ll find introduction different ways can reformat ‘query’ text data order begin asking questions . often referred computer science natural language processing contexts “information retrieval” foundation many search, including web search, processes.articles Tatman (2017) Pechenick, Danforth, Dodds (2015) focus seminar (Q&). articles get us thinking fundamentals text discovery sampling. reading articles think locating texts, sampling , biases might inhere sampling process, texts represent; .e., population phenomenon interest might provide inferences.Questions seminar:access text? need consider ?sample texts?biases need keep mind?Required reading:Tatman (2017)Tatman (2017)Pechenick, Danforth, Dodds (2015)Pechenick, Danforth, Dodds (2015)Manning, Raghavan, Schtze (2007) (chs.1 10): https://nlp.stanford.edu/IR-book/information-retrieval-book.htmlManning, Raghavan, Schtze (2007) (chs.1 10): https://nlp.stanford.edu/IR-book/information-retrieval-book.htmlKlaus Krippendorff (2004) (ch. 6)Klaus Krippendorff (2004) (ch. 6)reading:Olteanu et al. (2019)Biber (1993)Barberá Rivero (2015)Slides:Week 1 Slides","code":""},{"path":"week-2-tokenization-and-word-frequencies.html","id":"week-2-tokenization-and-word-frequencies","chapter":"2 Week 2: Tokenization and word frequencies","heading":"2 Week 2: Tokenization and word frequencies","text":"approaching large-scale quantiative analyses text, key task identify capture unit analysis. One commonly used approaches, across diverse analytical contexts, text tokenization. , splitting text word units: unigrams, bigrams, trigrams etc.chapters Manning, Raghavan, Schtze (2007), listed , provide technical introduction task “querying” text according different word-based queries. task studying hands-assignment week.seminar discussion, focusing widely-cited examples research applied social sciences employing token-based, word frequency, analyses large corpora. first, Michel et al. (2011) uses enormous Google books corpus measure cultural linguistic trends. second, Bollen et al. (2021a) uses corpus demonstrate specific change time—-called “cognitive distortion.” examples, attentive questions sampling covered previous weeks. question central back--forths short responses replies articles Michel et al. (2011) Bollen et al. (2021a).Questions:Tokenizing counting: capture?Corpus-based sampling: biases might threaten inference?write critique either Michel et al. (2011) Bollen et al. (2021a), focus ?Required reading:Michel et al. (2011)\nSchwartz (2011)\nMorse-Gagné (2011)\nAiden, Pickett, Michel (2011)\nSchwartz (2011)Morse-Gagné (2011)Aiden, Pickett, Michel (2011)Bollen et al. (2021a)\nSchmidt, Piantadosi, Mahowald (2021)\nBollen et al. (2021b)\nSchmidt, Piantadosi, Mahowald (2021)Bollen et al. (2021b)Manning, Raghavan, Schtze (2007) (ch. 2): https://nlp.stanford.edu/IR-book/information-retrieval-book.html]Klaus Krippendorff (2004) (ch. 5)reading:Rozado, Al-Gharbi, Halberstadt (2021)Alshaabi et al. (2021)Campos et al. (2015)Greenfield (2013)Slides:Week 2 Slides","code":""},{"path":"week-2-demo.html","id":"week-2-demo","chapter":"3 Week 2 Demo","heading":"3 Week 2 Demo","text":"","code":""},{"path":"week-2-demo.html","id":"setup","chapter":"3 Week 2 Demo","heading":"3.1 Setup","text":"section, ’ll quick overview ’re processing text data conducting analyses word frequency. ’ll using randomly simulated text.First load packages ’ll using:","code":"\nlibrary(stringi) #to generate random text\nlibrary(dplyr) #tidyverse package for wrangling data\nlibrary(tidytext) #package for 'tidy' manipulation of text data\nlibrary(ggplot2) #package for visualizing data\nlibrary(scales) #additional package for formatting plot axes\nlibrary(kableExtra) #package for displaying data in html format (relevant for formatting this worksheet mainly)"},{"path":"week-2-demo.html","id":"tokenizing","chapter":"3 Week 2 Demo","heading":"3.2 Tokenizing","text":"’ll first get random text see looks like ’re tokenizing text.can tokenize unnest_tokens() function tidytext.Now ’ll get larger data, simulating 5000 observations (rows) random Latin text strings.’ll add another column call “weeks.” unit analysis.Now ’ll simulate trend see increasing number words weeks go . Don’t worry much code little complex, share case interest.can see week goes , text.can trend week sees decreasing number words.Now let’s check top frequency words text.’re going check frequencies word “sed” ’re gonna normalize denominating total word frequencies week.First need get total word frequencies week.can join two dataframes together left_join() function ’re joining “week” column. can pipe joined data plot.","code":"\nlipsum_text <- data.frame(text = stri_rand_lipsum(1, start_lipsum = TRUE))\n\nhead(lipsum_text$text)## [1] \"Lorem ipsum dolor sit amet, mauris dolor posuere sed sit dapibus sapien egestas semper aptent. Luctus, eu, pretium enim, sociosqu rhoncus quis aliquam. In in in auctor natoque venenatis tincidunt. At scelerisque neque porta ut mi a, congue quis curae. Facilisis, adipiscing mauris. Dis non interdum cum commodo, tempor sapien donec in luctus. Nascetur ullamcorper, dui non semper, arcu sed. Sed non pellentesque rutrum tempor, curabitur in. Taciti gravida ut interdum iaculis. Arcu consectetur dictum et erat vestibulum luctus ridiculus! Luctus metus ad ex bibendum, eget at maximus nisl quisque ante posuere aptent. Cubilia tellus sed aliquam, suspendisse arcu et dapibus aenean. Ultricies primis sit nulla condimentum, sed, phasellus viverra nullam, primis.\"\ntokens <- lipsum_text %>%\n unnest_tokens(word, text)\n\nhead(tokens)## word\n## 1 lorem\n## 2 ipsum\n## 3 dolor\n## 4 sit\n## 5 amet\n## 6 mauris\n## Varying total words example\nlipsum_text <- data.frame(text = stri_rand_lipsum(5000, start_lipsum = TRUE))\n# make some weeks one to ten\nlipsum_text$week <- as.integer(rep(seq.int(1:10), 5000/10))\nfor(i in 1:nrow(lipsum_text)) {\n week <- lipsum_text[i, 2]\n morewords <-\n paste(rep(\"more lipsum words\", times = sample(1:100, 1) * week), collapse = \" \")\n lipsum_words <- lipsum_text[i, 1]\n new_lipsum_text <- paste0(morewords, lipsum_words, collapse = \" \")\n lipsum_text[i, 1] <- new_lipsum_text\n}\nlipsum_text %>%\n unnest_tokens(word, text) %>%\n group_by(week) %>%\n dplyr::count(word) %>%\n select(week, n) %>%\n distinct() %>%\n ggplot() +\n geom_bar(aes(week, n), stat = \"identity\") +\n labs(x = \"Week\", y = \"n words\") +\n scale_x_continuous(breaks= pretty_breaks())\n# simulate decreasing words trend\nlipsum_text <- data.frame(text = stri_rand_lipsum(5000, start_lipsum = TRUE))\n\n# make some weeks one to ten\nlipsum_text$week <- as.integer(rep(seq.int(1:10), 5000/10))\n\nfor(i in 1:nrow(lipsum_text)) {\n week <- lipsum_text[i,2]\n morewords <- paste(rep(\"more lipsum words\", times = sample(1:100, 1)* 1/week), collapse = \" \")\n lipsum_words <- lipsum_text[i,1]\n new_lipsum_text <- paste0(morewords, lipsum_words, collapse = \" \")\n lipsum_text[i,1] <- new_lipsum_text\n}\n\nlipsum_text %>%\n unnest_tokens(word, text) %>%\n group_by(week) %>%\n dplyr::count(word) %>%\n select(week, n) %>%\n distinct() %>%\n ggplot() +\n geom_bar(aes(week, n), stat = \"identity\") +\n labs(x = \"Week\", y = \"n words\") +\n scale_x_continuous(breaks= pretty_breaks())\nlipsum_text %>%\n unnest_tokens(word, text) %>%\n dplyr::count(word, sort = T) %>%\n top_n(5) %>%\n knitr::kable(format=\"html\")%>% \n kable_styling(\"striped\", full_width = F)## Selecting by n\nlipsum_totals <- lipsum_text %>%\n group_by(week) %>%\n unnest_tokens(word, text) %>%\n dplyr::count(word) %>%\n mutate(total = sum(n)) %>%\n distinct(week, total)\n# let's look for \"sed\"\nlipsum_sed <- lipsum_text %>%\n group_by(week) %>%\n unnest_tokens(word, text) %>%\n filter(word == \"sed\") %>%\n dplyr::count(word) %>%\n mutate(total_sed = sum(n)) %>%\n distinct(week, total_sed)\nlipsum_sed %>%\n left_join(lipsum_totals, by = \"week\") %>%\n mutate(sed_prop = total_sed/total) %>%\n ggplot() +\n geom_line(aes(week, sed_prop)) +\n labs(x = \"Week\", y = \"\n Proportion sed word\") +\n scale_x_continuous(breaks= pretty_breaks())"},{"path":"week-2-demo.html","id":"regexing","chapter":"3 Week 2 Demo","heading":"3.3 Regexing","text":"’ll notice worksheet word frequencies one point set parentheses str_detect() string “[-z]”. called character class use square brackets like [].character classes include, helpfully listed vignette stringr package. follows adapted materials regular expressions.[abc]: matches , b, c.[-z]: matches every character z\n(Unicode code point order).[^abc]: matches anything except , b, c.[\\^\\-]: matches ^ -.Several patterns match multiple characters. include:\\d: matches digit; opposite \\D, matches character \ndecimal digit.\\s: matches whitespace; opposite \\S^: matches start string$: matches end string^ $: exact string matchHold : plus signs etc. mean?+: 1 .*: 0 .?: 0 1.can tell output makes sense, ’re getting !","code":"\nstr_extract_all(\"1 + 2 = 3\", \"\\\\d+\")## [[1]]\n## [1] \"1\" \"2\" \"3\"\nstr_extract_all(\"1 + 2 = 3\", \"\\\\D+\")## [[1]]\n## [1] \" + \" \" = \"\n(text <- \"Some \\t badly\\n\\t\\tspaced \\f text\")## [1] \"Some \\t badly\\n\\t\\tspaced \\f text\"\nstr_replace_all(text, \"\\\\s+\", \" \")## [1] \"Some badly spaced text\"\nx <- c(\"apple\", \"banana\", \"pear\")\nstr_extract(x, \"^a\")## [1] \"a\" NA NA\nx <- c(\"apple\", \"banana\", \"pear\")\nstr_extract(x, \"^a$\")## [1] NA NA NA\nx <- c(\"apple\", \"banana\", \"pear\")\nstr_extract(x, \"^apple$\")## [1] \"apple\" NA NA\nstr_extract_all(\"1 + 2 = 3\", \"\\\\d+\")[[1]]## [1] \"1\" \"2\" \"3\"\nstr_extract_all(\"1 + 2 = 3\", \"\\\\D+\")[[1]]## [1] \" + \" \" = \"\nstr_extract_all(\"1 + 2 = 3\", \"\\\\d*\")[[1]]## [1] \"1\" \"\" \"\" \"\" \"2\" \"\" \"\" \"\" \"3\" \"\"\nstr_extract_all(\"1 + 2 = 3\", \"\\\\D*\")[[1]]## [1] \"\" \" + \" \"\" \" = \" \"\" \"\"\nstr_extract_all(\"1 + 2 = 3\", \"\\\\d?\")[[1]]## [1] \"1\" \"\" \"\" \"\" \"2\" \"\" \"\" \"\" \"3\" \"\"\nstr_extract_all(\"1 + 2 = 3\", \"\\\\D?\")[[1]]## [1] \"\" \" \" \"+\" \" \" \"\" \" \" \"=\" \" \" \"\" \"\""},{"path":"week-2-demo.html","id":"some-more-regex-resources","chapter":"3 Week 2 Demo","heading":"3.3.1 Some more regex resources:","text":"Regex crossword: https://regexcrossword.com/.Regexone: https://regexone.com/R4DS chapter 14","code":""},{"path":"week-3-dictionary-based-techniques.html","id":"week-3-dictionary-based-techniques","chapter":"4 Week 3: Dictionary-based techniques","heading":"4 Week 3: Dictionary-based techniques","text":"extension word frequency analyses, covered last week, -called “dictionary-based” techniques. basic form, analyses use index target terms classify corpus interest based presence absence. technical dimensions type analysis covered chapter section Klaus Krippendorff (2004), issues attending article - Loughran Mcdonald (2011). article Brooke (2021) provides outstanding illustration use text analysis techniques make inferences larger questions bias.also reading two examples application techniques Martins Baumard (2020) Young Soroka (2012). , discussing successful authors measuring phenomenon interest (“prosociality” “tone” respectively). Questions sampling representativeness relevant , naturally inform assessments work.Questions:general dictionaries possible; domain-specific?know dictionary accurate?enhance/supplement dictionary-based techniques?Required reading:Martins Baumard (2020)Voigt et al. (2017)Brooke (2021)reading:Tausczik Pennebaker (2010)Klaus Krippendorff (2004) (pp.283-289)Brier Hopp (2011)Bonikowski Gidron (2015)Barberá et al. (2021)Young Soroka (2012)Slides:Week 3 Slides","code":""},{"path":"week-3-demo.html","id":"week-3-demo","chapter":"5 Week 3 Demo","heading":"5 Week 3 Demo","text":"section, ’ll quick overview ’re processing text data conducting basic sentiment analyses.","code":""},{"path":"week-3-demo.html","id":"setup-1","chapter":"5 Week 3 Demo","heading":"5.1 Setup","text":"’ll first load packages need.","code":"\nlibrary(stringi)\nlibrary(dplyr)\nlibrary(tidytext)\nlibrary(ggplot2)\nlibrary(scales)"},{"path":"week-3-demo.html","id":"happy-words","chapter":"5 Week 3 Demo","heading":"5.2 Happy words","text":"discussed lectures, might find text class’s collective thoughts increase “happy” words time.simulated dataset text split weeks, students, words plus whether word word “happy” 0 means word “happy” 1 means .three datasets: one constant number “happy” words; one increasing number “happy” words; one decreasing number “happy” words. called: happyn, happyu, happyd respectively.can see trend “happy” words week student.First, dataset constant number happy words time.now simulated data increasing number happy words.finally decreasing number happy words.","code":"\nhead(happyn)## # A tibble: 6 × 4\n## # Groups: week, student [1]\n## week student word happy\n## \n## 1 1 9 lorem 0\n## 2 1 9 ipsum 0\n## 3 1 9 dolor 0\n## 4 1 9 sit 0\n## 5 1 9 amet 0\n## 6 1 9 nam 0\nhead(happyu)## # A tibble: 6 × 4\n## # Groups: week, student [1]\n## week student word happy\n## \n## 1 1 9 lorem 0\n## 2 1 9 ipsum 0\n## 3 1 9 dolor 0\n## 4 1 9 sit 0\n## 5 1 9 amet 0\n## 6 1 9 nam 0\nhead(happyd)## # A tibble: 6 × 4\n## # Groups: week, student [1]\n## week student word happy\n## \n## 1 1 9 lorem 0\n## 2 1 9 ipsum 0\n## 3 1 9 dolor 0\n## 4 1 9 sit 0\n## 5 1 9 amet 0\n## 6 1 9 nam 0## Warning: Returning more (or less) than 1 row per `summarise()` group was deprecated in\n## dplyr 1.1.0.\n## ℹ Please use `reframe()` instead.\n## ℹ When switching from `summarise()` to `reframe()`, remember that `reframe()`\n## always returns an ungrouped data frame and adjust accordingly.\n## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was\n## generated.## `summarise()` has grouped output by 'week', 'student'. You can override using\n## the `.groups` argument.\n## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'## Warning: Returning more (or less) than 1 row per `summarise()` group was deprecated in\n## dplyr 1.1.0.\n## ℹ Please use `reframe()` instead.\n## ℹ When switching from `summarise()` to `reframe()`, remember that `reframe()`\n## always returns an ungrouped data frame and adjust accordingly.\n## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was\n## generated.## `summarise()` has grouped output by 'week', 'student'. You can override using\n## the `.groups` argument.\n## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'## Warning: Returning more (or less) than 1 row per `summarise()` group was deprecated in\n## dplyr 1.1.0.\n## ℹ Please use `reframe()` instead.\n## ℹ When switching from `summarise()` to `reframe()`, remember that `reframe()`\n## always returns an ungrouped data frame and adjust accordingly.\n## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was\n## generated.## `summarise()` has grouped output by 'week', 'student'. You can override using\n## the `.groups` argument.\n## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'"},{"path":"week-3-demo.html","id":"normalizing-sentiment","chapter":"5 Week 3 Demo","heading":"5.3 Normalizing sentiment","text":"discussed lecture, also know just total number happy words increases, isn’t indication ’re getting happier class time.can begin make inference, need normalize total number words week., simulate data number happy words actually week (happyn dataset ).join data three datasets: happylipsumn, happylipsumu, happylipsumd. datasets random text, number happy words.first also number total words week. second two, however, differing number total words week: happylipsumu increasing number total words week; happylipsumd decreasing number total words week., see , ’re splitting week, student, word, whether “happy” word.plot number happy words divided number total words week student datasets, get .get normalized sentiment score–“happy” score–need create variable (column) dataframe sum happy words divided total number words dataframe.can following way.repeat datasets plot see following.plots look like ?Well, first, number total words week number happy words week. divided latter former, get proportion also stable time.second, however, increasing number total words week, number happy words time. means dividing ever larger number, giving ever smaller proportions. , trend decreasing time.third, decreasing number total words week, number happy words time. means dividing ever smaller number, giving ever larger proportions. , trend increasing time.","code":"\nhead(happylipsumn)## # A tibble: 6 × 4\n## # Groups: week, student [1]\n## week student word happy\n## \n## 1 1 9 lorem 0\n## 2 1 9 ipsum 0\n## 3 1 9 dolor 0\n## 4 1 9 sit 0\n## 5 1 9 amet 0\n## 6 1 9 semper 0\nhead(happylipsumu)## # A tibble: 6 × 4\n## # Groups: week, student [1]\n## week student word happy\n## \n## 1 1 9 lorem 0\n## 2 1 9 ipsum 0\n## 3 1 9 dolor 0\n## 4 1 9 sit 0\n## 5 1 9 amet 0\n## 6 1 9 commodo 0\nhead(happylipsumd)## # A tibble: 6 × 4\n## # Groups: week, student [1]\n## week student word happy\n## \n## 1 1 9 lorem 0\n## 2 1 9 ipsum 0\n## 3 1 9 dolor 0\n## 4 1 9 sit 0\n## 5 1 9 amet 0\n## 6 1 9 et 0\nhappylipsumn %>%\n group_by(week, student) %>%\n mutate(index_total = n()) %>%\n filter(happy==1) %>%\n summarise(sum_hap = sum(happy),\n index_total = index_total,\n prop_hap = sum_hap/index_total) %>%\n distinct()## Warning: Returning more (or less) than 1 row per `summarise()` group was deprecated in\n## dplyr 1.1.0.\n## ℹ Please use `reframe()` instead.\n## ℹ When switching from `summarise()` to `reframe()`, remember that `reframe()`\n## always returns an ungrouped data frame and adjust accordingly.\n## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was\n## generated.## `summarise()` has grouped output by 'week', 'student'. You can override using\n## the `.groups` argument.## # A tibble: 300 × 5\n## # Groups: week, student [300]\n## week student sum_hap index_total prop_hap\n## \n## 1 1 1 894 3548 0.252\n## 2 1 2 1164 5259 0.221\n## 3 1 3 1014 4531 0.224\n## 4 1 4 774 3654 0.212\n## 5 1 5 980 4212 0.233\n## 6 1 6 711 3579 0.199\n## 7 1 7 1254 5025 0.250\n## 8 1 8 1117 4846 0.230\n## 9 1 9 1079 4726 0.228\n## 10 1 10 1061 5111 0.208\n## # ℹ 290 more rows"},{"path":"week-4-natural-language-complexity-and-similarity.html","id":"week-4-natural-language-complexity-and-similarity","chapter":"6 Week 4: Natural language, complexity, and similarity","heading":"6 Week 4: Natural language, complexity, and similarity","text":"week delving deeply language used text. previous weeks, tried two main techniques rely, different ways, counting words. week, thinking sophisticated techniques identify measure language use, well compare texts . article Gomaa Fahmy (2013) provides overview different approaches. covering technical dimensions lecture.article Urman, Makhortykh, Ulloa (2021) investigates key question contemporary communications research—information exposed online—shows might compare web search results using similarity measures. Schoonvelde et al. (2019) article, hand, looks “complexity” texts, compares politicians different ideological stripes communicate.Questions:measure linguistic complexity/sophistication?biases might involved measuring sophistication?applications might similarity measures?Required reading:Urman, Makhortykh, Ulloa (2021)Schoonvelde et al. (2019)Gomaa Fahmy (2013)reading:Voigt et al. (2017)Peng Hengartner (2002)Lowe (2008)Bail (2012)Ziblatt, Hilbig, Bischof (2020)Benoit, Munger, Spirling (2019)Slides:Week 4 Slides","code":""},{"path":"week-4-demo.html","id":"week-4-demo","chapter":"7 Week 4 Demo","heading":"7 Week 4 Demo","text":"","code":""},{"path":"week-4-demo.html","id":"setup-2","chapter":"7 Week 4 Demo","heading":"7.1 Setup","text":"First, ’ll load packages ’ll using week’s brief demo.","code":"\nlibrary(quanteda)\nlibrary(quanteda.textstats)\nlibrary(quanteda.textplots)\nlibrary(tidytext)\nlibrary(stringdist)\nlibrary(corrplot)\nlibrary(janeaustenr)"},{"path":"week-4-demo.html","id":"character-based-similarity","chapter":"7 Week 4 Demo","heading":"7.2 Character-based similarity","text":"first measure text similarity level characters. can look last time (promise) example lecture see similarity compares.’ll make two sentences create two character objects . two thoughts imagined classes.know “longest common substring measure” , according stringdist package documentation, “longest string can obtained pairing characters b keeping order characters intact.”can easily get different distance/similarity measures comparing character objects b .","code":"\na <- \"We are all very happy to be at a lecture at 11AM\"\nb <- \"We are all even happier that we don’t have two lectures a week\"\n## longest common substring distance\nstringdist(a, b,\n method = \"lcs\")## [1] 36\n## levenshtein distance\nstringdist(a, b,\n method = \"lv\")## [1] 27\n## jaro distance\nstringdist(a, b,\n method = \"jw\", p =0)## [1] 0.2550103"},{"path":"week-4-demo.html","id":"term-based-similarity","chapter":"7 Week 4 Demo","heading":"7.3 Term-based similarity","text":"second example lecture, ’re taking opening line Pride Prejudice alongside versions famous opening line.can get text Jane Austen easily thanks janeaustenr package.’re going specify alternative versions sentence.Finally, ’re going convert document feature matrix. ’re quanteda package, package ’ll begin using coming weeks analyses ’re performing get gradually technical.see ?Well, ’s clear text2 text3 similar text1 —share words. also see text2 least contain words shared text1, original opening line Jane Austen’s Pride Prejudice., measure similarity distance texts?first way simply correlating two sets ones zeroes. can quanteda.textstats package like .’ll see get manipulated data tidy format (rows words columns 1s 0s).see expected text2 highly correlated text1 text3.\nEuclidean distances, can use quanteda .define function just see ’s going behind scenes.Manhattan distance, use quanteda .define function.cosine similarity, quanteda makes straightforward.make clear ’s going , write function.","code":"\n## similarity and distance example\n\ntext <- janeaustenr::prideprejudice\n\nsentences <- text[10:11]\n\nsentence1 <- paste(sentences[1], sentences[2], sep = \" \")\n\nsentence1## [1] \"It is a truth universally acknowledged, that a single man in possession of a good fortune, must be in want of a wife.\"\nsentence2 <- \"Everyone knows that a rich man without wife will want a wife\"\n\nsentence3 <- \"He's loaded so he wants to get married. Everyone knows that's what happens.\"\ndfmat <- dfm(tokens(c(sentence1,\n sentence2,\n sentence3)),\n remove_punct = TRUE, remove = stopwords(\"english\"))\n\ndfmat## Document-feature matrix of: 3 documents, 21 features (58.73% sparse) and 0 docvars.\n## features\n## docs truth universally acknowledged single man possession good fortune must\n## text1 1 1 1 1 1 1 1 1 1\n## text2 0 0 0 0 1 0 0 0 0\n## text3 0 0 0 0 0 0 0 0 0\n## features\n## docs want\n## text1 1\n## text2 1\n## text3 0\n## [ reached max_nfeat ... 11 more features ]\n## correlation\ntextstat_simil(dfmat, margin = \"documents\", method = \"correlation\")## textstat_simil object; method = \"correlation\"\n## text1 text2 text3\n## text1 1.000 -0.117 -0.742\n## text2 -0.117 1.000 -0.173\n## text3 -0.742 -0.173 1.000\ntest <- tidy(dfmat)\ntest <- test %>%\n cast_dfm(term, document, count)\ntest <- as.data.frame(test)\n\nres <- cor(test[,2:4])\nres## text1 text2 text3\n## text1 1.0000000 -0.1167748 -0.7416198\n## text2 -0.1167748 1.0000000 -0.1732051\n## text3 -0.7416198 -0.1732051 1.0000000\ncorrplot(res, type = \"upper\", order = \"hclust\", \n tl.col = \"black\", tl.srt = 45)\ntextstat_dist(dfmat, margin = \"documents\", method = \"euclidean\")## textstat_dist object; method = \"euclidean\"\n## text1 text2 text3\n## text1 0 3.74 4.24\n## text2 3.74 0 3.74\n## text3 4.24 3.74 0\n# function for Euclidean distance\neuclidean <- function(a,b) sqrt(sum((a - b)^2))\n# estimating the distance\neuclidean(test$text1, test$text2)## [1] 3.741657\neuclidean(test$text1, test$text3)## [1] 4.242641\neuclidean(test$text2, test$text3)## [1] 3.741657\ntextstat_dist(dfmat, margin = \"documents\", method = \"manhattan\")## textstat_dist object; method = \"manhattan\"\n## text1 text2 text3\n## text1 0 14 18\n## text2 14 0 12\n## text3 18 12 0\n## manhattan\nmanhattan <- function(a, b){\n dist <- abs(a - b)\n dist <- sum(dist)\n return(dist)\n}\n\nmanhattan(test$text1, test$text2)## [1] 14\nmanhattan(test$text1, test$text3)## [1] 18\nmanhattan(test$text2, test$text3)## [1] 12\ntextstat_simil(dfmat, margin = \"documents\", method = \"cosine\")## textstat_simil object; method = \"cosine\"\n## text1 text2 text3\n## text1 1.000 0.364 0\n## text2 0.364 1.000 0.228\n## text3 0 0.228 1.000\n## cosine\ncos.sim <- function(a, b) \n{\n return(sum(a*b)/sqrt(sum(a^2)*sum(b^2)) )\n} \n\ncos.sim(test$text1, test$text2)## [1] 0.3636364\ncos.sim(test$text1, test$text3)## [1] 0\ncos.sim(test$text2, test$text3)## [1] 0.2279212"},{"path":"week-4-demo.html","id":"complexity","chapter":"7 Week 4 Demo","heading":"7.4 Complexity","text":"Note: section borrows notation materials texstat_readability() function.also talked different document-level measures text characteristics. One “complexity” readability text. One frequently used Flesch’s Reading Ease Score (Flesch 1948).computed :{:}{Flesch’s Reading Ease Score (Flesch 1948).\n}can estimate readability score respective sentences . Flesch score 1948 default.see ? original Austen opening line marked lower readability colloquial alternatives.alternatives measures might use. can check clicking links function textstat_readability(). display .One McLaughlin (1969) “Simple Measure Gobbledygook, based recurrence words 3 syllables calculated :{:}{Simple Measure Gobbledygook (SMOG) (McLaughlin 1969). = Nwmin3sy = number words 3 syllables .\nmeasure regression equation D McLaughlin’s original paper.}can calculate three sentences ., , see original Austen sentence higher level complexity (gobbledygook!).","code":"\ntextstat_readability(sentence1)## document Flesch\n## 1 text1 62.10739\ntextstat_readability(sentence2)## document Flesch\n## 1 text1 88.905\ntextstat_readability(sentence3)## document Flesch\n## 1 text1 83.09904\ntextstat_readability(sentence1, measure = \"SMOG\")## document SMOG\n## 1 text1 13.02387\ntextstat_readability(sentence2, measure = \"SMOG\")## document SMOG\n## 1 text1 8.841846\ntextstat_readability(sentence3, measure = \"SMOG\")## document SMOG\n## 1 text1 7.168622"},{"path":"week-5-scaling-techniques.html","id":"week-5-scaling-techniques","chapter":"8 Week 5: Scaling techniques","heading":"8 Week 5: Scaling techniques","text":"begin thinking automated techniques analyzing texts. bunch additional considerations now need bring mind. considerations sparked significant debates… matter means settled.stake ? weeks come, studying various techniques ‘classify,’ ‘position’ ‘score’ texts based features. success techniques depends suitability question hand also higher-level questions meaning. short, ask : way can access underlying processes governing generation text? meaning governed set structural processes? can derive ‘objective’ measures contents given text?readings Justin Grimmer, Roberts, Stewart (2021), Denny Spirling (2018), Goldenstein Poschmann (2019b) (well response replies Nelson (2019) Goldenstein Poschmann (2019a)) required reading Flexible Learning Week.Justin Grimmer, Roberts, Stewart (2021)Justin Grimmer, Roberts, Stewart (2021)Justin Grimmer Stewart (2013a)Justin Grimmer Stewart (2013a)Denny Spirling (2018)Denny Spirling (2018)Goldenstein Poschmann (2019b)\nNelson (2019)\nGoldenstein Poschmann (2019a)\nGoldenstein Poschmann (2019b)Nelson (2019)Goldenstein Poschmann (2019a)substantive focus week set readings employ different types “scaling” “low-dimensional document embedding” techniques. article Lowe (2008) provides technical overview “wordfish” algorithm uses political science contexts. article Klüver (2009) also uses “wordfish” different way—measure “influence” interest groups. response article Bunea Ibenskas (2015) subsequent reply Klüver (2015) helps illuminate debates around questions. work Kim, Lelkes, McCrain (2022) gives insight ability text-scaling techniques capture key dimensions political communication bias.Questions:assumptions underlie scaling models text?; latent text decides?might scaling useful outside estimating ideological position/bias text?Required reading:Lowe (2008)Kim, Lelkes, McCrain (2022)Klüver (2009)\nBunea Ibenskas (2015)\nKlüver (2015)\nBunea Ibenskas (2015)Klüver (2015)reading:Benoit et al. (2016)Laver, Benoit, Garry (2003)Slapin Proksch (2008)Schwemmer Wieczorek (2020)Slides:Week 5 Slides","code":""},{"path":"week-5-demo.html","id":"week-5-demo","chapter":"9 Week 5 Demo","heading":"9 Week 5 Demo","text":"","code":""},{"path":"week-5-demo.html","id":"setup-3","chapter":"9 Week 5 Demo","heading":"9.1 Setup","text":"First, ’ll load packages ’ll using week’s brief demo.","code":"\ndevtools::install_github(\"conjugateprior/austin\")\nlibrary(austin)\nlibrary(quanteda)\nlibrary(quanteda.textstats)"},{"path":"week-5-demo.html","id":"wordscores","chapter":"9 Week 5 Demo","heading":"9.2 Wordscores","text":"can inspect function wordscores model Laver, Benoit, Garry (2003) following way:can take example data included austin package.reference documents documents marked “R” reference; .e., columns one five.matrix simply series words (: letters) reference texts word counts .can look wordscores words, calculated using reference dimensions reference documents.see thetas contained wordscores object, .e., reference dimensions reference documents pis, .e., estimated wordscores word.can now use score -called “virgin” texts follows.","code":"\nclassic.wordscores## function (wfm, scores) \n## {\n## if (!is.wfm(wfm)) \n## stop(\"Function not applicable to this object\")\n## if (length(scores) != length(docs(wfm))) \n## stop(\"There are not the same number of documents as scores\")\n## if (any(is.na(scores))) \n## stop(\"One of the reference document scores is NA\\nFit the model with known scores and use 'predict' to get virgin score estimates\")\n## thecall <- match.call()\n## C.all <- as.worddoc(wfm)\n## C <- C.all[rowSums(C.all) > 0, ]\n## F <- scale(C, center = FALSE, scale = colSums(C))\n## ws <- apply(F, 1, function(x) {\n## sum(scores * x)\n## })/rowSums(F)\n## pi <- matrix(ws, nrow = length(ws))\n## rownames(pi) <- rownames(C)\n## colnames(pi) <- c(\"Score\")\n## val <- list(pi = pi, theta = scores, data = wfm, call = thecall)\n## class(val) <- c(\"classic.wordscores\", \"wordscores\", class(val))\n## return(val)\n## }\n## \n## \ndata(lbg)\nref <- getdocs(lbg, 1:5)\nref## docs\n## words R1 R2 R3 R4 R5\n## A 2 0 0 0 0\n## B 3 0 0 0 0\n## C 10 0 0 0 0\n## D 22 0 0 0 0\n## E 45 0 0 0 0\n## F 78 2 0 0 0\n## G 115 3 0 0 0\n## H 146 10 0 0 0\n## I 158 22 0 0 0\n## J 146 45 0 0 0\n## K 115 78 2 0 0\n## L 78 115 3 0 0\n## M 45 146 10 0 0\n## N 22 158 22 0 0\n## O 10 146 45 0 0\n## P 3 115 78 2 0\n## Q 2 78 115 3 0\n## R 0 45 146 10 0\n## S 0 22 158 22 0\n## T 0 10 146 45 0\n## U 0 3 115 78 2\n## V 0 2 78 115 3\n## W 0 0 45 146 10\n## X 0 0 22 158 22\n## Y 0 0 10 146 45\n## Z 0 0 3 115 78\n## ZA 0 0 2 78 115\n## ZB 0 0 0 45 146\n## ZC 0 0 0 22 158\n## ZD 0 0 0 10 146\n## ZE 0 0 0 3 115\n## ZF 0 0 0 2 78\n## ZG 0 0 0 0 45\n## ZH 0 0 0 0 22\n## ZI 0 0 0 0 10\n## ZJ 0 0 0 0 3\n## ZK 0 0 0 0 2\nws <- classic.wordscores(ref, scores=seq(-1.5,1.5,by=0.75))\nws## $pi\n## Score\n## A -1.5000000\n## B -1.5000000\n## C -1.5000000\n## D -1.5000000\n## E -1.5000000\n## F -1.4812500\n## G -1.4809322\n## H -1.4519231\n## I -1.4083333\n## J -1.3232984\n## K -1.1846154\n## L -1.0369898\n## M -0.8805970\n## N -0.7500000\n## O -0.6194030\n## P -0.4507576\n## Q -0.2992424\n## R -0.1305970\n## S 0.0000000\n## T 0.1305970\n## U 0.2992424\n## V 0.4507576\n## W 0.6194030\n## X 0.7500000\n## Y 0.8805970\n## Z 1.0369898\n## ZA 1.1846154\n## ZB 1.3232984\n## ZC 1.4083333\n## ZD 1.4519231\n## ZE 1.4809322\n## ZF 1.4812500\n## ZG 1.5000000\n## ZH 1.5000000\n## ZI 1.5000000\n## ZJ 1.5000000\n## ZK 1.5000000\n## \n## $theta\n## [1] -1.50 -0.75 0.00 0.75 1.50\n## \n## $data\n## docs\n## words R1 R2 R3 R4 R5\n## A 2 0 0 0 0\n## B 3 0 0 0 0\n## C 10 0 0 0 0\n## D 22 0 0 0 0\n## E 45 0 0 0 0\n## F 78 2 0 0 0\n## G 115 3 0 0 0\n## H 146 10 0 0 0\n## I 158 22 0 0 0\n## J 146 45 0 0 0\n## K 115 78 2 0 0\n## L 78 115 3 0 0\n## M 45 146 10 0 0\n## N 22 158 22 0 0\n## O 10 146 45 0 0\n## P 3 115 78 2 0\n## Q 2 78 115 3 0\n## R 0 45 146 10 0\n## S 0 22 158 22 0\n## T 0 10 146 45 0\n## U 0 3 115 78 2\n## V 0 2 78 115 3\n## W 0 0 45 146 10\n## X 0 0 22 158 22\n## Y 0 0 10 146 45\n## Z 0 0 3 115 78\n## ZA 0 0 2 78 115\n## ZB 0 0 0 45 146\n## ZC 0 0 0 22 158\n## ZD 0 0 0 10 146\n## ZE 0 0 0 3 115\n## ZF 0 0 0 2 78\n## ZG 0 0 0 0 45\n## ZH 0 0 0 0 22\n## ZI 0 0 0 0 10\n## ZJ 0 0 0 0 3\n## ZK 0 0 0 0 2\n## \n## $call\n## classic.wordscores(wfm = ref, scores = seq(-1.5, 1.5, by = 0.75))\n## \n## attr(,\"class\")\n## [1] \"classic.wordscores\" \"wordscores\" \"list\"\n#get \"virgin\" documents\nvir <- getdocs(lbg, 'V1')\nvir## docs\n## words V1\n## A 0\n## B 0\n## C 0\n## D 0\n## E 0\n## F 0\n## G 0\n## H 2\n## I 3\n## J 10\n## K 22\n## L 45\n## M 78\n## N 115\n## O 146\n## P 158\n## Q 146\n## R 115\n## S 78\n## T 45\n## U 22\n## V 10\n## W 3\n## X 2\n## Y 0\n## Z 0\n## ZA 0\n## ZB 0\n## ZC 0\n## ZD 0\n## ZE 0\n## ZF 0\n## ZG 0\n## ZH 0\n## ZI 0\n## ZJ 0\n## ZK 0\n# predict textscores for the virgin documents\npredict(ws, newdata=vir)## 37 of 37 words (100%) are scorable\n## \n## Score Std. Err. Rescaled Lower Upper\n## V1 -0.448 0.0119 -0.448 -0.459 -0.437"},{"path":"week-5-demo.html","id":"wordfish","chapter":"9 Week 5 Demo","heading":"9.3 Wordfish","text":"wish, can inspect function wordscores model Slapin Proksch (2008) following way. much complex algorithm, printed , can inspect devices.can simulate data, formatted appropriately wordfiash estimation following way:can see document word-level FEs, well specified range thetas estimates.estimating document positions simply matter implementing algorithm.","code":"\nwordfish\ndd <- sim.wordfish()\n\ndd## $Y\n## docs\n## words D01 D02 D03 D04 D05 D06 D07 D08 D09 D10\n## W01 17 19 22 13 17 11 16 12 6 3\n## W02 18 21 18 16 12 19 11 7 10 4\n## W03 22 21 22 19 11 14 11 3 6 1\n## W04 22 19 18 15 16 13 18 6 2 8\n## W05 28 21 12 10 13 10 5 14 1 3\n## W06 5 7 12 13 15 8 12 13 23 19\n## W07 13 9 5 16 11 17 15 11 35 30\n## W08 8 7 7 10 9 15 18 23 21 23\n## W09 4 12 8 10 9 13 18 25 15 19\n## W10 5 3 7 11 19 16 13 18 17 18\n## W11 66 55 49 48 38 37 27 24 21 6\n## W12 53 56 47 39 49 28 22 15 12 14\n## W13 63 55 47 49 48 31 24 16 17 16\n## W14 57 64 48 51 27 36 24 27 11 12\n## W15 58 48 57 44 36 39 29 27 16 5\n## W16 17 13 24 28 24 32 41 56 67 61\n## W17 9 19 16 36 30 34 53 34 58 57\n## W18 11 19 34 27 42 38 48 58 49 66\n## W19 10 18 27 22 37 52 59 60 60 69\n## W20 14 14 20 23 37 37 36 51 53 66\n## \n## $theta\n## [1] -1.4863011 -1.1560120 -0.8257228 -0.4954337 -0.1651446 0.1651446 0.4954337\n## [8] 0.8257228 1.1560120 1.4863011\n## \n## $doclen\n## D01 D02 D03 D04 D05 D06 D07 D08 D09 D10 \n## 500 500 500 500 500 500 500 500 500 500 \n## \n## $psi\n## [1] 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1\n## \n## $beta\n## [1] 0 0 0 0 0 1 1 1 1 1 0 0 0 0 0 1 1 1 1 1\n## \n## attr(,\"class\")\n## [1] \"wordfish.simdata\" \"list\"\nwf <- wordfish(dd$Y)\nsummary(wf)## Call:\n## wordfish(wfm = dd$Y)\n## \n## Document Positions:\n## Estimate Std. Error Lower Upper\n## D01 -1.4243 0.10560 -1.6313 -1.21736\n## D02 -1.1483 0.09747 -1.3394 -0.95727\n## D03 -0.7701 0.08954 -0.9456 -0.59455\n## D04 -0.4878 0.08591 -0.6562 -0.31942\n## D05 -0.1977 0.08414 -0.3626 -0.03279\n## D06 0.0313 0.08411 -0.1336 0.19616\n## D07 0.4346 0.08704 0.2640 0.60517\n## D08 0.7163 0.09140 0.5372 0.89546\n## D09 1.2277 0.10447 1.0229 1.43243\n## D10 1.6166 0.11933 1.3827 1.85046"},{"path":"week-5-demo.html","id":"using-quanteda","chapter":"9 Week 5 Demo","heading":"9.4 Using quanteda","text":"can also use quanteda implement scaling techniques, demonstrated Exercise 4.","code":""},{"path":"week-6-unsupervised-learning-topic-models.html","id":"week-6-unsupervised-learning-topic-models","chapter":"10 Week 6: Unsupervised learning (topic models)","heading":"10 Week 6: Unsupervised learning (topic models)","text":"week builds upon past scaling techniques explored Week 5 instead turns another form unsupervised approach—topic modelling.substantive articles Nelson (2020) Alrababa’h Blaydes (2020) provide, turn, illuminating insights using topic models categorize thematic content text information.article Ying, Montgomery, Stewart (2021) provides valuable overview accompaniment earlier work Denny Spirling (2018) thinking validate findings test robustness inferences make models.Questions:assumptions underlie topic modelling approaches?Can develop structural models text?topic modelling discovery measurement strategy?validate model?Required reading:Nelson (2020)PARTHASARATHY, RAO, PALANISWAMY (2019)Ying, Montgomery, Stewart (2021)reading:Chang et al. (2009)Alrababa’h Blaydes (2020)J. Grimmer King (2011)Denny Spirling (2018)Smith et al. (2021)Boyd et al. (2018)Slides:Week 6 Slides","code":""},{"path":"week-6-demo.html","id":"week-6-demo","chapter":"11 Week 6 Demo","heading":"11 Week 6 Demo","text":"","code":""},{"path":"week-6-demo.html","id":"setup-4","chapter":"11 Week 6 Demo","heading":"11.1 Setup","text":"First, ’ll load packages ’ll using week’s brief demo.Estimating topic model requires us first data form document-term-matrix. another term referred previous weeks document-feature-matrix.can take example data topicmodels package. text news releases Associated Press. consists around 2,200 articles (documents) 10,000 terms (words).estimate topic model need specify document-term-matrix using, number (k) topics estimating. speed estimation, estimating 100 articles.can inspect contents topic follows.can use tidy() function tidytext gather relevant parameters ’ve estimated. get \\(\\beta\\) per-topic-per-word probabilities (.e., probability given term belongs given topic) can following.get \\(\\gamma\\) per-document-per-topic probabilities (.e., probability given document (: article) belongs particular topic) following.can easily plot \\(\\beta\\) estimates follows.shows us words associated topic, size associated \\(\\beta\\) coefficient.","code":"\nlibrary(topicmodels)\nlibrary(dplyr)\nlibrary(tidytext)\nlibrary(ggplot2)\nlibrary(ggthemes)\ndata(\"AssociatedPress\", \n package = \"topicmodels\")\nlda_output <- LDA(AssociatedPress[1:100,], k = 10)\nterms(lda_output, 10)## Topic 1 Topic 2 Topic 3 Topic 4 Topic 5 Topic 6 \n## [1,] \"bank\" \"wednesday\" \"fire\" \"bush\" \"administration\" \"noriega\" \n## [2,] \"new\" \"new\" \"barry\" \"i\" \"thats\" \"union\" \n## [3,] \"year\" \"central\" \"moore\" \"dukakis\" \"contact\" \"greyhound\"\n## [4,] \"soviet\" \"company\" \"church\" \"people\" \"farmer\" \"panama\" \n## [5,] \"last\" \"peres\" \"last\" \"year\" \"government\" \"president\"\n## [6,] \"million\" \"duracell\" \"mexico\" \"roberts\" \"grain\" \"officials\"\n## [7,] \"animals\" \"snow\" \"people\" \"campaign\" \"i\" \"national\" \n## [8,] \"florio\" \"warming\" \"died\" \"get\" \"new\" \"people\" \n## [9,] \"officials\" \"global\" \"friday\" \"two\" \"magellan\" \"plant\" \n## [10,] \"york\" \"offer\" \"pope\" \"years\" \"officials\" \"arco\" \n## Topic 7 Topic 8 Topic 9 Topic 10 \n## [1,] \"i\" \"percent\" \"state\" \"percent\" \n## [2,] \"new\" \"prices\" \"waste\" \"soviet\" \n## [3,] \"rating\" \"oil\" \"official\" \"economy\" \n## [4,] \"california\" \"year\" \"money\" \"committee\" \n## [5,] \"agents\" \"price\" \"people\" \"gorbachev\" \n## [6,] \"states\" \"gas\" \"announced\" \"union\" \n## [7,] \"mrs\" \"business\" \"company\" \"gorbachevs\"\n## [8,] \"police\" \"rate\" \"officials\" \"economic\" \n## [9,] \"percent\" \"report\" \"orr\" \"congress\" \n## [10,] \"three\" \"average\" \"senate\" \"war\"\nlda_beta <- tidy(lda_output, matrix = \"beta\")\n\nlda_beta %>%\n arrange(-beta)## # A tibble: 104,730 × 3\n## topic term beta\n## \n## 1 8 percent 0.0287\n## 2 10 percent 0.0197\n## 3 1 bank 0.0171\n## 4 8 prices 0.0170\n## 5 10 soviet 0.0160\n## 6 1 new 0.0159\n## 7 9 state 0.0158\n## 8 4 bush 0.0144\n## 9 7 i 0.0129\n## 10 8 oil 0.0118\n## # ℹ 104,720 more rows\nlda_gamma <- tidy(lda_output, matrix = \"gamma\")\n\nlda_gamma %>%\n arrange(-gamma)## # A tibble: 1,000 × 3\n## document topic gamma\n## \n## 1 76 5 1.00\n## 2 81 3 1.00\n## 3 6 6 1.00\n## 4 43 4 1.00\n## 5 31 8 1.00\n## 6 95 7 1.00\n## 7 77 4 1.00\n## 8 29 10 1.00\n## 9 80 5 1.00\n## 10 57 10 1.00\n## # ℹ 990 more rows\nlda_beta %>%\n group_by(topic) %>%\n top_n(10, beta) %>%\n ungroup() %>%\n arrange(topic, -beta) %>%\n mutate(term = reorder_within(term, beta, topic)) %>%\n ggplot(aes(beta, term, fill = factor(topic))) +\n geom_col(show.legend = FALSE) +\n facet_wrap(~ topic, scales = \"free\", ncol = 4) +\n scale_y_reordered() +\n theme_tufte(base_family = \"Helvetica\")"},{"path":"week-7-unsupervised-learning-word-embedding.html","id":"week-7-unsupervised-learning-word-embedding","chapter":"12 Week 7: Unsupervised learning (word embedding)","heading":"12 Week 7: Unsupervised learning (word embedding)","text":"week discussing second form “unsupervised” learning—word embeddings. previous weeks allowed us characterize complexity text, cluster text potential topical focus, word embeddings permit us expansive form measurement. essence, producing matrix representation entire corpus.reading Pedro L. Rodriguez Spirling (2022) provides effective overview technical dimensions technique. articles Garg et al. (2018) Kozlowski, Taddy, Evans (2019) two substantive articles use word embeddings provide insights prejudice bias manifested language time.Required reading:Garg et al. (2018)Kozlowski, Taddy, Evans (2019)Waller Anderson (2021)reading:P. Rodriguez Spirling (2021)Pedro L. Rodriguez Spirling (2022)Osnabrügge, Hobolt, Rodon (2021)Rheault Cochrane (2020)Jurafsky Martin (2021, ch.6): https://web.stanford.edu/~jurafsky/slp3/]Slides:Week 7 Slides","code":""},{"path":"week-7-demo.html","id":"week-7-demo","chapter":"13 Week 7 Demo","heading":"13 Week 7 Demo","text":"","code":""},{"path":"week-7-demo.html","id":"setup-5","chapter":"13 Week 7 Demo","heading":"13.1 Setup","text":"First, ’ll load packages ’ll using week’s brief demo. pre-loading already-estimated PMI matrix results singular value decomposition approach.work?Various approaches, including:\nSVD\n\nNeural network-based techniques like GloVe Word2Vec\n\nSVD\nSVDNeural network-based techniques like GloVe Word2Vec\nNeural network-based techniques like GloVe Word2VecIn approaches, :Defining context window (see figure )Looking probabilities word appearing near another wordsThe implementation technique using singular value decomposition approach requires following data structure:Word pair matrix PMI (Pairwise mutual information)PMI = log(P(x,y)/P(x)P(y))P(x,y) probability word x appearing within six-word window word yand P(x) probability word x appearing whole corpusand P(y) probability word y appearing whole corpusAnd resulting matrix object take following format:use “Singular Value Decomposition” (SVD) techique. another multidimensional scaling technique, first axis resulting coordinates captures variance, second second-etc…, simply need following.can collect vectors word inspect .","code":"\nlibrary(Matrix) #for handling matrices\nlibrary(tidyverse)\nlibrary(irlba) # for SVD\nlibrary(umap) # for dimensionality reduction\n\nload(\"data/wordembed/pmi_svd.RData\")\nload(\"data/wordembed/pmi_matrix.RData\")## 6 x 6 sparse Matrix of class \"dgCMatrix\"\n## the to and of https a\n## the 0.653259169 -0.01948121 -0.006446459 0.27136395 -0.5246159 -0.32557524\n## to -0.019481205 0.75498084 -0.065170433 -0.25694210 -0.5731182 -0.04595798\n## and -0.006446459 -0.06517043 1.027782342 -0.03974904 -0.4915159 -0.05862969\n## of 0.271363948 -0.25694210 -0.039749043 1.02111517 -0.5045067 0.09829389\n## https -0.524615878 -0.57311817 -0.491515918 -0.50450674 0.5451841 -0.57956404\n## a -0.325575239 -0.04595798 -0.058629689 0.09829389 -0.5795640 1.03048355## Formal class 'dgCMatrix' [package \"Matrix\"] with 6 slots\n## ..@ i : int [1:350700] 0 1 2 3 4 5 6 7 8 9 ...\n## ..@ p : int [1:21173] 0 7819 14360 20175 25467 29910 34368 39207 43376 46401 ...\n## ..@ Dim : int [1:2] 21172 21172\n## ..@ Dimnames:List of 2\n## .. ..$ : chr [1:21172] \"the\" \"to\" \"and\" \"of\" ...\n## .. ..$ : chr [1:21172] \"the\" \"to\" \"and\" \"of\" ...\n## ..@ x : num [1:350700] 0.65326 -0.01948 -0.00645 0.27136 -0.52462 ...\n## ..@ factors : list()\npmi_svd <- irlba(pmi_matrix, 256, maxit = 500)\nword_vectors <- pmi_svd$u\nrownames(word_vectors) <- rownames(pmi_matrix)\ndim(word_vectors)## [1] 21172 256\nhead(word_vectors[1:5, 1:5])## [,1] [,2] [,3] [,4] [,5]\n## the 0.007810973 0.07024009 0.06377615 0.03139044 -0.12362108\n## to 0.006889381 -0.03210269 0.10665925 0.03537632 0.10104552\n## and -0.050498380 0.09131495 0.19658197 -0.08136253 -0.01605705\n## of -0.015628371 0.16306386 0.13296127 -0.04087709 -0.23175976\n## https 0.301718525 0.07658843 -0.01720398 0.26219147 0.07930941"},{"path":"week-7-demo.html","id":"using-glove-or-word2vec","chapter":"13 Week 7 Demo","heading":"13.2 Using GloVe or word2vec","text":"neural network approach considerably involved, figure gives overview picture differing algorithmic approaches might use.","code":""},{"path":"week-8-sampling-text-information.html","id":"week-8-sampling-text-information","chapter":"14 Week 8: Sampling text information","heading":"14 Week 8: Sampling text information","text":"week ’ll thinking best sample text information, thinking different biases might inhere data-generating process, well representativeness generalizability text corpus construct.reading Barberá Rivero (2015) invesitgates representativeness Twitter data, give us pause thinking using digital trace data general barometer public opinion.reading Michalopoulos Xue (2021) takes entirely different tack, illustrates can think systematically text information broadly representative societies general.Required reading:Barberá Rivero (2015)Michalopoulos Xue (2021)Klaus Krippendorff (2004, chs. 5 6)reading:Martins Baumard (2020)Baumard et al. (2022)Slides:Week 8 Slides","code":""},{"path":"week-9-supervised-learning.html","id":"week-9-supervised-learning","chapter":"15 Week 9: Supervised learning","heading":"15 Week 9: Supervised learning","text":"Required reading:Hopkins King (2010)King, Pan, Roberts (2017)Siegel et al. (2021)Yu, Kaufmann, Diermeier (2008)Manning, Raghavan, Schtze (2007, chs. 13,14, 15): https://nlp.stanford.edu/IR-book/information-retrieval-book.html]reading:Denny Spirling (2018)King, Lam, Roberts (2017)","code":""},{"path":"week-10-validation.html","id":"week-10-validation","chapter":"16 Week 10: Validation","heading":"16 Week 10: Validation","text":"week ’ll thinking validate techniques ’ve used preceding weeks. Validation necessary important part text analysis technique.Often speak validation context machine labelling large text data. validation need ——restricted automated classification tasks. articles Ying, Montgomery, Stewart (2021) Pedro L. Rodriguez, Spirling, Stewart (2021) describe ways approach validation unsupervised contexts. Finally, article Peterson Spirling (2018) shows validation accuracy might provide measure substantive significance.Required reading:Ying, Montgomery, Stewart (2021)Pedro L. Rodriguez, Spirling, Stewart (2021)Peterson Spirling (2018)Manning, Raghavan, Schtze (2007, ch.2: https://nlp.stanford.edu/IR-book/information-retrieval-book.html)reading:K. Krippendorff (2004)Denny Spirling (2018)Justin Grimmer Stewart (2013b)Barberá et al. (2021)Schiller, Daxenberger, Gurevych (2021)Slides:Week 10 Slides","code":""},{"path":"exercise-1-word-frequency-analysis.html","id":"exercise-1-word-frequency-analysis","chapter":"17 Exercise 1: Word frequency analysis","heading":"17 Exercise 1: Word frequency analysis","text":"","code":""},{"path":"exercise-1-word-frequency-analysis.html","id":"introduction","chapter":"17 Exercise 1: Word frequency analysis","heading":"17.1 Introduction","text":"tutorial, learn summarise, aggregate, analyze text R:tokenize filter textHow clean preprocess textHow visualize results ggplotHow perform automated gender assignment name data (think possible biases methods may enclose)","code":""},{"path":"exercise-1-word-frequency-analysis.html","id":"setup-6","chapter":"17 Exercise 1: Word frequency analysis","heading":"17.2 Setup","text":"practice skills, use dataset already collected Edinburgh Fringe Festival website.can try : obtain data, must first obtain API key. Instructions available Edinburgh Fringe API page:","code":""},{"path":"exercise-1-word-frequency-analysis.html","id":"load-data-and-packages","chapter":"17 Exercise 1: Word frequency analysis","heading":"17.3 Load data and packages","text":"proceeding, ’ll load remaining packages need tutorial.tutorial, using data pre-cleaned provided .csv format. data come Edinburgh Book Festival API, provide data every event taken place Edinburgh Book Festival, runs every year month August, nine years: 2012-2020. many questions might ask data. tutorial, investigate contents event, speakers event, determine trends gender representation time.first task, , read data. can read_csv() function.read_csv() function takes .csv file loads working environment data frame object called “edbfdata.” can call object anything though. Try changing name object <- arrow. Note R allow names spaces , however. also good idea name object something beginning numbers, means call object within ` marks.’re working document computer (“locally”) can download Edinburgh Fringe data following way:","code":"\nlibrary(tidyverse) # loads dplyr, ggplot2, and others\nlibrary(tidytext) # includes set of functions useful for manipulating text\nlibrary(ggthemes) # includes a set of themes to make your visualizations look nice!\nlibrary(readr) # more informative and easy way to import data\nlibrary(babynames) #for gender predictions\nedbfdata <- read_csv(\"data/wordfreq/edbookfestall.csv\")## New names:\n## Rows: 5938 Columns: 12\n## ── Column specification\n## ───────────────────────────────────────────────────────── Delimiter: \",\" chr\n## (8): festival_id, title, sub_title, artist, description, genre, age_categ... dbl\n## (4): ...1, year, latitude, longitude\n## ℹ Use `spec()` to retrieve the full column specification for this data. ℹ\n## Specify the column types or set `show_col_types = FALSE` to quiet this message.\n## • `` -> `...1`\nedbfdata <- read_csv(\"https://raw.githubusercontent.com/cjbarrie/RDL-Ed/main/02-text-as-data/data/edbookfestall.csv\")"},{"path":"exercise-1-word-frequency-analysis.html","id":"inspect-and-filter-data","chapter":"17 Exercise 1: Word frequency analysis","heading":"17.4 Inspect and filter data","text":"next job cut dataset size, including columns need. first can inspect see existing column names , variable coded. can first call::can see description event included column named “description” year event “year.” now ’ll just keep two. Remember: ’re interested tutorial firstly representation gender feminism forms cultural production given platform Edinburgh International Book Festival. Given , first foremost interested reported content artist’s event.use pipe %>% functions tidyverse package quickly efficiently select columns want edbfdata data.frame object. pass data new data.frame object, call “evdes.”let’s take quick look many events time festival. , first calculate number individual events (row observations) year (column variable).can plot using ggplot!Perhaps unsurprisingly, context pandemic, number recorded bookings 2020 Festival drastically reduced.","code":"\ncolnames(edbfdata)## [1] \"...1\" \"festival_id\" \"title\" \"sub_title\" \"artist\" \n## [6] \"year\" \"description\" \"genre\" \"latitude\" \"longitude\" \n## [11] \"age_category\" \"ID\"\nglimpse(edbfdata)## Rows: 5,938\n## Columns: 12\n## $ ...1 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17,…\n## $ festival_id \"book\", \"book\", \"book\", \"book\", \"book\", \"book\", \"book\", \"b…\n## $ title \"Denise Mina\", \"Alex T Smith\", \"Challenging Expectations w…\n## $ sub_title \"HARD MEN AND CARDBOARD GANGSTERS\", NA, NA, \"WHAT CAUSED T…\n## $ artist \"Denise Mina\", \"Alex T Smith\", \"Peter Cocks\", \"Paul Mason\"…\n## $ year 2012, 2012, 2012, 2012, 2012, 2012, 2012, 2012, 2012, 2012…\n## $ description \"\\n\\tAs the grande dame of Scottish crime fiction, Deni…\n## $ genre \"Literature\", \"Children\", \"Children\", \"Literature\", \"Child…\n## $ latitude 55.9519, 55.9519, 55.9519, 55.9519, 55.9519, 55.9519, 55.9…\n## $ longitude -3.206913, -3.206913, -3.206913, -3.206913, -3.206913, -3.…\n## $ age_category NA, \"AGE 4 - 7\", \"AGE 11 - 14\", NA, \"AGE 10 - 14\", \"AGE 6 …\n## $ ID \"Denise Mina2012\", \"Alex T Smith2012\", \"Peter Cocks2012\", …\n# get simplified dataset with only event contents and year\nevdes <- edbfdata %>%\n select(description, year)\n\nhead(evdes)## # A tibble: 6 × 2\n## description year\n## \n## 1 \"\\n\\tAs the grande dame of Scottish crime fiction, Denise Mina places… 2012\n## 2 \"
\\n\\tWhen Alex T Smith was a little boy he wanted to be a chef, a rab… 2012\n## 3 \"
\\n\\tPeter Cocks is known for his fantasy series Triskellion written … 2012\n## 4 \"
\\n\\tTwo books by influential journalists are among the first to look… 2012\n## 5 \"
\\n\\tChris d’Lacey tells you all about The Fire Ascending, the … 2012\n## 6 \"
\\n\\tIt’s time for the honourable, feisty and courageous young … 2012\nevtsperyr <- evdes %>%\n mutate(obs=1) %>%\n group_by(year) %>%\n summarise(sum_events = sum(obs))\nggplot(evtsperyr) +\n geom_line(aes(year, sum_events)) +\n theme_tufte(base_family = \"Helvetica\") + \n scale_y_continuous(expand = c(0, 0), limits = c(0, NA))"},{"path":"exercise-1-word-frequency-analysis.html","id":"tidy-the-text","chapter":"17 Exercise 1: Word frequency analysis","heading":"17.5 Tidy the text","text":"Given data obtained API outputs data originally HTML format, text still contains HTML PHP encodings e.g. bold font paragraphs. ’ll need get rid , well punctuation analyzing data.set commands takes event descriptions, extracts individual words, counts number times appear years covered book festival data.","code":"\n#get year and word for every word and date pair in the dataset\ntidy_des <- evdes %>% \n mutate(desc = tolower(description)) %>%\n unnest_tokens(word, desc) %>%\n filter(str_detect(word, \"[a-z]\"))"},{"path":"exercise-1-word-frequency-analysis.html","id":"back-to-the-fringe","chapter":"17 Exercise 1: Word frequency analysis","heading":"17.6 Back to the Fringe","text":"see resulting dataset large (~446k rows). commands first taken events text, “mutated” set lower case character string. “unnest_tokens” function taken individual string create new column called “word” contains individual word contained event description texts.terminology also appropriate . tidy text format, often refer data structures consisting “documents” “terms.” “tokenizing” text “unnest_tokens” functions generating dataset one term per row., “documents” collection descriptions events year Edinburgh Book Festival. way sort text “documents” depends choice individual researcher.Instead year, might wanted sort text “genre.” , two genres: “Literature” “Children.” done , two “documents,” contained words included event descriptions genre.Alternatively, might interested contributions individual authors time. case, sorted text documents author. case, “document” represent words included event descriptions events given author (many multiple appearances time festival given year).can yet tidy , though. First ’ll remove stop words ’ll remove apostrophes:see number rows dataset reduces half ~223k rows. natural since large proportion string contain many -called “stop words”. can see stop words typing:lexicon (list words) included tidytext package produced Julia Silge David Robinson (see ). see contains 1000 words. remove informative interested substantive content text (rather , say, grammatical content).Now let’s look common words data:can see one common words “rsquo,” HTML encoding apostrophe. Clearly need clean data bit . common issue large-n text analysis key step want conduct reliably robust forms text analysis. ’ll another go using filter command, specifying keep words included string words rsquo, em, ndash, nbsp, lsquo.’s like ! words feature seem make sense now (actual words rather random HTML UTF-8 encodings).Let’s now collect words data.frame object, ’ll call edbf_term_counts:year, see “book” common word… perhaps surprises . evidence ’re properly pre-processing cleaning data. Cleaning text data important element preparing text analysis. often process trial error text data looks alike, may come e.g. webpages HTML encoding, unrecognized fonts unicode, potential cause issues! finding errors also chance get know data…","code":"\ntidy_des <- tidy_des %>%\n filter(!word %in% stop_words$word)\nstop_words## # A tibble: 1,149 × 2\n## word lexicon\n## \n## 1 a SMART \n## 2 a's SMART \n## 3 able SMART \n## 4 about SMART \n## 5 above SMART \n## 6 according SMART \n## 7 accordingly SMART \n## 8 across SMART \n## 9 actually SMART \n## 10 after SMART \n## # ℹ 1,139 more rows\ntidy_des %>%\n count(word, sort = TRUE)## # A tibble: 24,995 × 2\n## word n\n## \n## 1 rsquo 5638\n## 2 book 2088\n## 3 event 1356\n## 4 author 1332\n## 5 world 1240\n## 6 story 1159\n## 7 join 1095\n## 8 em 1064\n## 9 life 879\n## 10 strong 864\n## # ℹ 24,985 more rows\nremove_reg <- c(\"&\",\"<\",\">\",\"\", \"<\/p>\",\"&rsquo\", \"‘\", \"'\", \"\", \"<\/strong>\", \"rsquo\", \"em\", \"ndash\", \"nbsp\", \"lsquo\", \"strong\")\n \ntidy_des <- tidy_des %>%\n filter(!word %in% remove_reg)\ntidy_des %>%\n count(word, sort = TRUE)## # A tibble: 24,989 × 2\n## word n\n## \n## 1 book 2088\n## 2 event 1356\n## 3 author 1332\n## 4 world 1240\n## 5 story 1159\n## 6 join 1095\n## 7 life 879\n## 8 stories 860\n## 9 chaired 815\n## 10 books 767\n## # ℹ 24,979 more rows\nedbf_term_counts <- tidy_des %>% \n group_by(year) %>%\n count(word, sort = TRUE)\nhead(edbf_term_counts)## # A tibble: 6 × 3\n## # Groups: year [6]\n## year word n\n## \n## 1 2016 book 295\n## 2 2018 book 283\n## 3 2019 book 265\n## 4 2012 book 254\n## 5 2013 book 241\n## 6 2015 book 239"},{"path":"exercise-1-word-frequency-analysis.html","id":"analyze-keywords","chapter":"17 Exercise 1: Word frequency analysis","heading":"17.7 Analyze keywords","text":"Okay, now list words, number times appear, can tag words think might related issues gender inequality sexism. may decide list imprecise inexhaustive. , feel free change terms including grepl() function.","code":"\nedbf_term_counts$womword <- as.integer(grepl(\"women|feminist|feminism|gender|harassment|sexism|sexist\", \n x = edbf_term_counts$word))\nhead(edbf_term_counts)## # A tibble: 6 × 4\n## # Groups: year [6]\n## year word n womword\n## \n## 1 2016 book 295 0\n## 2 2018 book 283 0\n## 3 2019 book 265 0\n## 4 2012 book 254 0\n## 5 2013 book 241 0\n## 6 2015 book 239 0"},{"path":"exercise-1-word-frequency-analysis.html","id":"compute-aggregate-statistics","chapter":"17 Exercise 1: Word frequency analysis","heading":"17.8 Compute aggregate statistics","text":"Now tagged individual words relating gender inequality feminism, can sum number times words appear year denominate total number words event descriptions.intuition increase decrease percentage words relating issues capturing substantive change representation issues related sex gender.think measure? adequate measure representation issues cultural sphere?keywords used precise enough? , change?","code":"\n#get counts by year and word\nedbf_counts <- edbf_term_counts %>%\n group_by(year) %>%\n mutate(year_total = sum(n)) %>%\n filter(womword==1) %>%\n summarise(sum_wom = sum(n),\n year_total= min(year_total))\nhead(edbf_counts)## # A tibble: 6 × 3\n## year sum_wom year_total\n##
\\n\\tAs the grande dame of Scottish crime fiction, Deni…\n## $ genre \\n\\tAs the grande dame of Scottish crime fiction, Denise Mina places… 2012\n## 2 \" \\n\\tWhen Alex T Smith was a little boy he wanted to be a chef, a rab… 2012\n## 3 \" \\n\\tPeter Cocks is known for his fantasy series Triskellion written … 2012\n## 4 \" \\n\\tTwo books by influential journalists are among the first to look… 2012\n## 5 \" \\n\\tChris d’Lacey tells you all about The Fire Ascending, the … 2012\n## 6 \" \\n\\tIt’s time for the honourable, feisty and courageous young … 2012\nevtsperyr <- evdes %>%\n mutate(obs=1) %>%\n group_by(year) %>%\n summarise(sum_events = sum(obs))\nggplot(evtsperyr) +\n geom_line(aes(year, sum_events)) +\n theme_tufte(base_family = \"Helvetica\") + \n scale_y_continuous(expand = c(0, 0), limits = c(0, NA))"},{"path":"exercise-1-word-frequency-analysis.html","id":"tidy-the-text","chapter":"17 Exercise 1: Word frequency analysis","heading":"17.5 Tidy the text","text":"Given data obtained API outputs data originally HTML format, text still contains HTML PHP encodings e.g. bold font paragraphs. ’ll need get rid , well punctuation analyzing data.set commands takes event descriptions, extracts individual words, counts number times appear years covered book festival data.","code":"\n#get year and word for every word and date pair in the dataset\ntidy_des <- evdes %>% \n mutate(desc = tolower(description)) %>%\n unnest_tokens(word, desc) %>%\n filter(str_detect(word, \"[a-z]\"))"},{"path":"exercise-1-word-frequency-analysis.html","id":"back-to-the-fringe","chapter":"17 Exercise 1: Word frequency analysis","heading":"17.6 Back to the Fringe","text":"see resulting dataset large (~446k rows). commands first taken events text, “mutated” set lower case character string. “unnest_tokens” function taken individual string create new column called “word” contains individual word contained event description texts.terminology also appropriate . tidy text format, often refer data structures consisting “documents” “terms.” “tokenizing” text “unnest_tokens” functions generating dataset one term per row., “documents” collection descriptions events year Edinburgh Book Festival. way sort text “documents” depends choice individual researcher.Instead year, might wanted sort text “genre.” , two genres: “Literature” “Children.” done , two “documents,” contained words included event descriptions genre.Alternatively, might interested contributions individual authors time. case, sorted text documents author. case, “document” represent words included event descriptions events given author (many multiple appearances time festival given year).can yet tidy , though. First ’ll remove stop words ’ll remove apostrophes:see number rows dataset reduces half ~223k rows. natural since large proportion string contain many -called “stop words”. can see stop words typing:lexicon (list words) included tidytext package produced Julia Silge David Robinson (see ). see contains 1000 words. remove informative interested substantive content text (rather , say, grammatical content).Now let’s look common words data:can see one common words “rsquo,” HTML encoding apostrophe. Clearly need clean data bit . common issue large-n text analysis key step want conduct reliably robust forms text analysis. ’ll another go using filter command, specifying keep words included string words rsquo, em, ndash, nbsp, lsquo.’s like ! words feature seem make sense now (actual words rather random HTML UTF-8 encodings).Let’s now collect words data.frame object, ’ll call edbf_term_counts:year, see “book” common word… perhaps surprises . evidence ’re properly pre-processing cleaning data. Cleaning text data important element preparing text analysis. often process trial error text data looks alike, may come e.g. webpages HTML encoding, unrecognized fonts unicode, potential cause issues! finding errors also chance get know data…","code":"\ntidy_des <- tidy_des %>%\n filter(!word %in% stop_words$word)\nstop_words## # A tibble: 1,149 × 2\n## word lexicon\n## \", \"<\/p>\",\"&rsquo\", \"‘\", \"'\", \"\", \"<\/strong>\", \"rsquo\", \"em\", \"ndash\", \"nbsp\", \"lsquo\", \"strong\")\n \ntidy_des <- tidy_des %>%\n filter(!word %in% remove_reg)\ntidy_des %>%\n count(word, sort = TRUE)## # A tibble: 24,989 × 2\n## word n\n## tags, collect using “p” CSS selector. just grabbing text contained part page html_text() function.gives us one way capturing text, wanted get elements document, example date tags attributed document? Well can thing . Let’s take example getting date:see date identified “.calendar” CSS selector enter html_elements() function :course, well good, also need way scale—can’t just keep repeating process every page find wouldn’t much quicker just copy pasting. can ? Well need first understand URL structure website question.scroll page see listed number documents. directs individual pamphlet distributed protests 2011 Egyptian Revolution.Click one see URL changes.see starting URL :click March 2011, first month documents, see url becomes:, August 2011 becomes:, January 2012 becomes:notice month, URL changes addition month year back slashes end URL. next section, go efficiently create set URLs loop retrieve information contained individual webpage.going want retrieve text documents archived month. , first task store webpages series strings. manually , example, pasting year month strings end URL month March, 2011 January, 2012:wouldn’t particularly efficient…Instead, can wrap loop.’s going ? Well, first specifying starting URL . iterating numbers 3 13. telling R take new URL , depending number loop , take base starting url— https://wayback.archive-.org/2358/20120130143023/http://www.tahrirdocuments.org/ — paste end string “2011/0”, number loop , “/”. , first “” loop—number 3—effectively calling equivalent :gives:, ifelse() commands simply telling R: (number loop ) less 10 paste0(url,\"2011/0\",,\"/\"); .e., less 10 paste “2011/0”, “” “/”. number 3 becomes:\"https://wayback.archive-.org/2358/20120130143023/http://www.tahrirdocuments.org/2011/03/\", number 4 becomes\"https://wayback.archive-.org/2358/20120130143023/http://www.tahrirdocuments.org/2011/04/\", however, >=10 & <=12 (greater equal 10 less equal 12) calling paste0(url,\"2011/\",,\"/\") need first “0” months.Finally, (else) greater 12 calling paste0(url,\"2012/01/\"). last call, notice, specify whether greater equal 12 wrapping everything ifelse() commands. ifelse() calls like , telling R x “meets condition” y, otherwise z. wrapping multiple ifelse() calls within , effectively telling R x “meets condition” y, x “meets condition” z, otherwise . , “otherwise ” part ifelse() calls saying: less 10, 10 12, paste “2012/01/” end URL.Got ? didn’t even get first reading… wrote . best way understand going run code look part .now list URLs month. next?Well go onto page particular month, let’s say March, see page multiple paginated tabs bottom. Let’s see happens URL click one :see starting point URL March, , :click page 2 becomes:page 3 becomes:can see pretty clearly navigate page, appears appended URL string “page/2/” “page/3/”. shouldn’t tricky add list URLs. want avoid manually click archive month figure many pagination tabs bottom page.Fortunately, don’t . Using “Selector Gadget” tool can automate process grabbing highest number appears pagination bar month’s pages. code achieves :’s going ? Well, first two lines, simply creating empty character string ’re going populate subsequent loop. Remember set eleven starting URLs months archived webpage.code beginning (seq_along(files) saying, similar , beginning url end url, following loop: first, read url url <- urls[] read html contains html <- read_html(url).line, getting pages character vector page numbers calling html_elements() function “.page” tag. gives series pages stored e.g. “1” “2” “3”.order able see many , need extract highest number appears string. , first need reformat “integer” object rather “character” object R can recognize numbers. call pageints <- .integer(pages). get maximum simply calling: npages <- max(pageints, na.rm = T).next part loop, taking new information stored “npages,” .e., number pagination tabs month, telling R: pages, define new url adding “page/” number pagination tab “j”, “/”. ’ve bound together, get list URLs look like :next?next step get URLs documents contained archive month. ? Well, can use “Selector Gadget” tool work . main landing pages month, see listed, , document list. documents, see title, links revolutionary leaflet question, two CSS selectors: “h2” “.post”.can pass tags html_elements() grab ’s contained inside. can grab ’s contained inside extracting “children” classes. essence, just means lower level tag: tags can tags within tags flow downwards like family tree (hence name, suppose).one “children” HTML tag link contained inside, can get calling html_children() followed specifying want specific attribute web link encloses html_attr(\"href\"). subsequent lines just remove extraneous information.complete loop, , retrieve URL page every leaflet contained website :gives us:see now collected 523 separate URLs every revolutionary leaflet contained pages. Now ’re great position able crawl page collect information need. final loop need go URL ’re interested collect relevant information document text, title, date, tags, URL image revolutionary literature .See can work part fitting together. NOTE: want run final loop machines take several hours complete.now… ’re pretty much …back started!","code":"\nlibrary(tidyverse) # loads dplyr, ggplot2, and others\nlibrary(ggthemes) # includes a set of themes to make your visualizations look nice!\nlibrary(readr) # more informative and easy way to import data\nlibrary(stringr) # to handle text elements\nlibrary(rvest) #for scraping\npamphdata <- read_csv(\"data/sampling/pamphlets_formatted_gsheets.csv\")## Rows: 523 Columns: 8\n## ── Column specification ─────────────────────────────────────────────────────────\n## Delimiter: \",\"\n## chr (6): title, text, tags, imageurl, imgID, image\n## dbl (1): year\n## date (1): date\n## \n## ℹ Use `spec()` to retrieve the full column specification for this data.\n## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.\npamphdata <- read_csv(\"https://github.com/cjbarrie/CTA-ED/blob/main/data/sampling/pamphlets_formatted_gsheets.csv\")\nhead(pamphdata)## # A tibble: 6 × 8\n## title date year text tags imageurl imgID image\n## \\n\\tAs the grande dame of Scottish crime fiction, Deni…\n## $ genre \\n\\tAs the grande dame of Scottish crime fiction, Denise Mina places… 2012\n## 2 \" \\n\\tWhen Alex T Smith was a little boy he wanted to be a chef, a rab… 2012\n## 3 \" \\n\\tPeter Cocks is known for his fantasy series Triskellion written … 2012\n## 4 \" \\n\\tTwo books by influential journalists are among the first to look… 2012\n## 5 \" \\n\\tChris d’Lacey tells you all about The Fire Ascending, the … 2012\n## 6 \" \\n\\tIt’s time for the honourable, feisty and courageous young … 2012\nevtsperyr <- evdes %>%\n mutate(obs=1) %>%\n group_by(year) %>%\n summarise(sum_events = sum(obs))\nggplot(evtsperyr) +\n geom_line(aes(year, sum_events)) +\n theme_tufte(base_family = \"Helvetica\") + \n scale_y_continuous(expand = c(0, 0), limits = c(0, NA))"},{"path":"exercise-1-word-frequency-analysis.html","id":"tidy-the-text","chapter":"17 Exercise 1: Word frequency analysis","heading":"17.5 Tidy the text","text":"Given data obtained API outputs data originally HTML format, text still contains HTML PHP encodings e.g. bold font paragraphs. ’ll need get rid , well punctuation analyzing data.set commands takes event descriptions, extracts individual words, counts number times appear years covered book festival data.","code":"\n#get year and word for every word and date pair in the dataset\ntidy_des <- evdes %>% \n mutate(desc = tolower(description)) %>%\n unnest_tokens(word, desc) %>%\n filter(str_detect(word, \"[a-z]\"))"},{"path":"exercise-1-word-frequency-analysis.html","id":"back-to-the-fringe","chapter":"17 Exercise 1: Word frequency analysis","heading":"17.6 Back to the Fringe","text":"see resulting dataset large (~446k rows). commands first taken events text, “mutated” set lower case character string. “unnest_tokens” function taken individual string create new column called “word” contains individual word contained event description texts.terminology also appropriate . tidy text format, often refer data structures consisting “documents” “terms.” “tokenizing” text “unnest_tokens” functions generating dataset one term per row., “documents” collection descriptions events year Edinburgh Book Festival. way sort text “documents” depends choice individual researcher.Instead year, might wanted sort text “genre.” , two genres: “Literature” “Children.” done , two “documents,” contained words included event descriptions genre.Alternatively, might interested contributions individual authors time. case, sorted text documents author. case, “document” represent words included event descriptions events given author (many multiple appearances time festival given year).can yet tidy , though. First ’ll remove stop words ’ll remove apostrophes:see number rows dataset reduces half ~223k rows. natural since large proportion string contain many -called “stop words”. can see stop words typing:lexicon (list words) included tidytext package produced Julia Silge David Robinson (see ). see contains 1000 words. remove informative interested substantive content text (rather , say, grammatical content).Now let’s look common words data:can see one common words “rsquo,” HTML encoding apostrophe. Clearly need clean data bit . common issue large-n text analysis key step want conduct reliably robust forms text analysis. ’ll another go using filter command, specifying keep words included string words rsquo, em, ndash, nbsp, lsquo.’s like ! words feature seem make sense now (actual words rather random HTML UTF-8 encodings).Let’s now collect words data.frame object, ’ll call edbf_term_counts:year, see “book” common word… perhaps surprises . evidence ’re properly pre-processing cleaning data. Cleaning text data important element preparing text analysis. often process trial error text data looks alike, may come e.g. webpages HTML encoding, unrecognized fonts unicode, potential cause issues! finding errors also chance get know data…","code":"\ntidy_des <- tidy_des %>%\n filter(!word %in% stop_words$word)\nstop_words## # A tibble: 1,149 × 2\n## word lexicon\n## \", \"<\/p>\",\"&rsquo\", \"‘\", \"'\", \"\", \"<\/strong>\", \"rsquo\", \"em\", \"ndash\", \"nbsp\", \"lsquo\", \"strong\")\n \ntidy_des <- tidy_des %>%\n filter(!word %in% remove_reg)\ntidy_des %>%\n count(word, sort = TRUE)## # A tibble: 24,989 × 2\n## word n\n##