Skip to content

Latest commit

 

History

History
76 lines (39 loc) · 5.35 KB

README.md

File metadata and controls

76 lines (39 loc) · 5.35 KB

Compinion: Analysing complexity and subjectivity

Replication data and scripts for: The interplay of complexity and subjectivity in opinionated discourse. (version 1.0)

DOI

DOI

https://zenodo.org/badge/latestdoi/189996444

Description

This repository comprises the original data, scripts and extensive statistics for the analysis of text complexity and subjectivity described in the related publication

This publication is a large-scale, quantitative analysis of text complexity and various markers of subjectivity in opinionated discourse. Specifically, the authors investigate how text complexity interacts with markers of subjectivity to characterise (i) opinion articles, (ii) reader comments, and (iii) news articles. Methodologically, conditional inference trees and random forests (as implemented in the R package partykit) are used to unravel the interactions between text complexity and subjectivity. Text complexity is defined in terms of Kolmogorov complexity, i.e., the complexity of a text is measured as the length of the shortest possible description necessary to regenerate the original text. Subjectivity is operationalised as the frequency of lexico-grammatical markers of subjectivity and argumentation which have been well-established in research on sentiment, evaluation, stance and Appraisal.

The data published in this repository was retrieved from the Simon Fraser University opinion and comments corpus (SOCC) and a custom-made corpus of general news articles from the Canadian online newspaper The Globe and Mail.

Overview and description of folders and files

This repository contains the following resources (in alphabetical order):

Data

This folder contains the original dataset.

  • aggregate_totals_normalised.csv: The feature matrix with the individual file names as rows and textType, year, tokens, the raw and normalised feature frequencies, and the complexity scores as columns. The normalised feature frequencies of the subjectivity and argumentation markers were calculated based on the raw feature frequencies divided by the number of tokens per file and multiplied with 1000.

  • markerDistributions.csv: The raw frequencies of the individual subjectivity and argumentation markers per text type.

Subjectivity

This folder comprises the complete lists of subjectivity and argumentation markers described in the related publication.

  • other_features: A folder containing the lists of the argumentation markers adverbials, connectives and modals.

  • socal_features: A folder with two subdirectories sampling reduced features lists of subjectivity markers from the Semantic Orientation CALculator (SO-CAL). Specifically, only subjectivity features with a valency of 4 and 5 are included.

    • socal_invariant: negative and positive adverbs.
    • socal_variant: negative and positive adjectives, nouns and verbs.

Scripts

This folder contains the scripts for data analysis and the retrieval of the subjectivity markers.

  • compinion.r: R commands for the visualisation and implementation of the statistics, conditional inference trees and forests presented in the related publication. Only tested on Linux GNU Debian, using R version 3.6.2.

  • countFeat.py: A python script for retrieving the subjectivity and argumentation markers (see Subjectivity).

  • countFeat.md: Read me with instructions of how to run countFeat.py.

Statistics

This folder contains all statistics described in the related publication and additional stastistics.

  • The confusion matrices of the training and test datasets for conditional inference forests with N = 500, 1000, 2000 trees, respectively. Confusion matrices are used to calculate model performance, i.e. prediction accuracy.

    • confMat_500.csv and confMatTest_500.csv
    • confMat_1000.csv and confMatTest_1000.csv
    • confMat_2000.csv and confMatTest_2000.csv
  • correlations.csv: The Pearson correlation coefficients for correlations between all predictor variables described in the related publication, i.e. year, morphological complexity, syntactic complexity, overall complexity, subjective negative markers, subjective positive markers, modals, connectives, adverbials.

  • tunegridTree.csv: A csv file reporting the training and test accuracy for conditional inference trees grown with varying parameter settings. To be more precise, the following three parameters were used in tuning the tree: mincriterion, minbucket and maxsurrogate (for a detailed description of the parameters see https://cran.r-project.org/web/packages/partykit/vignettes/ctree.pdf).

  • The rankings of the nine predictor variables according to the conditional permutation-importance measure, a measure indicating the importance of individual predictor variables, which was calculated for three differently sized condtional inference forests, i.e. forests with N = 500, 1000, 2000 trees, respectively.

    • varimp500.csv
    • varimp1000.csv
    • varimp2000.csv