Scripts for: Through the compression glass: language complexity and the structure of algorithmically compressed strings (version 1.0)
https://zenodo.org/badge/latestdoi/299907289
This repository provides scripts and instructions for the retrieval and processing of gzip’s debugging output which is analysed in the related publication
- Ehret, Katharina. 2024. "Through the compression glass: language complexity and the linguistic structure of compressed strings". Linguistics Vanguard. DOI: https://doi.org/10.1515/lingvan-2022-0140
Against the background of the sociolinguistic-typological debate on language complexity which is all about measuring and explaining variability in language complexity, the publication presents an in-depth analysis of algorithmically compressed texts. Specifically, the formal and linguistic structure of compressed text sequences as retrieved from gzip’s debugging output (or lexicon) are examined. Compression algorithms like gzip are sometimes employed to approximate language complexity via the information-content, or complexity, in texts. The publication focuses on the compression technique, an information-theoretic measure based on Kolmogorov complexity. Scripts for the implementation of the technique are available here.
All scripts were tested on Debian GNU/Linux 9. Additionally, the following open source programs were used: R, version 3.6.3 (2020-02-29), and gzip, version 1.6.
This repository contains the following resources (in alphabetical order):
- debugout_analysis.r
An R file listing the commands to format and process gzip’s debugging output. Specifically, commands are provided to extract the complete lexicon entries, the distance to the previous identical sequence length and frequency of compressed strings, as well as commands for some basic distributional analyses. The file requires debugout_functions.r
- debugout_functions.r
An R file containing custom-made functions for formatting and processing gzip’s debugging output. The file needs to be stored in the same directory as the file debugout_analysis.r.
- makedebug.sh
A shell script which calls gzip’s debug version (dgzip) and the commands for lexicon retrieval (see below).
To build the debugging version of gzip and save it as a separate program (dgzip) use the following shell commands:
apt-get source gzip
cd gzip-*
./configure
make --debug > dgzip
In order to retrieve the debug output (or lexicon of compressed strings), gzip is used to first compress a given input text which is then piped to the debug version dgzip with a call for verbose decompression:
gzip -f < input.txt | dgzip -d -v -v -f
The lexicon can be retrieved and saved using makedebug.sh:
makedebug.sh input.txt > output.txt
Note
For replication of the analysis presented in the related publication, the input text should be pre-processed as follows. All punctuation, additional whitespace, UTF-8 characters or similar should be removed. The input text should be converted to lowercase. Different formatting might result in differences in frequency and length of compressed strings.
To cite this resource please cite the related publication:
Ehret, Katharina. 2024. "Through the compression glass: language complexity and the linguistic structure of compressed strings". Linguistics Vanguard. DOI: https://doi.org/10.1515/lingvan-2022-0140