CorpusAid is an advanced, user-friendly software tool designed specifically for preprocessing files in corpora compilation. This powerful application stands out for its ability to apply both personalized and traditional cleaning parameters across an entire corpus, regardless of its size. Whether you're working with a small collection of 10 files or a massive dataset of 10,000 documents, CorpusAid ensures consistent and accurate preprocessing.
CorpusAid offers a robust set of features designed to streamline and enhance the corpus preprocessing workflow:
CorpusAid provides a rich array of text cleaning options, giving users fine-grained control over their preprocessing tasks:
- Lowercase Conversion: Standardizes text by converting all characters to lowercase, ensuring consistency in text analysis.
- Whitespace Normalization: Ensures uniform spacing throughout the text by removing redundant spaces, tabs, and other whitespace characters.
- Line Break Removal: Merges multiple lines to create continuous text, useful for certain types of analysis that require unbroken text streams.
- Unicode Normalization: Converts text into a standard Unicode format (e.g., NFC, NFD, NFKC, NFKD), ensuring consistency in character representation across different languages and scripts.
- Punctuation Removal: Strips away punctuation marks to focus on core textual content.
- Number Removal: Eliminates numerical digits when they're not relevant to the analysis.
- Special Characters Removal: Removes symbols and characters that may interfere with text processing algorithms.
- Diacritic Removal: Strips accents and diacritical marks from characters, useful for certain types of cross-linguistic analysis.
- Greek and Cyrillic Character Removal: Selectively filters out specific scripts as required by the research parameters.
- HTML Tag Stripping: Removes HTML tags to extract plain text content from web-scraped or marked-up documents.
- Bibliographical Reference Removal: Automatically identifies and removes in-text bibliographical references (e.g., citations like
(Smith, 2020)
), cleaning up the text for analysis. - Custom Regular Expression Filtering: Allows users to define and apply custom patterns for advanced text filtering and extraction.
- Lemmatization: Reduces words to their base or dictionary form (lemmas), aiding in linguistic analysis by grouping together different forms of a word.
- Sentence Tokenization: Splits text into individual sentences, fundamental for tasks like sentiment analysis and syntactic parsing.
- Word Tokenization: Divides text into individual words or tokens, essential for word-level analysis and processing.
- Stop Word Removal: Excludes common, non-informative words (e.g., "the", "and") to focus on meaningful content.
- Process entire directories of text files simultaneously, regardless of the number of files.
- Apply consistent preprocessing across large corpora, ensuring uniformity in your dataset.
- Dramatically reduce processing time compared to manual or file-by-file approaches.
- User-friendly interface designed for researchers of all technical levels.
- Easy-to-navigate controls for selecting and applying preprocessing parameters.
- Real-time preview feature to see the effects of selected parameters on sample text.
- Progress indicators for tracking batch processing operations.
After processing, CorpusAid generates a comprehensive summary report that provides invaluable insights into your corpus:
- Word frequency distributions
- Sentence and token counts
- Type-token ratio analysis
- Corpus size statistics (pre and post-processing)
- Applied preprocessing parameters summary
- Processing time and performance metrics
- Save and load preprocessing profiles for consistent application across projects.
- Adjustable parameters to fine-tune preprocessing for specific research needs.
- Support for multiple input and output file formats (e.g., .txt, .csv, .json).
- Non-destructive processing: always keeps original files intact.
- Option to create backups automatically before processing.
- Detailed logging for audit trails and reproducibility.
- Efficiently handles corpora of all sizes, from small datasets to large-scale collections.
- Optimized for performance on both personal computers and high-performance computing environments.
- Export preprocessed data in formats compatible with popular corpus analysis tools.
- Integration capabilities with other NLP pipelines and workflows.
CorpusAid features an intuitive GUI that guides you through the text preprocessing workflow:
-
Loading Files:
- Click "Open Files" or "Open Directory" to load your corpus
- Supports single or multiple .txt files
- Drag and drop files directly into the application
-
Configuring Parameters: Navigate to Settings > Processing Parameters to configure preprocessing options organized in four main categories:
Basic Cleanup: - Remove Page Delimiters: Eliminates PDF conversion artifacts like "--- Page X ---" that often appear between pages - Remove Page Numbers: Detects and removes standalone numbers commonly used for page numbering (both Arabic and Roman numerals) - Normalize Line Breaks: Fixes irregular line breaks and scattered characters that often result from PDF-to-text conversion - Join Break Lines: Combines separated lines into continuous paragraphs while preserving intentional paragraph breaks Text Transformation: - Convert to Lowercase: Standardizes all text to lowercase for consistent analysis - Normalize Unicode: Standardizes different Unicode representations of the same character (e.g., combining diacritics vs. precomposed characters) - Remove Diacritics: Strips accent marks from characters (e.g., 'é' becomes 'e') while preserving the base letter - Word Tokenization: Splits text into individual words, handling contractions and special cases appropriately - Remove Stop Words: Eliminates common words (e.g., "the", "and", "in") that typically don't contribute to content analysis Character Sets: - Remove Greek/Cyrillic characters: Removes non-Latin alphabet characters that might appear in mathematical formulas or references - Remove Superscript/Subscript: Eliminates raised or lowered characters often used in mathematical notation or footnotes - Strip HTML tags: Removes any HTML markup while preserving the text content Advanced: - Custom Regex Patterns: Define custom search patterns to remove specific text structures (e.g., figure captions, table headers) - Remove Bibliographical References: Identifies and removes citation patterns like "(Author, Year)" or "[1]" from the text
-
Processing:
- Click "Process Files" in the toolbar
- Monitor progress in real-time
- Review results in the Preview tab
-
Saving:
- Review changes before saving
- Original files are backed up automatically
- Export processed text with the "Save" button
While CorpusAid is powerful for basic text preprocessing, users should be aware of its limitations:
-
Language Support:
- Basic support for other Latin-script languages
- Limited handling of right-to-left scripts
-
Processing Constraints:
- Recommended corpus size: Up to 100,000 files
- Memory usage scales with file size
-
Text Analysis:
- No semantic analysis capabilities
- No sentiment analysis
- No named entity recognition
- Basic statistical reporting only
-
File Formats:
- Only processes plain text (.txt) files
- No direct support for PDF, DOC, or other formats
- HTML handling limited to tag removal
CorpusAid is distributed as a standalone executable installer for Windows.
- Download the
corpusaid_win_setup
file from https://github.com/jhlopesalves/CorpusAid/releases. - Run the installer and follow the on-screen instructions.
- Once the installation is complete, you can launch CorpusAid from your desktop or Start Menu.