encoding/base32: Add RFC-compliant error handling and improve reliability #4641

zolk3ri · 2024-12-31T17:08:35Z

The current implementation of base32.decode() uses assertions for (incomplete) validation, which causes programs to crash on invalid input. This PR replaces assertions with proper error handling according to RFC 4648, allowing users of the package to handle invalid input gracefully.

Motivation: While implementing a pure Odin implementation of TOTP (Time-based One-Time Password), I discovered that base32.decode() would segfault on invalid input due to assertions. As a core library package, it should provide proper error handling to allow users to gracefully handle decode errors, which these changes enable.

Problem Description

Programs crash with segfaults on invalid base32 input due to assertions
Explicit disabling of bounds checking (#no_bounds_check) prevents proper error handling
Buffer allocation issues can cause out-of-bounds access
Non-compliance with RFC 4648 padding rules

The original implementation prioritized performance by disabling bounds checking, but this tradeoff is not appropriate for a core library package. The small or potentially negligible performance overhead from bounds checking is worth the improved safety and usability.

Implementation Details

The error enum:

Error :: enum {
  None,
  Invalid_Character, // Input contains characters outside the specified alphabet
  Invalid_Length,    // Input length is not valid for base32 (must be a multiple of 8 with proper padding)
  Malformed_Input,   // Input has improper structure (wrong padding position or incomplete groups)
}

Before (crashes on invalid input):

data := decode(input)
if data == nil { // Only checks for empty input
  return nil
}
defer delete(data)

After (error handling available):

data, err := decode(input)
if err != .None {
  #partial switch err {
    case .Invalid_Character:
    // Handle error where input contains characters not in base32 alphabet
    case .Malformed_Input:
    // Handle error where invalid padding structure or position
    case .Invalid_Length:
    // Handle error where input length is not valid for base32
  }
}
defer delete(data)

Key implementation changes:

Function now returns ([]byte, Error) instead of just []byte
Removed #no_bounds_check directive for safety
Proper buffer size calculation
Added bounds validation throughout
Proper padding validation
RFC 4648 compliant error handling
Support for custom alphabets with custom validation

Design Decisions

Bounds Checking:
- Original code used #no_bounds_check for performance
- Removed this directive despite small (if any) performance impact
- Proper error handling requires bounds validation
- Safety and usability outweigh minor performance cost
- Follows Odin's philosophy of being explicit and safe by default
Error Handling:
- Simple enum with distinct error cases
- Clear separation between character, length, and structure errors
- RFC 4648 compliant validation
- Empty input returns (nil, .None) by convention
Testing:
- RFC 4648 test vectors
- Tests all error conditions
- Verifies correct handling of edge cases

Breaking Changes

Function Signature:
- decode() now returns ([]byte, Error) instead of []byte
- Error handling available but not required
Performance:
- Minor overhead from bounds checking
- No additional allocations for error handling
Error Handling:
- Programs can handle errors if desired
- No more unexpected crashes on invalid input

Testing

Test coverage includes:

RFC 4648 test vectors and validation rules
Character validation (invalid characters and case sensitivity)
Padding validation (placement and length requirements)
Length validation
Roundtrip encoding/decoding
Custom alphabet support

All tests pass with 100% coverage of the new functionality and error cases.

Future Work

Add similar error handling improvements to encoding/base64 package
Consider adding error handling to encode() for input validation

Thank you for considering this contribution.

Happy New Year! :)

Replace assertions with proper error handling in base32.decode() to allow programs to handle invalid input gracefully rather than crashing. The function now returns ([]byte, Error) instead of just []byte.

Fix buffer allocation size calculation and add proper bounds checking to ensure output buffer has sufficient space. This fixes crashes that could occur with inputs like "AA" and other edge cases where the output buffer was too small. Remove #no_bounds_check as proper bounds checking is necessary for safe error handling. The small performance trade-off is worth the improved robustness.

Add test suite based on RFC 4648 test vectors and validation rules: - Add section 10 test vectors for valid encoding/decoding - Add test cases for invalid character handling (section 3.2) - Add test cases for padding validation (section 4) - Add test cases for length requirements (section 6) The test vectors verify that: - Empty string encodes/decodes correctly - Standard cases like "foo" -> "MZXW6===" work - Invalid characters are rejected - Missing or malformed padding is detected - Invalid lengths are caught

Rework base32.decode() to properly handle all cases per RFC 4648: - Fix error detection order: - Check minimum length first (Invalid_Length) - Check character validity (Invalid_Character) - Check padding and structure (Malformed_Input) - Fix padding validation: - Add required padding length checks (2=6, 4=4, 5=3, 7=1 chars) - Ensure padding only appears at end - Fix handling of unpadded inputs - Fix buffer handling: - Proper output buffer size calculation - Add bounds checking for buffer access - Add proper buffer validation For example: - "M" correctly returns Invalid_Length (too short) - "mzxq====" correctly returns Invalid_Character (lowercase) - "MZXQ=" correctly returns Malformed_Input (wrong padding) - Unpadded input lengths must be multiples of 8 These changes make the decode function fully compliant with RFC 4648 requirements while providing proper error handling.

Fix memory handling throughout base32 package: - Make padding map package-level constant (to avoid repeated allocs) - Use passed allocator in encode's make() call - Add defer delete for allocated memory in encode - Add proper cleanup in test cases - Fix memory cleanup of output buffers The changes ensure consistent allocator usage and cleanup in both implementation and tests.

Replace package-level map with a simple switch statement for padding validation. This eliminates allocations we can't properly free while maintaining the same validation logic.

Add support for custom alphabet validation through an optional validation function parameter. The default validation follows RFC 4648 base32 alphabet rules (A-Z, 2-7). This properly supports the documented ability to use custom alphabets.

Fix encoding to properly use provided encoding table parameter instead of hardcoded `ENC_TABLE`. This makes encode properly support custom alphabets as documented.

The decoding table was only 224 bytes which caused type mismatches when using custom alphabets, so expand with zeroes to cover full byte range while maintaining the same decoding logic.

Move existing test procedures to a dedicated test file for better code organization and maintainability.

Add test_base32_roundtrip() to verify the encode->decode roundtrip preserves data integrity. This test helps ensure our base32 implementation correctly handles the full encode->decode cycle without data loss or corruption.

Remove premature deallocation of the output buffer which was causing use-after-free behavior. The returned string needs to take ownership of this memory, but the defer delete was freeing it before the string could be used. This fixes issues with encoding that were introduced by overly aggressive memory cleanup in 93238db.

Add test case to verify custom alphabet support. The test uses a decimal-uppercase alphabet (0-9, A-V) to test both encoding and decoding with custom tables, including validation. This ensures the encode and decode functions work correctly with custom encoding tables and validation functions as documented.

Add defer delete for encoded strings across all test procedures to ensure proper cleanup and prevent memory leaks. This completes the memory management improvements started in 591dd87.

Fix incorrect RFC 4648 section references: - Add RFC URL reference at package level - Update Error enum documentation to reference correct sections: - Invalid_Character: Section 3.3 (non-alphabet characters) - Invalid_Length: Section 6 (base32 block size requirements) - Malformed_Input: Section 3.2 (padding) - Fix test file section references to match correct sections This ensures all RFC references are accurate and adds a link to the source RFC for reference.

bullyingteen · 2025-01-01T21:44:30Z

encoding/base64 needs the same treatment I guess?
See base64.decode_into

zolk3ri · 2025-01-01T21:47:27Z

encoding/base64 needs the same treatment I guess? See base64.decode_into

Yup! That is why I mentioned it under "Future Work". If this happens to get merged, I will work on encoding/base64 as well.

bullyingteen · 2025-01-01T22:15:34Z

encoding/base64 needs the same treatment I guess? See base64.decode_into

Yup! That is why I mentioned it under "Future Work". If this happens to get merged, I will work on encoding/base64 as well.

Missed it ngl 🙂