Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

encoding/base32: Add RFC-compliant error handling and improve reliability #4641

Open
wants to merge 19 commits into
base: master
Choose a base branch
from

Conversation

zolk3ri
Copy link

@zolk3ri zolk3ri commented Dec 31, 2024

The current implementation of base32.decode() uses assertions for (incomplete) validation, which causes programs to crash on invalid input. This PR replaces assertions with proper error handling according to RFC 4648, allowing users of the package to handle invalid input gracefully.

Motivation: While implementing a pure Odin implementation of TOTP (Time-based One-Time Password), I discovered that base32.decode() would segfault on invalid input due to assertions. As a core library package, it should provide proper error handling to allow users to gracefully handle decode errors, which these changes enable.

Problem Description

  • Programs crash with segfaults on invalid base32 input due to assertions
  • Explicit disabling of bounds checking (#no_bounds_check) prevents proper error handling
  • Buffer allocation issues can cause out-of-bounds access
  • Non-compliance with RFC 4648 padding rules

The original implementation prioritized performance by disabling bounds checking, but this tradeoff is not appropriate for a core library package. The small or potentially negligible performance overhead from bounds checking is worth the improved safety and usability.

Implementation Details

The error enum:

Error :: enum {
  None,
  Invalid_Character, // Input contains characters outside the specified alphabet
  Invalid_Length,    // Input length is not valid for base32 (must be a multiple of 8 with proper padding)
  Malformed_Input,   // Input has improper structure (wrong padding position or incomplete groups)
}

Before (crashes on invalid input):

data := decode(input)
if data == nil { // Only checks for empty input
  return nil
}
defer delete(data)

After (error handling available):

data, err := decode(input)
if err != .None {
  #partial switch err {
    case .Invalid_Character:
    // Handle error where input contains characters not in base32 alphabet
    case .Malformed_Input:
    // Handle error where invalid padding structure or position
    case .Invalid_Length:
    // Handle error where input length is not valid for base32
  }
}
defer delete(data)

Key implementation changes:

  • Function now returns ([]byte, Error) instead of just []byte
  • Removed #no_bounds_check directive for safety
  • Proper buffer size calculation
  • Added bounds validation throughout
  • Proper padding validation
  • RFC 4648 compliant error handling
  • Support for custom alphabets with custom validation

Design Decisions

  1. Bounds Checking:

    • Original code used #no_bounds_check for performance
    • Removed this directive despite small (if any) performance impact
    • Proper error handling requires bounds validation
    • Safety and usability outweigh minor performance cost
    • Follows Odin's philosophy of being explicit and safe by default
  2. Error Handling:

    • Simple enum with distinct error cases
    • Clear separation between character, length, and structure errors
    • RFC 4648 compliant validation
    • Empty input returns (nil, .None) by convention
  3. Testing:

    • RFC 4648 test vectors
    • Tests all error conditions
    • Verifies correct handling of edge cases

Breaking Changes

  1. Function Signature:

    • decode() now returns ([]byte, Error) instead of []byte
    • Error handling available but not required
  2. Performance:

    • Minor overhead from bounds checking
    • No additional allocations for error handling
  3. Error Handling:

    • Programs can handle errors if desired
    • No more unexpected crashes on invalid input

Testing

Test coverage includes:

  • RFC 4648 test vectors and validation rules
  • Character validation (invalid characters and case sensitivity)
  • Padding validation (placement and length requirements)
  • Length validation
  • Roundtrip encoding/decoding
  • Custom alphabet support

All tests pass with 100% coverage of the new functionality and error cases.

Future Work

  • Add similar error handling improvements to encoding/base64 package
  • Consider adding error handling to encode() for input validation

Thank you for considering this contribution.

Happy New Year! :)

Replace assertions with proper error handling in base32.decode() to allow
programs to handle invalid input gracefully rather than crashing.

The function now returns ([]byte, Error) instead of just []byte.
Fix buffer allocation size calculation and add proper bounds checking to
ensure output buffer has sufficient space. This fixes crashes that could
occur with inputs like "AA" and other edge cases where the output buffer
was too small.

Remove #no_bounds_check as proper bounds checking is necessary for safe
error handling. The small performance trade-off is worth the improved
robustness.
Add test suite based on RFC 4648 test vectors and validation rules:
- Add section 10 test vectors for valid encoding/decoding
- Add test cases for invalid character handling (section 3.2)
- Add test cases for padding validation (section 4)
- Add test cases for length requirements (section 6)

The test vectors verify that:
- Empty string encodes/decodes correctly
- Standard cases like "foo" -> "MZXW6===" work
- Invalid characters are rejected
- Missing or malformed padding is detected
- Invalid lengths are caught
Rework base32.decode() to properly handle all cases per RFC 4648:

- Fix error detection order:
  - Check minimum length first (Invalid_Length)
  - Check character validity (Invalid_Character)
  - Check padding and structure (Malformed_Input)

- Fix padding validation:
  - Add required padding length checks (2=6, 4=4, 5=3, 7=1 chars)
  - Ensure padding only appears at end
  - Fix handling of unpadded inputs

- Fix buffer handling:
  - Proper output buffer size calculation
  - Add bounds checking for buffer access
  - Add proper buffer validation

For example:
- "M" correctly returns Invalid_Length (too short)
- "mzxq====" correctly returns Invalid_Character (lowercase)
- "MZXQ=" correctly returns Malformed_Input (wrong padding)
- Unpadded input lengths must be multiples of 8

These changes make the decode function fully compliant with RFC 4648
requirements while providing proper error handling.
Fix memory handling throughout base32 package:
- Make padding map package-level constant (to avoid repeated allocs)
- Use passed allocator in encode's make() call
- Add defer delete for allocated memory in encode
- Add proper cleanup in test cases
- Fix memory cleanup of output buffers

The changes ensure consistent allocator usage and cleanup in both
implementation and tests.
Replace package-level map with a simple switch statement for padding
validation. This eliminates allocations we can't properly free while
maintaining the same validation logic.
Add support for custom alphabet validation through an optional validation
function parameter. The default validation follows RFC 4648 base32
alphabet rules (A-Z, 2-7).

This properly supports the documented ability to use custom alphabets.
Fix encoding to properly use provided encoding table parameter instead of
hardcoded `ENC_TABLE`.

This makes encode properly support custom alphabets as documented.
The decoding table was only 224 bytes which caused type mismatches when
using custom alphabets, so expand with zeroes to cover full byte range
while maintaining the same decoding logic.
Move existing test procedures to a dedicated test file for better
code organization and maintainability.
Add test_base32_roundtrip() to verify the encode->decode roundtrip
preserves data integrity. This test helps ensure our base32 implementation
correctly handles the full encode->decode cycle without data loss or
corruption.
Remove premature deallocation of the output buffer which was causing
use-after-free behavior. The returned string needs to take ownership
of this memory, but the defer delete was freeing it before the
string could be used. This fixes issues with encoding that were
introduced by overly aggressive memory cleanup in 93238db.
Add test case to verify custom alphabet support. The test uses a
decimal-uppercase alphabet (0-9, A-V) to test both encoding and decoding
with custom tables, including validation. This ensures the encode and
decode functions work correctly with custom encoding tables and
validation functions as documented.
Add defer delete for encoded strings across all test procedures to ensure
proper cleanup and prevent memory leaks. This completes the memory
management improvements started in 591dd87.
@zolk3ri zolk3ri force-pushed the fix/base32-error-handling branch from 60f794a to d6f4412 Compare December 31, 2024 18:00
@zolk3ri zolk3ri marked this pull request as draft December 31, 2024 18:49
Fix incorrect RFC 4648 section references:
- Add RFC URL reference at package level
- Update Error enum documentation to reference correct sections:
  - Invalid_Character: Section 3.3 (non-alphabet characters)
  - Invalid_Length: Section 6 (base32 block size requirements)
  - Malformed_Input: Section 3.2 (padding)
- Fix test file section references to match correct sections

This ensures all RFC references are accurate and adds a link to the
source RFC for reference.
@zolk3ri zolk3ri marked this pull request as ready for review December 31, 2024 22:48
@bullyingteen
Copy link

encoding/base64 needs the same treatment I guess?
See base64.decode_into

@zolk3ri
Copy link
Author

zolk3ri commented Jan 1, 2025

encoding/base64 needs the same treatment I guess? See base64.decode_into

Yup! That is why I mentioned it under "Future Work". If this happens to get merged, I will work on encoding/base64 as well.

@bullyingteen
Copy link

encoding/base64 needs the same treatment I guess? See base64.decode_into

Yup! That is why I mentioned it under "Future Work". If this happens to get merged, I will work on encoding/base64 as well.

Missed it ngl 🙂

Add `@(rodata)` attribute to `ENC_TABLE` and `DEC_TABLE` to mark
them as read-only data. This places these tables in the read-only
section of the executable, protecting them from modification
during program execution.
// Process input in 8-byte blocks
outi := 0
for i := 0; i < input_chars; i += 8 {
buf: [8]byte
Copy link
Contributor

@Waqar144 Waqar144 Jan 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can skip 0 initialization here by using ---

block_size := min(8, input_chars - i)

// Decode block
for j := 0; j < block_size; j += 1 {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

style: you can use range style loop: for j in 0 ..< block_size if you want

}

// Validate characters using provided validation function
for i := 0; i < len(data); i += 1 {
Copy link
Contributor

@Waqar144 Waqar144 Jan 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

style: you can use range style loop: for c in data if you want.

alternatively perhaps bytes.index_byte from core:bytes can be used to find the index of first PADDING. and then use that for this loop and the one below to eliminate the PADDING branch.

Also #no_bounds_check can be used here

// Validate padding and length
data_len := len(data)
padding_count := 0
for i := data_len - 1; i >= 0; i -= 1 {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#no_bounds_check can be used here

// Check for proper padding and length combinations
if padding_count > 0 {
// Verify no padding in the middle
for i := 0; i < data_len - padding_count; i += 1 {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#no_bounds_check can be used here

}

@private
_encode :: proc(out, data: []byte, ENC_TBL := ENC_TABLE, allocator := context.allocator) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can #no_bounds_check can be used for this proc?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants