Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Exclude hex above max Unicode Scalar Value #456

Open
wants to merge 4 commits into
base: main
Choose a base branch
from

Conversation

eugenesvk
Copy link
Contributor

@eugenesvk eugenesvk commented Dec 25, 2024

Similar to #450 but explicitly modifies the escaped unicode grammar rule to exclude non Unicode Scalar Values which aren't surrogates, i.e., everything above 10FFFF

(also simplifies surrogate regex to use ranges)

includes 000F format as long as it's not exceeding 6 digits (though now the rule is in the comments)

simplify surrogate regex to use ranges
@zkat
Copy link
Member

zkat commented Dec 25, 2024

It was an imprecision, but what is this non canonical format thing? This change does seem to be delving into “actual spec change”

@eugenesvk
Copy link
Contributor Author

It's just a phrase, I don't really know what the canon is here, just noticed some parsers errorred out, so went with the conservative version for this.

  • n s="\u{00000F}" isn't as clean as
  • n s="\u{F}"

but on the other hand padding exists? But if padding is allowed, then why is it limited to just 6 in `{1,6}? After all, the numbers would be the same and these are accepted:

  • a n=000000000000000000001
  • a n=1
  • a n=0x000000000000000000001
  • a n=0x1

@eilvelia
Copy link
Contributor

eilvelia commented Dec 25, 2024

Leading zeros really should be allowed. For one, that is convenient and how code points are usually written (U+000B, etc.), and how it is usually implemented in programming languages. If an implementation doesn't allow leading zeros, it should be a bug in the implementation. Not limiting the notation to 6 digits could be done (as it works in javascript), but isn't really a big deal. For example, in Rust and OCaml unicode escapes work as they currently do in kdl, allowing only <=6 digits with possible leading zeros.

edit: By the way, there's one test case that uses a leading zero

test_cases/input/esc_unicode_in_string.kdl
1:node "hello\u{0a}world"

@eugenesvk
Copy link
Contributor Author

Also turns out the JS parser had an issue with 6-digit a b="\u{00000A}", not with leading 0s in general, so the canon is indeed for 0s
Added an explicit comment in the grammar and and a 7-digit leading 0 test

SPEC.md Outdated
surrogates := [dD][8-9a-fA-F]hex-digit{2}
// U+D800-DFFF: D 8 00
// D F FF
hex-unicode := [\u{0}-\u{10FFFF}] - surrogate // Unicode Scalar Value₁₆, leading 0s allowed as long as length ≤ 6
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think \u is used to indicate unicode symbols in this metalanguage (see bom := '\u{FEFF}' below), so this actually means that hex-unicode is any unicode scalar value, not hexadecimal digits. [...] itself can represent only one character.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that could be more confusing, just thought that the name + comment would resolve it since this is still not a strict formal grammar?

Otherwise need to complicate the spec more with a few extra regex rules similar to the one removed for >10FFFF and for many 0s, which will hurt readability?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Personally I think it would be best to keep hex-digit{1,6} here (e.g. same as in https://doc.rust-lang.org/reference/tokens.html#character-literals) as it had been previously and then clearly state how it is allowed to resolve the unicode escape

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But the lack of clarity is what prompted this! I realized that the syntax is wrong when FFFFFF was highlighted fine because it followed this simple rule of 6. If you think this alt is also not clear, I'd then just add a few more rules, let it be more complicated, but less error-prone

Also, the rust tokens are incomplete, there are rules that are not listed there but the compiler warns you about (think this happens after tokenization), so it's not a great reference for this case

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps clarity could be solved by placing a comment there and modifying the spec text? The grammar shouldn't necessarily block all escapes that cannot be resolved, I think, i.e. resolving escapes is not necessarily total.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should be consistent: if we filter out the surrogates in the grammar, then we should also filter out > 10ffff.

IMO either we end up with something like

hex-unicode := hex-digit{1, 6} - surrogates - above-max-scalar
surrogates := 0{0,2}[dD][8-9a-fA-F]hex-digit{2}
// U+D800-DFFF: D  8         00
//              D  F         FF
above-max-scalar = '1' [1-9a-fA-F] hex-digit{4} | [2-9a-fA-F] hex-digit{5}

or

hex-unicode := (hex-digit{1, 5} | '0' hex-digit{5} | '10' hex-digit{4}) - surrogates
surrogates := '0'{0,2} [dD] [8-9a-fA-F] hex-digit{2}
// U+D800-DFFF: D  8         00
//              D  F         FF

(in which case we should really define what {<number>} and {<number>,<number>} mean in the "Grammar language" section at the bottom), or we define hex-unicode via a reference to the prose where it is explained that unicode hex values can be left-padded with zeros up to a length of six, that you aren't allowed to encode non-scalar values, …

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree we should explicitly filter everything out, otherwise it's too easy to miss

What do you think about trying to define it via an explicit range "D800-DFFF" in some format instead of the multiple regexes that amount to the same? Or better regexes + range in the comments?

@zkat
Copy link
Member

zkat commented Dec 25, 2024

It’s the same as Rust Unicode escapes: https://doc.rust-lang.org/reference/tokens.html#character-literals

I don’t think there’s a need to change it from that at this stage.

@eugenesvk
Copy link
Contributor Author

Rust's rule is incomplete since tokenization doesn't impose limits, those are applied at a later parsing stage. It's not language grammar

@eugenesvk
Copy link
Contributor Author

Added the more explicit regexy variant to avoid confusion at a slight cost of readability, and documented {1,3} range syntax

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants