-
-
Notifications
You must be signed in to change notification settings - Fork 64
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Exclude hex above max Unicode Scalar Value #456
base: main
Are you sure you want to change the base?
Conversation
simplify surrogate regex to use ranges
It was an imprecision, but what is this non canonical format thing? This change does seem to be delving into “actual spec change” |
It's just a phrase, I don't really know what the canon is here, just noticed some parsers errorred out, so went with the conservative version for this.
but on the other hand padding exists? But if padding is allowed, then why is it limited to just 6 in `{1,6}? After all, the numbers would be the same and these are accepted:
|
Leading zeros really should be allowed. For one, that is convenient and how code points are usually written (U+000B, etc.), and how it is usually implemented in programming languages. If an implementation doesn't allow leading zeros, it should be a bug in the implementation. Not limiting the notation to 6 digits could be done (as it works in javascript), but isn't really a big deal. For example, in Rust and OCaml unicode escapes work as they currently do in kdl, allowing only <=6 digits with possible leading zeros. edit: By the way, there's one test case that uses a leading zero
|
Also turns out the JS parser had an issue with 6-digit |
SPEC.md
Outdated
surrogates := [dD][8-9a-fA-F]hex-digit{2} | ||
// U+D800-DFFF: D 8 00 | ||
// D F FF | ||
hex-unicode := [\u{0}-\u{10FFFF}] - surrogate // Unicode Scalar Value₁₆, leading 0s allowed as long as length ≤ 6 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think \u
is used to indicate unicode symbols in this metalanguage (see bom := '\u{FEFF}'
below), so this actually means that hex-unicode is any unicode scalar value, not hexadecimal digits. [...]
itself can represent only one character.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that could be more confusing, just thought that the name + comment would resolve it since this is still not a strict formal grammar?
Otherwise need to complicate the spec more with a few extra regex rules similar to the one removed for >10FFFF and for many 0s, which will hurt readability?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Personally I think it would be best to keep hex-digit{1,6}
here (e.g. same as in https://doc.rust-lang.org/reference/tokens.html#character-literals) as it had been previously and then clearly state how it is allowed to resolve the unicode escape
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But the lack of clarity is what prompted this! I realized that the syntax is wrong when FFFFFF was highlighted fine because it followed this simple rule of 6. If you think this alt is also not clear, I'd then just add a few more rules, let it be more complicated, but less error-prone
Also, the rust tokens are incomplete, there are rules that are not listed there but the compiler warns you about (think this happens after tokenization), so it's not a great reference for this case
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps clarity could be solved by placing a comment there and modifying the spec text? The grammar shouldn't necessarily block all escapes that cannot be resolved, I think, i.e. resolving escapes is not necessarily total.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should be consistent: if we filter out the surrogates in the grammar, then we should also filter out > 10ffff.
IMO either we end up with something like
hex-unicode := hex-digit{1, 6} - surrogates - above-max-scalar
surrogates := 0{0,2}[dD][8-9a-fA-F]hex-digit{2}
// U+D800-DFFF: D 8 00
// D F FF
above-max-scalar = '1' [1-9a-fA-F] hex-digit{4} | [2-9a-fA-F] hex-digit{5}
or
hex-unicode := (hex-digit{1, 5} | '0' hex-digit{5} | '10' hex-digit{4}) - surrogates
surrogates := '0'{0,2} [dD] [8-9a-fA-F] hex-digit{2}
// U+D800-DFFF: D 8 00
// D F FF
(in which case we should really define what {<number>}
and {<number>,<number>}
mean in the "Grammar language" section at the bottom), or we define hex-unicode
via a reference to the prose where it is explained that unicode hex values can be left-padded with zeros up to a length of six, that you aren't allowed to encode non-scalar values, …
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree we should explicitly filter everything out, otherwise it's too easy to miss
What do you think about trying to define it via an explicit range "D800-DFFF" in some format instead of the multiple regexes that amount to the same? Or better regexes + range in the comments?
It’s the same as Rust Unicode escapes: https://doc.rust-lang.org/reference/tokens.html#character-literals I don’t think there’s a need to change it from that at this stage. |
Rust's rule is incomplete since tokenization doesn't impose limits, those are applied at a later parsing stage. It's not language grammar |
document {1,3} ranges
Added the more explicit regexy variant to avoid confusion at a slight cost of readability, and documented |
Similar to #450 but explicitly modifies the escaped unicode grammar rule to exclude non Unicode Scalar Values which aren't surrogates, i.e., everything above
10FFFF
(also simplifies surrogate regex to use ranges)
includes
000F
format as long as it's not exceeding 6 digits (though now the rule is in the comments)