Exclude hex above max Unicode Scalar Value #456

eugenesvk · 2024-12-25T16:04:59Z

Similar to #450 but explicitly modifies the escaped unicode grammar rule to exclude non Unicode Scalar Values which aren't surrogates, i.e., everything above 10FFFF

(also simplifies surrogate regex to use ranges)

includes 000F format as long as it's not exceeding 6 digits (though now the rule is in the comments)

simplify surrogate regex to use ranges

zkat · 2024-12-25T17:06:06Z

It was an imprecision, but what is this non canonical format thing? This change does seem to be delving into “actual spec change”

eugenesvk · 2024-12-25T17:59:44Z

It's just a phrase, I don't really know what the canon is here, just noticed some parsers errorred out, so went with the conservative version for this.

n s="\u{00000F}" isn't as clean as
n s="\u{F}"

but on the other hand padding exists? But if padding is allowed, then why is it limited to just 6 in `{1,6}? After all, the numbers would be the same and these are accepted:

a n=000000000000000000001
a n=1
a n=0x000000000000000000001
a n=0x1

eilvelia · 2024-12-25T18:08:27Z

Leading zeros really should be allowed. For one, that is convenient and how code points are usually written (U+000B, etc.), and how it is usually implemented in programming languages. If an implementation doesn't allow leading zeros, it should be a bug in the implementation. Not limiting the notation to 6 digits could be done (as it works in javascript), but isn't really a big deal. For example, in Rust and OCaml unicode escapes work as they currently do in kdl, allowing only <=6 digits with possible leading zeros.

edit: By the way, there's one test case that uses a leading zero

test_cases/input/esc_unicode_in_string.kdl
1:node "hello\u{0a}world"

eugenesvk · 2024-12-25T19:20:36Z

Also turns out the JS parser had an issue with 6-digit a b="\u{00000A}", not with leading 0s in general, so the canon is indeed for 0s
Added an explicit comment in the grammar and and a 7-digit leading 0 test

eilvelia · 2024-12-25T19:23:12Z

SPEC.md

-surrogates := [dD][8-9a-fA-F]hex-digit{2}
-// U+D800-DFFF: D  8         00
-//              D  F         FF
+hex-unicode := [\u{0}-\u{10FFFF}] - surrogate  // Unicode Scalar Value₁₆, leading 0s allowed as long as length ≤ 6


I think \u is used to indicate unicode symbols in this metalanguage (see bom := '\u{FEFF}' below), so this actually means that hex-unicode is any unicode scalar value, not hexadecimal digits. [...] itself can represent only one character.

Yes, that could be more confusing, just thought that the name + comment would resolve it since this is still not a strict formal grammar?

Otherwise need to complicate the spec more with a few extra regex rules similar to the one removed for >10FFFF and for many 0s, which will hurt readability?

Personally I think it would be best to keep hex-digit{1,6} here (e.g. same as in https://doc.rust-lang.org/reference/tokens.html#character-literals) as it had been previously and then clearly state how it is allowed to resolve the unicode escape

But the lack of clarity is what prompted this! I realized that the syntax is wrong when FFFFFF was highlighted fine because it followed this simple rule of 6. If you think this alt is also not clear, I'd then just add a few more rules, let it be more complicated, but less error-prone

Also, the rust tokens are incomplete, there are rules that are not listed there but the compiler warns you about (think this happens after tokenization), so it's not a great reference for this case

Perhaps clarity could be solved by placing a comment there and modifying the spec text? The grammar shouldn't necessarily block all escapes that cannot be resolved, I think, i.e. resolving escapes is not necessarily total.

We should be consistent: if we filter out the surrogates in the grammar, then we should also filter out > 10ffff.

IMO either we end up with something like

hex-unicode := hex-digit{1, 6} - surrogates - above-max-scalar surrogates := 0{0,2}[dD][8-9a-fA-F]hex-digit{2} // U+D800-DFFF: D 8 00 // D F FF above-max-scalar = '1' [1-9a-fA-F] hex-digit{4} | [2-9a-fA-F] hex-digit{5}

or

hex-unicode := (hex-digit{1, 5} | '0' hex-digit{5} | '10' hex-digit{4}) - surrogates surrogates := '0'{0,2} [dD] [8-9a-fA-F] hex-digit{2} // U+D800-DFFF: D 8 00 // D F FF

(in which case we should really define what {<number>} and {<number>,<number>} mean in the "Grammar language" section at the bottom), or we define hex-unicode via a reference to the prose where it is explained that unicode hex values can be left-padded with zeros up to a length of six, that you aren't allowed to encode non-scalar values, …

I agree we should explicitly filter everything out, otherwise it's too easy to miss

What do you think about trying to define it via an explicit range "D800-DFFF" in some format instead of the multiple regexes that amount to the same? Or better regexes + range in the comments?

zkat · 2024-12-25T23:01:46Z

It’s the same as Rust Unicode escapes: https://doc.rust-lang.org/reference/tokens.html#character-literals

I don’t think there’s a need to change it from that at this stage.

eugenesvk · 2024-12-26T04:13:57Z

Rust's rule is incomplete since tokenization doesn't impose limits, those are applied at a later parsing stage. It's not language grammar

document {1,3} ranges

eugenesvk · 2024-12-26T07:50:33Z

Added the more explicit regexy variant to avoid confusion at a slight cost of readability, and documented {1,3} range syntax

Exclude hex above max Unicode Scalar Value

48caa40

simplify surrogate regex to use ranges

allow leading 0s, but still limit max length to 6

e4da6d1

eilvelia reviewed Dec 25, 2024

View reviewed changes

bgotink mentioned this pull request Dec 25, 2024

Six digit unicode escapes don't work bgotink/kdl#9

Closed

Add explicit regex-set rules to hex unicode

4325170

document {1,3} ranges

add space-separators between sets

394af25

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Exclude hex above max Unicode Scalar Value #456

Exclude hex above max Unicode Scalar Value #456

eugenesvk commented Dec 25, 2024 •

edited

Loading

zkat commented Dec 25, 2024

eugenesvk commented Dec 25, 2024

eilvelia commented Dec 25, 2024 •

edited

Loading

eugenesvk commented Dec 25, 2024

eilvelia Dec 25, 2024

eugenesvk Dec 25, 2024

eilvelia Dec 25, 2024

eugenesvk Dec 25, 2024

eilvelia Dec 25, 2024

bgotink Dec 25, 2024

eugenesvk Dec 26, 2024

zkat commented Dec 25, 2024

eugenesvk commented Dec 26, 2024

eugenesvk commented Dec 26, 2024

Exclude hex above max Unicode Scalar Value #456

Are you sure you want to change the base?

Exclude hex above max Unicode Scalar Value #456

Conversation

eugenesvk commented Dec 25, 2024 • edited Loading

zkat commented Dec 25, 2024

eugenesvk commented Dec 25, 2024

eilvelia commented Dec 25, 2024 • edited Loading

eugenesvk commented Dec 25, 2024

eilvelia Dec 25, 2024

Choose a reason for hiding this comment

eugenesvk Dec 25, 2024

Choose a reason for hiding this comment

eilvelia Dec 25, 2024

Choose a reason for hiding this comment

eugenesvk Dec 25, 2024

Choose a reason for hiding this comment

eilvelia Dec 25, 2024

Choose a reason for hiding this comment

bgotink Dec 25, 2024

Choose a reason for hiding this comment

eugenesvk Dec 26, 2024

Choose a reason for hiding this comment

zkat commented Dec 25, 2024

eugenesvk commented Dec 26, 2024

eugenesvk commented Dec 26, 2024

eugenesvk commented Dec 25, 2024 •

edited

Loading

eilvelia commented Dec 25, 2024 •

edited

Loading