-
Notifications
You must be signed in to change notification settings - Fork 59
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Translate non-UTF-8 byte sequences into replacement characters #60
Conversation
The UTF-8 parser knows how to handle invalid byte sequences, so don't preprocess the input in the main parser; just hand any non-ASCII byte to the UTF-8 parser to handle. This includes what were previously interpreted as 8-bit C1 control codes; they are now interpreted as UTF-8 continuation characters.
Follow the Unicode recommenatation, which is also consistent with Rust's `String::from_utf8_lossy`, concerning when and where to emit replacement characters. See https://hsivonen.fi/broken-utf-8/ for a detailed description of the issue.
In settings where an input stream has a logical end, allow it to be reported to the parser so that it can correctly process an incomplete UTF-8 sequence.
Add tests derived from https://hsivonen.fi/broken-utf-8/test.html .
This way if a transition function detects an invalid UTF-8 sequence, it can reset the main state to Ground.
DCS/SOS/PM/APC are terminated with an 8-bit ST, which isn't compatible with UTF-8, so add a flag to allow users to optionally disable recognition of these sequences. To make room in the `Action` list, remove `Ignore` as it's effectively the same as `None`.
I've now added a way to disable recognition of DCS, SOS, PM, and APC sequences, since they are all terminated by 8-bit ST, which isn't valid UTF-8. They are still recognized by default, but the patch adds a flag for optionally disabling them. This is now ready for review feedback. There are several changes here, though they are split up into self-contained commits. I'm curious whether these kinds of changes are in scope for vte, and whether the approach I took here makes sense. |
Have you run benchmarks against your changes? You can do that with |
I don't quite understand what you mean and having a flag to disable these doesn't make much sense to me for a vt parser. Correctly parsing escapes is the one thing we're supposed to do. First of all, only because an escape is terminated with ST, doesn't mean it has to be the C1 ST. The C0 one should work perfectly fine. Even if ST is listed, BEL often works just as well. All C1 control characters are also valid UTF-8. So I have absolutely no idea what you mean by this. |
There's currently about a 20% slowdown in that benchmark as a result of this PR. That said, that test contains quite a lot of byte sequences which are not valid UTF-8, and with this PR, those sequences are translated into replacement characters rather than being ignored, so it's doing more work. How important is performance on binary or otherwise non-UTF-8 input for vte? On test files containing valid UTF-8, it ranges from having no slowdown to about a 5% slowdown, depending on the content. I'll take a look to see if there are any opportunities to optimize.
Some users will want to parse DCS, SOS, PM, and APC. But many terminals these days don't do anything with DCS, and I don't know of any that do anything with SOS, PM, or APC. So if these codes interfere with things that other users want, it doesn't seem unreasonable to have a way to disable them.
In a C0 ST, the When they extended OSC to recognize BEL instead of ST, I don't know why they didn't do the same for DCS, SOS, PM, or APC. But it seems they didn't; I can't find anything that recognizes BEL for those. vte doesn't recognize the UTF-8 C1 encodings as control codes. In the |
Performance is very important.
A slowdown of 5% for input that isn't affected at all would probably block this PR. Though it depends a bit on how the performance is in Alacritty. But that's quite a big slowdown.
It doesn't work the same way, but it still works. So why would we remove support for things with no benefit of doing so? If the goal is only to remove C1 support, we could still support it with C0 just fine? |
My question was about binary or otherwise non-UTF-8 data specifically. If that's important, then it's unlikely that any amount of optimization will fix it, because the point of this PR is to do more work in those cases.
I haven't spent much time on optimization yet, so it's possible there are opportunities to improve performance. However, it sounds like you're not interested in this PR, so I'll look into other options.
It isn't just about whether there exists a byte sequence which has the desired effect; it's also about what applications are emitting, and if they make a change, how that change would work in other terminals they work in. From what I can tell, C0 ST causes DCS strings to be cancelled in other terminals, so if existing applications using DCS with C1 ST are going to make any change at all, it'd make more sense for them to stop using DCS altogether. Since DCS is obscure and not widely supported these days anyway, moving away from it in general seems to make more sense. |
Of course the performance of that matters too, but if there's no effect on other cases a small hit is fine there. It's not necessary to have absolutely perfect performance when it comes to parsing invalid input.
I haven't looked at the PR, I'm mostly curious about your motivation and methodology. Just to be absolutely clear that my statements were purely about what you said, not the code itself.
I'm not entirely sure what you're trying to say here. Are you saying that since C0 ST isn't supported by other terminals, applications that use the C1 ST have no reason to make any changes? I don't see why applications would make any change in the first place, as a parser we should support C1 ST just fine if the terminal chooses to use it. I don't see many reasons other than performance to drop it and based on my testing there is no performance benefit to dropping C1 support completely. Though we don't have C1 ST support for OSC either, which was necessary to fix a bug. |
For my use cases, I'm looking for a parser in which every invalid UTF-8 byte sequence can be replaced with U+FFFD. If the input contains a 0x9C anywhere that isn't in a UTF-8 continuation position, I'm going to replace it, so I won't recognize it at ST. To me, 8-bit C1 codes mean increased surface area and weaker invariants. Being able to say that every non-UTF-8 byte sequence gets translated to U+FFFD means I can categorically rule out a class of potential bugs, such as the one you mentioned. One of the properties of UTF-8 is that it's self-synchronizing, so if I ever have to start decoding in the middle of a stream, I can always at least know where the code-point boundaries are, but that property doesn't completely hold in "UTF-8 with 8-bit C1 codes". And, some terminal implementations today do UTF-8 decoding as an independent step before escape-sequence decoding, so even if I'm using a library that doesn't do that today, I want the option of doing so in the future without having to worry about backwards compatibility. Some of these risks are arguably minor, but the 8-bit C1 codes sufficiently obscure and insufficiently useful that, for my purposes at least, they not worth it. |
I don't see how the user of VTE would be able to rule out any bugs with the help of that. It's just dropped vs replaced for them.
The thing is that vte is a library, so the goal is to try and accommodate for a wide range of usecases, not just the one specific to Alacritty for example. We personally have removed all support for C1 escapes too, but I'm not sure if I want to do that to vte. Maybe the current downstream users of vte should also be evaluated. If it really causes so much overhead and nobody actually uses it, maybe giving up on C1 is the better option. There's currently not really proper C1 support in vte, so maybe we're just making things worse for everyone by clinging to some halfway state that doesn't help anyone who wants C1 support but holds everyone else back. I know that other terminals print replacement characters, are you aware of how these handle C1 escapes? |
It would be simpler if there were a spec to follow. While the VT100 state machine diagram is a great starting point, it doesn't incoporate UTF-8. VTE is currently making its own choices about when and where to support 8-bit C1 controls. Will these choices be compatible with the choices my users are making? If not, is it my (or VTE's) bug or theirs? If I switch to a different parser, will I have subtle incompatibilities?
I agree, this sounds like a good evaluation to do.
According to Wikipedia, 0x85 the 8-bit C1 code might be still in use. I checked 9 different virtual terminals, plus screen and tmux, and the behavior ranges from silently ignoring it, printing |
Based on my testing, there's enough that don't support it that you cannot rely on it. Which seems in line with what you've seen. IIRC by default neither URxvt, VTE, Kitty nor XTerm support it. So I'd say there are two questions mainly:
Since you've looked into this already, you can probably best evaluate what overhead a runtime configuration would have. Do you think that it would be any slower than a compile-time option? And is there any performance advantage to completely dropping C1 support, or does the compile-time option only add code complexity without affecting performance? |
Right now, it looks like VTE's state machine is a fairly literal translation of the DEC state machine diagram, with a separate state machine for UTF-8. As it is now, removing 8-bit C1 support probably wouldn't help performance much. There are several optimizations possible. For example, when in the The runtime option I experimented with for disabling DCS/SOS/PM/APC looks to have caused about a 5% slowdown. It seems likely that a runtime option to disable/enable 8-bit C1 support would be similar.
Of course, this is just my opinion, but for what it's worth, I suggest "no" here, unless you know of a specific user that needs it. UTF-8 is so important in so many contexts these days, it should take a really strong justification to support exceptions to UTF-8 in a protocol which is otherwise UTF-8. As an update to a comment I made earlier, it appears XTerm in UTF-8 mode (the default mode in many OS's today) does recognize |
That's a good point and I've considered this myself in the past. It seems like there would be some significant changes necessary, but with a tiny lib like vte that shouldn't be a massive problem either.
I think I mostly agree at this point. Especially with the primary user of VTE not using C1 escapes at all. Looking at the RedOX and strip-ansi-escape usages of VTE it seems like they do the same. So that does raise the question why we should bother with C1 escapes at all when we can make our library better for all currently known users by dropping it.
That does seem to be the most reasonable solution, yes. |
This is a work in progress; I'm posting it now to provide context for #58.
This patch series fixes all the cases I could find where vte would silently discard bytes in the presence of invalid UTF-8 byte sequences, and makes it emit replacement characters in a manner compatible with Rust's
String::from_utf8_lossy
and the resolution to https://hsivonen.fi/broken-utf-8/, and fixes #38.There are other ways I could achieve my goals here. Building this patch series was useful for me to understand all the different aspects of this. But if taking a different approach would be better, I'm open to ideas.