Decide if we want to support `.R` files containing non-UTF-8 characters #60

DavisVaughan · 2024-11-26T13:20:20Z

For example, https://github.com/wch/r-source/blob/trunk/tests/utf8-regex.R is a test R file in base R that directly contains Latin1 characters. We currently fail to read in this file.

For reference, ruff also refuses to parse/format non-utf8 files.

tree-sitter used to effectively requires UTF-8 or UTF-16, but as of extremely recently it gained support for custom encodings, but I doubt we really want to get in the game of doing that. tree-sitter/tree-sitter#3833

I imagine if we did anything it would be:

Read in as OSString with some locale
Convert to UTF-8 as soon as possible
Parse/Format in UTF-8
Convert back to original locale

But that sounds tricky to get right

I imagine this is:

A non issue for Mac and Linux
A super minor issue for Windows, where 99.9% of the time users have UTF-8 files, but 0.1% of the time they've copied in some Latin1 characters into their file from some other system, or from R output. This likely improved on R >=4.2 though, since UTF-8 is now the default on Windows.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Decide if we want to support `.R` files containing non-UTF-8 characters #60

Decide if we want to support `.R` files containing non-UTF-8 characters #60

DavisVaughan commented Nov 26, 2024 •

edited

Loading

Decide if we want to support .R files containing non-UTF-8 characters #60

Decide if we want to support .R files containing non-UTF-8 characters #60

Comments

DavisVaughan commented Nov 26, 2024 • edited Loading

Decide if we want to support `.R` files containing non-UTF-8 characters #60

Decide if we want to support `.R` files containing non-UTF-8 characters #60

DavisVaughan commented Nov 26, 2024 •

edited

Loading