Skip to content

Commit

Permalink
Revise handling of non-XML characters
Browse files Browse the repository at this point in the history
  • Loading branch information
michaelhkay committed Dec 31, 2024
1 parent 4eaeba1 commit f42fc58
Show file tree
Hide file tree
Showing 2 changed files with 71 additions and 19 deletions.
34 changes: 17 additions & 17 deletions specifications/xpath-functions-40/src/function-catalog.xml
Original file line number Diff line number Diff line change
Expand Up @@ -27066,13 +27066,13 @@ return document {
<fos:default>false</fos:default>
<fos:values>
<fos:value value="false">
All characters in the input that are valid
in the version of XML supported by the implementation, whether or not they are represented
in the input by means of an escape sequence, are represented as unescaped characters in the result. Any
characters or codepoints that are not valid XML characters
(for example, unpaired surrogates) <phrase>are passed to the <code>fallback</code> function
as described below; in the absence of a fallback function, they are replaced by
<char>U+FFFD</char></phrase>.
Any <termref def="dt-permitted-character"/> in the input,
whether or not it is represented
in the input by means of an escape sequence, is represented as an unescaped character
in the result. Any other character or codepoint
(for example, an unpaired surrogate) is passed to the <code>fallback</code> function
as described below; in the absence of a fallback function, it is replaced by
<char>U+FFFD</char>.
</fos:value>
<fos:value value="true">
JSON escape sequences are used in the result to represent special characters in the JSON input, as defined below,
Expand All @@ -27081,7 +27081,8 @@ return document {
<ulist>
<item><p>all codepoints in the range <char>U+0000</char> to <char>U+001F</char>
or <char>U+007F</char> to <char>U+009F</char>;</p></item>
<item><p>all codepoints that do not represent characters that are valid in the version of XML supported by the processor,
<item><p>all codepoints that do not represent
<termref def="dt-permitted-character">permitted characters</termref>,
including codepoints representing unpaired surrogates;</p></item>
<item><p>the character <char>U+005C</char> itself.</p></item>
</ulist>
Expand All @@ -27097,21 +27098,16 @@ return document {
<fos:option key="fallback">
<fos:meaning>
Provides a function which is called when the input contains an escape sequence
that represents a character that is not valid in the version of XML
supported by the implementation.
that represents a character that is not a <termref def="dt-permitted-character"/>.
It is an error to supply the <code>fallback</code> option if the <code>escape</code>
option is present with the value <code>true</code>.
</fos:meaning>
<fos:type>(fn(xs:string) as xs:anyAtomicType)?</fos:type>
<fos:default>fn { char(0xFFFD) }</fos:default>
<fos:values>
<fos:value value="User-supplied function"
>
The function is called when the JSON input contains a special character
(as defined under the <code>escape</code> option) that is valid according to
the JSON grammar (whether the special character is represented in the input
directly or as an escape sequence), but which does not represent a
character that is valid in the version of XML supported by the processor.
<fos:value value="User-supplied function">
The function is called when the JSON input contains character that
is not a <termref def="dt-permitted-character"/>
It is called once for any surrogate
that is not properly paired with another surrogate. The untyped atomic item
supplied as the argument will always be a two- or six-character escape
Expand Down Expand Up @@ -27335,6 +27331,9 @@ return document {
</fos:example>
</fos:examples>
<fos:changes>
<fos:change issue="414" PR="546" date="2023-07-25">
<p>The rules regarding use of non-XML characters in JSON texts have been relaxed.</p>
</fos:change>
<fos:change issue="960" PR="1028" date="2024-02-20">
<p>An option is provided to control how the JSON <code>null</code> value should be handled.</p>
</fos:change>
Expand All @@ -27346,6 +27345,7 @@ return document {
specification gave the default value as <code>true</code>, but this appears to have been an error,
since it was inconsistent with examples given in the specification and with tests in the test suite.</p>
</fos:change>

</fos:changes>
</fos:function>

Expand Down
56 changes: 54 additions & 2 deletions specifications/xpath-functions-40/src/xpath-functions.xml
Original file line number Diff line number Diff line change
Expand Up @@ -6633,7 +6633,17 @@ correctly in all browsers, depending on the system configuration.</emph></p>-->


<?local-function-index?>
<p>Note also that the function <function>fn:serialize</function> has an option to act as the inverse function to <function>fn:parse-json</function>.</p>

<p>Note also:</p>
<ulist>
<item><p>The function <function>fn:serialize</function> has an option to generate
JSON output from a structure of maps and arrays.</p>
</item>
<item><p>The function <function>fn:elements-to-maps</function> enables
arbitrary XML node trees to be converted to trees of maps and arrays
suitable for serializing as JSON.</p>
</item>
</ulist>



Expand Down Expand Up @@ -6875,7 +6885,49 @@ correctly in all browsers, depending on the system configuration.</emph></p>-->
<function>fn:json-to-xml</function> function is <termref def="implementation-dependent">implementation-dependent</termref>.</p>
</div3>


<div3 id="json-character-repertoire">
<head>JSON character repertoire</head>
<changes>
<change issue="414" PR="546" date="2023-07-25">
The rules regarding use of non-XML characters in JSON texts have been relaxed.
</change>
</changes>
<p>The set of characters that may appear in JSON texts is not the same as
the set of characters allowed in XML. Specifically:</p>

<ulist>
<item><p>As plain unescaped characters, JSON allows any codepoint in the
numeric range 0x20 to 0x10FFFF, with the exception of <char>U+0022</char>
and <char>U+005C</char>.</p></item>
<item><p>As a backslash-escaped character, JSON allows any codepoint in the
numeric range 0x00 to 0xFFFF.</p></item>
<item><p>Whether escaped or not, the JSON grammar allows codepoints in the surrogate
range to appear, and does not explicitly require that they be properly paired.
However, the JSON specifications recognize that unpaired surrogates are likely to lead
to interoperability problems.</p></item>
</ulist>

<p>Ignoring unpaired surrogates, this means that JSON allows codepoints that are not
allowed by XML:</p>
<ulist>
<item><p>Not allowed by XML 1.0: 0x00 to 0x1F (other than 0x09, 0x0A, and 0x0D); 0xFFFE; 0xFFFF.</p></item>
<item><p>Not allowed by XML 1.1: 0x00; 0xFFFE; 0xFFFF.</p></item>
</ulist>

<p>The XDM data model (see <xspecref spec="DM40" ref="xml-and-xsd-versions"/>) allows
an implementation to define the set of <termref def="dt-permitted-character">permitted characters</termref>
in the <code>xs:string</code> data type in such a way that any
Unicode codepoint assigned to a character (which excludes surrogates)
is allowed. However, this
is not required: a conformant implementation <rfc2119>may</rfc2119> restrict the set of
codepoints to those permitted by XML 1.0 or XML 1.1.</p>

<p>In consequence, parsing of conformant JSON texts may fail if they contain codepoints
that the implementation does not support. However, if such codepoints are represented
in the input using JSON escape sequences, these specifications define
mechanisms for dealing with them, for example by substituting a replacement character.</p>

</div3>

<div3 id="func-parse-json">
<head><?function fn:parse-json?></head>
Expand Down

0 comments on commit f42fc58

Please sign in to comment.