Merge pull request qt4cg#770 from ndw/decode-uri

566: Use fn:decode-from-uri in fn:parse-uri
michaelhkay · Oct 31, 2023 · 8649197 · 8649197
2 parents 7ddbe06 + a146191
commit 8649197
Show file tree

Hide file tree

Showing 2 changed files with 96 additions and 87 deletions.
diff --git a/specifications/xpath-functions-40/src/function-catalog.xml b/specifications/xpath-functions-40/src/function-catalog.xml
@@ -27534,6 +27534,8 @@ declare function some(
          any backlashes (<code>\</code>), replace them with forward
          slashes (<code>/</code>).</p>
 
+         <p>Strip off the fragment identifier and any query:</p>
+
          <p>If the <emph>string</emph> matches <code>^(.*)#([^#]*)$</code>,
          the <emph>string</emph> is the first match group and the
          <emph>fragment</emph> is the second match group. Otherwise,
@@ -27546,6 +27548,8 @@ declare function some(
          the string is unchanged and the <emph>query</emph> is the empty
          sequence.</p>
 
+         <p>Attempt to identify the scheme:</p>
+
          <ulist>
             <item>
                <p>If the <emph>string</emph> matches <code>^[a-zA-Z][:|].*$</code>:</p>
@@ -27573,42 +27577,54 @@ declare function some(
          </item>
          </ulist>
 
-         <p>If the <emph>scheme</emph> is the empty sequence, the
-         <code>unc-path</code> option is <code>true</code>, and the <emph>string</emph>
-         matches <code>^//[^/].*$</code>, then the scheme is <code>file</code>
-         and the <emph>filepath</emph> is the <emph>string</emph>.
-         </p>
+         <p>Now that the scheme, if there is one, has been identified,
+         determine if the URI is hierarchical:</p>
 
+         <ulist>
+            <item>
          <p>If the <emph>scheme</emph> is known to be hierarchical, or known
          not to be hierarchical, then <emph>hierarchical</emph> is set accordingly.
-         Exactly which schemes are known to be hierarchical or
-         non-hierarchical is
-         <termref def="implementation-defined">implementation-defined</termref>.
          If the implementation does not know if a <emph>scheme</emph> is or is not
          hierarchical, the <emph>hierarchical</emph> setting depends on the
          <emph>string</emph>. If the <emph>string</emph> is the empty string,
          <emph>hierarchical</emph> is the empty sequence (<emph>i.e.</emph> not known),
          otherwise <emph>hierarchical</emph> is
-         <code>true</code> if <emph>string</emph> begins with <code>/</code> and <code>false</code> otherwise.</p>
+         <code>true</code> if <emph>string</emph> begins with <code>/</code> and
+         <code>false</code> otherwise.</p>
+            </item>
+         </ulist>
 
-         <p>If <phrase diff="add" at="2023-07-07">the scheme is not known or is known to be <code>file</code> and</phrase>
-         the <emph>string</emph> matches <code>^//*([a-zA-Z]:.*)$</code>,
-         the <emph>authority</emph> is empty and the <emph>string</emph> is
-         the first match group. Otherwise, if the <emph>string</emph>
-         matches <code>^///*([^/]+)(/.*)?$</code> then the <emph>authority</emph>
-         is the first match group and the <emph>string</emph> is the second
-         match group. If the <emph>string</emph> does not match either
-         regular expression, the <emph>authority</emph> is the empty sequence
-         and the <emph>string</emph> is unchanged.</p>
+         <p>Then examine the remaining parts of the string.</p>
 
-         <p>If the <emph>string</emph> matches <code>^//*([a-zA-Z]:.*)$</code>,
+         <ulist>
+            <item>
+         <p>If the <emph>scheme</emph> is the empty sequence, the
+         <code>unc-path</code> option is <code>true</code>, and the
+         <emph>string</emph> matches <code>^//[^/].*$</code>, then the
+         scheme is <code>file</code>, the <emph>authority</emph> is
+         empty, and the <emph>filepath</emph> is the
+         <emph>string</emph>.
+         </p>
+            </item>
+            <item>
+         <p>Otherwise:</p>
+
+         <ulist>
+            <item>
+         <p>If the scheme is not known or is known to be <code>file</code>
+         and the <emph>string</emph> matches <code>^//*([a-zA-Z]:.*)$</code>,
          the <emph>authority</emph> is empty and the <emph>string</emph> is
-         the first match group. Otherwise, if the <emph>string</emph>
-         matches <code>^///*([^/]+)(/.*)?$</code> then the <emph>authority</emph>
+         the first match group.</p></item>
+         <item><p>Otherwise, if the <emph>string</emph>
+         matches <code>^///*([^/]+)?(/.*)?$</code>, the <emph>authority</emph>
          is the first match group and the <emph>string</emph> is the second
-         match group. If the <emph>string</emph> does not match either
+         match group.</p></item>
+         <item><p>Finally, if the <emph>string</emph> does not match either
          regular expression, the <emph>authority</emph> is the empty sequence
-         and the <emph>string</emph> is unchanged.</p>
+         and the <emph>string</emph> is unchanged.</p></item>
+         </ulist>
+            </item>
+         </ulist>
 
          <p>If the <emph>authority</emph> matches
          <code>^(([^@]*)@)(.*)(:([^:]*))?$</code>,
@@ -27657,23 +27673,23 @@ declare function some(
          <p>Similar care must be taken to match the port because an IPv6/IPvFuture
          address may contain a colon.</p>
 
-         <olist>
+         <ulist>
             <item>
                <p>If the <emph>authority</emph> matches
                <code>^(([^@]*)@)?(\[[^\]]*\])(:([^:]*))?$</code>,
-               then the <emph>port</emph> is match group 5, otherwise
+               then the <emph>port</emph> is match group 5.
                </p>
             </item>
             <item>
-               <p>If the <emph>authority</emph> matches
+               <p>Otherwise, if the <emph>authority</emph> matches
                <code>^(([^@]*)@)?([^:]+)(:([^:]*))?$</code>,
-               then the <emph>port</emph> is match group 5, otherwise
+               then the <emph>port</emph> is match group 5.
                </p>
             </item>
             <item>
-               <p>the <emph>port</emph> is the empty sequence.</p>
+               <p>Otherwise, the <emph>port</emph> is the empty sequence.</p>
             </item>
-         </olist>
+         </ulist>
 
          <p>If the <code>omit-default-ports</code> option is <code>true</code>,  the port
          is discarded and set to the empty sequence if the port number is the same
@@ -27697,20 +27713,8 @@ declare function some(
          separator</emph> and applying <emph>uri decoding</emph> on each
          token.</p>
 
-         <p>Applying <emph>uri decoding</emph> replaces all occurrences of
-         plus (<code>+</code>) with spaces and all occurrences of
-         <code>%[a-fA-F0-9][a-fA-F0-9]</code> with a single character with the
-         codepoint represented by the two digit hexadecimal number that
-         follows the <code>%</code> character. In other words, <code>"A%42C"</code> becomes
-         <code>"ABC"</code> If there are any occurrences of <code>%</code> followed
-         by up to two characters that are not hexadecimal digits, they are
-         replaced by the character sequence <code>0xef</code>, <code>0xbf</code>, <code>0xbd</code>
-         (that is, <code>0xfffd</code>, the Unicode replacement character, in UTF-8).
-         After replacing all of the percent-escaped characters, the character sequence is
-         interpreted as UTF-8 to get the string. In other words <code>"A%XYC%Z%F0%9F%92%A9"</code> becomes
-         <code>"A&#xfffd;C&#xfffd;💩"</code>. <phrase diff="add" at="2023-07-07">If the character sequence is
-         not a valid sequence of UTF-8 characters, any invalid characters are replaced with the
-         <code>0xfffd</code>.</phrase></p>
+         <p>Applying <emph>uri decoding</emph> is equivalent to
+         calling <code>fn:decode-from-uri</code> on the string.</p>
 
          <p>The <emph>query separator</emph> is the value of the
          <code>query-separator</code> option.
@@ -28292,20 +28296,26 @@ path with an explicit <code>file:</code> scheme.</p>
         <p>The components are derived from the contents of the <code>$parts</code>
         map in the following way:</p>
 
-        <p>If the <code>scheme</code> key is present in the map, the URI begins
-        with the value of that key. A URI is considered to be non-hierarchical
-        if either the <code>hierarchical</code> key is present in the 
-        <code>$parts</code> map with the value
-        <code>false()</code> or if the scheme is known to be non-hierarchical.
-        (In other words, schemes are hierarchical by default.)</p>
-
-        <p>If the <code>scheme</code> is <code>file</code> and the <code>unc-path</code>
-        option is <code>true</code>, the scheme is delimited by a trailing <code>:////</code>,
-        otherwise, if the URI is non-hierarchical, the scheme is delimited by
-        a trailing <code>:</code>. For all other schemes, it is delimited by
-        a trailing <code>://</code>. Exactly which schemes are known to be
-        non-hierarchical is
-        <termref def="implementation-defined">implementation-defined</termref>.</p>
+        <p>If the <code>scheme</code> key is present in the map,
+        the URI begins with the value of that key. A URI is considered to be
+        non-hierarchical if either the <code>hierarchical</code> key
+        is present in the <code>$parts</code> map with the value
+        <code>false()</code> or if the scheme is known to be
+        non-hierarchical. (In other words, schemes are hierarchical by
+        default.)</p>
+
+        <ulist>
+           <item><p>If the <code>scheme</code> is
+        known to be non-hierarchical, it is delimited by a trailing
+        <code>:</code>.</p>
+        </item>
+        <item><p>Otherwise, if the <code>scheme</code> is <code>file</code> and the <code>unc-path</code>
+        option is <code>true</code>, the scheme is delimited by a trailing <code>:////</code>.</p>
+        </item>
+        <item><p>Otherwise, the scheme is delimited by
+        a trailing <code>://</code>.</p>
+        </item>
+        </ulist>
 
         <p>For simplicity of exposition, we take the
         <code>userinfo</code>, <code>host</code>, and
@@ -28501,4 +28511,4 @@ path with an explicit <code>file:</code> scheme.</p>
       </fos:history>
    </fos:function>
 
-</fos:functions>
+</fos:functions>
diff --git a/specifications/xpath-functions-40/src/xpath-functions.xml b/specifications/xpath-functions-40/src/xpath-functions.xml
@@ -3305,15 +3305,22 @@ It is recommended that implementers consult <bibref ref="UNICODE-TR18"/> for inf
          URIs, to identify their structure, and construct URI strings
          from their structured representation.</p>
 
+         <p>Some URI schemes are hierarchical and some are non-hierarchical.
+         Implementations must treat the following schemes as non-hierarchical:
+         <code>jar</code>, <code>mailto</code>, <code>news</code>, <code>tag</code>,
+         <code>tel</code>, and <code>urn</code>. Whether additional schemes
+         are known to be non-hierarchical
+         <termref def="implementation-defined">implementation-defined</termref>.
+         If a scheme is not known to be non-hierarchical, it must be
+         treated as hierarchical.</p>
+
          <?local-function-index?>
 
            <p>The structured representation of a URI is described by the
            <code>uri-structure-record</code>:</p>
 
             <?type uri-structure-record?>
 
-
-
            <p>The parts of this structure are:</p>
 
            <table border="0" role="data">
@@ -3361,7 +3368,7 @@ It is recommended that implementers consult <bibref ref="UNICODE-TR18"/> for inf
                  <td>Parsed and unescaped path segments.</td>
                </tr>
                <tr>
-                 <td>query-segments</td>
+                 <td>query-parameters</td>
                  <td>Parsed and unescaped query terms</td>
                </tr>
                <tr>
@@ -3372,39 +3379,31 @@ It is recommended that implementers consult <bibref ref="UNICODE-TR18"/> for inf
            </table>
 
            <p>The segmented forms of the path and query parameters provide
-           convenient access to commonly used information. They’re represented
-           in the map as arrays, instead of sequences, just for the convenience
-           of serializing the structure.</p>
+           convenient access to commonly used information.</p>
 
            <p>The path, if there is one, is tokenized on “/” characters and
-           each segment is unesaped. Consider the URI <code>http://example.com/path/to/a%2fb</code>. The path portion has to be returned as <code>/path/to/a%2fb</code> because
+           each segment is unescaped (as per the <code>fn:decode-from-uri</code> function). Consider the URI
+           <code>http://example.com/path/to/a%2fb</code>.
+           The path portion has to be returned as <code>/path/to/a%2fb</code> because
            decoding the <code>%2f</code> would change the nature of the path.
-           The unescaped form is easily accessible from the path-segments array:</p>
-
-<eg>[
-  "",
-  "path",
-  "to",
-  "a/b"
-]</eg>
+           The unescaped form is easily accessible from the path-segments list:</p>
+
+           <eg>("", "path", "to", "a/b")</eg>
+
            <p>Note that the presence or absence of a leading slash on the path
            will effect whether or not the array begins with an empty string.</p>
 
-           <p>The query parameters are similarly decoded. Consider the URI:
+           <p>The query parameters are decoded into a map. Consider the URI:
            <code>http://example.com/path?a=1&amp;b=2%264&amp;a=3</code>.
-           Here the decoded form in the query-segments gives quick access to
-           the parameter values:</p>
-
-           <eg>[
-  { "key": "a",
-    "value": "1" },
-  { "key": "b",
-    "value": "2&amp;4" },
-  { "key": "a",
-    "value": "3" }
-]</eg>
-           <p>Note that both keys and values are unescaped and that it’s an array
-           of maps because key values can be repeated, as seen for <code>a</code>
+           The decoded form in the query-parameters is the following map:</p>
+
+           <eg>{ "a": ("1", "3"),
+  "b": "2&amp;4",
+}
+</eg>
+           <p>Note that both keys and values are unescaped. If a key
+           is repeated in the query string, the map will contain a
+           sequence of values for that key, as seen for <code>a</code>
            in this example.</p>
 
            <div3 id="func-parse-uri">