Skip to content

Commit

Permalink
Expound on the capabilities of collations
Browse files Browse the repository at this point in the history
  • Loading branch information
michaelhkay committed Jan 3, 2025
1 parent 4eaeba1 commit 2c3fe51
Show file tree
Hide file tree
Showing 2 changed files with 89 additions and 48 deletions.
20 changes: 12 additions & 8 deletions specifications/xpath-functions-40/src/function-catalog.xml
Original file line number Diff line number Diff line change
Expand Up @@ -24828,7 +24828,7 @@ return map:build($titles/title, fn($title) { $title/ix })
<fos:signatures>
<fos:proto name="collation-available" return-type="xs:boolean">
<fos:arg name="collation" type="xs:string"/>
<fos:arg name="usage" type="enum('equality', 'sort', 'substring')*"
<fos:arg name="usage" type="enum('compare', 'key', 'substring')*"
default="()"/>
</fos:proto>
</fos:signatures>
Expand All @@ -24838,7 +24838,8 @@ return map:build($titles/title, fn($title) { $title/ix })
<fos:property>focus-independent</fos:property>
</fos:properties>
<fos:summary>
<p>Asks whether a collation URI is recognized by the implementation.</p>
<p>Asks whether a collation URI is recognized by the implementation,
and whether it has required properties.</p>
</fos:summary>
<fos:rules>
<p>The first argument is a candidate collation URI.</p>
Expand All @@ -24847,12 +24848,12 @@ return map:build($titles/title, fn($title) { $title/ix })
is a sequence containing zero or more of the following:</p>

<ulist>
<item><p><code>equality</code> indicates that the intended purpose of the collation
URI is to compare strings for equality, for example in functions such as
<function>fn:index-of</function> or <function>fn:deep-equal</function>.</p></item>
<item><p><code>sort</code> indicates that the intended purpose of the collation
URI is to sort or compare different strings in a collating sequence, for example
in functions such as <function>fn:sort</function> or <function>fn:max</function>.</p></item>
<item><p><code>compare</code> indicates that the intended purpose of the collation
URI is to compare strings for equality or ordering, for example in functions such as
<function>fn:index-of</function>, <function>fn:deep-equal</function>,
<function>fn:compare</function>, and <function>fn:sort</function>.</p></item>
<item><p><code>key</code> indicates that the intended purpose of the collation
URI is to obtain collation keys for strings using the <function>fn:collation-key</function>.</p></item>
<item><p><code>substring</code> indicates that the intended purpose of the collation
URI is to establish whether one string is a substring of another, for example
in functions such as <function>fn:contains</function> or <function>fn:starts-with</function>.</p></item>
Expand Down Expand Up @@ -24986,6 +24987,9 @@ return map:build($titles/title, fn($title) { $title/ix })
where <code>$collation</code> allows the collation to be chosen dynamically.</p>
<p>Note that <code>xs:base64Binary</code> becomes an ordered type
in XPath 3.1, making binary collation keys possible.</p>

<p>The <function>fn:collation-available</function> can be used to ask whether a particular
collation is capable of delivering collation keys.</p>

</fos:notes>
<fos:examples>
Expand Down
117 changes: 77 additions & 40 deletions specifications/xpath-functions-40/src/xpath-functions.xml
Original file line number Diff line number Diff line change
Expand Up @@ -2577,59 +2577,73 @@ string conversion of the number as obtained above, and the appropriate <var>suff
<?local-function-index?>
<div3 id="collations">
<head>Collations</head>
<p> A collation is a specification of the manner in which <termref def="string">strings</termref> are
compared and, by extension, ordered. When values whose type is
<code>xs:string</code> or a type derived from <code>xs:string</code> are
compared (or, equivalently, sorted), the comparisons are inherently
performed according to some collation (even if that collation is defined
entirely on codepoint values). The <bibref ref="charmod"/> observes that
some applications may require different comparison and ordering behaviors
than other applications. Similarly, some users having particular linguistic
expectations may require different behaviors than other users. Consequently,
the collation must be taken into account when comparing strings in any
context. Several functions in this and the following section make use of a
collation. </p>
<p>Collations can indicate that two different codepoints are, in fact, equal
for comparison purposes (e.g., “v” and “w” are considered equivalent in
<p><termdef id="dt-collation" term="collation"> A <term>collation</term>
is an algorithm that determines, for any two given strings
<var>S1</var> and <var>S2</var>, whether <var>S1</var> is less than,
equal to, or greater than <var>S2</var>. In this specification,
a collation is identified by an absolute URI.</termdef></p>

<p>The <bibref ref="charmod"/> observes that
different applications may require different comparison and ordering behaviors.
Similarly, different users with different linguistic
expectations may require different behaviors. Consequently,
the collation must be taken into account when comparing strings.</p>

<p>Collations can indicate that two different codepoints are to be considered equal
for comparison purposes (for example, “v” and “w” are considered equivalent in
some Swedish collations). Strings can be compared codepoint-by-codepoint or in a
linguistically appropriate manner, as defined by the collation. </p>
<p>Some collations, especially those based on the
Unicode Collation Algorithm (see <bibref ref="UNICODE-TR10"/>) can be “tailored” for various purposes. This
document does not discuss such tailoring, nor does it provide a mechanism to
perform tailoring. Instead, it assumes that the collation argument to the
various functions below is a tailored and named collation.</p>
<p>The <termref def="dt-codepoint-collation">Unicode codepoint collation</termref> is a collation
available in every implementation, which sorts based on codepoint values. For further details
linguistically appropriate manner.</p>
<note>
<p>Some sources, for example <bibref ref="UNICODE-TR10"/> use the term <term>collation</term>
to refer more generically to a set of sorting rules that can be further parameterized
or “tailored”. In this specification the term is always used for a specific algorithm
in which all such parameters have defined values.</p>
</note>

<p>This specification defines some collation URIs that provide interoperable
sorting behavior across applications. Other collation URIs are defined only
partially (leaving some aspects implementation-defined). Implementations may
define further collation URIs, or may allow users or third parties to define them.</p>

<p>The <termref def="dt-codepoint-collation">Unicode codepoint collation</termref> is
available in every implementation. This collation sorts based on codepoint values. For further details
see <specref ref="codepoint-collation"/>.</p>


<p>Collations may or may not perform Unicode normalization on strings before comparing them.</p>
<p>This specification assumes that collations are named and that the collation
name may be provided as an argument to string functions. Functions that
allow specification of a collation do so with an argument whose type is
<code>xs:string</code> but whose lexical form must conform to an
<code>xs:anyURI</code>.
This specification also defines the manner in which a
default collation is determined if the collation argument is not specified
in calls of functions that use a collation but allow it to be omitted. </p>
<p diff="chg" at="2023-05-29">If the collation is specified using a relative URI reference,

<p>This specification allows a collation
name to be provided as an argument to many string functions. Although
collations are defined to be URIs, they are supplied as instances of
<code>xs:string</code>.</p>

<p>The XQuery/XPath static context supplies a default collation
for use when the collation argument is not specified.
(see <xspecref spec="XP31" ref="static_context"/>).
If the default collation is not specified by the
user or the system, the default collation is the
<termref def="dt-codepoint-collation">Unicode codepoint collation</termref>.</p>


<p>If the collation is specified using a relative URI reference,
it is resolved relative to an <termref def="impl-def">implementation-defined</termref> base URI.</p>
<note diff="chg" at="2023-05-29"><p>Previous versions of this specification stated that it must
<note><p>Previous versions of this specification stated that it must
be resolved against the <xtermref spec="XP40" ref="dt-static-base-uri"/>, but this is not always
operationally convenient. It is <rfc2119>recommended</rfc2119> that processors should provide
a means of setting the base URI for resolving collation URIs independently of the
<xtermref spec="XP40" ref="dt-static-base-uri"/>, though for backwards compatibility,
the <xtermref spec="XP40" ref="dt-static-base-uri">Static Base URI</xtermref> or
<xtermref spec="XP40" ref="dt-executable-base-uri">Executable Base URI</xtermref>
should be used as a default.</p></note>

<p>This specification does not define whether or not the collation URI is
dereferenced. The collation URI may be an abstract identifier, or it may
refer to an actual resource describing the collation. If it refers to a
resource, this specification does not define the nature of that resource.
One possible candidate is that the resource is a locale description
expressed using the Locale Data Markup Language: see <bibref ref="UNICODE-TR35"/>.
</p>
<p>Functions such as <function>fn:compare</function> and <function>fn:max</function> that
<!--<p>Functions such as <function>fn:compare</function> and <function>fn:max</function> that
compare <code>xs:string</code> values use a single collation URI to identify
all aspects of the collation rules. This means that any parameters such as
the strength of the collation must be specified as part of the collation
Expand All @@ -2644,12 +2658,7 @@ string conversion of the number as obtained above, and the appropriate <var>suff
<code>http://www.example.com/collations/French2</code>.
Note that some specifications use the term collation to refer to
an algorithm that can be parameterized, but in this specification, each
possible parameterization is considered to be a distinct collation.</p>
<p>The XQuery/XPath static context includes a provision for a default collation
that can be used for string comparisons and ordering operations. See the
description of the static context in <xspecref spec="XP31" ref="static_context"/>.
If the default collation is not specified by the
user or the system, the default collation is the <termref def="dt-codepoint-collation">Unicode codepoint collation</termref>.</p>
possible parameterization is considered to be a distinct collation.</p>-->
<note>
<p>XML allows elements to specify the <code>xml:lang</code> attribute to
indicate the language associated with the content of such an element.
Expand All @@ -2660,6 +2669,27 @@ string conversion of the number as obtained above, and the appropriate <var>suff
when a string is multilingual. </p>
</note>
</div3>
<div3 id="collation-capabilities">
<head>Collation Capabilities</head>
<p>All collations support the ability to compare two strings to decide
whether they are equal, and if not, which one should sort first. This
must always define a total ordering, which implies that the comparison
is transitive.</p>
<p>A collation may (or may not) support the ability to derive a <term>collation key</term>
for a given string. A collation key is a binary value obtained as a function
of a string <var>S</var> and a collation <var>C</var>,
such that the collation keys for two strings <var>S1</var> and <var>S2</var>
have the same ordering relationship (less than, equal, or greater than) as
the two strings themselves, when compared under the relevant collation.
Collation keys are useful for operations such as indexing, because they
can be used as keys in maps. They are available using the
<function>fn:collation-key</function> function.</p>
<p>Furthermore, a collation may (or may not) support the ability to determine whether
one string is a substring of another under that collation. The use of collations
in substring matching is described in <specref ref="substring.functions"/>.</p>
<p>The capabilities of a collation may be determined using the
<function>fn:collation-available</function> function.</p>
</div3>
<div3 id="codepoint-collation">
<head>The Unicode Codepoint Collation</head>
<p><termdef id="dt-codepoint-collation" term="Unicode codepoint collation">The collation URI
Expand Down Expand Up @@ -2693,6 +2723,10 @@ string conversion of the number as obtained above, and the appropriate <var>suff
<note><p>While the Unicode codepoint collation does not produce results suitable for quality publishing of
printed indexes or directories, it is adequate for many purposes where a restricted alphabet
is used, such as sorting of vehicle registrations.</p></note>

<note><p>The Unicode codepoint collation differs from the
default sort order used in programming languages that sort strings
based on UTF-16 code units, which may include surrogate pairs.</p></note>
</div3>
<div3 id="uca-collations">
<head>The Unicode Collation Algorithm</head>
Expand Down Expand Up @@ -3011,7 +3045,10 @@ string conversion of the number as obtained above, and the appropriate <var>suff
compare two strings, but that does not have the capability to split the string
into collation units. Such a collation may cause the function to fail, or to
give unexpected results, or it may be rejected as an unsuitable argument. The
ability to decompose strings into collation units is an <termref def="implementation-defined"/> property of the collation.</p>
ability to decompose strings into collation units is an
<termref def="implementation-defined"/> property of the collation.
The <function>fn:collation-available</function> function can be used to ask
whether a particular collation has this property.</p>
<?local-function-index?>
<div3 id="func-contains">
<head><?function fn:contains?></head>
Expand Down

0 comments on commit 2c3fe51

Please sign in to comment.