From 2c3fe514e09ab088952bcc39b746a34e1d3f4645 Mon Sep 17 00:00:00 2001 From: Michael Kay Date: Fri, 3 Jan 2025 20:05:52 +0000 Subject: [PATCH] Expound on the capabilities of collations --- .../src/function-catalog.xml | 20 +-- .../src/xpath-functions.xml | 117 ++++++++++++------ 2 files changed, 89 insertions(+), 48 deletions(-) diff --git a/specifications/xpath-functions-40/src/function-catalog.xml b/specifications/xpath-functions-40/src/function-catalog.xml index 3b44fb539..198583d80 100644 --- a/specifications/xpath-functions-40/src/function-catalog.xml +++ b/specifications/xpath-functions-40/src/function-catalog.xml @@ -24828,7 +24828,7 @@ return map:build($titles/title, fn($title) { $title/ix }) - @@ -24838,7 +24838,8 @@ return map:build($titles/title, fn($title) { $title/ix }) focus-independent -

Asks whether a collation URI is recognized by the implementation.

+

Asks whether a collation URI is recognized by the implementation, + and whether it has required properties.

The first argument is a candidate collation URI.

@@ -24847,12 +24848,12 @@ return map:build($titles/title, fn($title) { $title/ix }) is a sequence containing zero or more of the following:

-

equality indicates that the intended purpose of the collation - URI is to compare strings for equality, for example in functions such as - fn:index-of or fn:deep-equal.

-

sort indicates that the intended purpose of the collation - URI is to sort or compare different strings in a collating sequence, for example - in functions such as fn:sort or fn:max.

+

compare indicates that the intended purpose of the collation + URI is to compare strings for equality or ordering, for example in functions such as + fn:index-of, fn:deep-equal, + fn:compare, and fn:sort.

+

key indicates that the intended purpose of the collation + URI is to obtain collation keys for strings using the fn:collation-key.

substring indicates that the intended purpose of the collation URI is to establish whether one string is a substring of another, for example in functions such as fn:contains or fn:starts-with.

@@ -24986,6 +24987,9 @@ return map:build($titles/title, fn($title) { $title/ix }) where $collation allows the collation to be chosen dynamically.

Note that xs:base64Binary becomes an ordered type in XPath 3.1, making binary collation keys possible.

+ +

The fn:collation-available can be used to ask whether a particular + collation is capable of delivering collation keys.

diff --git a/specifications/xpath-functions-40/src/xpath-functions.xml b/specifications/xpath-functions-40/src/xpath-functions.xml index 26dd5fe92..d15c75a4d 100644 --- a/specifications/xpath-functions-40/src/xpath-functions.xml +++ b/specifications/xpath-functions-40/src/xpath-functions.xml @@ -2577,44 +2577,57 @@ string conversion of the number as obtained above, and the appropriate suff Collations -

A collation is a specification of the manner in which strings are - compared and, by extension, ordered. When values whose type is - xs:string or a type derived from xs:string are - compared (or, equivalently, sorted), the comparisons are inherently - performed according to some collation (even if that collation is defined - entirely on codepoint values). The observes that - some applications may require different comparison and ordering behaviors - than other applications. Similarly, some users having particular linguistic - expectations may require different behaviors than other users. Consequently, - the collation must be taken into account when comparing strings in any - context. Several functions in this and the following section make use of a - collation.

-

Collations can indicate that two different codepoints are, in fact, equal - for comparison purposes (e.g., “v” and “w” are considered equivalent in +

A collation + is an algorithm that determines, for any two given strings + S1 and S2, whether S1 is less than, + equal to, or greater than S2. In this specification, + a collation is identified by an absolute URI.

+ +

The observes that + different applications may require different comparison and ordering behaviors. + Similarly, different users with different linguistic + expectations may require different behaviors. Consequently, + the collation must be taken into account when comparing strings.

+ +

Collations can indicate that two different codepoints are to be considered equal + for comparison purposes (for example, “v” and “w” are considered equivalent in some Swedish collations). Strings can be compared codepoint-by-codepoint or in a - linguistically appropriate manner, as defined by the collation.

-

Some collations, especially those based on the - Unicode Collation Algorithm (see ) can be “tailored” for various purposes. This - document does not discuss such tailoring, nor does it provide a mechanism to - perform tailoring. Instead, it assumes that the collation argument to the - various functions below is a tailored and named collation.

-

The Unicode codepoint collation is a collation - available in every implementation, which sorts based on codepoint values. For further details + linguistically appropriate manner.

+ +

Some sources, for example use the term collation + to refer more generically to a set of sorting rules that can be further parameterized + or “tailored”. In this specification the term is always used for a specific algorithm + in which all such parameters have defined values.

+
+ +

This specification defines some collation URIs that provide interoperable + sorting behavior across applications. Other collation URIs are defined only + partially (leaving some aspects implementation-defined). Implementations may + define further collation URIs, or may allow users or third parties to define them.

+ +

The Unicode codepoint collation is + available in every implementation. This collation sorts based on codepoint values. For further details see .

Collations may or may not perform Unicode normalization on strings before comparing them.

-

This specification assumes that collations are named and that the collation - name may be provided as an argument to string functions. Functions that - allow specification of a collation do so with an argument whose type is - xs:string but whose lexical form must conform to an - xs:anyURI. - This specification also defines the manner in which a - default collation is determined if the collation argument is not specified - in calls of functions that use a collation but allow it to be omitted.

-

If the collation is specified using a relative URI reference, + +

This specification allows a collation + name to be provided as an argument to many string functions. Although + collations are defined to be URIs, they are supplied as instances of + xs:string.

+ +

The XQuery/XPath static context supplies a default collation + for use when the collation argument is not specified. + (see ). + If the default collation is not specified by the + user or the system, the default collation is the + Unicode codepoint collation.

+ + +

If the collation is specified using a relative URI reference, it is resolved relative to an implementation-defined base URI.

-

Previous versions of this specification stated that it must +

Previous versions of this specification stated that it must be resolved against the , but this is not always operationally convenient. It is recommended that processors should provide a means of setting the base URI for resolving collation URIs independently of the @@ -2622,6 +2635,7 @@ string conversion of the number as obtained above, and the appropriate suff the Static Base URI or Executable Base URI should be used as a default.

+

This specification does not define whether or not the collation URI is dereferenced. The collation URI may be an abstract identifier, or it may refer to an actual resource describing the collation. If it refers to a @@ -2629,7 +2643,7 @@ string conversion of the number as obtained above, and the appropriate suff One possible candidate is that the resource is a locale description expressed using the Locale Data Markup Language: see .

-

Functions such as fn:compare and fn:max that +

XML allows elements to specify the xml:lang attribute to indicate the language associated with the content of such an element. @@ -2660,6 +2669,27 @@ string conversion of the number as obtained above, and the appropriate suff when a string is multilingual.

+ + Collation Capabilities +

All collations support the ability to compare two strings to decide + whether they are equal, and if not, which one should sort first. This + must always define a total ordering, which implies that the comparison + is transitive.

+

A collation may (or may not) support the ability to derive a collation key + for a given string. A collation key is a binary value obtained as a function + of a string S and a collation C, + such that the collation keys for two strings S1 and S2 + have the same ordering relationship (less than, equal, or greater than) as + the two strings themselves, when compared under the relevant collation. + Collation keys are useful for operations such as indexing, because they + can be used as keys in maps. They are available using the + fn:collation-key function.

+

Furthermore, a collation may (or may not) support the ability to determine whether + one string is a substring of another under that collation. The use of collations + in substring matching is described in .

+

The capabilities of a collation may be determined using the + fn:collation-available function.

+
The Unicode Codepoint Collation

The collation URI @@ -2693,6 +2723,10 @@ string conversion of the number as obtained above, and the appropriate suff

While the Unicode codepoint collation does not produce results suitable for quality publishing of printed indexes or directories, it is adequate for many purposes where a restricted alphabet is used, such as sorting of vehicle registrations.

+ +

The Unicode codepoint collation differs from the + default sort order used in programming languages that sort strings + based on UTF-16 code units, which may include surrogate pairs.

The Unicode Collation Algorithm @@ -3011,7 +3045,10 @@ string conversion of the number as obtained above, and the appropriate suff compare two strings, but that does not have the capability to split the string into collation units. Such a collation may cause the function to fail, or to give unexpected results, or it may be rejected as an unsuitable argument. The - ability to decompose strings into collation units is an property of the collation.

+ ability to decompose strings into collation units is an + property of the collation. + The fn:collation-available function can be used to ask + whether a particular collation has this property.