From 210e0f067f2091ead7d6aa881053de7769ee9339 Mon Sep 17 00:00:00 2001 From: Michael Kay Date: Fri, 6 Dec 2024 21:46:42 +0000 Subject: [PATCH] Normalize line endings in CSV prior to parsing --- .../src/function-catalog.xml | 54 +++++++++---------- .../src/xpath-functions.xml | 28 ++++++---- 2 files changed, 42 insertions(+), 40 deletions(-) diff --git a/specifications/xpath-functions-40/src/function-catalog.xml b/specifications/xpath-functions-40/src/function-catalog.xml index 34d704160..f6f1fd196 100644 --- a/specifications/xpath-functions-40/src/function-catalog.xml +++ b/specifications/xpath-functions-40/src/function-catalog.xml @@ -25867,7 +25867,8 @@ return json-to-xml($json, $options)]]> The character used to delimit rows within the CSV string. An instance of xs:string whose length is exactly one. - Defaults to a single newline character (U+000A). + Defaults to a single newline character (U+000A). + Note that this is tested after line endings are normalized. xs:string char('\n') @@ -25891,7 +25892,7 @@ return json-to-xml($json, $options)]]> - + Determines whether the first row of the CSV should be treated as a list of column names, or whether column names are being supplied by the caller. @@ -25978,11 +25979,10 @@ return json-to-xml($json, $options)]]> -

The default row delimiter is a single newline character U+000A. If the content - is read using the unparsed-text function, alternative line endings - such as CR and CRLF will have been normalized to a single - newline. In other cases, this normalization can be achieved by setting the - normalize-newlines option.

+

The default row delimiter is a single newline character U+000A. + Alternative line endings + such as CR and CRLF will already have been normalized to a single + newline.

All fields are returned as xs:string values.

Quoted fields in the input are returned without the quotes.

For more discussion of the returned data, see .

@@ -26234,7 +26234,9 @@ return (

The $value argument is CSV data, as defined in , in the form of an - xs:string value. The function parses this string. + xs:string value. The function parses this string, + after normalizing newlines so that U+000D and (U+000D, U+000A) + sequences are converted to U+000A. The result of the function is a sequence of arrays of strings, that is array(xs:string)*; each array represents one row of the CSV input.

@@ -26289,7 +26291,7 @@ return (
- +

An empty field is represented by a zero-length string. An empty field is deemed to exist @@ -26322,8 +26324,8 @@ return ( contain no rows; while if $value consists of a single row delimiter, it is considered to contain a single blank row. The presence or absence of a final row delimiter generally has no effect on the result, - except in the situation described in the previous paragraph where it causes a - blank row to exist.

+ except when it appears at the start of the input, in which case it causes a + single blank row to exist.

@@ -26339,12 +26341,10 @@ return ( quote-character.

-

The default row delimiter is a single newline character U+000A. If the content - is read using the unparsed-text function, alternative line endings - such as CR and CRLF will have been normalized to a single - newline. In other cases, this normalization can be achieved by setting the - option normalize-newlines. This option does not affect CR or CRLF - sequences occurring within quoted fields.

+

The default row delimiter is a single newline character U+000A. + Alternative line endings + such as CR and CRLF will already have been normalized to a single + newline.

All fields are returned as xs:string values.

Quoted fields in the input are returned without the quotes.

The first row is not treated specially.

@@ -26405,8 +26405,7 @@ return ( return csv-to-arrays( `name,city{ $CRLF }` || `Bob,Berlin{ $CRLF }` || - `Alice,Aachen{ $CRLF }`, - { "normalize-newlines": true() } + `Alice,Aachen{ $CRLF }` ) [ "name", "city" ], [ "Bob", "Berlin" ], @@ -26618,7 +26617,7 @@ return document {

With defaults for delimiters and quotes, recognizing headers:

csv-to-xml($csv-string, - { "header": true(), "normalize-newlines": true() }) + { "header": true() })
@@ -26651,8 +26650,7 @@ return document { csv-to-xml( $csv-uneven-cols, { "header": true(), - "select-columns": (2, 1, 4), - "normalize-newlines": true() + "select-columns": (2, 1, 4) } ) csv-to-xml( $csv-uneven-cols, - { "header": true(), "normalize-newlines": true() } + { "header": true() } ) @@ -26737,8 +26735,7 @@ return document { csv-to-xml( $csv-uneven-cols, { "header": true(), - "trim-rows": true(), - "normalize-newlines": true() + "trim-rows": true() } ) csv-to-xml( $csv-uneven-cols, { "header": true(), - "select-columns": 1 to 6, - "normalize-newlines": true() + "select-columns": 1 to 6 } )

-->

This specification uses the term row where RFC 4180 uses record.

-

Row delimiters other than CRLF are recognized.

+

Line endings are normalized: specifically, the character sequences + U+000D, or U+000D followed by U+000A, are converted + to a single U+000A character. This applies whether or not the line ending + appears within a quoted string, and whether or not U+000A is the chosen + row delimiter.

+

Row delimiters other than newline are recognized.

Field delimiters other than comma (",") are recognized.

Quote characters other than the double quotation mark ('"') are recognized.

@@ -6963,18 +6968,18 @@ correctly in all browsers, depending on the system configuration.

-->

Rows in CSV files are typically delimited with CRLF (U+000D, U+000A), LF (U+000A), or CR (U+000D) line endings, - although RFC 4180 specifies CRLF. By contrast, the fn:unparsed-text - function normalizes these line endings to LF (U+000A). - The CSV parsing functions therefore use LF by default. An option is available - to normalize line endings so that CR and CRLF are converted to U+000A (except - when they appear in quote fields). This option is off by default, because - line ending normalization will usually have been carried out earlier: for - example, the fn:unparsed-text function does it automatically. + although RFC 4180 specifies CRLF. The CSV parsing functions + normalize these line endings to LF (U+000A). + They therefore use LF as the default row delimiter.

-

The last row in the file may or may not be followed by a row delimiter.

+

The last row in the file may or may not be followed by a row delimiter. + An empty file is treated as containing zero rows, while a file consisting solely + of a row delimiter is treated as containing one empty row. In all other cases, + a file that does not end with a row delimiter is treated as if a row delimiter were + added at the end.

Fields in CSV are frequently delimited with a comma. Other field delimiters are useful, for @@ -6982,7 +6987,7 @@ correctly in all browsers, depending on the system configuration.

--> chosen field delimiter is then often U+003B or U+0009.

-

The column delimiter defaults to U+002C. +

The column delimiter thus defaults to U+002C. The value may be any single Unicode character. An error is raised if the column-delimiter option is set to a multi-character string.

@@ -6991,7 +6996,8 @@ correctly in all browsers, depending on the system configuration.

--> Field quoting -

CSVs, as specified in , require that fields be wrapped with a quote character if they +

CSVs, as specified in , require that fields be wrapped + with a quote character if they contain either the row or column delimiter. For example:

"A single field, containing a comma","another field containing CRLF