-
Notifications
You must be signed in to change notification settings - Fork 28.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-11753][SQL][test-hadoop2.2] Make allowNonNumericNumbers option work #9759
Conversation
Can you add also nan, infinity, -infinity, inf, -inf to the test case? And also turn it on by default. |
(Also update the documentation in DataFrameReader and readwrite.py to include this option. |
ok. |
Test build #46056 has finished for PR 9759 at commit
|
retest this please. |
Test build #46067 has finished for PR 9759 at commit
|
Test build #46093 has finished for PR 9759 at commit
|
val rdd = sqlContext.sparkContext.parallelize(Seq(str)) | ||
val df = sqlContext.read.option("allowNonNumericNumbers", "true").json(rdd) | ||
test("allowNonNumericNumbers on") { | ||
val testCases: Seq[String] = Seq("""{"age": NaN}""", """{"age": Infinity}""", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we still read them if they are quoted?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, so if we don't set JsonGenerator.Feature.QUOTE_NON_NUMERIC_NUMBERS
to false, we can't read them normally.
Test build #46407 has finished for PR 9759 at commit
|
@@ -100,34 +101,27 @@ object JacksonParser { | |||
parser.getFloatValue | |||
|
|||
case (VALUE_STRING, FloatType) => | |||
// Special case handling for NaN and Infinity. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why are we removing the special handling for float types here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yea, should revert it back. BTW, do we actually test "inf" and "-inf" before? Because "inf".toFloat is not legal.
val testCases: Seq[String] = Seq("""{"age": NaN}""", """{"age": Infinity}""", | ||
"""{"age": -Infinity}""", """{"age": "NaN"}""", """{"age": "Infinity"}""", | ||
"""{"age": "-Infinity"}""") | ||
val tests: Seq[Double => Boolean] = Seq(_.isNaN, _.isPosInfinity, _.isNegInfinity, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Besides, I found that "Inf", "-Inf" seems not working even JsonGenerator.Feature.QUOTE_NON_NUMERIC_NUMBERS
is enabled.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Need to upgrade jackson library version in order to support "INF" and "-INF" (case-sensitive).
Test build #46812 has finished for PR 9759 at commit
|
Test build #58469 has finished for PR 9759 at commit
|
Conflicts: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/json/JSONOptions.scala sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/json/JacksonParser.scala sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/json/JsonParsingOptionsSuite.scala sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/json/JsonSuite.scala
@rxin I have revisited this recently. This should be useful. Please check if it is good for you now. Thanks. |
case VALUE_STRING => StringType | ||
case VALUE_STRING => | ||
// If there is only one row, the following non-numeric numbers will be incorrectly | ||
// recognized as StringType. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
E.g., the two tests in JsonParsingOptionsSuite
.
Test build #58626 has finished for PR 9759 at commit
|
Test build #58627 has finished for PR 9759 at commit
|
lowerCaseValue.equals("-infinity") || | ||
lowerCaseValue.equals("inf") || | ||
lowerCaseValue.equals("-inf")) { | ||
if (value.equals("NaN") || |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"infinity".toDouble, "inf".toDouble are not legal. These non-numeric numbers are case-sensitive, both for Jackson and Scala.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should also allow INF
, -INF
here, to be consistent with the legal inputs of allowNonNumericNumbers
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok. added.
Test build #58628 has finished for PR 9759 at commit
|
ping @rxin also cc @cloud-fan @yhuai |
value.toFloat | ||
} else if (value.equals("+INF")) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should we support INF
(without +
)? And it's weird that we don't need +
for Infinity
, but need it for INF
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Although jackson supports Infinity
, +Infinity
, it only supports +INF
. I am neutral about this. But be consistent with it seems be better?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
think from an end user point of view, we should be more consistent with ourselves, not some random rule by jackson.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok. fixed.
Test build #59119 has finished for PR 9759 at commit
|
The rest LGTM |
Is this ready to merge now? |
thanks, merging to master and 2.0! |
… work ## What changes were proposed in this pull request? Jackson suppprts `allowNonNumericNumbers` option to parse non-standard non-numeric numbers such as "NaN", "Infinity", "INF". Currently used Jackson version (2.5.3) doesn't support it all. This patch upgrades the library and make the two ignored tests in `JsonParsingOptionsSuite` passed. ## How was this patch tested? `JsonParsingOptionsSuite`. Author: Liang-Chi Hsieh <simonh@tw.ibm.com> Author: Liang-Chi Hsieh <viirya@appier.com> Closes #9759 from viirya/fix-json-nonnumric. (cherry picked from commit c24b6b6) Signed-off-by: Wenchen Fan <wenchen@databricks.com>
@rxin, could we revert this one? Just hit a regression issue in jackson-module-scala 2.7.{1, 2, 3} about Option json serialization (FasterXML/jackson-module-scala#240) in my PR (#13335). We can remerge this one once jackson-module-scala fixes the issue. |
Darn. Is there any workaround? Downgrade to 2.6.5? |
2.6.5 doesn't have the Edited: That's why I need to revert the whole patch. |
What's the impact of the problem right now then? because it sounds like downgrading would simply cause another problem. It sounds like you have a workaround there? |
The problem is if a class has an Option parameter, jackson-module-scala may not be able to generate the correct json automatically (except you define a custom serializer by yourself). This also impacts users' codes. Reverting this PR just drops an unreleased feature, which I think not a big deal. |
Yea I'm OK with reverting this, if the other thing is difficult to work around. |
Sent #13417 to revert it. |
@srowen @rxin @zsxwing The Option json serialization issue (FasterXML/jackson-module-scala#240) looks like fixed now. Do you think it is ok I try to upgrade Jackson now? |
@viirya to what version? if it's a maintenance release that's usually fine |
@srowen Unfortunately, this fixing is only included since 2.8.1. Even latest maintenance release 2.7.8 doesn't contain it. |
Hi all. There are security issues in jackson-dataformat-xml prior to 2.7.4 and 2.8.0. Here are the links: FasterXML/jackson-dataformat-xml#199, FasterXML/jackson-dataformat-xml#190. Even though Spark itself doesn't use this module, this dependency forces Spark users to use affected version, to have consistent set of jackson libraries. |
…s in JSON ## What changes were proposed in this pull request? This PR is based on apache#16199 and extracts the valid change from apache#9759 to resolve SPARK-18772 This avoids additional conversion try with `toFloat` and `toDouble`. For avoiding additional conversions, please refer the codes below: **Before** ```scala scala> import org.apache.spark.sql.types._ import org.apache.spark.sql.types._ scala> spark.read.schema(StructType(Seq(StructField("a", DoubleType)))).option("mode", "FAILFAST").json(Seq("""{"a": "nan"}""").toDS).show() 17/05/12 11:30:41 ERROR Executor: Exception in task 0.0 in stage 2.0 (TID 2) java.lang.NumberFormatException: For input string: "nan" ... ``` **After** ```scala scala> import org.apache.spark.sql.types._ import org.apache.spark.sql.types._ scala> spark.read.schema(StructType(Seq(StructField("a", DoubleType)))).option("mode", "FAILFAST").json(Seq("""{"a": "nan"}""").toDS).show() 17/05/12 11:44:30 ERROR Executor: Exception in task 0.0 in stage 0.0 (TID 0) java.lang.RuntimeException: Cannot parse nan as DoubleType. ... ``` ## How was this patch tested? Unit tests added in `JsonSuite`. Closes apache#16199 Author: hyukjinkwon <gurwls223@gmail.com> Author: Nathan Howell <nhowell@godaddy.com> Closes apache#17956 from HyukjinKwon/SPARK-18772.
…s in JSON ## What changes were proposed in this pull request? This PR is based on #16199 and extracts the valid change from #9759 to resolve SPARK-18772 This avoids additional conversion try with `toFloat` and `toDouble`. For avoiding additional conversions, please refer the codes below: **Before** ```scala scala> import org.apache.spark.sql.types._ import org.apache.spark.sql.types._ scala> spark.read.schema(StructType(Seq(StructField("a", DoubleType)))).option("mode", "FAILFAST").json(Seq("""{"a": "nan"}""").toDS).show() 17/05/12 11:30:41 ERROR Executor: Exception in task 0.0 in stage 2.0 (TID 2) java.lang.NumberFormatException: For input string: "nan" ... ``` **After** ```scala scala> import org.apache.spark.sql.types._ import org.apache.spark.sql.types._ scala> spark.read.schema(StructType(Seq(StructField("a", DoubleType)))).option("mode", "FAILFAST").json(Seq("""{"a": "nan"}""").toDS).show() 17/05/12 11:44:30 ERROR Executor: Exception in task 0.0 in stage 0.0 (TID 0) java.lang.RuntimeException: Cannot parse nan as DoubleType. ... ``` ## How was this patch tested? Unit tests added in `JsonSuite`. Closes #16199 Author: hyukjinkwon <gurwls223@gmail.com> Author: Nathan Howell <nhowell@godaddy.com> Closes #17956 from HyukjinKwon/SPARK-18772. (cherry picked from commit 3f98375) Signed-off-by: Wenchen Fan <wenchen@databricks.com>
…s in JSON ## What changes were proposed in this pull request? This PR is based on apache#16199 and extracts the valid change from apache#9759 to resolve SPARK-18772 This avoids additional conversion try with `toFloat` and `toDouble`. For avoiding additional conversions, please refer the codes below: **Before** ```scala scala> import org.apache.spark.sql.types._ import org.apache.spark.sql.types._ scala> spark.read.schema(StructType(Seq(StructField("a", DoubleType)))).option("mode", "FAILFAST").json(Seq("""{"a": "nan"}""").toDS).show() 17/05/12 11:30:41 ERROR Executor: Exception in task 0.0 in stage 2.0 (TID 2) java.lang.NumberFormatException: For input string: "nan" ... ``` **After** ```scala scala> import org.apache.spark.sql.types._ import org.apache.spark.sql.types._ scala> spark.read.schema(StructType(Seq(StructField("a", DoubleType)))).option("mode", "FAILFAST").json(Seq("""{"a": "nan"}""").toDS).show() 17/05/12 11:44:30 ERROR Executor: Exception in task 0.0 in stage 0.0 (TID 0) java.lang.RuntimeException: Cannot parse nan as DoubleType. ... ``` ## How was this patch tested? Unit tests added in `JsonSuite`. Closes apache#16199 Author: hyukjinkwon <gurwls223@gmail.com> Author: Nathan Howell <nhowell@godaddy.com> Closes apache#17956 from HyukjinKwon/SPARK-18772.
…s in JSON ## What changes were proposed in this pull request? This PR is based on apache#16199 and extracts the valid change from apache#9759 to resolve SPARK-18772 This avoids additional conversion try with `toFloat` and `toDouble`. For avoiding additional conversions, please refer the codes below: **Before** ```scala scala> import org.apache.spark.sql.types._ import org.apache.spark.sql.types._ scala> spark.read.schema(StructType(Seq(StructField("a", DoubleType)))).option("mode", "FAILFAST").json(Seq("""{"a": "nan"}""").toDS).show() 17/05/12 11:30:41 ERROR Executor: Exception in task 0.0 in stage 2.0 (TID 2) java.lang.NumberFormatException: For input string: "nan" ... ``` **After** ```scala scala> import org.apache.spark.sql.types._ import org.apache.spark.sql.types._ scala> spark.read.schema(StructType(Seq(StructField("a", DoubleType)))).option("mode", "FAILFAST").json(Seq("""{"a": "nan"}""").toDS).show() 17/05/12 11:44:30 ERROR Executor: Exception in task 0.0 in stage 0.0 (TID 0) java.lang.RuntimeException: Cannot parse nan as DoubleType. ... ``` ## How was this patch tested? Unit tests added in `JsonSuite`. Closes apache#16199 Author: hyukjinkwon <gurwls223@gmail.com> Author: Nathan Howell <nhowell@godaddy.com> Closes apache#17956 from HyukjinKwon/SPARK-18772.
What changes were proposed in this pull request?
Jackson suppprts
allowNonNumericNumbers
option to parse non-standard non-numeric numbers such as "NaN", "Infinity", "INF". Currently used Jackson version (2.5.3) doesn't support it all. This patch upgrades the library and make the two ignored tests inJsonParsingOptionsSuite
passed.How was this patch tested?
JsonParsingOptionsSuite
.