[BUG][Spark] Unable to operate on Delta table after applying large delete expression (>20MB) #2565
Open
2 of 8 tasks
Labels
bug
Something isn't working
Bug
Which Delta project/connector is this regarding?
Describe the problem
Spark 3.5 depends on Jackson 2.15 which introduces limits on processed JSON, namely a 20 MB (default) limit on the length of JSON strings.
When using Delta with Spark 3.5 (Jackson 2.15) it is possible to generate delta log entries that use larger strings, and it will not be possible to read these log entries back with the same version of Spark/Jackson.
In our case, we hit this when applying a horribly large deletion expression.
Steps to reproduce
(tenant#75 = some_tenant) AND some_id#83 INSET lots, of, comma, separated, ids, ...
predicate
string.Observed results
Expected results
Delta table is silently loaded and job proceeds on its own.
Further details
Jackson 2.15 introduced a way to change the default limits, and this is also available in Spark 3.5, but it requires explicitly integrating with those versions of Spark/Jackson.
StreamReadConstraints
class and setting (See AddStreamReadConstraints
limit for longest textual value to allow (default: 5M) FasterXML/jackson-core#863 (comment)).spark.read.option("maxStringLength", LARGE_VALUE).json(fileName)
. I'm not aware of a global configuration that would allow overriding the default limit.It's already an action on our side to avoid the use of such long filter expressions 😓. That said, upgrading to latest Spark/Jackson will bring this lingering issue that can affect others, and it may leave limited options for recovery as the upgrade may be distanced from the surfacing of the issue.
In our case, we upgraded this particular job over a month ago, and only now were hit with the issue. Looking back, our horrible filter expressions managed to say just under the 20MB limit until the latest one blew up to 30MB.
We were able to fix our job by reverting to Spark 3.4 / Jackson 2.14 / Delta 2.4. We're looking into what can be done in Spark/Delta to avoid the issue or make the string length configurable when operating on Delta tables with such large predicates in transaction entries.
Environment information
Willingness to contribute
The Delta Lake Community encourages bug fix contributions. Would you or another member of your organization be willing to contribute a fix for this bug to the Delta Lake code base?
The text was updated successfully, but these errors were encountered: