Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Prevent inefficient internal conversion from BigDecimal to BigInteger wrt ultra-large scale #968

Closed
cowtowncoder opened this issue Apr 1, 2023 · 12 comments
Labels
performance Issue related to performance problems or enhancements
Milestone

Comments

@cowtowncoder
Copy link
Member

(note: somewhat related to #967)

Although we have reasonable protections against direct parsing/decoding of both BigDecimal (as of 2.15 release candidates), regarding "too long" numbers (by textual representation), it appears there may be one performance problem that only occurs if:

  1. Incoming number is large JSON floating-point number, using scientific notation (i.e. not long textually); decoded internally as BigDecimal (or double, depending)
  2. Due to target type being BigInteger, there is coercion (BigDecimal.toBigInteger())

but if so, performance can deteriorate significantly.
If this turns out to be true, we may need to limit magnitude (scale) of floating-point numbers that are legal to convert; this could be configurable limit (either new value in StreamReadConstraints, or derivative of max number length?) or, possibly just hard-coded value.

@cowtowncoder
Copy link
Member Author

/cc @pjfanning 2/2 of issues discussed separately.

@cowtowncoder cowtowncoder added performance Issue related to performance problems or enhancements 2.15 labels Apr 1, 2023
@pjfanning
Copy link
Member

pjfanning commented Apr 1, 2023

It might be useful to clone the BigDecimalParser code but have the methods use BigInteger instead. This would avoid creating a BigDecimal and converting that to a BigInteger. Small duplication of code but it should be more performant.

@cowtowncoder
Copy link
Member Author

@pjfanning Not sure it'd work since input uses engineering notation... is that legal for BigInteger?

@pjfanning
Copy link
Member

pjfanning commented Apr 4, 2023

The FastDoubleParser lib won't accept '1e20000000'. Do we need to support this value for BigInteger or do we need to ask the maintainer of FastDoubleParser lib to support this as a valid BigInteger?

new BigInteger("1e20000000") also fails.

Are we better off to modify jackson-core to fail if an Integer has 'e' notation?

@cowtowncoder
Copy link
Member Author

@pjfanning It's little bit different than that: if e notation is used, we will always get JsonToken.VALUE_NUMBER_FLOAT, not VALUE_NUMBER_INT. So we do not really (try to) parse BigInteger from E notation ever; it will go via BigDecimal. And I don't think we want to try to change this because it then gets into complications of variations (whether engineering value is integral or not).

But it seems to me that since the problem is conversion of BigDecimal into BigInteger we could impose limit on maximum scale -- from little I tried, it seemed that that's the key.
Whether to make maximum scale magnitude (I am assuming both 20000000 and -20000000 are problematic although haven't tested) configurable or just hard-coded is a question.

@cowtowncoder
Copy link
Member Author

One interesting note: I can only reproduce this with 2.15 -- 2.14 and 3.0 fail with different error; probably value overflow (I think given value exceeds Double range). That's not a real solution of course but fwtw specific performance problem is N/A for pre-2.15 I think.

@pjfanning
Copy link
Member

Would it make sense to add a StreamReadConstraints setting for max absolute BigInt exponent? We can add a sensible limit but lets people, who know the risks and need to support big exponents, go ahead and change the config to suit themselves.

@cowtowncoder
Copy link
Member Author

cowtowncoder commented Apr 4, 2023

@pjfanning That sounds reasonable. One question I have is whether it should only add to this conversion (BigDec->BigInt) or BigDec in general. It seems like it's not necessarily dangerous for general BigDecimal.

And the other angle is that with scale of 1000 you get numeric string of ~1000 characters so in a way we could actually simply use existing value maxNumberLength() for conversion case: especially since we do not allow engineering notation for integer anyway?

@pjfanning
Copy link
Member

I would suggest just adding it for (big) ints.

cowtowncoder added a commit that referenced this issue Apr 4, 2023
cowtowncoder added a commit that referenced this issue Apr 4, 2023
@cowtowncoder
Copy link
Member Author

@pjfanning Makes sense. But I am still wondering if a new limit is even needed. Given that this is sort of an edge case (from floating-point number to integer), and since problematic scale magnitude is orders of magnitude bigger than maximum number length... that is,

1e999999

is 1 meg string when written out as "plain" BigInteger, and we by default only allow number strings of 1000 characters, we could consider one of:

  1. Use a limit that is some multiple of maximum-number-length (10x ?)
  2. Use a fixed but higher limit

since users can work around the issue of big scale by using BigDecimal target and handle post-processing on their own, if limit becomes onerous.

It is not that I couldn't add a new limit constant, but there is some maintenance overhead.

Also: I think that validation of scale limit, whatever it is, could be done via StreamReadConstraints, making it bit easier for us to add explicit override if needed.

I guess I can cobble up a PR to show what I am thinking, as PoC.

@pjfanning
Copy link
Member

@plokhotnyuk - feel free to ignore this but we ran into an edge case where deserialization to BigInteger can be very slow if the input has a large exponent (eg '1e20000000'). jackson-core parses numbers like this as BigDecimals and then uses the .toBigInteger method on BigDecimal because new BigInteger(str) can't parse numbers with e notation. It is the .toBigInteger method on BigDecimal that is very slow.

You have shown great diligence about these problematic inputs in the past. I'm just wondering if you have any thoughts on the best way to handle them.

@cowtowncoder cowtowncoder changed the title Investigate performance problem wrt internal conversion from BigDecimal to BigInteger wrt large exponents Prevent inefficient internal conversion from BigDecimal to BigInteger wrt ultra-large scale Apr 5, 2023
@cowtowncoder cowtowncoder added this to the 2.15.0-rc3 milestone Apr 5, 2023
@cowtowncoder
Copy link
Member Author

Quick note: there's a PR (#980) to block specific issue but it would be good to know if there are other approaches to avoid having to block this
(although TBH allowing this notation could to asymmetric processing on output -- we force non-E ("plain") notation on output so it could easily explode output size until/unless we start limiting output sizes).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance Issue related to performance problems or enhancements
Projects
None yet
Development

No branches or pull requests

2 participants