Regex compilation cache #3374
Replies: 21 comments 4 replies
-
How would this be implemented? |
Beta Was this translation helpful? Give feedback.
-
High level my idea was that On node this seems to be happening already, but if we want to do it there anyway to protect against potential implementation behaviour in JS runtimes we could very easily use a More relevant though would be Erlang, and this is where I was hoping you might have some input. I'm far less familiar with Erlang having only started looking into it at all as a necessity for dealing with some edge cases in Gleam so I've really not done any Erlang code at all yet. Looking into this issue I found out first hand why it seems you can't do the Given all the above, and some cursory searching through the Erlang documentation I think it should be possible to implement using an ETS table as the cache mechanism, again keyed by the regex string. This would get us per-process caching at least, which while not universally applicable should at least help. I haven't tried implementing a proof of concept yet though, I was reaching out in part to get some input on feasibility of options from someone who knew Erlang a bit more than "copied and pasted some gleam-generated erlang code and called it FFI". |
Beta Was this translation helpful? Give feedback.
-
Never mind me, I just looked into ETS more deeply and I could have sworn that you could fetch a table by name but I must have been imagining things because I can't see anything like that which would mean you'd need to keep a reference to the table around somewhere which brings us back to the same root problem. |
Beta Was this translation helpful? Give feedback.
-
Yes you can, you give it a name and pass in the named |
Beta Was this translation helpful? Give feedback.
-
I wonder if persistent_term would make more sense here than ETS if we expect compiled regexes to remain unchanged for the lifetime of the program. That's what @lpil pointed me to when I asked about something similar a while ago (#3182). In terms of the API, I think that having a separate method for cached regexes makes more sense here instead of caching all of them by default. There are times where it might not make sense to cache some regexes for the lifetime of the program. |
Beta Was this translation helpful? Give feedback.
-
@apainintheneck thanks for the heads up, that does look like an interesting option. An opt-in version using that might be a good fallback, but from an API design perspective I think a transparent cache that just does the right thing is still the ideal design. I have some ideas to address Louis's feedback on my PR that I want to try out first. To their first point, your proposal can still lead to a memory leak if people use dynamically generated regex strings with the opt-in cache. We can document that you shouldn't do that, but an API design which you can't use wrong is superior and one that will automatically improve a bunch of already written code is even better. |
Beta Was this translation helpful? Give feedback.
-
Given this is the standard library banning dynamically created regex is not an option. |
Beta Was this translation helpful? Give feedback.
-
Yeah, definitely. The two options being discussed were the transparent cache style like what I put into the PR, or an optional second way to compile a regex (eg. |
Beta Was this translation helpful? Give feedback.
-
Seeing as we don't have a good regex specific solution I think this may be more suited to being in another library; especially since the current approaches involve adding the concept of mutation to the standard library, something which Gleam deliberately avoids. |
Beta Was this translation helpful? Give feedback.
-
This is part of why I thought the stdlib might be a good place for this. If we can provide an immutable-looking API for an implementation that requires mutability internally (for the cache) then it prevents pushing that concern down to user code and we can keep pushing Gleam as an immutable language more easily. The other reason I was considering the stdlib was discoverability. Specifically, if it's in the stdlib then the JS and Erlang targets become more consistent in that neither need app-level caching of regexes unless they are very performance sensitive. If it's a library, then users need to be aware it's something that they should be worried about, and then go looking for a solution. That said, I'm happy to move this off to a library if you think that's best. |
Beta Was this translation helpful? Give feedback.
-
If you have an idea for an immutable looking API maybe that would make it more appropriate. What might that look like? |
Beta Was this translation helpful? Give feedback.
-
The current PR doesn't require any changes to the API interface at all. This solution might be a bit heavy for the stdlib, but the thoughts I had for addressing your concerns on that PR were basically:
Just handle the exception, if it's already been created on another process, then proceed as normal.
To prevent the ETS from disappearing, start a new process to own the ETS. This would have very little in the handler, but not nothing because...
If we have our own process managing the ETS table, then we can send a message there whenever something is fetched or inserted. Importantly, it wouldn't use the message to do the fetch or insert for performance reasons, but it would send a notification to the process. The process can use these notifications to update a last used timestamp for each entry on fetch, and trim the oldest entries on insert to keep the table to a fixed size. I definitely understand if this proposal is a bit heavy for the stdlib though. |
Beta Was this translation helpful? Give feedback.
-
That would result in the cache process being unsupervised and unlinked, and it would have to do work every time a regex is used by any other process in the system. Would it be able to keep up in regex heavy systems? The programmer would need to be able to opt-out of the cache on a per-regex basis, would need to be able to fine-tune the expiration to control memory usage of the cache, and need to be able to flush the cache. Given these and the fact the API is clearly built around implicit global mutable state, something we have deliberately avoided in Gleam, I don't find this design very compelling. |
Beta Was this translation helpful? Give feedback.
-
I don't have the experience with Erlang to answer that well, but I suspect the answer is "yes, unless you are using dynamic regex strings extensively" since the major work would only happen on insertion of a new item into the cache. The fetch update could be handled inline in the function rather than being posted over to the cache process - it would really only be one extra call to Out of curiosity, why would you ever want or need to flush the cache entirely? Keeping memory usage under control I understand, but regex are deterministic. Flushing the cache would only ever result in recomputing the exact same thing you've already computed. Ditto for expiration, I'd expect people would want to perhaps control a maximum amount of memory to use, but not care enough to expire individual regexes more or less quickly. Given the deterministic nature of regex compilation I was planning on treating this as a most-recently-used cache, ie. keep the Per regex opt-out would require a new option in the regex options at least, I'm not sure if you'd consider exposing the fact there is a cache to opted out of to be breaking the immutable style or not. It's borderline for me personally.
Unsupervised doesn't feel like a big issue here, we could restart that process with a fresh blank cache if it's not existing when we try to use it? I feel like that's a surmountable issue since the first thing any compilation would need to do is check the ETS table exists, and by proxy if the process exists. I don't have the Erlang experience to know how bad an issue an unlinked process would be though. Honestly, I don't love the design either - this is getting further and further from the simple option I was hoping we could use. I've been writing too much JavaScript for too long and forgotten a lot of the complexities that actually parallel code needs to deal with, and the lack of module-level values in Erlang doesn't make things any easier. I'm definitely very open to other ideas or approaches. But I can't shake the feeling that some kind of solution is needed, even if it a library. I expected regex compilation to be a performance hit, but I didn't expect 10 lines of code to be ~50% of the runtime of an algorithm that is a couple thousands of lines long and that barely calls those 10 lines of code. Adding to this the difficulty of actually deduplicating the regex compilation calls inside Gleam code, for all the same reasons it's hard in Erlang, this adds up to a difficult problem to solve. The solution I linked in my original message is contained entirely inside the library and prevents recompiling the regexes for each commonmark document that it parses, but in order to completely remove the overhead I'd have to push that new let parser = commonmark.new_parser()
commonmark.parse(parser, document_string) And then ask them to hold onto that |
Beta Was this translation helpful? Give feedback.
-
@mscharley If this is out of the scope of the standard library, another option would be to use a library like glemo to memoize regexes globally throughout the program and that would work cross-platform too. I also initially considered creating a persistent regex library when I first ran into this but as I wasn't working on anything that used a bunch of regexes so I lost interest. It doesn't surprise me that you've run into this problem with the markdown parser you're working on though. |
Beta Was this translation helpful? Give feedback.
-
@apainintheneck Unfortunately, it looks like glemo only has a JS implementation for now, which is a shame. If you haven't seen it already, gleam-lang/stdlib#656 does have a basic implementation of an ETS cache for regexes, but Louis rightly pointed out a few different shortcomings with that option in the comments on the PR which is what I was trying to address in my last wall of text. This does very much feel like it's getting above the scope of the stdlib, and i know Louis has expressed some reservations about having regex in the stdlib at all in the past. It could well be this should all be it's own separate dedicated thing. |
Beta Was this translation helpful? Give feedback.
-
It uses ETS tables on Erlang by way of the carpenter library. I'm not sure if the implementation runs into the same problems discussed in your PR though. |
Beta Was this translation helpful? Give feedback.
-
Creation of a timestamp is not necessarily that cheap, which is why BEAM web servers use a cache for that rather than computing it per request. Here a single process would be doing this work and competing with many thousands of other processes, while also performing the GC scanning work. I don't feel as confident that the single process won't have performance problems. Flushing and configuring the cache are requirements in order to give the programmer full control of memory usage. The standard library cannot make assumptions about what the resource constraints of Gleam programs might be. |
Beta Was this translation helpful? Give feedback.
-
Yep, Glemo is basically a slightly more generalised version of what I wrote. Re @lpil 's feedback, that's all fair enough I guess. If timestamps aren't cheap then any other increasing value over time, even a counter would suffice - all that really matters for the algorithm is relative ordering, not specific differences. Anyway, I think the takeaway from all this is that this is a very hard problem to solve in a generic way and a relatively simple problem to solve in the specific case. If you know which regexes you need to cache then you can throw them all in a single The Erlang re module includes the following warning:
But the Gleam stdlib documentation doesn't include anything. Would you be fine with either reproducing this warning or something akin to it Louis? If nothing else, I think that documenting regex compilation as being slow will help. Of particular note is the second sentence. The Gleam stdlib forces you to compile your regex, but the Gleam language doesn't encourage you to reuse it because storing it isn't simple so I suspect that a lot of people will just use |
Beta Was this translation helpful? Give feedback.
-
Adding more documentation sounds good! It's worth nothing that regex compilation is not slow, it's that it's needless work. Few languages have compile time regular expressions so most people know you should not use them in a loop, so it slipped my mind to document this. It would also be good to look at other designs for this that do not have the problems of this cache dictionary design. |
Beta Was this translation helpful? Give feedback.
-
I played around with the idea of creating a library for caching regexes globally and my findings matched what @mscharley already posted above. Basically, there seems to be no clear benefit to caching regexes in JavaScript when using Node.js 22.4.0. It seems like the V8 JavaScript runtime already does that internally. Erlang 27.0 ended up being almost an order of magnitude faster when caching regexes using persistent term internally though which lines up with what we've seen here. I'm not sure having a library which writes to persistent term is a good API though because it could behave unexpectedly if users try to save large numbers of regexes. It is also heavily dependent on assumptions about the current target implementations which might not hold true over time. It was fun to look at it though out of an abundance of curiosity. |
Beta Was this translation helpful? Give feedback.
-
Hi, looks like discussions aren't enabled for this repo so I'm opening an issue for this.
I recently added library-side compilation caching for regexes and wanted to report back the improvements because they were stark.
On JavaScript (node specifically) it seems like the runtime does this already, there was no improvement.
On Erlang however, there is a ~45% improvement in the overall benchmark speed for the whole parsing algorithm (2.44ms -> ~1.31ms on my hardware) despite regex compilation being a tiny proportion of the overall code and not even on a particularly hot path (only used for parsing blocks, not inline elements).
Given that it is impossible to fully solve on the application side using only Gleam since Gleam doesn't support creating
const
s with aRegex
, it feels like there would be a lot of benefit to adding compilation caching at the stdlib level, perhaps viaets
? I'm happy to take a look at adding this, I just wanted to confirm that this is a good idea or if there's perhaps a better way to tackle the problem.Beta Was this translation helpful? Give feedback.
All reactions