-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement CreatePreservationTasks with CreateBulk #1048
base: main
Are you sure you want to change the base?
Conversation
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #1048 +/- ##
==========================================
- Coverage 54.68% 54.64% -0.05%
==========================================
Files 104 105 +1
Lines 7631 7683 +52
==========================================
+ Hits 4173 4198 +25
- Misses 3201 3221 +20
- Partials 257 264 +7 ☔ View full report in Codecov by Sentry. 🚨 Try these New Features:
|
@djjuhasz yes, I've been looking at it and will provide feedback. I'm having a hard time wrapping my head around the iterator code flow. |
The syntax is a bit confusing 😢... I found the docstring of the compiler's rangefunc package somewhat useful: https://github.com/golang/go/blob/master/src/cmd/compile/internal/rangefunc/rewrite.go. There are a couple of areas where I'm unsure if I'm doing things correctly:
I haven't found yet an implementation like this in other codebases, I should probably look around more. I assume other poeple are also trying to solve the problem of ublk inserts with iterators? A project that is slightly related is https://github.com/achille-roussel/sqlrange. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @sevein, this is pretty cool. I made a couple of comments, but neither is a blocker to merging.
I think passing an iterator to CreatePreservationTasks()
instead of a slice is going to make it a bit harder to understand for those unfamiliar with iterators, but it's more flexible so I think it's worth it. Adding a call to CreatePreservationTasks()
in the code would be helpful as an example, but that can come in a later commit.
It would be cool to add some benchmark tests at some point to compare batch vs. single inserts, though I guess the performance will depend a lot on the DBMS used and deployment particulars.
name: "Creates multiple preservation tasks exceeding batch size", | ||
tasks: func(entc *db.Client) []*datatypes.PreservationTask { | ||
pa1, _ := addDBFixtures(t, entc) | ||
pts := make([]*datatypes.PreservationTask, 0, 300) // Three batches. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems like a good place to use a generator instead of creating a big slice of PreservationTasks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I need the full slice for assert.DeepEqual - what testing approach did you have in mind with an iterator?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, I didn't realize CreatePreservationTasks()
returned the full slice of created preservation tasks. I guess my counter question is: what's the advantage of accepting an iterator when you return the full slice of results anyway? I'm trying to imagine a scenario where you would want to generate 3 tranches of data 100 rows at a time (for example), yield the rows to CreatePreservationTasks()
, but then get back all three hundred results in one slice. 🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It also just seems weird to me to create a slice of 300 rows, convert that to an iterator (with slices.Values(tasks)
), then get back a slice of 300 results. I get that this is just a test, so the data flow is necessarily simplified, but I just wonder if we're going to end up using this same flow in production. 🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the advantage of accepting an iterator when you return the full slice of results anyway? I'm trying to imagine a scenario where you would want to generate 3 tranches of data 100 rows at a time (for example), yield the rows to CreatePreservationTasks(), but then get back all three hundred results in one slice.
Fair point. My use case was savePreservationTasks where I thought thatit would be preferable to eventually load tasks from the data source (a3m) in batches while doing bulk inserts, but I was probably overthinking. I also realized that packageImpl.CreatePreservationTasks
needs the slice back from MySQL so we can publish the corresponding events with the MySQL-generated identifiers.
I'm not sure if it's a good idea to accept an iterator as a generic form of an iterable, so the user gets to decide if they want to pass a slice wrapped by slices.Values or an iterator function. Honestly I don't know what ar ethe performance implications, I've been assuming that iterators are fast because the compiler is inlining all the code but that may not be true.
I can append new commits to this PR bringing the simpler form accepting a slice, and doing all the work end to end (a3m activity » package service » persistence » ent) to see how it feels.
There are all sorts of packages emerging using rangefunc and generics, e.g. https://github.com/BooleanCat/go-functional introduces the idea of chainable methods. I need to keep exploring this space.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, iterators are quite interesting. Because I'm new to coroutines and generators, I keep referring back to https://research.swtch.com/coro, which I've found really helpful for conceptualizing the mechanics and use-cases of coroutines, but I'm still not very confident of my understanding.
To me the batch()
function fits well with my concept of a coroutine - the batch function receives data from an iterator until it hits the batch threshold, then yields a tranche of records back to the caller which then does some stuff with that tranche, before resuming the batch coroutine.
I think the use case for CreatePreservationTasks()
where the iterator parameter is sending batched rows from a MySQL query is an interesting one. I guess in that case the hypothetical ListPreservationTasks()
would do a SELECT query with a limit on the rows returned, and then would return the results as an iterator. When the batching function in ListPreservationTasks()
got to the last row in the tranche would return control back to it's caller to SELECT the next tranche of rows. Meanwhile CreatePreservationTasks()
would be batching the rows from it's input iterator and inserting them into the database when it hit the batch limit. I guess my naive expectation at that point is that CreatePreservationTasks()
would yield back the results from that tranche of inserted rows to it's caller so the caller could do whatever it needs to do with those results (e.g. publish events), then the results could be expired before receiving the next tranche of results (and collected by the garbage collector whenever that ran). This scenario took a while for me to puzzle through because it's a pretty long chain of coroutines to switch between, with batching on both the SELECT and INSERT operations. 🤯
@sevein is this the close to the flow you were envisioning, or were you thinking of something very different?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What you're saying reasonates with me as well but I am avoiding to adopt the mental model where we think of iterators as coroutines because that's not how they ended up implementing it - Russ' article explores the broader coroutine space, which differs from go1.23's implementation. Unlike Python (run this example if interested), in Go iterators we have a single stack and there is no pausing/resuming, the compiler is inlining the code (these examples are clarifying to me).
I guess in that case the hypothetical ListPreservationTasks() would do a SELECT query with a limit on the rows returned, and then would return the results as an iterator. When the batching function in ListPreservationTasks() got to the last row in the tranche would return control back to it's caller to SELECT the next tranche of rows.
Depending on the underlying database driver, I suspect you could leverage server-side cursors so the database can stream the result set in chunks efficiently. See sqlrange, the Query function can be a good example. That probably means that you may not need to handle the LIMIT clause manually. But they're not supported in MySQL as far as I know.
I guess my naive expectation at that point is that CreatePreservationTasks() would yield back the results from that tranche of inserted rows to it's caller so the caller could do whatever it needs to do with those results (e.g. publish events), then the results could be expired before receiving the next tranche of results (and collected by the garbage collector whenever that ran).
Yeah, that make sense. I chose not to return an iterator, concerned that I wouldn't be able to manage the complexity or that the API would become too cryptic. But it looks like that approach would result in less allocations and less pressure in the garbage collector? But I'll start the simpler version where we just work with slices, then we can explore other options further.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What you're saying reasonates with me as well but I am avoiding to adopt the mental model where we think of iterators as coroutines because that's not how they ended up implementing it - Russ' article explores the broader coroutine space, which differs from go1.23's implementation. Unlike Python (run this example if interested), in Go iterators we have a single stack and there is no pausing/resuming, the compiler is inlining the code (these examples are clarifying to me).
Man https://go.dev/wiki/RangefuncExperiment#can-range-over-a-function-perform-as-well-as-hand-written-loops is interesting, but am I the only one that has a really hard time parsing functions used function params or return values? 🤯 I had to read those examples at least three times to figure out what was going on. 😆
I'm not really clear on the practical differences between using Go's inlining iterator functions and Python's coroutines, but I did try to reimplement your Python coroutine in Go. I did get the Go coroutine output to match the Python output, but it was much harder in Go, and I had to use iter.Pull
to get access to a next()
function. The exercise highlighted just how shaky my understanding is on how to write and use iterators in Go. 😬
Depending on the underlying database driver, I suspect you could leverage server-side cursors so the database can stream the result set in chunks efficiently. See sqlrange, the Query function can be a good example. That probably means that you may not need to handle the LIMIT clause manually. But they're not supported in MySQL as far as I know.
Hmm, server-side cursors sound cool. We could use whatever method works best for chunking the query results here.
Yeah, that make sense. I chose not to return an iterator, concerned that I wouldn't be able to manage the complexity or that the API would become too cryptic. But it looks like that approach would result in less allocations and less pressure in the garbage collector? But I'll start the simpler version where we just work with slices, then we can explore other options further.
Doing the simpler version of passing slices sounds good to me. 👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did get the Go coroutine output to match the Python output, but it was much harder in Go, and I had to use iter.Pull to get access to a next() function.
Thanks for doing this, it made me realize that I misunderstood a few things, and this conversation has been helpful in clarifying them. It turns out that Go used a lightweight/constrained version of coroutines to implement iter.Pull. This post was very clarifying: https://unskilled.blog/posts/coroutines-in-go/.
I feel like I've finally reached the point where I know I know nothing, so now I can get back to the task I was supposed to be doing in the first place and debug it like a monkey, haha.
8323dc6
to
3a64500
Compare
3a64500
to
25f1c7e
Compare
This PR explores a new pattern for handling bulk database operations using Go 1.23's range-over-function feature. It introduces a new
CreatePreservationTasks
method in the persistence package, which accepts a generator of preservation tasks (iter.Seq[*datatype.PreservationTask]
). The method utilizes thebatch
function to group tasks into manageable batches and leveragesCreateBulk
to perform efficient bulk inserts in a single INSERT operation.To view the main idea of this proposal in isolation, visit: https://go.dev/play/p/y7tYvD7NO28.
In future PRs, this new function could be used in areas like CreateAIPActivity and JobTracker to improve bulk task creation.