Implement CreatePreservationTasks with CreateBulk #1048

sevein · 2024-10-16T17:32:51Z

This PR explores a new pattern for handling bulk database operations using Go 1.23's range-over-function feature. It introduces a new CreatePreservationTasks method in the persistence package, which accepts a generator of preservation tasks (iter.Seq[*datatype.PreservationTask]). The method utilizes the batch function to group tasks into manageable batches and leverages CreateBulk to perform efficient bulk inserts in a single INSERT operation.

To view the main idea of this proposal in isolation, visit: https://go.dev/play/p/y7tYvD7NO28.

In future PRs, this new function could be used in areas like CreateAIPActivity and JobTracker to improve bulk task creation.

codecov · 2024-10-16T17:36:12Z

Codecov Report

Attention: Patch coverage is 75.64103% with 19 lines in your changes missing coverage. Please review.

Project coverage is 54.64%. Comparing base (c0afd76) to head (25f1c7e).

Files with missing lines	Patch %	Lines
internal/persistence/telemetry.go	0.00%	10 Missing ⚠️
...ternal/persistence/ent/client/preservation_task.go	88.88%	3 Missing and 3 partials ⚠️
internal/persistence/ent/client/batch.go	78.57%	2 Missing and 1 partial ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #1048      +/-   ##
==========================================
- Coverage   54.68%   54.64%   -0.05%     
==========================================
  Files         104      105       +1     
  Lines        7631     7683      +52     
==========================================
+ Hits         4173     4198      +25     
- Misses       3201     3221      +20     
- Partials      257      264       +7

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚨 Try these New Features:

Flaky Tests Detection - Detect and resolve failed and flaky tests
JS Bundle Analysis - Avoid shipping oversized bundles

jraddaoui · 2024-11-05T20:11:33Z

Sorry for the late update @sevein. I took a quick look a while ago, really nice and good learnings! I think @djjuhasz is also looking at it and will do the actual code review.

djjuhasz · 2024-11-05T20:39:18Z

@djjuhasz yes, I've been looking at it and will provide feedback. I'm having a hard time wrapping my head around the iterator code flow.

sevein · 2024-11-06T13:06:39Z

I've been looking at it and will provide feedback. I'm having a hard time wrapping my head around the iterator code flow.

The syntax is a bit confusing 😢... I found the docstring of the compiler's rangefunc package somewhat useful: https://github.com/golang/go/blob/master/src/cmd/compile/internal/rangefunc/rewrite.go.

There are a couple of areas where I'm unsure if I'm doing things correctly:

Making CreatePreservationTasks accept an iterator seemed convenient because it allows the user to simply provide a function that yields objects, without having to worry about chunking at all. I'm thinking in savePreservationTasks, you'd only need to write a function that walks the slice of a3m jobs and produces preservation tasks or use some kind of map & transform helper like this. It seems a better option if compared with accepting a slice I think?
batch accepts and returns an iterator which is a pattern I haven't seen a lot in other projects, e.g. pkg.go.dev/slices has functions where they either accept or return an iterator but not both. But I've found at least one example: ChunkToSlices.

I haven't found yet an implementation like this in other codebases, I should probably look around more. I assume other poeple are also trying to solve the problem of ublk inserts with iterators? A project that is slightly related is https://github.com/achille-roussel/sqlrange.

djjuhasz

Thanks @sevein, this is pretty cool. I made a couple of comments, but neither is a blocker to merging.

I think passing an iterator to CreatePreservationTasks() instead of a slice is going to make it a bit harder to understand for those unfamiliar with iterators, but it's more flexible so I think it's worth it. Adding a call to CreatePreservationTasks() in the code would be helpful as an example, but that can come in a later commit.

It would be cool to add some benchmark tests at some point to compare batch vs. single inserts, though I guess the performance will depend a lot on the DBMS used and deployment particulars.

djjuhasz · 2024-11-15T18:23:04Z

internal/persistence/ent/client/preservation_task_test.go

+			name: "Creates multiple preservation tasks exceeding batch size",
+			tasks: func(entc *db.Client) []*datatypes.PreservationTask {
+				pa1, _ := addDBFixtures(t, entc)
+				pts := make([]*datatypes.PreservationTask, 0, 300) // Three batches.


This seems like a good place to use a generator instead of creating a big slice of PreservationTasks.

I need the full slice for assert.DeepEqual - what testing approach did you have in mind with an iterator?

Hmm, I didn't realize CreatePreservationTasks() returned the full slice of created preservation tasks. I guess my counter question is: what's the advantage of accepting an iterator when you return the full slice of results anyway? I'm trying to imagine a scenario where you would want to generate 3 tranches of data 100 rows at a time (for example), yield the rows to CreatePreservationTasks(), but then get back all three hundred results in one slice. 🤔

It also just seems weird to me to create a slice of 300 rows, convert that to an iterator (with slices.Values(tasks)), then get back a slice of 300 results. I get that this is just a test, so the data flow is necessarily simplified, but I just wonder if we're going to end up using this same flow in production. 🤔

What's the advantage of accepting an iterator when you return the full slice of results anyway? I'm trying to imagine a scenario where you would want to generate 3 tranches of data 100 rows at a time (for example), yield the rows to CreatePreservationTasks(), but then get back all three hundred results in one slice.

Fair point. My use case was savePreservationTasks where I thought thatit would be preferable to eventually load tasks from the data source (a3m) in batches while doing bulk inserts, but I was probably overthinking. I also realized that packageImpl.CreatePreservationTasks needs the slice back from MySQL so we can publish the corresponding events with the MySQL-generated identifiers.
I'm not sure if it's a good idea to accept an iterator as a generic form of an iterable, so the user gets to decide if they want to pass a slice wrapped by slices.Values or an iterator function. Honestly I don't know what ar ethe performance implications, I've been assuming that iterators are fast because the compiler is inlining all the code but that may not be true.
I can append new commits to this PR bringing the simpler form accepting a slice, and doing all the work end to end (a3m activity » package service » persistence » ent) to see how it feels.

There are all sorts of packages emerging using rangefunc and generics, e.g. https://github.com/BooleanCat/go-functional introduces the idea of chainable methods. I need to keep exploring this space.

Yeah, iterators are quite interesting. Because I'm new to coroutines and generators, I keep referring back to https://research.swtch.com/coro, which I've found really helpful for conceptualizing the mechanics and use-cases of coroutines, but I'm still not very confident of my understanding.

To me the batch() function fits well with my concept of a coroutine - the batch function receives data from an iterator until it hits the batch threshold, then yields a tranche of records back to the caller which then does some stuff with that tranche, before resuming the batch coroutine.

I think the use case for CreatePreservationTasks() where the iterator parameter is sending batched rows from a MySQL query is an interesting one. I guess in that case the hypothetical ListPreservationTasks() would do a SELECT query with a limit on the rows returned, and then would return the results as an iterator. When the batching function in ListPreservationTasks() got to the last row in the tranche would return control back to it's caller to SELECT the next tranche of rows. Meanwhile CreatePreservationTasks() would be batching the rows from it's input iterator and inserting them into the database when it hit the batch limit. I guess my naive expectation at that point is that CreatePreservationTasks() would yield back the results from that tranche of inserted rows to it's caller so the caller could do whatever it needs to do with those results (e.g. publish events), then the results could be expired before receiving the next tranche of results (and collected by the garbage collector whenever that ran). This scenario took a while for me to puzzle through because it's a pretty long chain of coroutines to switch between, with batching on both the SELECT and INSERT operations. 🤯

@sevein is this the close to the flow you were envisioning, or were you thinking of something very different?

What you're saying reasonates with me as well but I am avoiding to adopt the mental model where we think of iterators as coroutines because that's not how they ended up implementing it - Russ' article explores the broader coroutine space, which differs from go1.23's implementation. Unlike Python (run this example if interested), in Go iterators we have a single stack and there is no pausing/resuming, the compiler is inlining the code (these examples are clarifying to me).

I guess in that case the hypothetical ListPreservationTasks() would do a SELECT query with a limit on the rows returned, and then would return the results as an iterator. When the batching function in ListPreservationTasks() got to the last row in the tranche would return control back to it's caller to SELECT the next tranche of rows.

Depending on the underlying database driver, I suspect you could leverage server-side cursors so the database can stream the result set in chunks efficiently. See sqlrange, the Query function can be a good example. That probably means that you may not need to handle the LIMIT clause manually. But they're not supported in MySQL as far as I know.

I guess my naive expectation at that point is that CreatePreservationTasks() would yield back the results from that tranche of inserted rows to it's caller so the caller could do whatever it needs to do with those results (e.g. publish events), then the results could be expired before receiving the next tranche of results (and collected by the garbage collector whenever that ran).

Yeah, that make sense. I chose not to return an iterator, concerned that I wouldn't be able to manage the complexity or that the API would become too cryptic. But it looks like that approach would result in less allocations and less pressure in the garbage collector? But I'll start the simpler version where we just work with slices, then we can explore other options further.

What you're saying reasonates with me as well but I am avoiding to adopt the mental model where we think of iterators as coroutines because that's not how they ended up implementing it - Russ' article explores the broader coroutine space, which differs from go1.23's implementation. Unlike Python (run this example if interested), in Go iterators we have a single stack and there is no pausing/resuming, the compiler is inlining the code (these examples are clarifying to me).

Man https://go.dev/wiki/RangefuncExperiment#can-range-over-a-function-perform-as-well-as-hand-written-loops is interesting, but am I the only one that has a really hard time parsing functions used function params or return values? 🤯 I had to read those examples at least three times to figure out what was going on. 😆

I'm not really clear on the practical differences between using Go's inlining iterator functions and Python's coroutines, but I did try to reimplement your Python coroutine in Go. I did get the Go coroutine output to match the Python output, but it was much harder in Go, and I had to use iter.Pull to get access to a next() function. The exercise highlighted just how shaky my understanding is on how to write and use iterators in Go. 😬

Depending on the underlying database driver, I suspect you could leverage server-side cursors so the database can stream the result set in chunks efficiently. See sqlrange, the Query function can be a good example. That probably means that you may not need to handle the LIMIT clause manually. But they're not supported in MySQL as far as I know.

Hmm, server-side cursors sound cool. We could use whatever method works best for chunking the query results here.

Yeah, that make sense. I chose not to return an iterator, concerned that I wouldn't be able to manage the complexity or that the API would become too cryptic. But it looks like that approach would result in less allocations and less pressure in the garbage collector? But I'll start the simpler version where we just work with slices, then we can explore other options further.

Doing the simpler version of passing slices sounds good to me. 👍

I did get the Go coroutine output to match the Python output, but it was much harder in Go, and I had to use iter.Pull to get access to a next() function.

Thanks for doing this, it made me realize that I misunderstood a few things, and this conversation has been helpful in clarifying them. It turns out that Go used a lightweight/constrained version of coroutines to implement iter.Pull. This post was very clarifying: https://unskilled.blog/posts/coroutines-in-go/.
I feel like I've finally reached the point where I know I know nothing, so now I can get back to the task I was supposed to be doing in the first place and debug it like a monkey, haha.

internal/persistence/ent/client/batch.go

sevein requested review from djjuhasz and jraddaoui October 16, 2024 17:32

djjuhasz approved these changes Nov 15, 2024

View reviewed changes

sevein force-pushed the dev/save-multiple-preservation-tasks branch from 8323dc6 to 3a64500 Compare November 21, 2024 17:00

sevein added 2 commits November 21, 2024 18:03

Implement CreatePreservationTasks with CreateBulk

fd12a9b

Iterate seq using range

25f1c7e

sevein force-pushed the dev/save-multiple-preservation-tasks branch from 3a64500 to 25f1c7e Compare November 21, 2024 17:04

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement CreatePreservationTasks with CreateBulk #1048

Implement CreatePreservationTasks with CreateBulk #1048

sevein commented Oct 16, 2024 •

edited

Loading

codecov bot commented Oct 16, 2024 •

edited

Loading

jraddaoui commented Nov 5, 2024

djjuhasz commented Nov 5, 2024

sevein commented Nov 6, 2024 •

edited

Loading

djjuhasz left a comment

djjuhasz Nov 15, 2024

sevein Nov 18, 2024

djjuhasz Nov 18, 2024 •

edited

Loading

djjuhasz Nov 18, 2024 •

edited

Loading

sevein Nov 19, 2024

djjuhasz Nov 19, 2024 •

edited

Loading

sevein Nov 20, 2024 •

edited

Loading

djjuhasz Nov 21, 2024

sevein Nov 21, 2024 •

edited

Loading

Implement CreatePreservationTasks with CreateBulk #1048

Are you sure you want to change the base?

Implement CreatePreservationTasks with CreateBulk #1048

Conversation

sevein commented Oct 16, 2024 • edited Loading

codecov bot commented Oct 16, 2024 • edited Loading

Codecov Report

jraddaoui commented Nov 5, 2024

djjuhasz commented Nov 5, 2024

sevein commented Nov 6, 2024 • edited Loading

djjuhasz left a comment

Choose a reason for hiding this comment

djjuhasz Nov 15, 2024

Choose a reason for hiding this comment

sevein Nov 18, 2024

Choose a reason for hiding this comment

djjuhasz Nov 18, 2024 • edited Loading

Choose a reason for hiding this comment

djjuhasz Nov 18, 2024 • edited Loading

Choose a reason for hiding this comment

sevein Nov 19, 2024

Choose a reason for hiding this comment

djjuhasz Nov 19, 2024 • edited Loading

Choose a reason for hiding this comment

sevein Nov 20, 2024 • edited Loading

Choose a reason for hiding this comment

djjuhasz Nov 21, 2024

Choose a reason for hiding this comment

sevein Nov 21, 2024 • edited Loading

Choose a reason for hiding this comment

sevein commented Oct 16, 2024 •

edited

Loading

codecov bot commented Oct 16, 2024 •

edited

Loading

sevein commented Nov 6, 2024 •

edited

Loading

djjuhasz Nov 18, 2024 •

edited

Loading

djjuhasz Nov 18, 2024 •

edited

Loading

djjuhasz Nov 19, 2024 •

edited

Loading

sevein Nov 20, 2024 •

edited

Loading

sevein Nov 21, 2024 •

edited

Loading