Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WIP: [Rhythm] Block-builder consumption loop #4480

Draft
wants to merge 4 commits into
base: main-rhythm
Choose a base branch
from

Conversation

mdisibio
Copy link
Contributor

What this PR does:
This is an alternate block-builder consumption loop that I think has benefits and would like to get feedback on.

The curent loop can be thought of as top-down. It calculates the total time range that the block builder is lagging, splits it into smaller sections (i.e. 5 minutes), consume/flush/commit each section.

This new loop is: while more data, start at last commit and consume/flush/commit another chunk of data (i.e. 5 minutes).

The benefits are:

  • less state - Each consume/flush/commit cycle is independent, doesn't require knowledge of the overall state of the queue, or side effects from previous loops
  • fewer kafka apis are involved, (i.e. no CalculateGroupLag)
  • doesn't require custom commit metadata
  • I think core loop is simpler and will make it easier to iterate. For example there is a TODO to round-robin when the block-builder is assigned to multiple partitions and they are all lagging. I think this is more complex to try in the current design.

The drawbacks are:

  • Mainly around how we want to metric "lag". The current metric is number of messages, but I think finding a way to determine length of time (i.e. "the block-builder is 15 minutes behind") is more useful. However you have to read a record to determine that, so this needs more work.

TODO

  • This is a draft, and needs more finishing touches before it can be merged. What I want is to think about the overall loop structure. Logs/cleanup/finishing touches would be added before any merge.
  • Disregard the test failures, they are because the "fake" kafka is not mimicing real behavior. Updated the first test TestBlockbuilder_lookbackOnNoCommit to show an example of how it can be fixed. Specifically, it needs to run with consumer group support, and remove:
k.ControlKey(kmsg.OffsetCommit.Int16(), func(kmsg.Request) (kmsg.Response, error, bool) {
		kafkaCommits.Add(1)
		return nil, nil, false
	})

because this means the commits are not actually committed.

Which issue(s) this PR fixes:
Fixes #

Checklist

  • Tests updated
  • Documentation added
  • CHANGELOG.md updated - the order of entries should be [CHANGE], [FEATURE], [ENHANCEMENT], [BUGFIX]

Copy link
Contributor

@javiermolinar javiermolinar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good!

case <-time.After(waitTime):
err := b.consume(ctx)
if err != nil {
b.logger.Log("msg", "consumeCycle failed", "err", err)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
b.logger.Log("msg", "consumeCycle failed", "err", err)
level.Error(b.logger).Log("msg", "consumeCycle failed", "err", err)

return false, err
}

lastCommit, ok := commits.Lookup(topic, partition)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
lastCommit, ok := commits.Lookup(topic, partition)
lastCommit, exists := commits.Lookup(topic, partition)
if exists && lastCommit.At >= 0 {
startOffset = startOffset.At(lastCommit.At)
} else {
startOffset = kgo.NewOffset().AtStart()
}

https://pkg.go.dev/github.com/twmb/franz-go/pkg/kadm@v1.14.0#OffsetResponses.Lookup

}

err := b.pushTraces(rec.Key, rec.Value, writer)
if err != nil {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What will happen if something is wrong with the WAL? I guess it will enter in a loop

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants