Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
TBS: drop and recreate badger db after exceeding storage limit for TTL time #15106
TBS: drop and recreate badger db after exceeding storage limit for TTL time #15106
Changes from 10 commits
1e4d5b3
0b9bcc1
25957a6
092b587
8f6ac0e
c8cc7a4
962905f
82dc615
266f735
f1e11bd
a2aa762
817e69f
e9a4957
6efb732
dd8dc17
4da8534
e2d52ec
1c39e98
0195098
8d503d1
7e62985
786ad51
33abec0
0b3e742
6113a71
f6340f2
ae08526
5da092a
bf4bfba
fc4ed8b
bcfc1db
3c00d97
71e3be2
9d15913
File filter
Filter by extension
Conversations
Jump to
There are no files selected for viewing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Getting DB size should be a relatively cheap operation so we can consider lowering ticker to be more frequent. This should prevent a configuration where TTL is low sub-minute to swap DB frequently after just 2 limit hits in a row.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure if I understand here.
if TTL < 1m, DB will be swapped after 2*TTL. (I actually consider this pathological as badger db size reporting comes with 1m delay)
if TTL >=1m, DB will be swapped after at max TTL+1m.
Do you think we should fix TTL <1m? It will increase test run time to 1m, but will fix swapping DB frequently for sub-minute TTL (which I don't think has much value in reality). I'm inclined to fix it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh I see, I didn't realize the size is getting recalculated every 1m. Then I think the current implementation is fine.
I was mostly afraid of the edge case situation when a TTL is set to a low value like 5s and DB is running close to the max disk limit, then even if compaction is making progress we might perform DB swap every other minute if we are unlucky and check twice in a row when the DB is full.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm actually afraid of the situation that TTL is way lower than a minute e.g. 5s, causing storage limit check to be done on the same 1m cached reported db size. I have the
min()
so that the test doesn't take a minute to run, but this edge case worries me a bit.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed in 0b3e742 but TestDropLoop tests will take 1m to run.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we change the subcriber position directory to be a separate directory than badger data? All the orchestration required to move this file is just not needed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We'll still need to maintain backwards compatibility. e.g. even if we ship a migration in 8.16.3, we'll need to keep backwards compatibility code at apm-server startup time to read from a pre-8.16.3 directory structure and move to a 8.16.3 directory structure. Then the dropAndRecreate implementation can be much cleaner. I was originally thinking about doing the move in 9.0.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, is it critical for us to keep the subscriber position data for restarts? Will it result in starting from zero and rereading all sampled traces?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not 100% sure. Need to dig deeper.
Either a performance hit to reread all sampling decisions, or data loss if it does not reread sampling decisions made during apm-server restart.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will this be a blocker? I think that the scope of this PR is fairly large already, not sure if we should further expand the scope.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Definitely not a blocker though I don't personally like either of the renames (this one + the one we do earlier to the directory). But, feel free to go ahead with merge if you have approving reviews as I might not be able to get to this today.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The reason I'm raising this is that in an ideal world instead of subscriber position being in a subdirectory, I think we should move badger into a subdirectory. And the risk factor may be different.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(Personally I hate all the fs interactions. There are too many ways to fail - think about owner, file mode, file systems)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Exactly, too many pieces for comfort for me.