Clarify "resourceVersion unset" semantics in Watch #26064

embano1 · 2021-01-12T21:20:42Z

This is a Feature Request

I wasn't sure whether to file as bug or feature (more clarity), thus decided to use "feature" here. I also understand that there have been lots of discussions around watch behavior [1,2], but wanted to seek clarity from a documentation perspective.

[1] #16944
[2] kubernetes/kubernetes#74022

What would you like to be added

The current documentation on resource version for Watches in the API concepts states:

This is correct for API servers without cache. However, the default behavior in kube-apiserver is --watch-cache Default: true. From the Cacher code path for Watch() I read that no matter which RV is passed, it will always be served from the cache:

func (c *Cacher) Watch(ctx context.Context, key string, opts storage.ListOptions) (watch.Interface, error) {
	pred := opts.Predicate
	watchRV, err := c.versioner.ParseResourceVersion(opts.ResourceVersion) // returns 0 on empty string ("")
	if err != nil {
		return nil, err
	}

func (a APIObjectVersioner) ParseResourceVersion(resourceVersion string) (uint64, error) {
	if resourceVersion == "" || resourceVersion == "0" {
		return 0, nil
	}

If my understanding is correct, any client Watch() with empty RV ("") against an API server with cache enabled will not be served via quorum read. If so, the documentation should point this out, e.g. amend the following paragraph:

Start a watch at the most recent resource version, which must be consistent (i.e. served from etcd via a quorum read). [add something about clusters with API server cache enabled]

Why is this needed

A user (controller developer) might be under the false assumption that their Watches() are strongly consistent when not specifying a RV and not knowing whether the actual cluster runs with API server caching or not (e.g. managed environments, separate teams, etc.).

Comments

If my understanding is correct the time travel bug for lists could potentially exist in many custom controllers/CRDs if users rely on strongly consistent RV == "" semantics as described in the documentation.

cc/ @liggitt @jpbetz @lavalamp @wojtek-t @tnozicka @marshtompsxd @lalithsuresh

The text was updated successfully, but these errors were encountered:

k8s-ci-robot · 2021-01-12T21:20:49Z

@embano1: This issue is currently awaiting triage.

SIG Docs takes a lead on issue triage for this website, but any Kubernetes member can accept issues by applying the triage/accepted label.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

lavalamp · 2021-01-12T21:34:03Z

I think @jpbetz knows the current state on this, this might be a dup

embano1 · 2021-01-12T21:48:03Z

Thx, yeah...I checked the docs (last PR from him actually) and also the issues but could not find clarification around cache/nocache Watch() behavior in the docs.

wojtek-t · 2021-01-13T06:57:01Z

Yes - this is a bug in documentation, which we should fix.
/assign @jpbetz
Can you please take a care of it?

That said - using watch with RV = 0 or RV unset isn't really a recommended pattern. Generally the "list + watch from there" is the pattern that we recommend everywhere.
It might make sense to also make it very explicit in this doc.

embano1 · 2021-01-13T07:58:21Z

Generally the "list + watch from there" is the pattern that we recommend everywhere.
It might make sense to also make it very explicit in this doc.

Agree.

Follow-up question since most users will likely use ListerWatcher with SharedIndexInformer (reflector) or controller-runtime (which does this implicitly): These controllers don't enforce quorum list on start, thus are also vulnerable to time travel issues even though using list+watch, right? Reason why I am asking is that this behavior likely is not known to the majority of the community. Should we also document this then?

jpbetz · 2021-01-13T17:19:55Z

The documentation absolutely needs to be updated for this case. I'll try to get to that. I agree that watch with RV="" should be avoided. I'll try to incorporate that into the documentation.

ListerWatcher, SharedIndexInformer or whatever other documentation doesn't capture this also really needs to be updated. I'm less familiar with that documentation, so if anyone is willing to own getting that updated, it would be a huge help. I believe the Reflector documentation is OK since it just delegates to the ListerWatcher provided to NewReflector, but it might be good to see if it can be improved in any way as well.

embano1 · 2021-01-13T19:47:56Z

I'm less familiar with that documentation, so if anyone is willing to own getting that updated, it would be a huge help

Happy to help, wondering whether this would only be in the client-go docs or is there any public documentation for the SDK wrt ListWatch semantics?

I believe the Reflector documentation is OK

Please help me with my understanding: as you said, reflector uses the configured ListerWatcher() typically passed via SharedIndexInformer construction. Most (all?) of the provided typed/dynamic don't explicitly specify a RV, thus using the one injected during ListAndWatch() in the reflector:

func (r *Reflector) ListAndWatch(stopCh <-chan struct{}) error {
	klog.V(3).Infof("Listing and watching %v from %s", r.expectedTypeName, r.name)
	var resourceVersion string

	options := metav1.ListOptions{ResourceVersion: r.relistResourceVersion()}

	if err := func() error {
		initTrace := trace.New("Reflector ListAndWatch", trace.Field{"name", r.name})
		defer initTrace.LogIfLong(10 * time.Second)
		var list runtime.Object
		var paginatedResult bool
		var err error
		listCh := make(chan struct{}, 1)
		panicCh := make(chan interface{}, 1)
		go func() {
			defer func() {
				if r := recover(); r != nil {
					panicCh <- r
				}
			}()
			// Attempt to gather list in chunks, if supported by listerWatcher, if not, the first
			// list request will return the full response.
			pager := pager.New(pager.SimplePageFunc(func(opts metav1.ListOptions) (runtime.Object, error) {
				return r.listerWatcher.List(opts)
			}))

IMHO options is set to "0" based on this code in r.relistResourceVersion():

if r.lastSyncResourceVersion == "" {
		// For performance reasons, initial list performed by reflector uses "0" as resource version to allow it to
		// be served from the watch cache if it is enabled.
		return "0"
	}

Thus, the reflector will List from any and use the returned RV (if valid) to start the Watcher(). Couldn't this also lead to staleness issues, e.g. when the controller starts working from events on an initial ListWatch returned at say RV10 from API server A, then crashes, gets reconnected to server B and starts the same logic from RV8 because server B is lagging?

jpbetz · 2021-01-13T21:22:28Z

Thus, the reflector will List from any and use the returned RV (if valid) to start the Watcher(). Couldn't this also lead to staleness issues, e.g. when the controller starts working from events on an initial ListWatch returned at say RV10 from API server A, then crashes, gets reconnected to server B and starts the same logic from RV8 because server B is lagging?

Oh right. So the background here is that we did attempt (quite a long time ago, now) to update Reflectors to do the initial list using RV="". The PR (kubernetes/kubernetes#83520) got rolled back due to due to excessive list calls to etcd (kubernetes/kubernetes#86824). I haven't gotten the bandwidth since then to try to tackle this again, but maybe it would be reasonable to provide an option when constructing reflectors to specify that initial lists should be consistent.

For documentation, you're absolutely right, we have the "reflectors can go back in time problem" and we should document that.

embano1 · 2021-01-13T22:20:30Z

Perfect, let me know where I can help and I'll file PR(s). Just don't know all the places from a docs perspective.

sftim · 2021-01-14T17:32:53Z

/sig api-machinery

jpbetz · 2021-01-15T15:02:25Z

@embano1 Updating the client-go godocs seems appropriate to me.

@roycaihw Any thoughts on how to best document this?

fejta-bot · 2021-04-15T15:19:50Z

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale

dprotaso · 2021-04-15T15:29:42Z

/remove-lifecycle stale

fejta-bot · 2021-07-14T15:44:52Z

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale

k8s-triage-robot · 2021-08-13T16:11:56Z

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

wojtek-t · 2021-09-09T13:30:32Z

/remove-lifecycle rotten

sftim · 2021-09-09T13:44:01Z

I could revise #25299 if someone suggests new wording.

embano1 · 2021-09-19T10:50:19Z

@sftim can you please point me to the section where you would like to add the different Watch() semantics? Then I can propose a paragraph on the different behavior using cached API servers, client-go optimizations, different RV= semantics and issues this might cause (e.g. time-travel anomalies, see https://github.com/sieve-project/sieve/blob/main/docs/bugs.md and https://github.com/sieve-project/sieve/blob/main/docs/paper-hotos.pdf)

k8s-triage-robot · 2021-12-18T11:45:34Z

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle stale
Mark this issue or PR as rotten with /lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

sftim · 2022-01-08T18:01:26Z

@embano1 I'd add the extra details about watch actions into https://kubernetes.io/docs/reference/using-api/api-concepts/#efficient-detection-of-changes

k8s-triage-robot · 2022-02-07T18:26:41Z

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

k8s-triage-robot · 2022-03-09T18:58:28Z

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Reopen this issue or PR with /reopen
Mark this issue or PR as fresh with /remove-lifecycle rotten
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close

k8s-ci-robot · 2022-03-09T18:58:46Z

@k8s-triage-robot: Closing this issue.

In response to this:

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied

After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied

After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Reopen this issue or PR with /reopen

Mark this issue or PR as fresh with /remove-lifecycle rotten

Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

k8s-ci-robot added the needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. label Jan 12, 2021

k8s-ci-robot assigned jpbetz Jan 13, 2021

k8s-ci-robot added the sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. label Jan 14, 2021

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Apr 15, 2021

k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Apr 15, 2021

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jul 14, 2021

k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Aug 13, 2021

k8s-ci-robot removed the lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. label Sep 9, 2021

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Dec 18, 2021

k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Feb 7, 2022

k8s-ci-robot closed this as completed Mar 9, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Clarify "resourceVersion unset" semantics in Watch #26064

Clarify "resourceVersion unset" semantics in Watch #26064

embano1 commented Jan 12, 2021

k8s-ci-robot commented Jan 12, 2021

lavalamp commented Jan 12, 2021

embano1 commented Jan 12, 2021

wojtek-t commented Jan 13, 2021

embano1 commented Jan 13, 2021 •

edited

Loading

jpbetz commented Jan 13, 2021

embano1 commented Jan 13, 2021

jpbetz commented Jan 13, 2021 •

edited

Loading

embano1 commented Jan 13, 2021

sftim commented Jan 14, 2021

jpbetz commented Jan 15, 2021

fejta-bot commented Apr 15, 2021

dprotaso commented Apr 15, 2021

fejta-bot commented Jul 14, 2021

k8s-triage-robot commented Aug 13, 2021

wojtek-t commented Sep 9, 2021

sftim commented Sep 9, 2021

embano1 commented Sep 19, 2021

k8s-triage-robot commented Dec 18, 2021

sftim commented Jan 8, 2022

k8s-triage-robot commented Feb 7, 2022

k8s-triage-robot commented Mar 9, 2022

k8s-ci-robot commented Mar 9, 2022

Clarify "resourceVersion unset" semantics in Watch #26064

Clarify "resourceVersion unset" semantics in Watch #26064

Comments

embano1 commented Jan 12, 2021

k8s-ci-robot commented Jan 12, 2021

lavalamp commented Jan 12, 2021

embano1 commented Jan 12, 2021

wojtek-t commented Jan 13, 2021

embano1 commented Jan 13, 2021 • edited Loading

jpbetz commented Jan 13, 2021

embano1 commented Jan 13, 2021

jpbetz commented Jan 13, 2021 • edited Loading

embano1 commented Jan 13, 2021

sftim commented Jan 14, 2021

jpbetz commented Jan 15, 2021

fejta-bot commented Apr 15, 2021

dprotaso commented Apr 15, 2021

fejta-bot commented Jul 14, 2021

k8s-triage-robot commented Aug 13, 2021

wojtek-t commented Sep 9, 2021

sftim commented Sep 9, 2021

embano1 commented Sep 19, 2021

k8s-triage-robot commented Dec 18, 2021

sftim commented Jan 8, 2022

k8s-triage-robot commented Feb 7, 2022

k8s-triage-robot commented Mar 9, 2022

k8s-ci-robot commented Mar 9, 2022

embano1 commented Jan 13, 2021 •

edited

Loading

jpbetz commented Jan 13, 2021 •

edited

Loading