-
Notifications
You must be signed in to change notification settings - Fork 93
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add taint to user and worker nodes #2605
base: main
Are you sure you want to change the base?
Conversation
@@ -41,10 +41,33 @@ class ExistingInputVars(schema.Base): | |||
kube_context: str | |||
|
|||
|
|||
class DigitalOceanNodeGroup(schema.Base): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Duplicate class, so I deleted it
This method works as intended when tested on GCP. However, One issue is that certain daemonsets won't run on the tainted nodes. I saw the issue with rook ceph csi-cephfslplugin from my rook PR, but I expect it would also be an issue for the monitoring daemonset pods. So we'd likely need to add the appropriate toleration to those daemonsets. |
@@ -45,6 +45,13 @@ resource "helm_release" "rook-ceph" { | |||
}, | |||
csi = { | |||
enableRbdDriver = false, # necessary to provision block storage, but saves some cpu and memory if not needed | |||
provisionerReplicas : 1, # default is 2 on different nodes | |||
pluginTolerations = [ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
runs csi-driver on all nodes, even those with NoSchedule taints. Doesn't run on nodes with NoExecute taints. This is what the nebari-prometheus-node-exporter daemonset does so I copied it here.
effect = "NoSchedule" | ||
}, | ||
{ | ||
operator = "Exists" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
runs promtail on all nodes, even those with NoSchedule taints. Doesn't run on nodes with NoExecute taints. This is what the nebari-prometheus-node-exporter daemonset does so I copied it here. Promtail is what exports logs from the node so we still want it to run on the user and worker nodes.
{ | ||
key = "node-role.kubernetes.io/master" | ||
operator = "Exists" | ||
effect = "NoSchedule" | ||
}, | ||
{ | ||
key = "node-role.kubernetes.io/control-plane" | ||
operator = "Exists" | ||
effect = "NoSchedule" | ||
}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These top 2 are the default value for this helm chart.
Okay, so things are working for the user node group. I tried adding a taint to the worker node group, but the dask scheduler won't run on the tainted worker node group. See this commit to see what I tried in a quick test. I do see the new scheduler_pod_extra_config value in
so I think possibly the merge isn't going as expected, but I need to verify. The docs say that "This dict will be deep merged with the scheduler pod spec (a V1PodSpec object) before submission. Keys should match those in the kubernetes spec, and should be camelCase." |
I managed to get the taints applied to the scheduler pod in this commit. I would have expected the
I still need to apply the toleration to the dask workers. |
@@ -227,18 +229,23 @@ def base_username_mount(username, uid=1000, gid=100): | |||
} | |||
|
|||
|
|||
def worker_profile(options, user): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I renamed this function since it affects the scheduler as well and not just the worker
Okay things were working as expected for the jupyterlab pod and the dask worker and scheduler pods on GKE. I need to test on:
I also need to test:
|
Okay, I think things are working on AWS. I resolved some issues on Azure, but still need to test Azure. |
I tested on GCP, Azure, and AWS and works well on all. I did the following in my test:
I also tested removing the taints on Azure and AWS and saw that the taints were removed successfully. |
Update: done now - #2824 |
@@ -150,6 +201,22 @@ class AWSNodeGroupInputVars(schema.Base): | |||
permissions_boundary: Optional[str] = None | |||
ami_type: Optional[AWSAmiTypes] = None | |||
launch_template: Optional[AWSNodeLaunchTemplate] = None | |||
node_taints: list[dict] | |||
|
|||
@field_validator("node_taints", mode="before") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This code is repeated (see line 233 in this file) for GCP and AWS NodeGroupInputVars classes, but that's b/c the format expected by GCP and AWS terraform modules for taints happens to be the same. I think the required formats for the different modules could evolve separately and so I chose to duplicate the code in this case.
We should add some instructions to the docs about adding other node groups (e.g. gpus). Users should add the user taint to other user node profiles in order to prevent the same issue this PR prevents. |
Rather than making the user make sure to put a consistent taint on each node group, maybe we should just have a "type" field on node groups to simplify this. Less flexible, but I think it's flexible enough for the use cases we expect. google_cloud_platform:
project: qhub-279316
region: us-central1
kubernetes_version: 1.28.9-gke.1289000
tags:
- "nebari-quansight-dev"
node_groups:
general:
instance: n1-standard-8
min_nodes: 1
max_nodes: 1
type: general <--------------------------NEW-------------------------------
user:
instance: n1-standard-4
min_nodes: 0
max_nodes: 200
type: user <----------------------------- NEW -----------------------------
large:
instance: n1-standard-8
min_nodes: 0
max_nodes: 200
type: user <----------------------------- NEW -----------------------------
worker:
instance: n1-standard-4
min_nodes: 0
max_nodes: 1000
type: worker <----------------------------- NEW ----------------------------- ^ We discussed this in a group meeting. This is not the ideal solution. Instead, we should just:
|
Okay, this PR is ready for review! |
Reference Issues or PRs
Fixes #2507
What does this implement/fix?
Put a
x
in the boxes that applyTesting
How to test this PR
A few possible ways to test:
Things working is defined as:
Any other comments?