Fluxqueue

Under development!

I'm still thinking over improvements to fluxnetes, fluence, and related projects, and this is the direction I'm currently taking. I've been thinking of a design where Flux works as a controller, as follows:

The controller has an admission webhook that intercepts jobs and pods being submit. For jobs, they are suspended. For all other abstractions, scheduling gates are used.
The jobs are wrapped as FluxJob and parsed into Flux Job specs and passed to a part of the controller, the Flux Queue.
The Flux Queue, which runs in a loop, moves through the queue and interacts with a Fluxion service to schedule work.
When a job is scheduled, it is unsuspended and/or targeted for the fluxqueue custom scheduler plugin that will assign exactly to the nodes it has been intended for.
We will need an equivalent cleanup process to receive when pods are done, and tell fluxion and update the queue. Likely those will be done in the same operation.

This project comes out of fluxqueue, which was similar in design, but did the implementation entirely inside of Kubernetes. fluxqueue was a combination of Kubernetes and Fluence, both of which use the HPC-grade pod scheduling Fluxion scheduler to schedule pod groups to nodes. For our queue, we use river backed by a Postgres database. The database is deployed alongside fluence and could be customized to use an operator instead.

Important This is an experiment, and is under development. I will change this design a million times - it's how I tend to learn and work. I'll share updates when there is something to share. It deploys but does not work yet! See the docs for some detail on design choices.

Design

Fluxqueue builds three primary containers:

ghcr.io/converged-computing/fluxqueue: contains the webhook and operator with a flux queue for pods and groups that interacts with fluxion
ghcr.io/converged-computing/fluxqueue-scheduler: (TBA) will provide a simple scheduler plugin
ghcr.io/converged-computing/fluxqueue-postgres: holds the worker queue and provisional queue tables

And we use ghcr.io/converged-computing/fluxion for the fluxion service.

Deploy

Create a kind cluster. You need more than a control plane.

kind create cluster --config ./examples/kind-config.yaml

Install the certificate manager:

kubectl apply -f https://github.com/cert-manager/cert-manager/releases/download/v1.13.1/cert-manager.yaml

Then you can deploy as follows:

./hack/quick-build-kind.sh

You'll then have the fluxqueue service running, a postgres database (for the job queue), along with (TBA) the scheduler plugins controller, which we currently have to use PodGroup.

$ kubectl get pods -n fluxqueue-system
NAME                                                 READY   STATUS    RESTARTS   AGE
fluxqueue-chart-controller-manager-6dd6f95c6-z9qdk   0/1     Running   0          9s
postgres-5dc8c6b49d-llv2s                            0/1     Running   0          9s

You can then create a job or a pod:

kubectl apply -f test/job.yaml
kubectl apply -f test/pod.yaml

Which will currently each be suspended (job) or schedule gated (pod) to prevent scheduling. A FluxJob to wrap them is also created:

$ kubectl get fluxjobs.jobs.converged-computing.org 
NAME      AGE
job-pod   4s
pod-pod   6s

Next I'm going to figure out how we can add a queue that receives these jobs and asks to schedule with fluxion.

Development

Debugging Postgres

It is often helpful to shell into the postgres container to see the database directly:

kubectl exec -n fluxqueue-system -it postgres-597db46977-9lb25 bash
psql -U postgres

# Connect to database 
\c

# list databases
\l

# show tables
\dt

# test a query
SELECT group_name, group_size from pods_provisional;

TODO

Figure out how to add queue
Figure out how to add fluxion
kubectl plugin to get fluxion state?

License

HPCIC DevTools is distributed under the terms of the MIT license. All new contributions must be made under this license.

See LICENSE, COPYRIGHT, and NOTICE for details.

SPDX-License-Identifier: (MIT)

LLNL-CODE- 842614

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
api/v1alpha1		api/v1alpha1
build/postgres		build/postgres
chart		chart
cmd/manager		cmd/manager
config		config
dist		dist
hack		hack
img		img
internal/controller		internal/controller
pkg		pkg
test		test
.dockerignore		.dockerignore
.gitignore		.gitignore
.golangci.yml		.golangci.yml
COPYRIGHT		COPYRIGHT
Dockerfile		Dockerfile
LICENSE		LICENSE
Makefile		Makefile
NOTICE		NOTICE
PROJECT		PROJECT
README.md		README.md
go.mod		go.mod
go.sum		go.sum

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Fluxqueue

Design

Deploy

Development

Debugging Postgres

TODO

License

About

Releases

Packages

Languages

License

converged-computing/fluxqueue

Folders and files

Latest commit

History

Repository files navigation

Fluxqueue

Design

Deploy

Development

Debugging Postgres

TODO

License

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages