Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Missing result: CycleCloud CPU for mt-gemm #47

Open
vsoch opened this issue Sep 15, 2024 · 3 comments
Open

Missing result: CycleCloud CPU for mt-gemm #47

vsoch opened this issue Sep 15, 2024 · 3 comments

Comments

@vsoch
Copy link
Member

vsoch commented Sep 15, 2024

I didn't find Azure CycleCloud for CPU when parsing the mt-gemm results. For example, it should be in the plot here:

https://github.com/converged-computing/performance-study/tree/main/analysis/mt-gemm#performance-gflops-per-second-for-cpu

And I don't see it in any of the directories here:

https://github.com/converged-computing/performance-study/tree/main/experiments/azure/cyclecloud/cpu/size32/results

Maybe it was an artifact that needs to be pulled? Here are the tags we have:

 for tag in $(oras repo tags ghcr.io/converged-computing/metrics-operator-experiments/performance)
> do
> echo $tag | grep gem
> done
eks-cpu-32-mt-gemm
eks-cpu-64-mt-gemm
gke-cpu-mt-gemm
eks-efa-cpu-32-mt-gemm
eks-cpu-64-efa-mt-gemm
eks-cpu-128-mt-gemm
gke-cpu-64-mt-gemm
gke-cpu-128-mt-gemm
gke-cpu-256-mt-gemm
aks-infiniband-cpu-32-mt-gemm
gke-cpu-mt-gemm-1k
aks-infiniband-cpu-64-mt-gemm
compute-engine-cpu-32-mt-gemm
eks-gpu-8-mt-gemm
compute-engine-cpu-64-mt-gemm
compute-engine-cpu-128-mt-gemm
compute-engine-cpu-256-mt-gemm
eks-gpu-16-mt-gemm
aks-infiniband-cpu-32-mt-gemm-placement
gke-gpu-8-2-mt-gemm
gke-gpu-4-mt-gemm
aws-parallelcluster-cpu-32-node-mt-gemm
gke-gpu-16-mt-gemm
gke-gpu-32-mt-gemm
aks-gpu-4-mt-gemm
aks-gpu-8-mt-gemm
aks-gpu-16-mt-gemm
aks-gpu-32-mt-gemm
eks-cpu-256-mt-gemm
compute-engine-gpu-4-mt-gemm
compute-engine-gpu-8-mt-gemm
compute-engine-gpu-16-mt-gemm
compute-engine-gpu-32-mt-gemm

It's not under @asarkar-parsys here or @amarathe84 here. Likely I'm not looking in all places. Let's try to find it.

cc @asarkar-parsys @milroy that ran on CycleCloud for CPU.

@asarkar-parsys
Copy link
Collaborator

I had run into issues running mt-gemm on cyclecloud. The following is from the notes I had made available before leaving.

From Slack 8/28:
Oras uploaded for 32 and 64 node azure cyclecloud results. Benchmarks that worked with reasonable success on both: amg, kripke, lammps, quicksilver, osu
Need attention: laghos, minife, mt_gemm

Excerpt from the document I had shared.

mt-gemm jobs: 2 jobs timedout, could be a simple error with running the module before submitting the job
module load mpi/hpcx-pmix-2.18
for i in {1..5}; do sbatch --output=../../data/mt-gemm/%x-%j-iter-${i}.out --error=../../data/mt-gemm/%x-%j-iter-${i}.err slurm-mt-gemm-32n.sh; done
Error: /var/spool/slurmd/job00276/slurm_script: 9: module: not found

So, maybe someone fixed it and ran mt-gemm on cyclecloud? @milroy @amarathe84

@vsoch
Copy link
Member Author

vsoch commented Sep 16, 2024

Thanks @asarkar-parsys for sharing (likely again) the note from slack. If it turns out that we didn't get the data because of slurm/module errors (it's OK!), and we can document that. But maybe it was tried again and is in an artifact or someone's local machine. Likely we need to do a full inventory across apps, also to get counts of things, which I've started to do in my head as I'm processing the analyses results, but a more thorough look is needed I think.

@vsoch
Copy link
Member Author

vsoch commented Sep 16, 2024

I also want to clarify there is no urgency here - I am just documenting what I find as I go, because I won't remember. Maybe a colored spreadsheet would do better so I don't open issues for each one.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants