Am i really making use of gpu, if so why the wall time is more than my local. #3929

experiment-23 · 2024-06-29T04:17:01Z

experiment-23
Jun 29, 2024

My university HPC has some tensorflow dependency error. System admin asked me to install it on hpc just the way i did in my local machine. I did install using miniforge and used this submission script.

history:
302 conda create -n deepmd deepmd-kit lammps horovod -c conda-forge
303 conda activate deepmd

(deepmd) [user@login04 gpu]$ cat submit.sh
#!/bin/bash
#SBATCH --job-name=Job #Job name
#SBATCH -N 1 #Number of nodes
#SBATCH --ntasks-per-node=1 #Number of core per node
#SBATCH --gres=gpu:2 #Number of GPUs
#SBATCH --error=job.%J.err #Name of output file
#SBATCH --output=job.%J.out #Name of error file
#SBATCH --time=72:00:00 #Time take to execute the program
#SBATCH --partition=gpu #specifies queue name(standard is the default partiti>

module load openmpi/4.1.4

conda init bash
conda activate deepmd

export OMP_NUM_THREADS=4
export TF_INTRA_OP_PARALLELISM_THREADS=4
export TF_INTER_OP_PARALLELISM_THREADS=4

mpirun -np 1 dp train input.json > output.txt

(deepmd)[user@login04 gpu]$ dp --version
DeePMD-kit v2.2.10

from job error file:
DEEPMD INFO ---Summary of the training---------------------------------------
DEEPMD INFO running on: gpu008
DEEPMD INFO computing device: cpu:0
DEEPMD INFO Count of visible GPU: 0
DEEPMD INFO num_intra_threads: 4
DEEPMD INFO num_inter_threads: 4
DEEPMD INFO -----------------------------------------------------------------

Am i really making use of gpu, if so why the wall time is more than my local.
Please help me

njzjz · 2024-07-02T20:02:28Z

njzjz
Jul 2, 2024
Maintainer

It looks like the GPU is not used, and you may first check in conda list whether the CPU or GPU version is installed.

0 replies

experiment-23 · 2024-07-04T07:52:08Z

experiment-23
Jul 4, 2024
Author

(deepmd) [jayaprakash@login01 gpu3]$ conda create -n deepmd_gpu deepmd-kit=*=gpu libdeepmd==*gpu lammps cudatoolkit=11.6 horovod -c https://conda.deepmodeling.com -c defaults

ERROR conda.notices.fetch:get_channel_notice_response(73): Request error <HTTPSConnectionPool(host='repo.anaconda.com', port=443): Max retries exceeded with url: /pkgs/main/notices.json (Caused by NewConnectionError('<urllib3.connection.HTTPSConnection object at 0x7ff3b927e230>: Failed to establish a new connection: [Errno -2] Name or service not known'))> for channel: defaults url: https://repo.anaconda.com/pkgs/main/notices.json
Retrying (Retry(total=0, connect=None, read=None, redirect=None, status=None)) after connection broken by 'NewConnectionError('<urllib3.connection.HTTPSConnection object at 0x7ff3b927de70>: Failed to establish a new connection: [Errno -2] Name or service not known')': /notices.json

ERROR conda.notices.fetch:get_channel_notice_response(73): Request error <HTTPSConnectionPool(host='repo.anaconda.com', port=443): Max retries exceeded with url: /pkgs/r/notices.json (Caused by NewConnectionError('<urllib3.connection.HTTPSConnection object at 0x7ff3b927e1a0>: Failed to establish a new connection: [Errno -2] Name or service not known'))> for channel: defaults url: https://repo.anaconda.com/pkgs/r/notices.json
ERROR conda.notices.fetch:get_channel_notice_response(73): Request error <HTTPSConnectionPool(host='conda.deepmodeling.com', port=443): Max retries exceeded with url: /notices.json (Caused by NewConnectionError('<urllib3.connection.HTTPSConnection object at 0x7ff3b927e200>: Failed to establish a new connection: [Errno -2] Name or service not known'))> for channel: conda.deepmodeling.com url: https://conda.deepmodeling.com/notices.json
done
Channels:

https://conda.deepmodeling.com
defaults
conda-forge
Platform: linux-64
Collecting package metadata (repodata.json): / Retrying (Retry(total=2, connect=None, read=None, redirect=None, status=None)) after connection broken by 'NewConnectionError('<urllib3.connection.HTTPSConnection object at 0x7ff3b819d7e0>: Failed to establish a new connection: [Errno -2] Name or service not known')': /linux-64/repodata.json.zst

Retrying (Retry(total=1, connect=None, read=None, redirect=None, status=None)) after connection broken by 'NewConnectionError('<urllib3.connection.HTTPSConnection object at 0x7ff3b066c1f0>: Failed to establish a new connection: [Errno -2] Name or service not known')': /conda-forge/noarch/repodata.json.zst

Retrying (Retry(total=1, connect=None, read=None, redirect=None, status=None)) after connection broken by 'NewConnectionError('<urllib3.connection.HTTPSConnection object at 0x7ff3b066c6d0>: Failed to establish a new connection: [Errno -2] Name or service not known')': /pkgs/r/noarch/repodata.json.zst

failed

CondaHTTPError: HTTP 000 CONNECTION FAILED for url https://conda.deepmodeling.com/linux-64/repodata.json
Elapsed: -

An HTTP error occurred when trying to retrieve this URL.
HTTP errors are often intermittent, and a simple retry will get you on your way.
'https//conda.deepmodeling.com/linux-64'

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Am i really making use of gpu, if so why the wall time is more than my local. #3929

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 2 comments

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

Am i really making use of gpu, if so why the wall time is more than my local. #3929

experiment-23 Jun 29, 2024

Replies: 2 comments

njzjz Jul 2, 2024 Maintainer

experiment-23 Jul 4, 2024 Author

experiment-23
Jun 29, 2024

njzjz
Jul 2, 2024
Maintainer

experiment-23
Jul 4, 2024
Author