Am i really making use of gpu, if so why the wall time is more than my local. #3929
Replies: 2 comments
-
It looks like the GPU is not used, and you may first check in |
Beta Was this translation helpful? Give feedback.
-
(deepmd) [jayaprakash@login01 gpu3]$ conda create -n deepmd_gpu deepmd-kit=*=gpu libdeepmd==*gpu lammps cudatoolkit=11.6 horovod -c https://conda.deepmodeling.com -c defaults ERROR conda.notices.fetch:get_channel_notice_response(73): Request error <HTTPSConnectionPool(host='repo.anaconda.com', port=443): Max retries exceeded with url: /pkgs/main/notices.json (Caused by NewConnectionError('<urllib3.connection.HTTPSConnection object at 0x7ff3b927e230>: Failed to establish a new connection: [Errno -2] Name or service not known'))> for channel: defaults url: https://repo.anaconda.com/pkgs/main/notices.json ERROR conda.notices.fetch:get_channel_notice_response(73): Request error <HTTPSConnectionPool(host='repo.anaconda.com', port=443): Max retries exceeded with url: /pkgs/r/notices.json (Caused by NewConnectionError('<urllib3.connection.HTTPSConnection object at 0x7ff3b927e1a0>: Failed to establish a new connection: [Errno -2] Name or service not known'))> for channel: defaults url: https://repo.anaconda.com/pkgs/r/notices.json
Retrying (Retry(total=1, connect=None, read=None, redirect=None, status=None)) after connection broken by 'NewConnectionError('<urllib3.connection.HTTPSConnection object at 0x7ff3b066c1f0>: Failed to establish a new connection: [Errno -2] Name or service not known')': /conda-forge/noarch/repodata.json.zst Retrying (Retry(total=1, connect=None, read=None, redirect=None, status=None)) after connection broken by 'NewConnectionError('<urllib3.connection.HTTPSConnection object at 0x7ff3b066c6d0>: Failed to establish a new connection: [Errno -2] Name or service not known')': /pkgs/r/noarch/repodata.json.zst failed CondaHTTPError: HTTP 000 CONNECTION FAILED for url https://conda.deepmodeling.com/linux-64/repodata.json An HTTP error occurred when trying to retrieve this URL. |
Beta Was this translation helpful? Give feedback.
-
My university HPC has some tensorflow dependency error. System admin asked me to install it on hpc just the way i did in my local machine. I did install using miniforge and used this submission script.
history:
302 conda create -n deepmd deepmd-kit lammps horovod -c conda-forge
303 conda activate deepmd
(deepmd) [user@login04 gpu]$ cat submit.sh
#!/bin/bash
#SBATCH --job-name=Job #Job name
#SBATCH -N 1 #Number of nodes
#SBATCH --ntasks-per-node=1 #Number of core per node
#SBATCH --gres=gpu:2 #Number of GPUs
#SBATCH --error=job.%J.err #Name of output file
#SBATCH --output=job.%J.out #Name of error file
#SBATCH --time=72:00:00 #Time take to execute the program
#SBATCH --partition=gpu #specifies queue name(standard is the default partiti>
module load openmpi/4.1.4
conda init bash
conda activate deepmd
export OMP_NUM_THREADS=4
export TF_INTRA_OP_PARALLELISM_THREADS=4
export TF_INTER_OP_PARALLELISM_THREADS=4
mpirun -np 1 dp train input.json > output.txt
(deepmd)[user@login04 gpu]$ dp --version
DeePMD-kit v2.2.10
from job error file:
DEEPMD INFO ---Summary of the training---------------------------------------
DEEPMD INFO running on: gpu008
DEEPMD INFO computing device: cpu:0
DEEPMD INFO Count of visible GPU: 0
DEEPMD INFO num_intra_threads: 4
DEEPMD INFO num_inter_threads: 4
DEEPMD INFO -----------------------------------------------------------------
Am i really making use of gpu, if so why the wall time is more than my local.
Please help me
Beta Was this translation helpful? Give feedback.
All reactions