GPU Cluster
The department of Diagnostic Radiology manages five servers colloquially referred to as the "GPU cluster".
They were set up by IT services and handed over in February 2020. The rack is physically located at The Hong Kong Jockey Club Building For Interdisciplinary Research, 1/F.
Contents
Hardware Specifications
There are four computing nodes, and one storage node that shares storage across all nodes.
Name | Public IP address | Physical CPU cores | GPU | RAM(GB) | Storage (TB) | Storage mount point | URL |
---|---|---|---|---|---|---|---|
gpu1 | 147.8.193.173 | 16 | V100(16G)x4 | 64 | 1.6 | /scratch |
NA |
gpu2 | 147.8.193.172 | 16 | V100(16G)x4 | 64 | 1.6 | /scratch |
NA |
gpu3 | 147.8.193.175 | 16 | V100(16G)x4 | 64 | 1.6 | /scratch |
NA |
cpu1 | 147.8.193.174 | 80 | NA | 512 | 1.6 | /scratch |
NA |
storage1 | 147.8.193.171 | 12 | NA | 16 | 100 | exported as nfs to /home |
hpc.radiology.hku.hk
|
Usage
Users are currently expected to use the GPU cluster in the following ways:
- Access to this wiki
- Access to a GitLab account for code sharing and collaboration
- Shell access to the compute nodes
- Users get their own user-specific home folder.
- The home folder comes with a pre-installed python environment (anaconda). They may in turn use that to host a Jupyter Notebook server.
- Other software may be installed upon request.
- To use the shell access features of the GPU cluster, users need to get a server account.
- All users must be on the HKU network / VPN to access anything.
Choosing a server
Refer to the specs table for IP address information.
storage1
is not intended for direct shell access, so password login for non-admins is disabled.
For running code and scripts
Choose cpu1 or any of the 3 gpu servers. Optionally check the local resource usage with command top
, ps
or nvidia-smi
.
Apart from GPU related things, the 4 servers should have a similar software stack. Software installed includes various scripting languages like Lua, Python, PHP, and Perl. There is also a C/C++ compiler (gcc
) and a git client.
For file transfer/storage
It doesn't matter which server you choose, since the user home folder (/home/[userID]
) is shared across servers. E.g. if you upload a file to cpu1
, it will also be available when you connect to gpu2
. Additionally, users may create files and folders in /home/shared_data
to allow other users to access the same data. To evaluate available storage see the df
or du
commands.
If required for performance reasons, users may write in the /scratch
directory. Unlike the /home
folder, files stored under /scratch
are purely local to a server. As a /scratch
folder is accessible to all users on the server, it is advisable to restrict access to others using chmod 700
on your files and subfolders.
/scratch
folders may cleared without notice, so do not use it for long-term storage.Security
Users may have several sets of login credentials:
- One for the Linux shell
- One for this wiki
- One for their Gitlab
All the passwords used in the above are stored in a salted hash format. This means, in the event of a data breach, user passwords won't be compromised. It also means they are not retrievable, not even by admins. Password resets are (ideally) handled by automated email.
Ports
Users may run servers listening on ports, e.g. a VNC/X11 server or Jupyter notebook. Such connections should use an SSH tunnel to ensure security and pass the firewall.
Connections to websites on hpc.radiology.hku.hk should use SSL/TLS (i.e. https in the browser). This means traffic to the (proxy) server is encrypted.
Technical Overview
This section contains information that may be useful for admins.
- All GPU cluster systems run on CentOS 7.
- User creation is done on cpu1 and automatically synchronised across servers via NIS.
- Normal users do not have package manager installation privileges.
- The storage from storage1 is mounted as an NFS share on /home in the other 4 servers.
- Root password is not available, current sudoers are hpcadmin, itsupport, richard, jurgen.
Known issues
See here for a to-do list of things that should be fixed.
Reverse proxy
- Running as container on
storage1
, redirecting http to https. - TLS certificates (letsencrypt CA) scripted to auto-renew.
- proxy overview
GitLab Service
- Running as 3 containers on
storage1
, behind proxy. - Currently not blocking non-department users, set to require email with hku.hk domain, but no confirmation system active.
Mediawiki Service
- Running as 2 containers on
storage1
, behind proxy. - Currently not blocking non-department users, open to anonymous editors.
- Several extensions and templates installed.
Emails
Todo. Required for password resets and admin alerts.
Backups
Todo.
- Updated tar to v1.32.
Firewall
- Disabled firewalld. Using iptables defaults from IT services.
- Confirmed all HKU IPs (public and private IP ranges) are whitelisted.
- In addition to the defaults, our opened ports are 22, 80, 443, 9171