Difference between revisions of "GPU Cluster"

From radwiki
Jump to navigation Jump to search
 
(23 intermediate revisions by 3 users not shown)
Line 3: Line 3:
 
They were set up by IT services and handed over in February 2020. The rack is [[Physical Access|physically]] located at The Hong Kong Jockey Club Building For Interdisciplinary Research, 1/F.
 
They were set up by IT services and handed over in February 2020. The rack is [[Physical Access|physically]] located at The Hong Kong Jockey Club Building For Interdisciplinary Research, 1/F.
 
== Hardware Specifications ==
 
== Hardware Specifications ==
There are four computing nodes, and one storage node that shares storage across all nodes.
+
The servers are typically called ''nodes''. There are four computing nodes and one storage node.  
  
 
{| class="wikitable"
 
{| class="wikitable"
 
|-
 
|-
! Name !! Public IP address !! Physical CPU cores !! GPU !! RAM(GB) !! Storage (TB) !! Storage mount point !! URL
+
! Name !! Public IP address !! Physical CPU cores !! GPU !! RAM(GB) !! Storage (TB) !! Storage mount point !! [[Domain registration|URL]]
 
|-
 
|-
| gpu1 || 147.8.193.173 || 16 || V100(16G)x4 || 64 || 1.6 || `/scratch` || `hpc.radiology.hku.hk/gpu1`
+
| gpu1 || 147.8.193.173 || 16 || V100(16G)x4 || 64 || 1.6 || `/scratch` || NA
 
|-
 
|-
| gpu2 || 147.8.193.172 || 16 || V100(16G)x4 || 64 || 1.6 || `/scratch` || `hpc.radiology.hku.hk/gpu2`
+
| gpu2 || 147.8.193.172 || 16 || V100(16G)x4 || 64 || 1.6 || `/scratch` || NA
 
|-
 
|-
| gpu3 || 147.8.193.175 || 16 || V100(16G)x4 || 64 || 1.6 || `/scratch` || `hpc.radiology.hku.hk/gpu3`
+
| gpu3 || 147.8.193.175 || 16 || V100(16G)x4 || 64 || 1.6 || `/scratch` || NA
 
|-
 
|-
| cpu1 || 147.8.193.174 || 40 || NA || 512 || 1.6 || `/scratch` || `hpc.radiology.hku.hk/cpu1`
+
| cpu1 || 147.8.193.174 || 80 || NA || 512 || 1.6 || `/scratch` || NA
 
|-
 
|-
| storage1 || 147.8.193.171 || 12 || NA || 16 || 100 || exported as nfs to `/home` || Reverse proxy for hpc.radiology.hku.hk
+
| storage1 || 147.8.193.171 || 12 || NA || 16 || 100 || exported as nfs to `/home` || `hpc.radiology.hku.hk`
 
|}
 
|}
  
{{Note|url access is not 100% available yet, direct IP address access should always work.|warn}}
+
Realtime performance and usage metrics can be found at [[HPC Diagnostics and Statistics]]
  
 
== Usage ==
 
== Usage ==
 
Users are currently expected to use the GPU cluster in the following ways:
 
Users are currently expected to use the GPU cluster in the following ways:
* Access to this wiki
+
# Access to this wiki
* Access to a [[personal wiki]]
+
# Access to a [[GitLab Service|GitLab]] account for code sharing and collaboration
* Access to create and use a [[GitLab Service|GitLab]] account
+
# [[SSH/SFTP| Shell]] access to the compute nodes
* [[SSH/SFTP| Shell]] access to the compute nodes
+
#* Users get their own user-specific home folder.
** Users get their own user-specific home folder.
+
 
** The home folder comes with a pre-installed python environment ([[anaconda]]). They may in turn use that to host a [[Jupyter Notebooks|Jupyter Notebook]] server.
+
A few [[guides|introductory guides]] are available to help users. Other software may be installed upon request, but users should note they can manually install any software in their own home directory without needing admin privileges.
* Other [[Software ideas|software]] may be installed upon request.
+
 
 +
{{Note|
 +
* To use the shell access features of the GPU cluster, users need to get a [[User account|server account]].
 +
* All users must be on the HKU network / [https://www.its.hku.hk/documentation/guide/network/remote/hkuvpn2fa VPN] to access anything.|warn}}
  
 
=== Choosing a server ===
 
=== Choosing a server ===
 
Refer to the [[GPU_Cluster#Hardware_Specifications| specs table]] for IP address information.
 
Refer to the [[GPU_Cluster#Hardware_Specifications| specs table]] for IP address information.
`storage1` is not intended for direct access, so password login for non-admins is disabled.
+
`storage1` is not intended for direct shell access, so password login for non-admins is disabled. The storage node transparently makes its storage capacity available to all nodes as the `/home` directory.
  
 
==== For running code and scripts ====
 
==== For running code and scripts ====
 
Choose cpu1 or any of the 3 gpu servers.  Optionally check the local resource usage with command <code>top</code>, <code>ps</code> or <code>nvidia-smi</code>.
 
Choose cpu1 or any of the 3 gpu servers.  Optionally check the local resource usage with command <code>top</code>, <code>ps</code> or <code>nvidia-smi</code>.
Apart from GPU related things, the 4 servers should have a similar software stack. Software installed includes various scripting languages like Lua, Python, PHP, and Perl. There is also a C/C++ compiler (`gcc`) and a git client.
+
Apart from GPU related things, the 4 servers should have a similar software stack. Software installed includes various scripting languages like Lua, Python, PHP, and Perl. There is also a C/C++ compiler (`gcc`) and a git client. The home folder comes with a pre-installed python environment ([[anaconda]]).  
  
 
==== For file transfer/storage ====
 
==== For file transfer/storage ====
It doesn't matter which server you choose, since the user home folder (`/home/[userID]`) is shared across servers. E.g. if you upload a file to `cpu1`, it will also be available when you connect to `gpu2`. Additionally, users may create files and folders in `/home/shared_data` to allow other users to access the same data.
+
It doesn't matter which server you choose, since the user home folder (`/home/[userID]`) is shared across servers. E.g. if you upload a file to `cpu1`, it will also be available when you connect to `gpu2`. Additionally, users may create files and folders in `/home/shared_data` to allow other users to access the same data. To evaluate available storage see the `df` or `du` commands.
  
 
If required for performance reasons, users may write in the <code>/scratch</code> directory. Unlike the <code>/home</code> folder, files stored under <code>/scratch</code> are purely '''local to a server'''. As a `/scratch` folder is accessible to all users on the server, it is advisable to restrict access to others using <code>chmod 700</code> on your files and subfolders.
 
If required for performance reasons, users may write in the <code>/scratch</code> directory. Unlike the <code>/home</code> folder, files stored under <code>/scratch</code> are purely '''local to a server'''. As a `/scratch` folder is accessible to all users on the server, it is advisable to restrict access to others using <code>chmod 700</code> on your files and subfolders.
  
=== Note ===
+
{{Note|* The `/scratch` folders may cleared without notice, so do '''not''' use it for long-term storage.|warn}}
* To use the shell access features of the GPU cluster, users need to get a [[User account|server account]].
 
* All users must be on the HKU network / [https://www.its.hku.hk/documentation/guide/network/remote/hkuvpn2fa VPN] to access anything.
 
* The `/scratch` folders may cleared without notice so do not use it for long-term storage.
 
 
 
== Technical Overview ==
 
This section contains information that may be useful for admins.
 
  
* All GPU cluster systems run on [https://en.wikipedia.org/wiki/CentOS CentOS 7].
+
== Security ==
* User creation is done on cpu1 and automatically synchronised across servers via NIS.
+
Users may have several sets of login credentials:
* Normal users do not have package manager installation privileges.
+
* One for the Linux shell
* The storage from storage1 is mounted as an NFS share on /home in the other 4 servers.
+
* One for this wiki
* Root password is not available, current sudoers are hpcadmin, itsupport, richard, jurgen.
+
* One for their Gitlab
=== Known issues ===
 
See [[to-do|here]] for a to-do list of things that should be fixed.
 
  
=== GitLab Service ===
+
All the passwords used in the above are stored in a salted hash format. This means, in the event of a data breach, user passwords won't be compromised. It also means they are not retrievable, not even by admins. Password resets are (ideally) handled by automated email.  
Running on CPU1. Uses Nginx server with port 9171.
 
Currently not blocking non-department users.
 
  
Visibility of user repositories is private amongst member of the same Group. We created the department-wide group '''Radiology'''.
+
{{Note|
 +
* Please use a strong password and protect it.  
 +
* Non-anonymous patient data should be stored in '''encrypted''' format.|warn}}
  
Backup with: `sudo gitlab-backup create` , default path /var/opt/gitlab/backups
+
=== Ports ===
 +
Users may run servers listening on ports, e.g. a VNC/X11 server or Jupyter notebook. Such connections should use an SSH tunnel to ensure security and pass the firewall.
  
=== Mediawiki Service ===
+
Connections to websites on hpc.radiology.hku.hk should use SSL/TLS (i.e. https in the browser). This means traffic to the (proxy) server is encrypted.
Running on CPU1. Uses Apache server on port 80.
 
Currently not blocking non-department users.
 
  
=== Emails ===
+
== Administration ==
Todo. Required for password resets and admin alerts.
+
See [[gpu cluster technical overview| here]] for logs and technical details.
=== Backups===
 
Todo. Updated tar to v1.32.
 
=== Firewall ===
 
* Disabled firewalld. Using iptables defaults from IT services.
 
* Confirmed all HKU IPs (public and private IP ranges) are whitelisted.
 
* In addition to the defaults, our opened ports are 22, 80, 443, 9171
 

Latest revision as of 05:55, 9 November 2020

The department of Diagnostic Radiology manages five servers colloquially referred to as the "GPU cluster".

They were set up by IT services and handed over in February 2020. The rack is physically located at The Hong Kong Jockey Club Building For Interdisciplinary Research, 1/F.

Hardware Specifications

The servers are typically called nodes. There are four computing nodes and one storage node.

Name Public IP address Physical CPU cores GPU RAM(GB) Storage (TB) Storage mount point URL
gpu1 147.8.193.173 16 V100(16G)x4 64 1.6 /scratch NA
gpu2 147.8.193.172 16 V100(16G)x4 64 1.6 /scratch NA
gpu3 147.8.193.175 16 V100(16G)x4 64 1.6 /scratch NA
cpu1 147.8.193.174 80 NA 512 1.6 /scratch NA
storage1 147.8.193.171 12 NA 16 100 exported as nfs to /home hpc.radiology.hku.hk

Realtime performance and usage metrics can be found at HPC Diagnostics and Statistics

Usage

Users are currently expected to use the GPU cluster in the following ways:

  1. Access to this wiki
  2. Access to a GitLab account for code sharing and collaboration
  3. Shell access to the compute nodes
    • Users get their own user-specific home folder.

A few introductory guides are available to help users. Other software may be installed upon request, but users should note they can manually install any software in their own home directory without needing admin privileges.

  • To use the shell access features of the GPU cluster, users need to get a server account.
  • All users must be on the HKU network / VPN to access anything.

Choosing a server

Refer to the specs table for IP address information. storage1 is not intended for direct shell access, so password login for non-admins is disabled. The storage node transparently makes its storage capacity available to all nodes as the /home directory.

For running code and scripts

Choose cpu1 or any of the 3 gpu servers. Optionally check the local resource usage with command top, ps or nvidia-smi. Apart from GPU related things, the 4 servers should have a similar software stack. Software installed includes various scripting languages like Lua, Python, PHP, and Perl. There is also a C/C++ compiler (gcc) and a git client. The home folder comes with a pre-installed python environment (anaconda).

For file transfer/storage

It doesn't matter which server you choose, since the user home folder (/home/[userID]) is shared across servers. E.g. if you upload a file to cpu1, it will also be available when you connect to gpu2. Additionally, users may create files and folders in /home/shared_data to allow other users to access the same data. To evaluate available storage see the df or du commands.

If required for performance reasons, users may write in the /scratch directory. Unlike the /home folder, files stored under /scratch are purely local to a server. As a /scratch folder is accessible to all users on the server, it is advisable to restrict access to others using chmod 700 on your files and subfolders.

* The /scratch folders may cleared without notice, so do not use it for long-term storage.

Security

Users may have several sets of login credentials:

  • One for the Linux shell
  • One for this wiki
  • One for their Gitlab

All the passwords used in the above are stored in a salted hash format. This means, in the event of a data breach, user passwords won't be compromised. It also means they are not retrievable, not even by admins. Password resets are (ideally) handled by automated email.

  • Please use a strong password and protect it.
  • Non-anonymous patient data should be stored in encrypted format.

Ports

Users may run servers listening on ports, e.g. a VNC/X11 server or Jupyter notebook. Such connections should use an SSH tunnel to ensure security and pass the firewall.

Connections to websites on hpc.radiology.hku.hk should use SSL/TLS (i.e. https in the browser). This means traffic to the (proxy) server is encrypted.

Administration

See here for logs and technical details.