Difference between revisions of "Admin guide"

From radwiki
Jump to navigation Jump to search
 
(8 intermediate revisions by 3 users not shown)
Line 1: Line 1:
 +
This page describes the software stack that [[GPU cluster|GPU cluster 1]] runs to provide its end-user services.
  
{{Note|Using docker containers allows easier setup, per-user configuration, scaling, and backup. Avoid breaking servers with unnecessary software additions, use containers for complex things.|reminder}}
+
For management purposes the main components to know about are:
 +
* Basic Linux sysadmin practices
 +
* RAID and NFS storage
 +
* Docker and docker-compose
 +
* Some HKU networking architecture
  
 +
== Basic cluster management ==
 
Currently the cluster is managed using the commandline over SSH. A large part of the system was configured by CPOS sysadmins, so there is no exact record of the software stack.  
 
Currently the cluster is managed using the commandline over SSH. A large part of the system was configured by CPOS sysadmins, so there is no exact record of the software stack.  
* See key info at [[GPU_Cluster#Technical_Overview]].
 
* See the [[Admin_install_log]] for changes made to the cluster, please try to keep this log up-to-date.
 
  
In short, the major additions we've installed since the handover are:
+
The nodes synchronise user information via NIS, If the NIS client or the server stops it may prevent SSH login. The main NIS node is cpu1, as such all new user accounts should be created on this node. Common tasks include user creation, setting password their and group, and changing file permissions.
* Gitlab omnibus edition (on cpu1) -> intend to move to docker on storage1
 
* Mediawiki 1.34 including mariadb  (on cpu1) -> intend to move to docker on storage1
 
* Docker, including docker-compose. (on cpu1 and storage1)
 
* An attempt at installing X11/Xfce as part of a TigerVNC installation. -> aborted, try a docker solution.
 
  
The nodes synchronise user information via NIS, which was configured by CPOS and doesn't seem to include group info. If the NIS daemon stops it may prevent SSH login. The main NIS node is cpu1.
+
=== Installation log ===
 +
* X11/Xfce on cpu1 as part of a TigerVNC installation
 +
* MATLAB on cpu1
  
== Network architecture ==
+
=== Setup VNC ===
HKU ITS have [[domain registration|registered]] the domain hpc.radiology.hku.hk to the storage1 IP. So storage1 will host a reverse proxy ([[nginx container]]) that will use path-based resolution for upstream addresses, i.e. `hpc.radiology.hku.hk/<servicename>`.
+
<nowiki>
 +
create vncconfig in /etc/systemd/system (optional, for permanent server)
 +
as user run vncpasswd
 +
cp /etc/X11/Xresources ~/.Xresources
 +
cp /home/utility/xstartup ~/.vnc/xstartup  (for custom xfce)
 +
sysctl daemon-reload && start && enable vncservice@:N.service (optional, for permanent server)
 +
</nowiki>
 +
Use SSH tunnel, then connect on port 5900+N.
 +
Optionally try `rm -R /tmp/.X*` to clean locks.
 +
Consider restricting number of per-user connections.
  
Proxy upstreams include:
+
=== Common admin problems ===
* The 4 compute nodes
 
* Services like gitlab, mediawikis, etc.
 
  
Currently we think it is best to run gitlab and wiki servers locally on storage1. New services would have to be evaluated on a per-case basis, as the 12GB/12core hardware might not be enough to host everything. Likely future additions are PACs and a web file manager.
+
==== Locked out of server ====
 +
When adjusting iptables, set a timed reset using `cron` or `at` so you can get back in. Worst case you will need [[physical access]]. If you haven't touched iptables, it could be an issue with the NIS server. Also make sure firewalld stays disabled.
  
== Common Admin Tasks ==
+
==== Lost a docker container or volume ====
=== Creating a new user ===
+
Keep in mind `docker-compose down` will stop ''and'' remove containers along with anonymous volumes. Likewise running a container with `--rm` option will remove anonymous dangling volumes.
`sudo useradd -m -G students <username>`
 
Then
 
`sudo passwd <username>`
 
  
=== Add user to a group ===
+
== Docker ==
`sudo usermod -aG <groupname> <username>`
+
{{Note| Try to use Docker for non-trivial services. Using docker containers allows easier setup, per-user configuration, scaling, and cleanup. The learning curve for basic usage is not high and might be a useful skill in software development. |reminder}}
  
=== Anaconda in home folder ===
+
Most docker services are split into multiple containers that talk to each other, with one running the service webserver that faces end-users. Docker-compose is a useful tool to configure and start/stop these sets of containers.
`su <username>`
 
then
 
`cd ~/`
 
then
 
`bash /home/utility/<anaconda script>`
 
Follow onscreen prompts to install into ~/Anaconda and choose "yes" to initialize the installation.
 
  
=== Print running docker containers on host ===
+
You can print docker containers on host:
 
  <nowiki>
 
  <nowiki>
   sudo docker ps --format 'table {{.Names}}\t{{.Image}}\t{{.Ports}}'</nowiki>
+
   sudo docker container ls --all --format 'table {{.Names}}\t{{.Image}}\t{{.Ports}}'</nowiki>
  
=== Reconfigure reverse proxy ===
+
=== Services ===
See the [[Nginx_container|proxy info page]]
+
In short, the major additions we've installed since the handover are:
 +
* Gitlab omnibus edition as a container on storage1
 +
* Mediawiki with mariadb as a container on storage1
 +
* Nginx reverse proxy to access the various web services as a container on storage1
 +
* Rstudio server as a container on cpu1
  
 +
Currently we think it is best to run gitlab and wiki servers locally on storage1. New services would have to be evaluated on a per-case basis, as the 12GB/12core hardware might not be enough to host everything.
  
== Common admin problems ==
+
== HKU Network architecture ==
=== Locked out of server ===
+
HKU ITS have [[domain registration|registered]] the domain hpc.radiology.hku.hk to the storage1 IP. So storage1 will host a reverse proxy ([[nginx container]]) that will use path-based resolution for upstream addresses, i.e. `hpc.radiology.hku.hk/<servicename>`.
When adjusting iptables, set a timed reset using `cron` or `at` so you can get back in. Worst case you will need [[physical access]]. If you haven't touched iptables, it could be an issue with the NIS server. Also make sure firewalld stays disabled.
 
  
=== Lost a docker container or volume ===
+
Proxy upstreams include services like gitlab, mediawikis, etc.
Keep in mind `docker-compose down` will stop ''and'' remove containers along with anonymous volumes. Likewise running a container with `--rm` option will remove anonymous dangling volumes.
 

Latest revision as of 12:55, 22 May 2022

This page describes the software stack that GPU cluster 1 runs to provide its end-user services.

For management purposes the main components to know about are:

  • Basic Linux sysadmin practices
  • RAID and NFS storage
  • Docker and docker-compose
  • Some HKU networking architecture

Basic cluster management

Currently the cluster is managed using the commandline over SSH. A large part of the system was configured by CPOS sysadmins, so there is no exact record of the software stack.

The nodes synchronise user information via NIS, If the NIS client or the server stops it may prevent SSH login. The main NIS node is cpu1, as such all new user accounts should be created on this node. Common tasks include user creation, setting password their and group, and changing file permissions.

Installation log

  • X11/Xfce on cpu1 as part of a TigerVNC installation
  • MATLAB on cpu1

Setup VNC

create vncconfig in /etc/systemd/system (optional, for permanent server)
as user run vncpasswd
cp /etc/X11/Xresources ~/.Xresources
cp /home/utility/xstartup ~/.vnc/xstartup  (for custom xfce)
sysctl daemon-reload && start && enable vncservice@:N.service (optional, for permanent server)

Use SSH tunnel, then connect on port 5900+N. Optionally try rm -R /tmp/.X* to clean locks. Consider restricting number of per-user connections.

Common admin problems

Locked out of server

When adjusting iptables, set a timed reset using cron or at so you can get back in. Worst case you will need physical access. If you haven't touched iptables, it could be an issue with the NIS server. Also make sure firewalld stays disabled.

Lost a docker container or volume

Keep in mind docker-compose down will stop and remove containers along with anonymous volumes. Likewise running a container with --rm option will remove anonymous dangling volumes.

Docker

Try to use Docker for non-trivial services. Using docker containers allows easier setup, per-user configuration, scaling, and cleanup. The learning curve for basic usage is not high and might be a useful skill in software development.

Most docker services are split into multiple containers that talk to each other, with one running the service webserver that faces end-users. Docker-compose is a useful tool to configure and start/stop these sets of containers.

You can print docker containers on host:

  sudo docker container ls --all --format 'table {{.Names}}\t{{.Image}}\t{{.Ports}}'

Services

In short, the major additions we've installed since the handover are:

  • Gitlab omnibus edition as a container on storage1
  • Mediawiki with mariadb as a container on storage1
  • Nginx reverse proxy to access the various web services as a container on storage1
  • Rstudio server as a container on cpu1

Currently we think it is best to run gitlab and wiki servers locally on storage1. New services would have to be evaluated on a per-case basis, as the 12GB/12core hardware might not be enough to host everything.

HKU Network architecture

HKU ITS have registered the domain hpc.radiology.hku.hk to the storage1 IP. So storage1 will host a reverse proxy (nginx container) that will use path-based resolution for upstream addresses, i.e. hpc.radiology.hku.hk/<servicename>.

Proxy upstreams include services like gitlab, mediawikis, etc.