Install
To install or reinstall a new node:
If node is in the cluster already, first make sure the node is not a part of Ceph or is a gateway (MetalLB, ingress, etc). Ceph nodes can only be taken out one at a time, allowing time to recover after being brought back.
Find the network settings: IP, subnet, gateway, DNS (if not Google or Cloudflare)
Note the current disks setup - whether the node has similar OS drives for MD RAID
Drain the node
Login to the node’s IPMI screen
Attach the ubuntu 22.04 image via the virtual media
Reboot the node
Trigger the boot menu (usually F10), choose to boot from virtual media
Start the install with media check off
Agree to everything it asks
Set up the network:
DNS can be 1.1.1.1,8.8.8.8
Disable unused networks
Can use the subnet calculator to figure out the subnet
For disk: if node has OS drive mirror, use custom layout:
Delete all existing MD arrays
Click the drives you’re going to use, choose reformat
Add unformatted GPT partitions to the drives
Create MD array with those partitions
For 2nd disk choose “Add as another boot device”
Create
ext4GPT partition on created MD arrayProceed with installation
For username choose
nautilusChoose to install SSH server, optionally import key from github
Don’t install any additional packages
In the end disconnect media, reboot
After the node boots, make the
nautilususer sudoer with NOPASSWD:sudo visudo,%sudo ALL=(ALL:ALL) NOPASSWD:ALLAdd
mtu: 9000to/etc/netplan/00-installer-config.yaml, execnetplan apply. Themtuis under the ethernets device.
Steps for NRP Administrators
The below steps are meant for NRP administrators, and do not need to be performed by site system administrators.
Make changes in Ansible inventory file if needed. The node should be in the proper region and zone section, with zone labels added.
Generate a
join_tokenby logging into the controller and running:kubeadm token createRun the Ansible playbook according to docs:
ansible-playbook setup.yml -l <node> -e join_token=...
Labels added by Ansible:
Check that proper labels were added by Ansible:
host-endpoint: "true"
mtu: "9000"
nautilus.io/network: "10000" - network speed (10000/40000/100000) (needed for perfsonar maddash)
netbox.io/site: UNL - SLUG for netbox site (should exist)
topology.kubernetes.io/region: us-central - region (us-west, us-east, etc)
topology.kubernetes.io/zone: unl - zone
To set all labels run:
kubectl label node node_name nautilus.io/network="10000" netbox.io/site="UNL" topology.kubernetes.io/region="us-central" topology.kubernetes.io/zone="unl"Cluster firewall
The node’s CIDR should be in https://gitlab.nrp-nautilus.io/prp/calico/-/blob/master/networksets.yaml list for the node to be accessible by other cluster nodes
Verifying the connectivity
New zone connectivity
When deploying a node in a new zone, ensure that the zone has properly configured WAN access. Wait for it to appear in perfsonar and wait for the corresponding row and column in every dashboard to turn green for this zone.
It might take up to 6 hours for the node to appear and be fully connected to the mesh.
New node in existing zone
Test the new node storage access. Run the storage test task from the ansible playbook.
./run.sh storage-test <new_node>All tests should succeed.
Manually checking the node connectivity
Using the pstest and pstesth commands from ansible zsh script, login to the perfsonar pod on the new node and some other node you know is working properly. (pstesth is for host network, pstest for overlay. Both should work fine.)
Run the iperf3 test between the two nodes.
iperf3 -sieprf3 -c <ip_of_node_above>The throughput should be close to the node’s max NIC bandwidth. Having it in megabits range usually means having a firewall on the path. Having no data passed means MTU issues or other network issues.
