Our cluster combines various hardware resources from multiple universities and other organizations. By default you can only use the production nodes (see the resources page of the portal).
Here’s the full list of taints on the nodes. Some are set automatically on deployed jobs, some can only be used by privileged users. Please refer to this list and don’t set the ones you’re not allowed to use.
|Taint||Purpose||Normal users can set manually|
|nautilus.io/bafna=true:NoSchedule||Private Vineet Bafna’s node.||No|
|nautilus.io/ceph=true:NoSchedule||Don’t run any user jobs on ceph storage nodes||No|
|nautilus.io/ece=true:NoSchedule||Private Dinesh Bharadia nodes||No|
|nautilus.io/haosu=true:NoSchedule||Private Hao Su cluster.||No|
|nautilus.io/large-gpu=true:NoSchedule||Node accepts 4- and 8-GPU jobs only. Set automatically.||No|
|nautilus.io/noceph=true:NoSchedule||Ceph is not working on the node. Otherwise the node is fine.||Yes|
|nautilus.io/science-dmz=true:NoSchedule||Node can only access the science DMZ network, and not the public Internet.||Yes|
|nautilus.io/stashcache=true:NoSchedule||Private OSG nodes||No|
|nautilus.io/testing=true:NoSchedule||Node is broken||No|
|nautilus.io/oldkernel=true:NoSchedule||Kernel is below v4||Yes|
|nvidia.com/gpu=Exists:PreferNoSchedule||Fence GPU nods from CPU jobs (Preferred! Jobs can still go on the node if there are no free CPU nodes)||No|
Priorities in group namespaces
Our cluster contains several sets of nodes dedicated to certain groups. All nodes of group
group1 are labelled with
nautilus.io/group=group1 and some are tainted with
For such group’s namespaces the pods will be automatically changed to have the corresponding toleration to be scheduled on such nodes. If user is targeting ONLY THE GROUP NODES by using the nodeAffinity such as:
spec: affinity: nodeAffinity: requiredDuringSchedulingIgnoredDuringExecution: nodeSelectorTerms: - matchExpressions: - key: nautilus.io/group operator: In values: - group1
the pods will also automatically get the higher non-preemtible
owner-no-preempt (+10) priority. This means such pods will have priority for scheduling on ALL group nodes (tainted and untainted on the image above). General users have lower priority on untainted nodes and not able to run on tainted nodes.
Non-preemtible priority means that the lower priority pods will not be killed right away when the higher-priority pod appears, but higher priority will be considered when choosing candidate pods for a free resource.
If your project is related to one of these persons:
|Ken Kreutz-Delgado||Tajana Simunic Rosing||Amit K. Roy Chowdhury||Walid Najjar|
|Nikil Dutt||Trevor Darrell||Lise Getoor||Anshul Kundaje|
|Gary Cottrell||Frank Wuerthwein||Hao Su||Dinesh Bharadia|
|YangQuan Chen||Jeff Krichmar||Charless Fowlkes||Padhraic Smyth|
|James Demmel||Yisong Yue||Shawfeng Dong||Rajesh Gupta|
|Todd Hylton||Falko Kuester||Jurgen Schulze||Arun Kumar|
|Ron Dror||Ravi Ramamoorthi||John Sheppard||Nuno Vasconcelos|
|Ramakrishna Akella||Manmohan Chandraker||Baris Aksanli||Dimitris Achlioptas|
|Ilkay Altintas||Brad Smith||Christopher Paolini||Jerry Sheehan|
, which means you’ve specified the person as a PI in the namespace description, it will also be assigned to nodes tainted as Chase-CI (“chaseci”). This will give you more GPU nodes and shorter wait time.
XILINX FPGA nodes
Current installation of Xilinx FPGAs require running nodes with an older kernel. Those nodes require the
spec: tolerations: - key: "nautilus.io/oldkernel" operator: "Exists" effect: "NoSchedule"
Some nodes in the cluster don’t have access to public Internet, and can only access educational network. They still can pull images from Docker Hub using a proxy.
If your workload is not using the public Internet resources, you might tolerate the
nautilus.io/science-dmz and get access to additional nodes.