Running Jobs on Hopper HEAD NODES
Head Nodes Usage Policy
The head/login nodes are shared by many users and can be used for submitting jobs and/or light development, testing or debugging purposes only. These head nodes are a shared resource which all account holders on the cluster use so one needs to be mindful of how these resources are used. Any long-running, memory- and/or compute-intensive jobs run directly on the head/login node are considered violations.
ARBITER2 for monitoring Head Node Usage
To enhance the fairness and efficiency of resource allocation, we have installed Arbiter2 on the head nodes. Arbiter will help us better monitor and enforce the resource usage policies to ensure optimal performance for all users by managing the assigned quotas on the Hopper head nodes. Currently, once logged in, you get a default 4 CPU cores and 8 GB RAM.
Usage Violations Details
-
All users will start in the "normal" status with their default resource quotas [4 CPUs and 8GB RAM]
-
First Violation: Upon the first violation of resource quotas, the user's quota will be penalized by 80% for a duration of 30 minutes. After this timeout, the user's quota will automatically revert to the normal status.
-
Second Violation within 3 Hours: If a user violates the quotas again within 3 hours of their previous violation, their quota will be penalized by 50% for 1 hour. Once the penalty duration elapses, the user's quota will return to normal.
-
Third Violation within 3 Hours: If a user violates the quotas for the third time within 3 hours of their previous violation, their quota will be penalized by 30% for 2 hours. Following this period, the user's quota will be restored to the normal status.
-
Subsequent Violations: Any subsequent violations within 3 hours of the previous violation will result in the user being placed back at 30% quota, without further timeout.
Arbiter2's Badness Score:
Arbiter utilizes a "badness" score to determine whether a user's resource usage warrants a violation. This means that minor spikes in resource consumption will not immediately trigger penalties. The badness score accumulates over time based on resource usage patterns.
IMPORTANT_NOTE:_ Arbiter does not terminate processes; rather, it enforces resource quotas by throttling the user's resource allocation. Users violating the head-node policy will get an email with details so they can submit jobs with SLURM instead.
It is crucial that we all adhere to the established usage policies to maintain a smooth and efficient computing environment for everyone.