Class: OodCore::Job::Adapters::Slurm::Batch Private
- Inherits:
-
Object
- Object
- OodCore::Job::Adapters::Slurm::Batch
- Defined in:
- lib/ood_core/job/adapters/slurm.rb
Overview
This class is part of a private API. You should avoid using this class if possible, as it may be removed or be changed in the future.
Object used for simplified communication with a Slurm batch server
Defined Under Namespace
Classes: Error, SlurmTimeoutError
Constant Summary collapse
- UNIT_SEPARATOR =
This constant is part of a private API. You should avoid using this constant if possible, as it may be removed or be changed in the future.
"\x1F"
- RECORD_SEPARATOR =
This constant is part of a private API. You should avoid using this constant if possible, as it may be removed or be changed in the future.
"\x1E"
Instance Attribute Summary collapse
-
#bin ⇒ Pathname
readonly
private
The path to the Slurm client installation binaries.
-
#bin_overrides ⇒ Object
readonly
private
Optional overrides for Slurm client executables.
-
#cluster ⇒ String?
readonly
private
The cluster of the Slurm batch server.
-
#conf ⇒ Pathname?
readonly
private
The path to the Slurm configuration file.
-
#strict_host_checking ⇒ Bool
readonly
private
Wheter to use strict host checking when ssh to submit_host.
-
#submit_host ⇒ String
readonly
private
The login node where the job is submitted via ssh.
Instance Method Summary collapse
- #accounts ⇒ Object private
- #all_sinfo_node_fields ⇒ Object private
-
#all_squeue_fields ⇒ Object
private
Fields requested from a formatted `squeue` call Note that the order of these fields is important.
-
#delete_job(id) ⇒ void
private
Delete a specified job from batch server.
-
#get_cluster_info ⇒ ClusterInfo
private
Get a ClusterInfo object containing information about the given cluster.
-
#get_jobs(id: "", owner: nil, attrs: nil) ⇒ Array<Hash>
private
Get a list of hashes detailing each of the jobs on the batch server.
-
#hold_job(id) ⇒ void
private
Put a specified job on hold.
-
#initialize(cluster: nil, bin: nil, conf: nil, bin_overrides: {}, submit_host: "", strict_host_checking: true) ⇒ Batch
constructor
private
A new instance of Batch.
- #nodes ⇒ Object private
- #queues ⇒ Object private
-
#release_job(id) ⇒ void
private
Release a specified job that is on hold.
- #sacct_info(job_ids, states, from, to, show_steps) ⇒ Object private
-
#sacct_info_fields ⇒ Object
private
Job info fields requested from a formatted `sacct` call.
-
#squeue_args(id: "", owner: nil, options: []) ⇒ Object
private
TODO: write some barebones test for this? like 2 options and id or no id.
- #squeue_fields(attrs) ⇒ Object private
- #squeue_required_fields ⇒ Object private
-
#submit_string(str, args: [], env: {}) ⇒ String
private
Submit a script expanded as a string to the batch server.
Constructor Details
#initialize(cluster: nil, bin: nil, conf: nil, bin_overrides: {}, submit_host: "", strict_host_checking: true) ⇒ Batch
This method is part of a private API. You should avoid using this method if possible, as it may be removed or be changed in the future.
Returns a new instance of Batch.
100 101 102 103 104 105 106 107 |
# File 'lib/ood_core/job/adapters/slurm.rb', line 100 def initialize(cluster: nil, bin: nil, conf: nil, bin_overrides: {}, submit_host: "", strict_host_checking: true) @cluster = cluster && cluster.to_s @conf = conf && Pathname.new(conf.to_s) @bin = Pathname.new(bin.to_s) @bin_overrides = bin_overrides @submit_host = submit_host.to_s @strict_host_checking = strict_host_checking end |
Instance Attribute Details
#bin ⇒ Pathname (readonly)
This method is part of a private API. You should avoid using this method if possible, as it may be removed or be changed in the future.
The path to the Slurm client installation binaries
69 70 71 |
# File 'lib/ood_core/job/adapters/slurm.rb', line 69 def bin @bin end |
#bin_overrides ⇒ Object (readonly)
This method is part of a private API. You should avoid using this method if possible, as it may be removed or be changed in the future.
Optional overrides for Slurm client executables
75 76 77 |
# File 'lib/ood_core/job/adapters/slurm.rb', line 75 def bin_overrides @bin_overrides end |
#cluster ⇒ String? (readonly)
This method is part of a private API. You should avoid using this method if possible, as it may be removed or be changed in the future.
The cluster of the Slurm batch server
57 58 59 |
# File 'lib/ood_core/job/adapters/slurm.rb', line 57 def cluster @cluster end |
#conf ⇒ Pathname? (readonly)
This method is part of a private API. You should avoid using this method if possible, as it may be removed or be changed in the future.
The path to the Slurm configuration file
63 64 65 |
# File 'lib/ood_core/job/adapters/slurm.rb', line 63 def conf @conf end |
#strict_host_checking ⇒ Bool (readonly)
This method is part of a private API. You should avoid using this method if possible, as it may be removed or be changed in the future.
Wheter to use strict host checking when ssh to submit_host
85 86 87 |
# File 'lib/ood_core/job/adapters/slurm.rb', line 85 def strict_host_checking @strict_host_checking end |
#submit_host ⇒ String (readonly)
This method is part of a private API. You should avoid using this method if possible, as it may be removed or be changed in the future.
The login node where the job is submitted via ssh
80 81 82 |
# File 'lib/ood_core/job/adapters/slurm.rb', line 80 def submit_host @submit_host end |
Instance Method Details
#accounts ⇒ Object
This method is part of a private API. You should avoid using this method if possible, as it may be removed or be changed in the future.
183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 |
# File 'lib/ood_core/job/adapters/slurm.rb', line 183 def accounts user = Etc.getlogin args = ['-nP', 'show', 'users', 'withassoc', 'format=account,cluster,partition,qos', 'where', "user=#{user}"] [].tap do |accts| call('sacctmgr', *args).each_line do |line| acct, cluster, queue, qos = line.split('|') next if acct.nil? || acct.chomp.empty? args = { name: acct, qos: qos.to_s.chomp.split(','), cluster: cluster, queue: queue.to_s.empty? ? nil : queue } info = OodCore::Job::AccountInfo.new(**args) unless acct.nil? accts << info unless acct.nil? end end end |
#all_sinfo_node_fields ⇒ Object
This method is part of a private API. You should avoid using this method if possible, as it may be removed or be changed in the future.
376 377 378 379 380 381 382 |
# File 'lib/ood_core/job/adapters/slurm.rb', line 376 def all_sinfo_node_fields { procs: '%c', name: '%n', features: '%f' } end |
#all_squeue_fields ⇒ Object
This method is part of a private API. You should avoid using this method if possible, as it may be removed or be changed in the future.
Fields requested from a formatted `squeue` call Note that the order of these fields is important
272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 |
# File 'lib/ood_core/job/adapters/slurm.rb', line 272 def all_squeue_fields { account: "%a", job_id: "%A", exec_host: "%B", min_cpus: "%c", cpus: "%C", min_tmp_disk: "%d", nodes: "%D", end_time: "%e", dependency: "%E", features: "%f", array_job_id: "%F", group_name: "%g", group_id: "%G", over_subscribe: "%h", sockets_per_node: "%H", array_job_task_id: "%i", cores_per_socket: "%I", job_name: "%j", threads_per_core: "%J", comment: "%k", array_task_id: "%K", time_limit: "%l", time_left: "%L", min_memory: "%m", time_used: "%M", req_node: "%n", node_list: "%N", command: "%o", contiguous: "%O", qos: "%q", partition: "%P", priority: "%Q", reason: "%r", start_time: "%S", state_compact: "%t", state: "%T", user: "%u", user_id: "%U", reservation: "%v", submit_time: "%V", wckey: "%w", licenses: "%W", excluded_nodes: "%x", core_specialization: "%X", nice: "%y", scheduled_nodes: "%Y", sockets_cores_threads: "%z", work_dir: "%Z", gres: "%b", # must come at the end to fix a bug with Slurm 18 } end |
#delete_job(id) ⇒ void
This method is part of a private API. You should avoid using this method if possible, as it may be removed or be changed in the future.
This method returns an undefined value.
Delete a specified job from batch server
254 255 256 |
# File 'lib/ood_core/job/adapters/slurm.rb', line 254 def delete_job(id) call("scancel", id.to_s) end |
#get_cluster_info ⇒ ClusterInfo
This method is part of a private API. You should avoid using this method if possible, as it may be removed or be changed in the future.
Get a ClusterInfo object containing information about the given cluster
111 112 113 114 115 116 117 118 119 120 121 122 123 |
# File 'lib/ood_core/job/adapters/slurm.rb', line 111 def get_cluster_info node_cpu_info = call("sinfo", "-aho %A/%D/%C").strip.split('/') gres_length = call("sinfo", "-o %G").lines.map(&:strip).map(&:length).max + 2 gres_lines = call("sinfo", "-ahNO ,nodehost,gres:#{gres_length},gresused:#{gres_length}") .lines.uniq.map(&:split) ClusterInfo.new(active_nodes: node_cpu_info[0].to_i, total_nodes: node_cpu_info[2].to_i, active_processors: node_cpu_info[3].to_i, total_processors: node_cpu_info[6].to_i, active_gpus: gres_lines.sum { |line| Slurm.gpus_from_gres(line[2]) }, total_gpus: gres_lines.sum { |line| Slurm.gpus_from_gres(line[1]) } ) end |
#get_jobs(id: "", owner: nil, attrs: nil) ⇒ Array<Hash>
This method is part of a private API. You should avoid using this method if possible, as it may be removed or be changed in the future.
Get a list of hashes detailing each of the jobs on the batch server
147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 |
# File 'lib/ood_core/job/adapters/slurm.rb', line 147 def get_jobs(id: "", owner: nil, attrs: nil) fields = squeue_fields(attrs) args = squeue_args(id: id, owner: owner, options: fields.values) #TODO: switch mock of Open3 to be the squeue mock script # then you can use that for performance metrics StringIO.open(call("squeue", *args)) do |output| advance_past_squeue_header!(output) jobs = [] output.each_line(RECORD_SEPARATOR) do |line| # TODO: once you can do performance metrics you can test zip against some other tools # or just small optimizations # for example, fields is ALREADY A HASH and we are setting the VALUES to # "line.strip.split(unit_separator)" array # # i.e. store keys in an array, do Hash[[keys, values].transpose] # # or # # job = {} # keys.each_with_index { |key, index| [key] = values[index] } # jobs << job # # assuming keys and values are same length! if not we have an error! line = line.encode('UTF-8', invalid: :replace, undef: :replace) values = line.chomp(RECORD_SEPARATOR).strip.split(UNIT_SEPARATOR) jobs << Hash[fields.keys.zip(values)] unless values.empty? end jobs end rescue SlurmTimeoutError # TODO: could use a log entry here return [{ id: id, state: 'undetermined' }] end |
#hold_job(id) ⇒ void
This method is part of a private API. You should avoid using this method if possible, as it may be removed or be changed in the future.
This method returns an undefined value.
Put a specified job on hold
234 235 236 |
# File 'lib/ood_core/job/adapters/slurm.rb', line 234 def hold_job(id) call("scontrol", "hold", id.to_s) end |
#nodes ⇒ Object
This method is part of a private API. You should avoid using this method if possible, as it may be removed or be changed in the future.
384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 |
# File 'lib/ood_core/job/adapters/slurm.rb', line 384 def nodes args = all_sinfo_node_fields.values.join(UNIT_SEPARATOR) output = call('sinfo', '-ho', "#{RECORD_SEPARATOR}#{args}") output.each_line(RECORD_SEPARATOR).map do |line| values = line.chomp(RECORD_SEPARATOR).strip.split(UNIT_SEPARATOR) next if values.empty? data = Hash[all_sinfo_node_fields.keys.zip(values)] data[:name] = data[:name].to_s.split(',').first data[:features] = data[:features].to_s.split(',') NodeInfo.new(**data) end.compact end |
#queues ⇒ Object
This method is part of a private API. You should avoid using this method if possible, as it may be removed or be changed in the future.
366 367 368 369 370 371 372 373 374 |
# File 'lib/ood_core/job/adapters/slurm.rb', line 366 def queues info_raw = call('scontrol', 'show', 'part', '-o') [].tap do |ret_arr| info_raw.each_line do |line| ret_arr << str_to_queue_info(line) end end end |
#release_job(id) ⇒ void
This method is part of a private API. You should avoid using this method if possible, as it may be removed or be changed in the future.
This method returns an undefined value.
Release a specified job that is on hold
244 245 246 |
# File 'lib/ood_core/job/adapters/slurm.rb', line 244 def release_job(id) call("scontrol", "release", id.to_s) end |
#sacct_info(job_ids, states, from, to, show_steps) ⇒ Object
This method is part of a private API. You should avoid using this method if possible, as it may be removed or be changed in the future.
400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 |
# File 'lib/ood_core/job/adapters/slurm.rb', line 400 def sacct_info(job_ids, states, from, to, show_steps) # https://slurm.schedmd.com/sacct.html fields = sacct_info_fields args = ['-P'] # Output will be delimited args.concat ['--delimiter', UNIT_SEPARATOR] args.concat ['-n'] # No header args.concat ['--units', 'G'] # Memory units in GB args.concat ['--allocations'] unless show_steps # Show statistics relevant to the job, not taking steps into consideration args.concat ['-o', fields.values.join(',')] # Required data args.concat ['--state', states.join(',')] unless states.empty? # Filter by these states args.concat ['-j', job_ids.join(',')] unless job_ids.empty? # Filter by these job ids args.concat ['-S', from] if from # Filter from This date args.concat ['-E', to] if to # Filter until this date jobs_info = [] StringIO.open(call('sacct', *args)) do |output| output.each_line do |line| # Replace blank values with nil values = line.strip.split(UNIT_SEPARATOR).map{ |value| value.to_s.empty? ? nil : value } jobs_info << Hash[fields.keys.zip(values)] unless values.empty? end end jobs_info end |
#sacct_info_fields ⇒ Object
This method is part of a private API. You should avoid using this method if possible, as it may be removed or be changed in the future.
Job info fields requested from a formatted `sacct` call
327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 |
# File 'lib/ood_core/job/adapters/slurm.rb', line 327 def sacct_info_fields { # The user name of the user who ran the job. user: 'User', # The group name of the user who ran the job. group_name: 'Group', # Job Id for reference job_id: 'JobId', # The name of the job or job step job_name: 'JobName', # The job's elapsed time. elapsed: 'Elapsed', # Minimum required memory for the job req_mem: 'ReqMem', # Count of allocated CPUs alloc_cpus: 'AllocCPUS', # Number of requested CPUs. req_cpus: 'ReqCPUS', # What the timelimit was/is for the job time_limit: 'Timelimit', # Displays the job status, or state state: 'State', # The sum of the SystemCPU and UserCPU time used by the job or job step total_cpu: 'TotalCPU', # Maximum resident set size of all tasks in job. max_rss: 'MaxRSS', # Identifies the partition on which the job ran. partition: 'Partition', # The time the job was submitted. In the same format as End. submit_time: 'Submit', # Initiation time of the job. In the same format as End. start_time: 'Start', # Termination time of the job. end: 'End', # Trackable resources. These are the minimum resource counts requested by the job/step at submission time. gres: 'ReqTRES' } end |
#squeue_args(id: "", owner: nil, options: []) ⇒ Object
This method is part of a private API. You should avoid using this method if possible, as it may be removed or be changed in the future.
TODO: write some barebones test for this? like 2 options and id or no id
220 221 222 223 224 225 226 |
# File 'lib/ood_core/job/adapters/slurm.rb', line 220 def squeue_args(id: "", owner: nil, options: []) args = ["--all", "--states=all", "--noconvert"] args.concat ["-o", "#{RECORD_SEPARATOR}#{.join(UNIT_SEPARATOR)}"] args.concat ["-u", owner.to_s] unless owner.to_s.empty? args.concat ["-j", id.to_s] unless id.to_s.empty? args end |
#squeue_fields(attrs) ⇒ Object
This method is part of a private API. You should avoid using this method if possible, as it may be removed or be changed in the future.
204 205 206 207 208 209 210 |
# File 'lib/ood_core/job/adapters/slurm.rb', line 204 def squeue_fields(attrs) if attrs.nil? all_squeue_fields else all_squeue_fields.slice(*squeue_attrs_for_info_attrs(Array.wrap(attrs) + squeue_required_fields)) end end |
#squeue_required_fields ⇒ Object
This method is part of a private API. You should avoid using this method if possible, as it may be removed or be changed in the future.
212 213 214 215 216 217 |
# File 'lib/ood_core/job/adapters/slurm.rb', line 212 def squeue_required_fields #TODO: does this need to include ::array_job_task_id? #TODO: does it matter that order of the output can vary depending on the arguments and if "squeue_required_fields" are included? # previously the order was "fields.keys"; i don't think it does [:job_id, :state_compact] end |
#submit_string(str, args: [], env: {}) ⇒ String
This method is part of a private API. You should avoid using this method if possible, as it may be removed or be changed in the future.
Submit a script expanded as a string to the batch server
264 265 266 267 268 |
# File 'lib/ood_core/job/adapters/slurm.rb', line 264 def submit_string(str, args: [], env: {}) args = args.map(&:to_s) + ["--parsable"] env = env.to_h.each_with_object({}) { |(k, v), h| h[k.to_s] = v.to_s } call("sbatch", *args, env: env, stdin: str.to_s).strip.split(";").first end |