Skip to content

Conversation

@m4r1k
Copy link
Contributor

@m4r1k m4r1k commented Sep 26, 2025

The Slurm v6 controller setup may take quite long resulting in a failed cluster deployment due to a quite stringent hardcoded munge mount timeout. My proposal is bumping this up to 1200s (rather than the default 120s).

I confirm of having tested my fork version and now works as intended. Follows just the relevant log portion

2025-09-26 09:49:11,124 ERROR: Command '['mount', '--types=nfs', '--options=defaults,hard,intr,_netdev', 'slurmh4d-controller:/etc/munge', '/mnt/munge']' returned non-zero exit status 32.
2025-09-26 09:49:11,124 ERROR: stderr: mount.nfs: access denied by server while mounting slurmh4d-controller:/etc/munge
2025-09-26 09:49:11,124 ERROR: munge mount failed: '['mount', '--types=nfs', '--options=defaults,hard,intr,_netdev', 'slurmh4d-controller:/etc/munge', '/mnt/munge']' Command '['mount', '--types=nfs', '--options=defaults,hard,intr,_netdev', 'slurmh4d-controller:/etc/munge', '/mnt/munge']' returned non-zero exit status 32., try 122, waiting 3.59s
2025-09-26 09:49:14,747 INFO: Copy munge.key from: /mnt/munge
2025-09-26 09:49:14,751 INFO: Restrict permissions of munge.key
2025-09-26 09:49:14,751 INFO: Unmount /mnt/munge
2025-09-26 09:49:14,984 INFO: running script ghpc_startup.sh with timeout=300
2025-09-26 09:49:14,986 INFO: ghpc_startup.sh returncode=0
stdout=stderr=
2025-09-26 09:49:14,986 INFO: Check status of cluster services
2025-09-26 09:49:15,004 INFO: Done setting up login

Attached the full logs from the login and controller node.
login-setup.log
controller-setup.log

Follows my blueprint:

---
blueprint_name: slurm-h4d
vars:
  project_id: REDACTED
  deployment_name: slurm-h4d
  region: europe-west4
  zone: europe-west4-b
  rdma_net_range: 192.168.128.0/18

  #Provisioning model, select one, lack of selection will rely on on-demand capacity
  h4d_dws_flex_enabled: false
  h4d_reservation_name: "REDACTED"

# Documentation for each of the modules used below can be found at
# https://github.com/GoogleCloudPlatform/hpc-toolkit/blob/main/modules/README.md
deployment_groups:
- group: primary
  modules:

  # Source is an embedded module, denoted by "modules/*" without ./, ../, /
  # as a prefix. To refer to a local module, prefix with ./, ../ or /

  - id: h4d-slurm-net-0
    source: modules/network/vpc

  - id: h4d-rdma-net
    source: modules/network/vpc
    settings:
      network_name: $(vars.deployment_name)-rdma-net-0
      mtu: 8896
      network_profile: https://www.googleapis.com/compute/beta/projects/$(vars.project_id)/global/networkProfiles/$(vars.zone)-vpc-falcon
      network_routing_mode: REGIONAL
      enable_cloud_router: false
      enable_cloud_nat: false
      enable_internal_traffic: false
      subnetworks:
      - subnet_name: $(vars.deployment_name)-rdma-sub-0
        subnet_region: $(vars.region)
        subnet_ip: $(vars.rdma_net_range)
        region: $(vars.region)

  - id: private_service_access
    source: community/modules/network/private-service-access
    use: [h4d-slurm-net-0]

  - id: hpc_dash
    source: modules/monitoring/dashboard

  - id: spack-setup
    source: community/modules/scripts/spack-setup
    settings:
      install_dir: /app/spack

  - id: spack-build
    source: community/modules/scripts/spack-execute
    use: [spack-setup]

  - id: homefs
    source: modules/file-system/filestore
    use: [h4d-slurm-net-0, private_service_access]
    settings:
      filestore_tier: BASIC_SSD
      size_gb: 2560
      filestore_share_name: homeshare
      local_mount: /home

  - id: appsfs
    source: modules/file-system/filestore
    use: [h4d-slurm-net-0, private_service_access]
    settings:
      filestore_tier: BASIC_HDD
      size_gb: 1024
      filestore_share_name: appsshare
      local_mount: /opt/apps

  - id: h4d_startup
    source: modules/scripts/startup-script
    settings:
      set_ofi_cloud_rdma_tunables: true

  - id: h4d_nodeset
    source: community/modules/compute/schedmd-slurm-gcp-v6-nodeset
    use: [h4d_startup, h4d-slurm-net-0, homefs, appsfs]
    settings:
      bandwidth_tier: gvnic_enabled
      machine_type: h4d-standard-192
      node_count_static: 3
      node_count_dynamic_max: 0
      enable_placement: false

      #Provisioning models
      reservation_name: $(vars.h4d_reservation_name)
      dws_flex:
        enabled: $(vars.h4d_dws_flex_enabled)

      disk_type: hyperdisk-balanced
      on_host_maintenance: TERMINATE
      additional_networks:
        $(concat(
          [{
            network=null,
            subnetwork=h4d-rdma-net.subnetwork_self_link,
            subnetwork_project=vars.project_id,
            nic_type="IRDMA",
            queue_count=null,
            network_ip=null,
            stack_type=null,
            access_config=null,
            ipv6_access_config=[],
            alias_ip_range=[]
          }]
        ))

  - id: h4d_partition
    source: community/modules/compute/schedmd-slurm-gcp-v6-partition
    use: [h4d_nodeset]
    settings:
      exclusive: false
      partition_name: h4d
      is_default: true
      partition_conf:
        ResumeTimeout: 900
        SuspendTimeout: 600

  - id: slurm_login
    source: community/modules/scheduler/schedmd-slurm-gcp-v6-login
    use: [h4d-slurm-net-0]
    settings:
      machine_type: n4-standard-8
      disk_type: hyperdisk-balanced
      enable_login_public_ips: true

  - id: slurm_controller
    source: community/modules/scheduler/schedmd-slurm-gcp-v6-controller
    use: [h4d-slurm-net-0, h4d_partition, homefs, appsfs, spack-build, slurm_login]
    settings:
      machine_type: n4-standard-8
      disk_type: hyperdisk-balanced
      enable_controller_public_ips: true
      controller_state_disk:
        type: hyperdisk-balanced
        size: 50
      cloud_parameters:
        unkillable_step_timeout: 900

@m4r1k m4r1k requested review from a team and samskillman as code owners September 26, 2025 09:54
Copy link
Member

@cboneti cboneti left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi Federico,

Thank you for this PR, I left one comment asking about the smallest timeout that could fix the issue, or perhaps we should not hardcode this timeout, and rather use 1) a new variable or 2) one of the already existing timeouts such as controller_startup_scripts_timeout (if available to this script).

@m4r1k
Copy link
Contributor Author

m4r1k commented Sep 26, 2025

Hi Federico,

Thank you for this PR, I left one comment asking about the smallest timeout that could fix the issue, or perhaps we should not hardcode this timeout, and rather use 1) a new variable or 2) one of the already existing timeouts such as controller_startup_scripts_timeout (if available to this script).

While for sure a variable would make the code cleaner, the aim was to get this fixed quickly leveraging the existing logic, just through timeout extension.

@m4r1k m4r1k force-pushed the main branch 2 times, most recently from 73f05a4 to 3e9e190 Compare September 27, 2025 12:04
@m4r1k
Copy link
Contributor Author

m4r1k commented Sep 27, 2025

Adding a new commit: the entire deployment process takes longer the older the HPC images gets.
Inside the controller setting I personally had to define:

      controller_startup_scripts_timeout: 1200
      login_startup_scripts_timeout: 1200
      compute_startup_scripts_timeout: 1200

This is particularly evident when also Spack is part of the deployment. In such a context controller's Spack install takes lots of time. The hardcoded 40 retries in the login is not enough time to mount the /opt/apps NFS folder and the setup fails.

2025-09-27 11:17:01,443 ERROR: Command '['mount', '/opt/apps']' returned non-zero exit status 32.
2025-09-27 11:17:01,443 ERROR: stderr: mount.nfs: access denied by server while mounting slurmc4-controller:/opt/apps
2025-09-27 11:17:01,443 ERROR: mount of path '/opt/apps' failed: <class 'subprocess.CalledProcessError'>: Command '['mount', '/opt/apps']' returned non-zero exit status 32.
2025-09-27 11:17:01,444 ERROR: Aborting setup...

With this new commit, the hardcoded retries are much longer, set at 400 times.

2025-09-27 11:52:16,965 INFO: Starting setup, fetching config
2025-09-27 11:52:17,030 WARNING: config is not ready yet: slurmc4-files/config.yaml not found in bucket, sleeping for 5s
[SNIP]
2025-09-27 11:54:33,174 WARNING: config is not ready yet: nodeset c4nodeset not defined in config, sleeping for 5s
2025-09-27 11:54:38,620 INFO: Config fetched
2025-09-27 11:54:38,625 ERROR: Command '['systemctl', 'is-active', '--quiet', 'google-cloud-ops-agent.service']' returned exit status 3.
2025-09-27 11:54:38,626 INFO: Setting up login
2025-09-27 11:54:38,713 INFO: installing custom script: login.d/slurm/login/ghpc_startup.sh from slurmc4-files/slurm-login-slurm-login-script-ghpc_startup_sh
2025-09-27 11:54:38,818 INFO: Set up network storage
2025-09-27 11:54:38,819 INFO: Setting up mount (nfs) 10.244.0.2:/homeshare to /home
2025-09-27 11:54:38,819 INFO: Setting up mount (nfs) slurmc4-controller:/opt/apps to /opt/apps
2025-09-27 11:54:38,819 INFO: Setting up mount (nfs) 10.244.0.10:/appsshare to /apps
2025-09-27 11:54:38,820 INFO: Waiting for '/home' to be mounted...
2025-09-27 11:54:38,820 INFO: Waiting for '/opt/apps' to be mounted...
2025-09-27 11:54:38,821 INFO: Waiting for '/apps' to be mounted...
2025-09-27 11:54:41,999 INFO: Mount point '/apps' was mounted.
2025-09-27 11:54:42,002 INFO: Mount point '/home' was mounted.
2025-09-27 11:54:44,191 INFO: Waiting for '/opt/apps' to be mounted...
2025-09-27 11:54:44,207 ERROR: Command '['mount', '/opt/apps']' returned non-zero exit status 32.
2025-09-27 11:54:44,207 ERROR: stderr: mount.nfs: access denied by server while mounting slurmc4-controller:/opt/apps
2025-09-27 11:54:44,207 ERROR: mount of path '/opt/apps' failed: <class 'subprocess.CalledProcessError'>: Command '['mount', '/opt/apps']' returned non-zero exit status 32.
[SNIP]
2025-09-27 12:03:05,993 INFO: Waiting for '/opt/apps' to be mounted...
2025-09-27 12:03:06,029 INFO: Mount point '/opt/apps' was mounted.
2025-09-27 12:03:06,031 INFO: Mounting munge share to: /mnt/munge
2025-09-27 12:03:06,056 INFO: Copy munge.key from: /mnt/munge
2025-09-27 12:03:06,059 INFO: Restrict permissions of munge.key
2025-09-27 12:03:06,059 INFO: Unmount /mnt/munge
2025-09-27 12:03:06,294 INFO: running script ghpc_startup.sh with timeout=1200
2025-09-27 12:03:06,298 INFO: ghpc_startup.sh returncode=0
stdout=stderr=
2025-09-27 12:03:06,298 INFO: Check status of cluster services
2025-09-27 12:03:06,320 INFO: Done setting up login

As you can see, the Ctrl got ready and let the Login node mount the /opt/apps folder after over 9 minutes.

ps: if needed I can create a dedicated PR

@m4r1k
Copy link
Contributor Author

m4r1k commented Sep 30, 2025

In my testing, munge mount 120s needs to be at least double. setup_network_storage.py on the other hand is very dependent on whether Spack is installed or not (just spack, no spack install whatsoever). If it's in the blueprint, then the controller setup can takes even 15 minutes. The current 40 iterations resulting in a waiting time of about 10 minutes, to be on the safe side, I would triple it to 120.

@m4r1k
Copy link
Contributor Author

m4r1k commented Oct 1, 2025

Adding updated blueprint, which includes allow_automatic_updates: false. It has shaved down a bit the deployment time but I believe this patch is still needed.
slurm-cluster.yaml

  • Controller node: 802 seconds / 13.3 minutes. Most of the time is within ghpc_startup.sh taking 541 seconds (9 minutes)
2025-10-01 08:51:39,214 INFO: Starting setup, fetching config
[SNIP]
2025-10-01 08:55:50,175 INFO: running script ghpc_startup.sh with timeout=1200
2025-10-01 09:04:52,249 INFO: ghpc_startup.sh returncode=0
[SNIP]
2025-10-01 09:05:01,798 INFO: Done setting up controller
  • Login node: 790 seconds / 13.1 minutes, Most of the time is waiting the controller to mount then /opt/apps. This step took 587 seconds over 42 attempts (default is after 40 is fails).
2025-10-01 08:52:01,233 INFO: Starting setup, fetching config
2025-10-01 08:55:24,338 ERROR: Command '['mount', '/opt/apps']' returned non-zero exit status 32.
2025-10-01 09:05:11,497 INFO: Mount point '/opt/apps' was mounted.
2025-10-01 09:05:11,795 INFO: Done setting up login
cat /var/log/slurm/setup.log| grep "Waiting for '/opt/apps' to be mounted..."|wc -l
42

Given the impact of allow_automatic_updates set to false, I updated the commits:

  • extend timeout by 2x (from 120s to 240s)
  • retry longer time for login node /opt/apps mount by 3x (from 40 to 120 retries)

cboneti
cboneti previously approved these changes Oct 1, 2025
Copy link
Member

@cboneti cboneti left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Directionally approved, I would like someone else in the CTK team to run the relevant slurm tests here to make sure everything is still fine.

@cboneti
Copy link
Member

cboneti commented Oct 1, 2025

/gcbrun

@cboneti cboneti added the release-bugfix Added to release notes under the "Bug fixes" heading. label Oct 1, 2025
@m4r1k
Copy link
Contributor Author

m4r1k commented Oct 1, 2025

follows the sample blueprint used to reproduce the original issues.

---
blueprint_name: slurm-cluster
vars:
  project_id: azimuthpc
  deployment_name: <PROJECT>
  region: europe-west4
  zone: europe-west4-b
  gcs_bucket: <UNIQUE GCS BUCKET NAME>

  # H4D RDMA VPC
  rdma_net_range: 192.168.128.0/18

  # H4D provisioning model
  h4d_dws_flex_enabled: false
  h4d_reservation_name: ""

# Documentation for each of the modules used below can be found at
# https://github.com/GoogleCloudPlatform/hpc-toolkit/blob/main/modules/README.md
deployment_groups:
- group: primary
  modules:

  - id: slurm-net-0
    source: modules/network/vpc

  - id: h4d-rdma-net
    source: modules/network/vpc
    settings:
      network_name: $(vars.deployment_name)-rdma-net-0
      mtu: 8896
      network_profile: https://www.googleapis.com/compute/beta/projects/$(vars.project_id)/global/networkProfiles/$(vars.zone)-vpc-falcon
      network_routing_mode: REGIONAL
      enable_cloud_router: false
      enable_cloud_nat: false
      enable_internal_traffic: false
      subnetworks:
      - subnet_name: $(vars.deployment_name)-rdma-sub-0
        subnet_region: $(vars.region)
        subnet_ip: $(vars.rdma_net_range)
        region: $(vars.region)

  - id: private_service_access
    source: community/modules/network/private-service-access
    use: [slurm-net-0]

  - id: hpc_dash
    source: modules/monitoring/dashboard

  - id: spack-setup
    source: community/modules/scripts/spack-setup
    settings:
      install_dir: /apps/spack
      spack_ref: "v0.23.1"

  - id: spack-build
    source: community/modules/scripts/spack-execute
    use: [spack-setup]
      #settings:
      #  commands: |
      #    spack install [email protected] %gcc ^intel-oneapi-mpi
      #    spack install [email protected]

  - id: homefs
    source: modules/file-system/filestore
    use: [slurm-net-0, private_service_access]
    settings:
      filestore_tier: BASIC_SSD
      size_gb: 2560
      filestore_share_name: homeshare
      local_mount: /home

  - id: appsfs
    source: modules/file-system/filestore
    use: [slurm-net-0, private_service_access]
    settings:
      filestore_tier: BASIC_HDD
      size_gb: 1024
      filestore_share_name: appsshare
      local_mount: /apps

  - id: data-bucket
    source: modules/file-system/pre-existing-network-storage
    settings:
      remote_mount: $(vars.gcs_bucket)
      local_mount: /data-bucket
      fs_type: gcsfuse
      mount_options: defaults,_netdev,implicit_dirs,allow_other

  - id: c4_nodeset
    source: community/modules/compute/schedmd-slurm-gcp-v6-nodeset
    use: [slurm-net-0, homefs, appsfs, data-bucket]
    settings:
      bandwidth_tier: gvnic_enabled
      machine_type: c4-highcpu-48
      node_count_static: 0
      node_count_dynamic_max: 2
      enable_placement: true
      placement_max_distance: 1
      advanced_machine_features:
        threads_per_core: 1
        turbo_mode: "ALL_CORE_MAX"
      disk_type: hyperdisk-balanced
      on_host_maintenance: TERMINATE
      allow_automatic_updates: false

  - id: c4_partition
    source: community/modules/compute/schedmd-slurm-gcp-v6-partition
    use: [c4_nodeset]
    settings:
      exclusive: false
      partition_name: c4
      is_default: true
      partition_conf:
        ResumeTimeout: 900
        SuspendTimeout: 600

  - id: c4_gnr_nodeset
    source: community/modules/compute/schedmd-slurm-gcp-v6-nodeset
    use: [slurm-net-0, homefs, appsfs, data-bucket]
    settings:
      bandwidth_tier: gvnic_enabled
      machine_type: c4-standard-288
      node_count_static: 0
      node_count_dynamic_max: 2
      enable_placement: true
      placement_max_distance: 1
      advanced_machine_features:
        threads_per_core: 1
        turbo_mode: "ALL_CORE_MAX"
      disk_type: hyperdisk-balanced
      on_host_maintenance: TERMINATE
      allow_automatic_updates: false

  - id: c4_gnr_partition
    source: community/modules/compute/schedmd-slurm-gcp-v6-partition
    use: [c4_gnr_nodeset]
    settings:
      exclusive: false
      partition_name: c4gnr
      is_default: false
      partition_conf:
        ResumeTimeout: 900
        SuspendTimeout: 600

  - id: h4d_startup
    source: modules/scripts/startup-script
    settings:
      set_ofi_cloud_rdma_tunables: true

  - id: h4d_nodeset
    source: community/modules/compute/schedmd-slurm-gcp-v6-nodeset
    use: [h4d_startup, slurm-net-0, homefs, appsfs, data-bucket]
    settings:
      bandwidth_tier: gvnic_enabled
      machine_type: h4d-highmem-192-lssd
      node_count_static: 0
      node_count_dynamic_max: 2
      enable_placement: true
      placement_max_distance: 1

      #Provisioning models
      reservation_name: $(vars.h4d_reservation_name)
      dws_flex:
        enabled: $(vars.h4d_dws_flex_enabled)

      disk_type: hyperdisk-balanced
      on_host_maintenance: TERMINATE
      allow_automatic_updates: false
      additional_networks:
        $(concat(
          [{
            network=null,
            subnetwork=h4d-rdma-net.subnetwork_self_link,
            subnetwork_project=vars.project_id,
            nic_type="IRDMA",
            queue_count=null,
            network_ip=null,
            stack_type=null,
            access_config=null,
            ipv6_access_config=[],
            alias_ip_range=[]
          }]
        ))

  - id: h4d_partition
    source: community/modules/compute/schedmd-slurm-gcp-v6-partition
    use: [h4d_nodeset]
    settings:
      exclusive: false
      partition_name: h4d
      is_default: true
      partition_conf:
        ResumeTimeout: 900
        SuspendTimeout: 600

  - id: slurm_login
    source: community/modules/scheduler/schedmd-slurm-gcp-v6-login
    use: [slurm-net-0]
    settings:
      machine_type: n4-standard-4
      disk_type: hyperdisk-balanced
      enable_login_public_ips: true
      allow_automatic_updates: false

  - id: slurm_controller
    source: community/modules/scheduler/schedmd-slurm-gcp-v6-controller
    use: [slurm-net-0, c4_partition, c4_gnr_partition, h4d_partition, homefs, appsfs, data-bucket, spack-build, slurm_login]
    settings:
      machine_type: n4-standard-4
      disk_type: hyperdisk-balanced
      enable_controller_public_ips: true
      allow_automatic_updates: false
      controller_state_disk:
        type: hyperdisk-balanced
        size: 50
      cloud_parameters:
        unkillable_step_timeout: 900
      controller_startup_scripts_timeout: 1200
      login_startup_scripts_timeout: 1200
      compute_startup_scripts_timeout: 1200

@cboneti cboneti changed the base branch from main to develop October 14, 2025 08:13
@cboneti cboneti dismissed their stale review October 14, 2025 08:13

The base branch was changed.

@cboneti cboneti enabled auto-merge October 14, 2025 08:14
@cboneti cboneti assigned cboneti, bytetwin and sarthakag and unassigned cboneti and bytetwin Oct 14, 2025
@sarthakag
Copy link
Contributor

/gcbrun

@sarthakag sarthakag self-requested a review October 16, 2025 15:54
@cboneti cboneti merged commit 1b1a9bb into GoogleCloudPlatform:develop Oct 27, 2025
33 of 69 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

release-bugfix Added to release notes under the "Bug fixes" heading.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants