Linux Research Computing Resources at CMU

Body

Overview:

CMU runs or has access to shared Linux computing environments with varying access methods and capabilities.  This article is intended to be a high-level overview of those compute environments, their major considerations, and to be a starting point for you to talk with your local techs.  ARC techs encourage you to reach out for a consult with your research needs so that we can assist you in finding the appropriate compute for your workloads.

MSU HPCC:

CMU has an ongoing contract with Michigan State University for access to their HPCC (High Performance Computing Center).  Each time that MSU purchases a new cluster, we are given the opportunity to "buy-in" by purchasing compute nodes for their cluster.  We have priority access to our nodes, and standard access to the rest of the cluster.  In this agreement, both CMU and MSU researchers have access to a greater depth of compute hardware than either would alone.  When clusters are retired, CMU is given the nodes that we purchased, and they are often re-purposed internally for other research or teaching workloads.  The ARC k8s cluster (more on this below) is partially comprised of retired MSU nodes.  MSU HPCC clusters are guaranteed for at least 5 years after purchase, and are generally only cycled out to make room in their datacenter for newer clusters.

COMPUTE:

Active clusters with CMU buy-in:

Cluster Contract # Node CPU RAM GPU
"amd24" 1/1/2025 - 1/1/2030 4 "CPU-C" 2x AMD EPYC 9654 2.3TB N/A
"amd22" 10/14/2022 - 10/31/2027 5 "CPU-A" 2x AMD EPYC 7763 512GB N/A
"amd20" 9/28/2020 - 10/7/2025 19 "CPU-A" 2x AMD EPYC 7H12 512GB N/A
"amd20" 9/28/2020 - 10/7/2025 3 "CPU-B" 2x AMD EPYC 7H12 1TB N/A
"amd20-v100" 9/28/2020 - 10/7/2025 1 "GPU" 2x Intel Xeon 8260 565GB 4x Nvidia V100S

STORAGE:

100GB of persistent "home" and up to 3TB of persistent "project" storage is available for each user within our contract, and additional storage can be rented for a monthly fee from MSU.  Please contact your local techs if your needs exceed the user default, and we can get a quote on your behalf. 

50TB of "scratch" storage is also available for running jobs, and is automatically purged by the system to reclaim space as new jobs run.

JOB QUEUE:

Resource sharing, job queuing, and scheduling are all managed through SLURM.  This imposes limits on jobs, and ensures that resources in the cluster are shared "fairly".  In practice, this means that jobs run on the MSU HPCC schedule quicker, inversely related to stated resource requirements.  There is also a maximum runtime on jobs before they are automatically culled by the scheduler.  Jobs run on the MSU HPCC should be thoughtfully scripted in a way that they can regularly "checkpoint" or "save" at intermediary states to ensure reasonable scheduling, and ensure that work is not lost if the scheduler culls an incomplete job at the end of its time slot.  

USEFUL LINKS:

SLURM overview:

To see all available clusters and nodes that we have standard access to:

To see the current queue times for the entire HPCC:

To request up to 3TB of project storage

ICER offers research support when using their cluster, for more information please visit the following links:

Quick-start guide for "Open OnDemand" (a webgui for job submissions)

 

SE-K8S:

CMU runs an on-prem compute cluster, custom built by S&E techs for ARC research, teaching, and learning workloads.  This cluster generally lacks the raw throughput of the MSU HPCC, and is notably lacking in modern GPU compute, but is much more flexible and can be reconfigured on-the fly to support bespoke workloads with complex, lengthy, or difficult-to-define requirements.  It can also be used as a staging ground to validate and test workloads before moving to the MSU HPCC or another cluster made available through grant funding.  This cluster is primarily composed of servers that have otherwise been retired from other workloads (Citrix, SAP, HPCC, etc).  The servers are generally given new, reliable internal storage; and are then reconfigured appropriately for cluster workloads as required.

Workloads on this cluster are not inherently shared environments, instead, S&E techs will work with you directly to define, design, and automate the creation of environments specially tailored for your research needs.  If a job needs to run longer, or requires more resources than SLURM will allow on the MSU HPCC, SE-K8S might be a more appropriate place to stage your workflows, even if they may take a little longer to run.

COMPUTE:

Active nodes in SE-K8S:

# Donating Dept CPU RAM GPU
1 MSU HPCC 2x Xeon E5-2670 v2 128GB N/A
5 MSU HPCC 2x Xeon E5-2670 v2 256GB N/A
1 CMU EAS 2x Xeon Gold 6330 1TB N/A
2 CMU BIS 4x Xeon E7-8880 v3 3TB N/A
3 CMU S&E 2x Xeon E5-2695 v3 512GB 2x Nvidia P40


STORAGE:

We advertise se-k8s as having "no" default storage.  Everything in the cluster, including the environments themselves, are designed intentionally to be both portable and ephemeral.  Persistent storage can be attached programatically from other locations, and is something that needs to be discussed at environment creation time.  Often for small datasets (10-100GB) we can accommodate internally, but backups will always incur a monthly cost beyond otherwise "best-effort" storage.  Especially if you have an active grant, or otherwise have irreplaceable datasets, please discuss those needs with us as soon as possible so that we can bake those costs into your grants or via other sources.  "Free" storage is allocated as-available, as-needed, but we strongly encourage you to be proactive in managing your own data.  S&E techs can assist in the creation and curation of automation to make these processes easier.

PORTABILITY:

SE-K8S environments are designed from the ground up to be portable, and in many cases can be easily modified to run on any environment that supports docker.  If you have need for such portability, environment automation can be packaged and exported for your use by S&E techs.  Some examples of where this might be useful to you:

  • Having access to identical environments under wsl2 on your local laptop/desktop
  • Sharing the environment with students or research colleague at another institution
  • Tinkering with a copy of your research environment without disrupting existing work
  • Parallelizing your work by running it across multiple nodes at once

USEFUL LINKS:

Overview and example usage of a se-k8s research pod:

Primary on-prem storage backend for SE-K8S (SE-ZFS):

  • [article will be placed here when the service is generally available]

Other on-prem storage options:

IRB document on research storage involving human subject data:

Details

Details

Article ID: 37451
Created
Tue 4/8/25 4:36 PM
Modified
Wed 6/4/25 9:26 AM