📁
Information Technology Services
📅
105455 Requisition #

Stanford Research Computing is looking for a talented system administrator to join our team of collaborative and innovative professionals helping Stanford’s faculty and students use advanced computing and data tools to explore new frontiers in knowledge and solve some of humanity’s most urgent problems. Our staff work directly with some of the world's top researchers in a broad range of disciplines, across all of Stanford’s seven schools — while also supporting and learning from each other in cross-project endeavors. We maintain and steadily improve an advanced research computing facility, and we support a variety of environments for Stanford research. In Stanford Research Computing, you’ll have a rare opportunity to contribute to discoveries and inventions that have global reach and positive impact, and to share in the curiosity and commitment of the scholars and scientists who lead these projects.

This new position will support Stanford’s world-class data science and AI-focused research by managing and administering an NVIDIA DGX SuperPod instrument. You and another HPC administrator will partner closely with a team of data scientists from Stanford Data Science to ensure that the GPU cluster environment is configured and operated to maximize research productivity. We’d love to have you join us on this exciting journey.

 

RESPONSIBILITIES

This role is primarily systems-facing. In this position, you will put to use your in-depth knowledge of Slurm and Linux, your HPC cluster administration experience, and your passion for supporting ground-breaking research on a daily basis. You will play a crucial role in optimizing, improving and sustaining our advanced computing infrastructure.

  • HPC Infrastructure Maintenance: Help manage the day-to-day system administration of an NVIDIA DGX Superpod and associated storage, management and networking infrastructure, in alignment with applicable university, regulatory agency, and/or contractual security and privacy requirements and instrument governance group decisions.
  • Slurm: With peer administrator, configure and manage Slurm for efficient resource allocation and job scheduling across the cluster, consistent with faculty guidance on system resource usage and utilization.
  • GPU Resource Management: Manage GPU resources within the cluster, optimizing utilization for compute-intensive tasks while maintaining a balance between user requirements and system stability. Provide automated, easily accessible resource utilization metrics.
  • User Support: Collaborate with Stanford Data Science team members and system users to understand their computing needs, provide technical assistance, and troubleshoot issues related to system performance and job execution. Provide user consultation and training in system use as needed.
  • Performance monitoring: Monitor system performance, diagnose bottlenecks, and take necessary actions to improve system performance.
  • Documentation: Maintain detailed documentation of system configurations, procedures, and troubleshooting guides to facilitate knowledge sharing and team collaboration. Develop user-facing documentation in coordination with colleagues from Stanford Data Science.
  • Planning: Meet regularly with stakeholders to understand existing challenges, anticipated needs, and opportunities for closer collaboration.
  • Vendor engagement: Liaise with system vendors and other external partners as needed to ensure system issues are triaged and resolved expeditiously and correctly.

 

MINIMUM REQUIREMENTS

Education and Experience:

  • Bachelor’s degree and eight years of relevant experience, or a combination of education and relevant experience. Eight years of increasingly technical work experience preferred.
  • In-depth experience managing complex multiuser HPC clusters and storage environments is necessary, as is experience managing GPU-based infrastructure.

Qualifications:

This position requires in-depth knowledge of and substantial hands-on experience with:

  • HPC cluster system administration, preferably in an academic/research environment
  • GPU technologies and their integration into HPC environments (driver management, software stack tools, monitoring)
  • Infiniband (driver management, software stack tools, monitoring)
  • Container platforms (ex: Apptainer)
  • Slurm configuration and management
  • NFS-based storage management and configuration
  • High-performance parallel filesystem (Lustre) management and configuration
  • Scripting for system management, monitoring and task automation
  • Installing and repairing servers and associated cluster hardware
  • Complex technical problem-solving and troubleshooting, with a proactive approach to system optimization and issue resolution
  • Security practices and compliance standards in a computing environment
  • Collaborating effectively across teams and with researchers

Additional desired skills and experience include:

  • AI/ML software and frameworks, deep learning, and LLM training
  • Bright Cluster Manager
  • Pyxis/enroot
  • CUDA
  • System and storage benchmarking
  • DataDirect Networks (DDN) SFA high-performance storage systems

Working Conditions

This is a hybrid position, in which you will work on-site at the Stanford campus for a minimum of 3 days a week through the first 9 months of employment, and at least 2 days a week thereafter.

You will be expected to travel to the research data center (3 miles away, on the SLAC campus) as needed to meet with and escort vendor technicians; inspect and troubleshoot hardware; receive and install FRU components for the system; and manage any RMAs. Typically, Stanford service vehicles can be checked out and used for your travel between the Stanford campus and the data center. Note that availability and ability to travel to/from the data center is required for all work days (not only on-site days) and in emergency off-hours situations.

Our core work hours are 9 am - 5 pm Pacific. This role occasionally will require extended hours and weekend work, and you will participate in rotation of on- and off-site responsibilities during the annual winter closure. Periodically, the data center is shut down for required maintenance. All team members with system responsibilities are expected to be physically on-site to return services to production status at the end of any planned facility outage.

 

The expected pay range for this position is $148,162 to $168,602 per annum.

Stanford University provides pay ranges representing its good faith estimate of what the university reasonably expects to pay for a position. The pay offered to a selected candidate will be determined based on factors such as (but not limited to) the scope and responsibilities of the position, the qualifications of the selected candidate, departmental budget availability, internal equity, geographic location and external market pay for comparable jobs.

At Stanford University, base pay represents only one aspect of the comprehensive rewards package. The Cardinal at Work website (https://cardinalatwork.stanford.edu/benefits-rewards) provides detailed information on Stanford’s extensive range of benefits and rewards offered to employees. Specifics about the rewards package for this position may be discussed during the hiring process.

 

Why Stanford is for You:

Imagine a world without search engines or social platforms. Consider lives saved through first-ever organ transplants and research to cure illnesses. Stanford University has revolutionized the way we live and enriched the world. Supporting this mission is our diverse and dedicated 17,000 staff. We seek talent driven to impact the future of our legacy. Our culture and unique perks empower you with:

  • Freedom to grow. We offer career development programs, tuition reimbursement, and course auditing. Join a TedTalk, watch a film screening, or listen to a renowned author or global leaders speak.
  • A caring culture. We provide superb retirement plans, generous time-off, and family care resources.
  • A healthier you. Choose from hundreds of health or fitness classes at our world-class exercise facilities. We provide excellent health care benefits.
  • Discovery and fun. Stroll through historic sculptures, trails, and museums.
  • Enviable resources. Enjoy free commuter programs, ridesharing incentives, discounts and more.

We look forward to receiving your application and cover letter.

 

The job duties listed are typical examples of work performed by positions in this job classification and are not designed to contain or be interpreted as a comprehensive inventory of all duties, tasks, and responsibilities. Specific duties and responsibilities may vary depending on department or program needs without changing the general nature and scope of the job or level of responsibility. Employees may also perform other duties as assigned.

Consistent with its obligations under the law, the University will provide reasonable accommodations to applicants and employees with disabilities. Applicants requiring a reasonable accommodation for any part of the application or hiring process should contact Stanford University Human Resources by submitting a contact form.

Stanford is an equal employment opportunity and affirmative action employer. All qualified applicants will receive consideration for employment without regard to race, color, religion, sex, sexual orientation, gender identity, national origin, disability, protected veteran status, or any other characteristic protected by law.

My Submissions

Track your opportunities.

My Submissions

Similar Listings

Business Affairs: University IT (UIT), Stanford, California, United States

📁 Information Technology Services

Business Affairs: University IT (UIT), Stanford, California, United States

📁 Information Technology Services

Business Affairs: University IT (UIT), Stanford, California, United States

📁 Information Technology Services

Global Impact
We believe in having a global impact

Climate and Sustainability

Stanford's deep commitment to sustainability practices has earned us a Platinum rating and inspired a new school aimed at tackling climate change.

Medical Innovations

Stanford's Innovative Medicines Accelerator is currently focused entirely on helping faculty generate and test new medicines that can slow the spread of COVID-19.

Technology

From Google and PayPal to Netflix and Snapchat, Stanford has housed some of the most celebrated innovations in Silicon Valley.

Advancing Education

Through rigorous research, model training programs and partnerships with educators worldwide, Stanford is pursuing equitable, accessible and effective learning for all.

Working Here
We believe you matter as much as the work

Group Dance Class In A Gym
Nora Cata Portrait

I love that Stanford is supportive of learning, and as an education institution, that pursuit of knowledge extends to staff members through professional development, wellness, financial planning and staff affinity groups.

Nora Cata

School of Engineering

Students Working With A Robot Arm
Philip Cheng Portrait

I get to apply my real-world experiences in a setting that welcomes diversity in thinking and offers support in applying new methods. In my short time at Stanford, I've been able to streamline processes that provide better and faster information to our students.

Phillip Cheng

Office of the Vice Provost for Student Affairs

Students Working With A Robot Arm
Denisha Clark Portrait

Besides its contributions to science, health, and medicine, Stanford is also the home of pioneers across disciplines. Joining Stanford has been a great way to contribute to our society by supporting emerging leaders.

Denisha Clark

School of Medicine

Students Working With A Robot Arm
Laura Lind Portrait

I like working in a place where ideas matter. Working at Stanford means being part of a vibrant, international culture in addition to getting to do meaningful work.

Laura Lind

Office of the President and Provost

Getting Started
We believe that you can love your job

Join Stanford in shaping a better tomorrow for your community, humanity and the planet we call home.

  • 4.2 Review Ratings
  • 81% Recommend to a Friend

View All Jobs