Previous Meeting Minutes from XSEDE SP Forum
Meeting: July 21, 2022 @1PM Pacific/4 pm Eastern
Welcome
ACCESS Coordinating Office - John Towns and Dina Meek
XSEDE and SP Coordination News
Next meeting: August 18, 2022
Meeting: July 7, 2022 @1PM Pacific/4 pm Eastern
Welcome
ACCESS Track 4 (Monitoring & Measurement Services): Tom Furlani and team
XSEDE and SP Coordination News
Next meeting: July 21, 2022
Meeting: June 23, 2022 @1PM Pacific/4 pm Eastern
Welcome
Introducing ACCESS's operations and user services - The CONECT project: Amy Scheule, Tim Boerner, Winona Sapps-Childs, Kathy Benninger and Alex Withers
XSEDE and SP Coordination News
Next meeting: June 23, 2022
Meeting: June 9, 2022 @1PM Pacific/4 pm Eastern
Welcome
Introducing ACCESS's approach to user services - The MATCH project: Alan Chalker
XSEDE and SP Coordination News
Next meeting: June 23, 2022
Meeting: May 25, 2022 @1PM Pacific/4 pm Eastern
Welcome
Introducing ACCESS's approach to allocation services - The RAMPS project: Dave Hart
XSEDE and SP Coordination News
Next meeting: June 9, 2022
Meeting: May 12, 2022 @1PM Pacific/4 pm Eastern
Welcome
Open Storage Network current activities and future plans: John Goodhue
XSEDE and SP Coordination News
Next meeting: May 25, 2022
Meeting: March 31, 2022 @1PM Pacific/4 pm Eastern
Attendees: 21
Agenda:
Welcome
Davide Del Vento is leaving NCAR, welcome Ben Kirk as NCAR's SP Forum representative
Introducing ACES, a new NSF-funded resource (Honggao Liu, Texas A&M)
New system ACES program (NSF ACSS grant - #2112356);
$5M + 1.25M for 5 years); program has been archived
2-3 awards/year over 5 years
User Portal — uses Slurm and Liquid Software+Kubernetes
Composable resources
11.500+ CPU cores; dual intel sapphire rapides 48c profs
liquid composable infrastructure; 120 nodes; supports up to 10 PCIe cards (GPU, FPGA, VE)
Mult. accelerator techs: GPU, FPGA, VE, Intel Optane memory
Deployment: test phase: Sept2022; early user: Q4 2022; allocations: Q1/Q2 2023
Roundtable
Next meeting: April 14, 2022 - SPs' Training activities and approaches
Meeting: March 17, 2022 @1PM Pacific/4 pm Eastern
Attendees: Tim Boerner, Kevin Colby, Davide Del Vento, Craig Earley, Jeremy Fischer, Jim Griffioen, David Hancock, Doug Jennewein, Ruth Marinshaw, Kenton McHenry, Tim Middelkoop, Sudhakar Pamidighantam, Tabitha Samuel, Sergiu Sanielevici, Eva Siegmann, Bob Sinkovitz, Dan Stanzione, Mary Thomas
Agenda:
Welcome
Discussion of Campus Champions SP allocations post-XSEDE, with Champions program leadership team member Tim Middelkoop and SP Forum member Doug Jennewein (also part of the CC leadership team)
Campus Champion program was introduced in 2009, and it has grown steadily over the years. Post-XSEDE, the CC program will continue and the leadership team has been crafting a sustainability plan
Champions find value in the CC allocations (about 40 CCs out of 600 have active allocations) from Service Providers as well as significant value from the XSEDE allocations reports
Champions would like clarity regarding whether the individual SPs will continue to provide allocations post-XSEDE.
SPs noted that the ACCESS awardees should be known soon, but the SPs are committed to continue to provide allocations in some form. For example, TACC noted that it would be happy to continue allocations going on their systems using local accounts if necessary. There is hope that the allocations infrastructure and process will continue in ACCESS, rather than each SP having to manage parallel allocations. In general, CC's usage of the allocations they have received has been quite modestSPs persist across XSEDE and ACCESS, and CCs will too.
Discussion of how best to manage allocations if the number of champions per campus/institution grows significantly, as there IS overhead for the SPs in terms of account management. One recommendation was to create one shared project per institution.
Another suggestion was perhaps separate the types of CC allocations into (a)kick the tires – low CPU utilization, debugging mostly and (b) testing codes and running science, which are more intensive
XSEDE program news: Tim reported that the XSEDE Quarterly meetings had been held earlier in March, in a hybrid mode. A future meeting topic for the SP Forum will be: SP experiences over XSEDE and Lessons Learned, to help inform XSEDE's wrap up reports and insights to the NSF and community.
Next meeting: March 31, 2022
Meeting: February 17, 2022 @1PM Pacific/4 pm Eastern
Attendees: Jaime Camboriza, Davide Del Vento, Craig Earley, Jeremy Fischer, Vikram Gazula, Jim Griffioen, Robert Harrison, John Huffman, Doug Jennewein, Honggao Liu, Ruth Marinshaw, Linh Ngo, Sudhakar Pamidighantam, Sheri Sanders, Sergiu Sanielevici, Joe Schoonover, Anita Schwartz, Eva Siegmann, Preston Smith, Carol Song, Dan Stanzione, Mary Thomas, Grigori Yourganov
Agenda:
Welcome
SP Coordination Updates (Tabitha Samuel) - Ookami will soon be an allocable resource through the XRAC process
Ookami presentation and discussion (Robert Harrison and Eva Siegmann)
Ookami is a testbed system, up for 2 years, 1 year operating as a testbed; 5 year project, + 1 additional year to ops)
mid-2022 move into production,
Move to L2 SP (Oct’22) — 90% allocations dedicated to XSEDE
slides on Ookami (Japanese for wolf); motivated by Fugaku
located at StonyBrook, NSF funded
1.5 million node-hrs/year; 32GB@1TB/sec
211 users (90% US, 93% academic). List of the 71 current projects @ https:www.stonybrook.edu//ookami/projtects
support:
website
Home | Ookami ; email
ticketing system (https://iacs.supportsystem.com);
biweekly office hours (Tues 10am, Thur 2);
slack
Qs:
SVE compilers are getting better; performance is not what was expected
TACC: still waiting for the Fujistu compiler; how does it compare to ARM compiler?
Fujitsu and CRAY are generating good SVE code
Suggested by Robert: Fujistu ARM compiler is more standard, so better place to start
Lack of vectorized math is factor 40x different
e.g. memory bound code not looking great
ARM compiler getting better for TACC: everything compiles now they are working on optimization
apps :which work best?
thread scaling is good for large memory BW
issue: short vectors add to instruction latency and power consumption
good with apps operating on long vectors
how are legacy codes doing (in particular chemistry)?
most "out of the box" codes are working fine
for well-vectorized codes, they have reasonably competitive code
vectorized scalar not doing that well.
How is NWChem working?
very well; Fujitsu compiler is giving good performance
expect different perf between mat-mat-mult and vectorization of FP operations
training
webinars on ARM vectorization and profiling tools
Data and Testbed?
each node has SSD (512GB)
high-bandwidth/local storage (talking with NSF (Bob Chadduck) to expand to multi-terabytes)
Modern (2008) fortran support/what features will Fujitsu support
ARM compiler (C/C++) is good
Cray is best (C++ 17 support)
Fujitsu — challenging
Are commercial vendors working on compiling software for the system?
they have 1: Parallelware Analyzer; testing on Ookami
VAPS, working with XDMOD team to get “top 10 or 15”
Are they expecting a lot of startup allocation requests?
Yes. They expect a lot of exploration application
Roundtable
Joe Schoonover noted that Fluid Numerics is working with AMD to have a co-located "lunch & learn" on HIP and OpenMP at PEARC22
Next meeting: March 3, 2022 – seeking volunteers to present.
March 17, 2022 meeting - Mary Thomas is organizing a panel or discussion around training
Meeting: January 6, 2022 @1PM Pacific/4 pm Eastern
Agenda:
Welcome
SP Forum leadership for remainder of XSEDE project
Roundtable discussion
Next meeting: January 20, 2022
Meeting: December 8, 2021 @1PM Pacific/4 pm Eastern
Welcome
XSEDE Evaluation Team Activities - Julie Wernert
Roundtable
Next meeting: January 6, 2022
Meeting: October 29 @1PM Pacific/4 pm Eastern
Welcome
XSEDE's YouTube channel (Susan Mehringer)
Featured L3: Arizona State (Doug Jennewein)
XSEDE Program Updates (John Towns/Tim B)
ROI Survey Reminder
Next meeting: November 11, 2021
Meeting: July 8, 2021 @1PM Pacific/4 pm Eastern
Agenda:
Welcome
Dana Brunson, Internet2: New Center of Excellence
John or Tim: XSEDE annual review recap
Roundtable (all)
Next meeting: August 5: Terminology Training
Meeting: June 10, 2021 @1PM Pacific/4 pm Eastern
Welcome
David Wheeler, XSEDE Data Transfer Service (DTS) Engagement
Ruth Marinshaw: Fluid Numerics' SP Forum membership application update
Newest SPs: updates on onboarding and allocation processes
John or Tim: XSEDE annual review recap
Roundtable (all)
Meeting: May 27, 2021 @1PM Pacific/4 pm Eastern
Welcome
Honggao Liu: Texas A&M's FASTER system and HPC activities
Ruth Marinshaw: Fluid Numerics' SP Forum membership application
Tim: XSEDE program updates
Meeting: April 15, 2021 @1PM Pacific/4 pm Eastern
Attendees: 20
Welcome
CC* CyberTeam from Utah, Colorado and Colorado State (RMACC): “Creating a Community of Regional Data and Workflow Cyberinfrastructure Facilitators”. (Brett Milash, Andrew Monaghan, Mara Sedlins)
SP Coordination: working towards a single source of SP info (Tabitha Samuel)
XSEDE News (Tim Boerner)
Reminder to sign up to review XSEDE's plans for the coming year (https://docs.google.com/document/d/1R--DEdoo0_A7a-d2n4fYiqpa9Gp7dQpe7AZ8MnfMmBM/edit )
Roundtable (all)
Meeting: April 1, 2021 @1PM Pacific/4 pm Eastern
Attendees: 30
Welcome
XSEDE Terminology Task Force - Linda Akli & Susan Mehringer
Status: list, docs, outreach (Slides at events)
Subtasks:
intake/review/add new terms;
publicity/post list;
training/orientation; internal communications; tracking metrics/records
Next: XSEDE rollout
Each event has pre-slides to show
Terminology Statement is now linked on bottom of home site
Replacement terms:
master branch —> main branch
white paper —> position paper, publications, etc.
other word: disabled, brown bag, sanity check
NIPS project —> changed acronym —> NaIP
PY11 Planning - Tim Boerner
Annual project planning exercise (Sept 1 - Aug 31), start 6 months before; in time for June NSF meetings
Goals: what would people do/not do with +- 5% change in budget?
looking for input from L3/L2 program areas… what should XSEDE do next year for the SP?
Leslie Froeschl will put out a link to a google doc describing what they are seeking
XSEDE transition and spin down - John Towns
NSF ACCESS; does not cover all XSEDE activitie along with reduced support some existing activities
creates risks for XSEDE and SPs in final deliverables
May be a slightly higher risk near the end of XSEDE.
how to prepare for transition?
L1 SPs; having discussions with program officers about what to do after XSEDE services disappear? (sergio)
Concern for areas without an obvious continuation
XSEDE may be understaffed — staff may move on to other positions
could SPs possibly find staff who can step in and finish project tasks?
will impact deliverables and support
How are L1 SPs planning for end of XSEDE?
what are the gaps in service — can we identify them?
support
training —> what will replace this?
user portal
allocation systems
ECSS —> end of ECSS will be a big gap in funding
Campus Champions
PEARC
what types of mitigate plans should be put in place?
Next Meeting agenda: SP data cleanup and RMACC CC* CyberTeam presentation
Roundtable (all)
Meeting: March 18, 2021 @1PM Pacific/4 pm Eastern
Rockfish - a resource coming to XSEDE mid-2021 (Jaime Comboriza, JHU)
XSEDE News (Tim B or John T)
Roundtable (all)
(To add: interim meetings)
Meeting: January 7, 2021 @1PM Pacific/4 pm Eastern
Welcome new SP Forum members
Site Update: Stanford
XSEDE Project news (Tim B)
SP Coordination news (Tabitha Samuel)
SP Forum elections upcoming (Ruth)
Roundtable (all)
Meeting: December 17, 2020 @1PM Pacific/4 pm Eastern
Attendees: Jon Anderson, Tim Boerner, Jaime Combariza, Keith Crabb, Matt Deaton, David Hancock, Andrew Keen, Ruth Marinshaw, Kenton McHenry, Sudhakar Pamidighantam, Tabitha Samuel, Carol Song, Sergiu Sanielevici, Dan Stanzione, Mary Thomas, John Towns, Jim Wilgenbusch, Paul Williams
Agenda
Announcements (Ruth Marinshaw)
Chet Langin is retiring from SIU; Matt Deaton and/or Jerry Richards will represent SIU as an L3 SP Forum member.
L3 Update: The Minnesota Supercomputing Institute (Jim Wilgenbusch, Director of Research Computing)
MSI’s Research Computing team provides access to diverse data & computing services; training and support via tutorials and other means across 2 campuses and 6 colleges; dedicated staff expertise and support written into grants; and deploy and test novel services. In addition, they provide seed funding for informatics research projects as well as a graduate assistantship program for Informatics students.
Key service area of focus is U-Spatial (https://thespatialuniversity.umn.edu) with expertise in GIS, remote sensing and spatial computing
HPC resources include 2 HPC clusters, storage (6 PB high performance, 6 PB second tier, 30 PB tape library), interactive computing (Citrix, DCS Nice, Open Stack-based secure cloud), Galaxy, Jupyter Hub and custom web interfaces, databases and applications
75% of compute resources are used by Engr, Chem and Physics; 50% of the storage is used by Biology, Genetics & Health Sciences.
New major systems are purchased every 5 years. Next system coming spring 2021. Also have a condo-like option to buy in with a one time fee that covers equipment purchase as well as staffing to support those systems for 5 years.
Secure computing environment has 42TB VM block storage, 1 PB NFS storage, and nodes have 2TB SSD per node for job-allocated local scratch.
MSI has 45 FT staff; UM’s Informatics Institute (UMII) has 8 FTE; U-Spatial has 10.
Jim reports to the VPR; the MSI, UMII and U-Spatial programs report to him. There is a substantial governance structure associated with CI services.
XSEDE News (Tim Boerner)
The XSEDE Quarterly Staff meetings were held last week
Working on diagrams to illustrate the interconnections between the various XSEDE program areas
Coming in January: SP Forum Elections
Next meeting: Thursday, January 7, 2021.
Meeting: October 29@1PM Pacific/4 pm Eastern
Attendees: Jon Anderson, Tim Boerner, Jaime Cambariza, Alan Chalker, Kevin Colby, Keith Crabb, Jeremy Fischer, David Hancock, Ron Hawkins, Dave Hudak, John Huffman, Andrew Keen, Chet Langin, Dan Lapine, Lee Liming, Ruth Marinshaw, JP Navarro, Sudhakar Pamidighantam, Tabitha Samuel, Scott Sakai, Sergiu Sanielevici, Anita Schwartz, Bob Settlage, Shava Smallen, Preston Smith, Dan Stanzione, John Towns, Brian Voss,
Agenda
The primary agenda topic was to return to the discussion of ways to deliver Open On Demand for XSEDE SPs. Since discussed by the SP Forum earlier in 2020, the OOD development team and the XSEDE Requirements Analysis & Capability Delivery (RACD) team have been meeting to discuss possibilities and options. Their recommendation was as follows:
“Our recommendation for serving OOD to XSEDE users is to create a service provider specific OOD portal that served researchers from that SP but served as a portal to BOTH the local resources (simply PSU for instance) AND the XSEDE resources (local and remote, but moderated by what the user had access to), ie the user would navigate to ood.psu.edu and be able to see and hit any resource they have access to including local and XSEDE. We do think there should be some consistency across SPs.
For completeness, the options we discussed included:
create SP specific OOD portals managed by the site to server their XSEDE resources only, ie if a VT investigator had access to PSU XSEDE resources, they would go to
create a master XSEDE OOD portal that contained the connectors to the various SP resources, ie a VT investigator would go to
ood.xsede.org and be able to submit to the resource they were allocated on, ie xsede.psu
create a SP specific OOD portal that served researchers from that SP but served as a portal to BOTH the local resources (simply PSU for instance) AND the XSEDE resources (local and remote, but moderated by what the user had access to), ie the user would navigate to
ood.psu.edu and be able to see and hit any resource they have access to including local and XSEDE.”
Draft use cases were reviewed and discussed based on this document: https://docs.google.com/document/d/1Vs87wUkRP9gPLP3ZC4JdE6YEdXhKo5RUGWZLqlQaEdY/edit
Key questions explored were what would OOD do for users? Should XSEDE do this and, if so, how? It was noted that several SPs have already adoped OOD. Among the use cases that are compelling is that of training - participants don't have to download and configure a tool like Putty, etc. SDSC noted that it has configured systems to use federated authentication. OOD is also a powerful way to provide access to Jupyter notebooks and to applications like Matlab and Ansys. The OOD developers are discussing integration with Globus.
A vigorous discussion was held, with Forum members ultimately supportive of the recommendation from OOD and XRAS.
The forum will not meet again until December, given Supercomputing and Thanksgiving.
Meeting: October 15@1PM Pacific/4 pm Eastern
Attendees: Jon Anderson, Jaime Cambariza, Kevin Colby, Jeremy Fischer, Dave Hart, Dave Hancock, John Huffman, Andrew Keen, Chet Langin, Rob Light, John Lowe, Ruth Marinshaw, Tabitha Samuel, Sergiu Sanielevici, Anita Schwartz, Carol Song, Mary Thomas, Nathan Tolbert, 14 others
Agenda
XSEDE's RAS team discusses an updated AMIE environment for usage data and pushing project info out to SPs (Dave Hart, Rob Light)
AMIE -XSEDE's account management and information exchange. Work is in progress to overhaul its infrastructure and database and to improve analytics
current: data transport using XML via SSH or rabbitMQ
new features – target date is the end of this program year
new -- Rest API
transactions are mostly the same:
e.g. request_project_create
notify_project_usage now has its own API
some SPs are starting to use
legacy AIME required SPs to maintain a local DB
new version does not need localDB,
use new API to query
jobs sent to XDCDB individually via AMIE, very slow (transmission and processing —> bottleneck)
e.g.: 100,000 tiny jobs in 1 day; and this impacted all jobs
new usage API, restful; different URL; but same AMIE auth system
non-blocking transactions; uses AWS lambda so easy to scale functions so can handle different loads
currently process 1million jobs / month (5400 jobs/hour)
new API can process this in 2 hours
new: POST / usage API
post many jobs at once
queues all jobs for async. processing
jobs that fail return with error msg
POSTing jobs already in the database will be replaces automatically (no longer manual)
GET /usage/ status
From and TO dates; sound of job records; sum of charge for each SP
provides list of records that did not load —> error log table
if job fails, SP can repost again
Usage API — avail soon
GET usage_summary, ….
different job attributes (CPU, GPU, storage, …)
AMIE Client library posted on GitHub:
Other tasks in redesign:
redesign entire system
simplifying: remove use of/mention of the Grid term
schema design changes to make action processing simpler;
keep balance for each allocation not person. like bank account; job storage —> more flexible and faster
schemas for people and organizations—> right now they are embedded in accounting
changes will mostly be invisible to SPs
RDR redesign: resource description repository
simpler resource entry; better schema for describing resources
Overall, everyone was very pleased with the new system. The SP Forum gave the presenters the closest thing to a standing ovation that could be done via zoom
SP Coordination (Tabitha Samuel) - Working on SP checklist updates as well as a software/services baseline. There is also a Slack channel now for SP sys admins.
Meeting: September 17@1PM Pacific/4 pm Eastern
Attendees: Jon Anderson, Tim Boerner, Jeremy Fischer, Ron Hawkins, John Huffman, Kyle Hutson, Ruth Marinshaw, Addis O’Connor, Sudhakar Pamidighantam, Tabitha Samuel, Anita Schwartz, Carol Song, Sergiu Sanielevici, Derek Simmel, Preston Smith, Dan Stanzione, Barr von Oehsen
Agenda
1, Cybersecurity Updates (Derek Simmel)
No XSEDE systems were broken into but a couple of user accounts were compromised
Cooperation with/info sharing with European HPC community
Use Zeek security monitoring (
The Zeek Network Security Monitor ); XSEDE monitors and coordinates communication between allocated SPsL3s are, though, welcome to participate in XSEDE’s security community and meetings; contact derek (dsimmel@psc.edu) to be added to the meetings and email list
Question about gateway users and potential vulnerabilities/concerns; vigilance in all aspects of information security was recommended
Discussion of persistent intruder threats
Derek gave a brief overview of REN-ISAC (https://www.ren-isac.net/) for those who weren’t familiar and strongly encouraged sites to be sure their campus participates in some way
XSEDE Update (Tim Boerner
Relatively quiet right now