What Data Does Everyone Have to Drive Data Based Decisons
Data Driven Decision Making
Session Lead: Shelley Knuth, Support
Slides: fall2025_quarterly_data_driven_decision_making.pptx
This is an interactive session.
What data are we collecting? Raw Notes
Support
OOD through XDMOD
ACCESS Pegasus
Usage
Findings/issues
Resources running on
Institutions
Focus Groups 2023 website review
User experience
Webinar (NAIRR/ACCESS)
Names
Institutions
ACCESS ID
Email
How they found us
NAIRR UEWG surveys from underutilization
November 2024
What resource?
Why?
What can we do to help?
NAIRR Underutilized allocations by 2000 SU
Currently ongoing 2025
1:1 meetings
Resource?
Why?
What can we do?
Survey data (CU) from Workshops
Pre and post workshop surveys
AI centered
Institution
Why they want to attend
Questions about their experience
CCEP awards (Community Grant)
Who applied
Institution and $$ awarded
Where they went with the funds
Announcements
Who is making announcements (RPs)
Support Digest
How many are clicking links
Open rates
CSSN Roles/Interests
Events and training (ACCESS/NAIRR)
Registrations for events
What types of events/trainings
Ticketing Information (ACCESS System only)
Everything!
Generate Tags
Generating new data sets
SDS
We know what they search for
All the software collected from CIDeR/IPFTool
Chatbot
Questions
Answers
Ratings
MATCH Services
Who is requesting MATCH
Office Hours Tracking
Who attends
Concerns they raise
Website data/weblogs
Institution
Who is Logged-in/logged-out
Carnegie classification
Geographic location
Session clicks/origination website
Affinity Group tracking
How they connect to ACCESS (e.g. CampusChampions)
Metrics
XDMoD Data Sources
From the ACCESS allocations database & XRAS:
Jobs submitted and run.
Queues and system accounts.
Users, PIs, and organizations.
Projects and allocations.
Fields of science.
Science Gateway users.
From CiDeR
Resource specifications.
From NSF.gov:
NSF awards.
Direct from RPs:
Slurm (and other resource manager) accounting logs.
Compute node-level performance data.
OpenStack logs.
Open OnDemand logs.
Network logs (stored in NetSage).
CloudBank data.
From NetSage
data for every flow and Globus record, source and dest:
Flow/task size and rate (and unique ID)
Organization, ASN
Country, Lat, Long
Subnet, Port
ScienceRegistry Project Name, Discipline
Resource name
Community membership
Allocations
Metrics Framework
https://link.springer.com/article/10.1007/s42979-024-02787-4
KPI-based project improvement metrics
Democratization Index
Ecosystem access time
RP Satisfaction
XRAS uptime
Feedback responsiveness
Ticket resolution
Staff satisfaction
Annual Report
https://kilthub.cmu.edu/articles/report/ACCESS_Allocations_PY3_Annual_Report/28950569?file=54281699
General ecosystem activity
Users
Institution, academic status, # of allocations they’re on
Projects
Usage, documents, abstracts, FOS, etc.
Publications
RPs utilized
Flipside of this → per-RP data (don’t actively monitor/publish)
Counts of requests
Response times
Approval/decline rates
Operations
Data we have now:
System & service Logs – security data we retain in case we need to look at it.
Monitoring that security does on different services
Qualys data (is this host vulnerable?)
Resource information data
Logging, APIs, software, hardware, RP people/contacts
Resource news (outages etc.)
Nagios monitoring data
Service index about our services
perfSONAR log info
ACCESS identity information / COManage info
Teragrid.org identities
ciLogon/ACCESS authentication logs
Ticketing statistics, time to resolution, tags
Web stats/hits to web pages (google analytics)
STEP application institution demographics
STEP survey data (sensitive/not to be share)
Historical data on volumes of network traffic sent across the network
Data we need that requires input from across all ACCESS teams:
Time/effort required to guide a new resource through integration process
Time/effort required to support/maintain an existing RP
Data we receive from others:
Flow data from Internet2
Netsage data
Eval team: community, staff, RP survey data
ACO
ACO – What data do we collect
Meeting Notes / Attendance – RAC
Meeting Notes / Attendance – EAB
ideas generated
Meeting Notes / Attendance - EC
Software Working Groups and Standing Committees Notes
Comms
Stories (HPC Wire – articles & awards)
Newsletters - Internal and External
statistics
Reach / Social Media
Website stats – visits, click-through
Publications – some are not in website Publication list
Surveys
EAB Survey
QM Survey
Staff Survey
RP Survey
Community Surveys
SC booth, PEARC booth info – names of people who have visited booth
Reports
Financial Reports
Meeting Reports – ACO
Tools we are using: UDO & VIVO
Risk Register
Jira used to track NSF, EC, and ACO meeting tasks
PEP milestones
Confluence
Adoption
Engagement Tracker
Quarterly Reports
EAB -
number of new allocations
Turn around time
Publications
New Publications
Turnaround time
EC decisions
Lessons learned
Financials
Tasks generated and completed
AI Generated reference list:
Collated Data Sets & Cross-Reference Schema
Below is a structured breakdown of the hundreds of data sets your teams collect, organized into eight key categories. We then identify the common identifiers across those categories and present a cross-reference matrix showing where each identifier appears. This will help you design joins, spot overlaps, and build a unified analytics warehouse.
Data Categories & Sources
User & Institution Info
Names; Institutions; ACCESS ID; Email; Carnegie classification; Geographic location; How they found us; Affinity group tracking.Usage & Allocations
XDMoD jobs/queues/accounts; XRAS allocations & credit exchanges; SDS search logs; Software inventory from CiDeR/IPFTool.Support Interactions
ACCESS ticketing; Support Digest (open rates, clicks); Office Hours attendance & concerns; Chatbot Q&A logs & ratings.Surveys & Feedback
Focus Groups 2023 site review; User-experience surveys; NAIRR UEWG underutilization (Nov 2024, ongoing 2025); 1:1 meetings; Pre/post workshop surveys; STEP surveys; EAB/QM/Staff/RP/community surveys.Events & Training
Webinar registrations (NAIRR/ACCESS); CSSN roles/interests; MATCH requests; CCEP award applications; SC/PEARC booth logs; Workshop attendance.Communications & Outreach
Announcements (RPs); Internal/external newsletters; Website stats (visits, click-through); Social-media reach; Article mentions (HPC Wire).Resource & Infrastructure Metrics
OnDemand logs; ACCESS Pegasus usage; NetSage flows; CloudBank billing; Nagios/perfSONAR/security logs; SDS search behavior.Performance & KPIs
Democratization Index; Ecosystem access time; RP satisfaction; XRAS uptime; Feedback response times; Annual Report metrics.
Common Identifiers (Join Keys)
Person: Name, Email, ACCESS ID
Institution: University/Company, Carnegie classification, Geographic region
Resource: Resource name/ID (e.g., queue, system account), Software package
Interaction: Survey response ID, Ticket ID, Session or event ID
Session: Web session ID, click events
Cross-Reference Matrix
Category | Key Data Sets |
|---|---|
User & Institution Info | Names; Institutions; ACCESS ID; Email; Carnegie classification; Geographic location; Affinity group tracking |
Usage & Allocations | XDMoD; XRAS credits & allocations; SDS searches; CiDeR software |
Support Interactions | Ticketing; Support Digest; Office Hours; Chatbot |
Surveys & Feedback | Focus Groups 2023 site review; NAIRR UEWG underutilization; NAIRR Underutilized allocations; 1:1 meetings; Pre/post workshop surveys; STEP application & survey; EAB/QM/Staff/RP/Community surveys |
Events & Training | Webinars; Registrations; MATCH requests; CCEP awards; Booth logs |
Communications & Outreach | Announcements; Newsletters; Website visits; Social media reach |
Resource & Infrastructure | OnDemand logs; Pegasus usage; NetSage flows; CloudBank billing; Nagios/perfSONAR logs |
Performance & KPIs | Democratization Index; Ecosystem access time; RP satisfaction; XRAS uptime; Feedback response times |
Next Steps
Design a star schema with a central User dimension (Person ID + Institution ID) and separate fact tables for Allocations/Usage, Support, Events, Surveys, Communications, and System Logs.
Align naming conventions so that “Institution” in surveys, ticketing, and allocations all reference the same institution dimension.
Implement a master registry of resource IDs (including queue names, software packages, OnDemand instances) to unify Usage and Infrastructure facts.
Develop ETL pipelines that tag each row with the proper Session ID and Survey ID to enable cross-analysis (e.g., linking support tickets to resource usage spikes).
Beyond this schema, you might consider:
Building real-time dashboards for Alerts (e.g., sudden drop in XRAS uptime vs. surge in support tickets).
Implementing role-based data access so teams see only their slice yet can share global snapshots.
Establishing a data governance council to refine definitions (e.g., what counts as “active” in each domain).