Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

The OSN serves two principal purposes: (1) enable the smooth flow of large data sets between resources such as instruments, campus data centers, national supercomputing centers, and cloud providers; and (2) facilitate access to long tail data sets by the scientific community. Examples of data currently available on the OSN include synthetic data from ocean models; the widely used Extracted Features Set from the Hathi Trust Digital Library; open access earth sciences data from Pangeo; and Geophysical Data from BCO-DMO. These data sets are being used by researchers to machine learning models, validate simulations, and perform statistical analysis of live data.

System Overview

OSN data is housed in storage pods interconnected by national, high-performance networks creating well-connected, cloud-like storage that is easily accessible at high data transfer rates comparable to or exceeding the public cloud storage providers, where users can temporariy park data, for retrieval by a collaborator or create a repository of active research data .

...

  • End Users who wish to view metadata and retrieve data.

  • Data Curators who maintain data sets

  • Data Managers who grant access to data sets for Curators and End Users

Configuration

Key characteristics of OSN storage are:

...

OSN storage pods are located in science DMZs at Big Data Hub sites, interconnected by national, high-performance networks. 5 petabytes of storage are currently available for allocation.

...

File Systems

OSN Storage is disk based and primarily intended to house active data sets. OSN storage is allocated from the pod(s) closest to the requestor with capacity to fulfill the request. Allocations of a minimum 10 terabytes and a maximum of 50 terabytes can be requested through the XRAS process. If your project needs more than 50 terabytes, please contact the OSN team directly to discuss before you submit your request.

...

An active research data set can remain in OSN storage up to five years and usage must comply with the OSN Acceptable Use Policy.

Allocations

Storage on the OSN is allocated in standalone buckets independent of HPC allocations. There is a one-to-one mapping between buckets and allocations. This User Guide uses "Allocation" when referring to outward-facing operations such as Allocation requests, and "Bucket" when referring to inward-facing operations such as Bucket creation.

...

An active research dataset can remain in OSN storage up to five years.

Accessing Datasets

OSN supports a RESTful API that is compatible with the basic data access model of the Amazon S3 API. Any software that complies with that API can access data stored on the OSN.

There are three common methods for connecting to and using OSN resources: OSN portal built-in web tools, third party desktop applications and third party data management server applications.

Third Party Desktop Applications

There are numerous commercial and open source software tools for moving files to and from S3 buckets. These tools provide more sophisticated capabilities than the built-in browser tool including transfer management, multi-upload management and provide configuration options that can help optimize data transfer for a given computer/network environment.

...

Note that the "Bucket" information displayed in the portal has two components (this will be important when you configure third party tools). The bucket information contains the OSN site/pod location and the specific allocation on that pod.

Cyberduck

Cyberduck is a popular file transfer tool that supports the S3 API. The following describes how to configure Cyberduck to connect to an OSN resource. Cyberduck is a "cloud storage browser" for Mac and Windows that supports multiple storage providers/protocols. The software may be downloaded at: https://cyberduck.io/download/

...

When specifying "Port", use 443 if the location starts with "https://"; use 80 if the location starts with "http://".

...

Anonymous Access Data Sets

Some datasets provide anonymous read access; if you are accessing buckets anonymously, type "anonymous" into the Access ID portion and Cyberduck will then select the grayed out anonymous access box in the window.

...

Exit the window for the bookmark to save.

Browsing, Uploading and Downloading

Once a bookmark is created, you can use it to access data by double-clicking the bookmark. This logs your user in and lists the contents of the dataset.

...

The tool supports multiple upload/download streams, chunking, pausing and restarting.

Rclone

Rclone is an open source command line utility that functions similarly to rsync and is able to communicate with numerous cloud-based storage providers. The application and documentation may be found at https://rclone.org . Download and install the application per the instructions at the rclone website.

Rclone Configuration

The most straightforward way to configure Rclone for OSN is to edit the rclone configuration file. This file may be found by typing the command "rclone config file". The command will return the path to the rclone config file. Open this file with a text editor and add the following stanza to the end of the file:

...

Code Block
[ocean-data]
type = s3
provider = Ceph
access_key_id = ASasd8KJHDAKH**&asd
secret_access_key =asd(*&Adskj*(*(&868778
endpoint = https://mghp.osn.xsede.org
no_check_bucket = true

Rclone commands

Rclone commands are of the form:

...

Rclone offers a wide range of commands for performing typical unix file operations (ls, cp, rm, rsync, etc.) Details on these commands can be found here.

Third Party Data Management Applications

OSN users may also choose to layer more sophisticated data management applications on top of the S3 API services that OSN provides. Two applications that have been used with OSN include Globus (using the Globus S3) connector and iRods. Both packages have detailed descriptions on how to connect the service with a S3 storage provider.

Landing Pages

Coming Soon! The data set owner may also create a landing page that follows DOI landing page conventions, making it easy to visit from a browser or data catalog. The landing page contains metadata that describes the data set and links to preconfigured, downloadable tools for accessing the data set.

...

A completed template example is shown below.

...

Open Access & Protected Data Sets

OSN datasets can be either open access or protected. In the former case, keys are only needed to write new objects to the dataset otherwise, read access can be accomplished anonymously (e.g. as shown earlier for the anonymous cyberduck configuration).

...

In the image below, culbertj@mit.edu and jtgoodhue@mghpcc.org have access to all the keys in the project because they are both data managers for the project "JIMTEST", dsimmel@psc.edu will only have access to the two keys shown with the "visible" checkbox checked.

...

Managing Files and Data

There are four roles associated with an OSN allocation:

  • Principal Investigator - Responsible for the allocation and serves as either the Data Manager or the Alternate Data Manager for the allocation.

  • Data Manager:

    • Adds/removes data curators and data managers

    • Adds/removes end users for protected data

    • Maintain Data Set Landing Page Information

    • Monitors capacity vs utilization and requests allocation changes when needed

    The OSN Portal is used by PIs/Data managers to manage their allocations. The Portal uses CiLogon for authentication, and provides bucket administration tools to the PI/Data Manager who requested the allocation. When requesting an allocation, the PI provides an identity that is recognized by CILogon. After the bucket is created, the PI can log in to the OSN Portal and administer access to the bucket.

  • Data Curator - Maintains the data set

  • End User:

    • Has read access to all of the data in the bucket. Public-access buckets allow access to anyone who has the name of the pod and bucket. Authenticated access buckets allow access to anyone who has the READ key.

    • Registers via any identity service that is trusted by the data manager (InCommon, ORCID, Github, Google, Amazon, etc.

    • Logs in after receiving an invitation from a Data Manager or OSN Operations

Transferring Data to the OSN

OSN data sets are comprised of Ceph Objects accessible from anywhere, via a RESTful protocol that follows S3 conventions. All end user access is via S3 put and get requests, mediated by Bucket Policies.

...

  • Object-level access control is not supported

  • There is no audit trail that identifies the originator of any request

  • Keys are unique to a bucket, and buckets are unique to a pod

  • For example, if a data set has been replicated across two pods, each instance has a different set of keys and separately maintained access control

  • Since there is one READ key per bucket, the origin of a Get request may only be distinguished by source IP address

Help

Users may always contact /wiki/spaces/PreReleaseDocumentation/pages/67712578 for help.