Downloading datasets#
All communication with SciCat is handled by a client object. Normally, one would construct one using something like
from scitacean import Client
from scitacean.transfer.sftp import SFTPFileTransfer
client = Client.from_token(url="https://scicat.ess.eu/api/v3",
token=...,
file_transfer=SFTPFileTransfer(
host="login.esss.dk"
))
In this example, we use ESS’s SciCat. If you want to use a different one, you need to figure out its URL. Note that this is not the same URL that you open in a browser but typically ends in a suffix like /api/v3
.
Here, we authenticate using a token. You can find your token in the web interface by logging in and opening the settings. Alternatively, we could use username and password via Client.from_credentials.
WARNING:
Do not hard code secrets like tokens or passwords in notebooks or scripts! There is a high risk of exposing them when code is under version control or uploaded to SciCat.
Scitacean currently requires secrets to be passed as function arguments. So you will have to find your own solution for now.
While the client itself is responsible for talking to SciCat, a file_transfer
object is required to download data files. Here, we use SFTPFileTransfer
which downloads / uploads files via SFTP.
The file transfer needs to authenticate separately from the SciCat connection. By default, it requires an SSH agent to be running an set up for the selected host
.
For the purposes of this guide, we don’t want to connect to a real SciCat server in order to avoid the complications associated with that. So we set up a fake client that only pretends to connect to SciCat and file servers. Everything else in this guide works in the same way with a real client. See Developer Documentation/Testing if you are interested in the details.
[1]:
from scitacean.testing.docs import setup_fake_client
client = setup_fake_client()
Metadata#
We need the ID (pid
) of a dataset in order to download it. The fake client provides a dataset with id 20.500.12269/72fe3ff6-105b-4c7f-b9d0-073b67c90ec3
. We can download it using
[2]:
dset = client.get_dataset("20.500.12269/72fe3ff6-105b-4c7f-b9d0-073b67c90ec3")
Datasets can easily be inspected in Jupyter notebooks:
[3]:
dset
[3]:
Name | Type | Value | Description | |
---|---|---|---|---|
* |
creation_time | datetime | 2022-06-29 14:01:05+0000 | Time when dataset became fully available on disk, i.e. all containing files have been written, or the dataset was created in SciCat.<br>It is expected to be in ISO8601 format according to specifications for internet date/time format in RFC 3339, chapter 5.6 (https://www.rfc-editor.org/rfc/rfc3339#section-5).<br>Local times without timezone/offset info are automatically transformed to UTC using the timezone of the API server. |
* |
input_datasets | list[PID] | None | Array of input dataset identifiers used in producing the derived dataset. Ideally these are the global identifier to existing datasets inside this or federated data catalogs. |
* |
source_folder | RemotePath | RemotePath('/hex/ps/thaum') | Absolute file path on file server containing the files of this dataset, e.g. /some/path/to/sourcefolder. In case of a single file dataset, e.g. HDF5 data, it contains the path up to, but excluding the filename. Trailing slashes are removed. |
description | str | Measured the thaum flux | Free text explanation of contents of dataset. | |
name | str | Thaum flux | A name for the dataset, given by the creator to carry some semantic meaning. Useful for display purposes e.g. instead of displaying the pid. Will be autofilled if missing using info from sourceFolder. | |
pid | PID | 20.500.12269/72fe3ff6-105b-4c7f-b9d0-073b67c90ec3 | Persistent identifier of the dataset. | |
proposal_id | str | None | The ID of the proposal to which the dataset belongs. | |
sample_id | str | None | ID of the sample used when collecting the data. |
Advanced fields
* |
contact_email | str | p.stibbons@uu.am | Email of the contact person for this dataset. The string may contain a list of emails, which should then be separated by semicolons. |
* |
creation_location | str | UnseenUniversity | Unique location identifier where data was taken, usually in the form /Site-name/facility-name/instrumentOrBeamline-name. This field is required if the dataset is a Raw dataset. |
* |
investigator | str | None | First name and last name of the person or people pursuing the data analysis. The string may contain a list of names, which should then be separated by semicolons. |
* |
owner | str | Ponder Stibbons | Owner or custodian of the dataset, usually first name + last name. The string may contain a list of persons, which should then be separated by semicolons. |
* |
owner_group | str | uu | Name of the group owning this item. |
* |
principal_investigator | str | p.stibbons@uu.am | First name and last name of principal investigator(s). If multiple PIs are present, use a semicolon separated list. This field is required if the dataset is a Raw dataset. |
* |
used_software | list[str] | None | A list of links to software repositories which uniquely identifies the pieces of software, including versions, used for yielding the derived data. |
access_groups | list[str] | ['faculty'] | List of groups which have access to this item. | |
api_version | str | None | Version of the API used in creation of the dataset. | |
classification | str | None | ACIA information about AUthenticity,COnfidentiality,INtegrity and AVailability requirements of dataset. E.g. AV(ailabilty)=medium could trigger the creation of a two tape copies. Format 'AV=medium,CO=low' | |
comment | str | None | Comment the user has about a given dataset. | |
created_at | datetime | 2022-08-17 14:20:23+0000 | Date and time when this record was created. This field is managed by mongoose with through the timestamp settings. The field should be a string containing a date in ISO 8601 format (2024-02-27T12:26:57.313Z) | |
created_by | str | Ponder Stibbons | Indicate the user who created this record. This property is added and maintained by the system. | |
data_format | str | None | Defines the format of the data files in this dataset, e.g Nexus Version x.y. | |
data_quality_metrics | int | None | Data Quality Metrics is a number given by the user to rate the dataset. | |
end_time | datetime | None | End time of data acquisition for the current dataset.<br>It is expected to be in ISO8601 format according to specifications for internet date/time format in RFC 3339, chapter 5.6 (https://www.rfc-editor.org/rfc/rfc3339#section-5).<br>Local times without timezone/offset info are automatically transformed to UTC using the timezone of the API server. | |
instrument_group | str | None | Group of the instrument which this item was acquired on. | |
instrument_id | str | None | ID of the instrument where the data was created. | |
is_published | bool | None | Flag is true when data are made publicly available. | |
job_log_data | str | None | The output job logfile. Keep the size of this log data well below 15 MB. | |
job_parameters | dict[str, typing.Any] | None | The creation process of the derived data will usually depend on input job parameters. The full structure of these input parameters are stored here. | |
keywords | list[str] | None | Array of tags associated with the meaning or contents of this dataset. Values should ideally come from defined vocabularies, taxonomies, ontologies or knowledge graphs. | |
license | str | None | Name of the license under which the data can be used. | |
lifecycle | Lifecycle | None | Describes the current status of the dataset during its lifetime with respect to the storage handling systems. | |
orcid_of_owner | str | None | ORCID of the owner or custodian. The string may contain a list of ORCIDs, which should then be separated by semicolons. | |
owner_email | str | None | Email of the owner or custodian of the dataset. The string may contain a list of emails, which should then be separated by semicolons. | |
relationships | list[Relationship] | None | Stores the relationships with other datasets. | |
run_number | str | None | Run number assigned by the system to the data acquisition for the current dataset. | |
shared_with | list[str] | None | List of users that the dataset has been shared with. | |
source_folder_host | str | None | DNS host name of file server hosting sourceFolder, optionally including a protocol e.g. [protocol://]fileserver1.example.com | |
start_time | datetime | None | Start time of data acquisition for the current dataset.<br>It is expected to be in ISO8601 format according to specifications for internet date/time format in RFC 3339, chapter 5.6 (https://www.rfc-editor.org/rfc/rfc3339#section-5).<br>Local times without timezone/offset info are automatically transformed to UTC using the timezone of the API server. | |
techniques | list[Technique] | None | Stores the metadata information for techniques. | |
updated_at | datetime | 2022-11-01 13:22:08+0000 | Date and time when this record was updated last. This field is managed by mongoose with through the timestamp settings. The field should be a string containing a date in ISO 8601 format (2024-02-27T12:26:57.313Z) | |
updated_by | str | anonymous | Indicate the user who updated this record last. This property is added and maintained by the system. | |
validation_status | str | None | Defines a level of trust, e.g. a measure of how much data was verified or used by other persons. |
Files: 2 (95 B)
Local | Remote | Size |
---|---|---|
None | RemotePath('flux.dat') | 20 B |
None | RemotePath('logs/measurement.log') | 75 B |
All attributes listed above can be accessed directly:
[4]:
dset.type
[4]:
<DatasetType.RAW: 'raw'>
[5]:
dset.name
[5]:
'Thaum flux'
[6]:
dset.owner
[6]:
'Ponder Stibbons'
See Dataset for a list of available fields.
In addition, datasets can have free form scientific metadata which we can be accessed using
[7]:
dset.meta
[7]:
{'data_type': 'histogram', 'temperature': {'value': '123', 'unit': 'K'}}
Files#
The data files associated with this dataset can be accessed using
[8]:
for f in dset.files:
print(f"{f.remote_access_path(dset.source_folder) = }")
print(f"{f.local_path = }")
print(f"{f.size = } bytes")
print("----")
f.remote_access_path(dset.source_folder) = RemotePath('/hex/ps/thaum/flux.dat')
f.local_path = None
f.size = 20 bytes
----
f.remote_access_path(dset.source_folder) = RemotePath('/hex/ps/thaum/logs/measurement.log')
f.local_path = None
f.size = 75 bytes
----
Note that the local_path
for both files is None
. This indicates that the files have not been downloaded. Indeed, client.get_dataset
downloads only the metadata from SciCat, not the files.
We can download the first file using
[9]:
dset_with_local_file = client.download_files(dset, target="download", select="flux.dat")
[10]:
for f in dset_with_local_file.files:
print(f"{f.remote_access_path(dset.source_folder) = }")
print(f"{f.local_path = }")
print(f"{f.size = } bytes")
print("----")
f.remote_access_path(dset.source_folder) = RemotePath('/hex/ps/thaum/flux.dat')
f.local_path = PosixPath('download/flux.dat')
f.size = 20 bytes
----
f.remote_access_path(dset.source_folder) = RemotePath('/hex/ps/thaum/logs/measurement.log')
f.local_path = None
f.size = 75 bytes
----
Which populates the local_path
:
[11]:
file = list(dset_with_local_file.files)[0]
[12]:
file.local_path
[12]:
PosixPath('download/flux.dat')
We can use it to read the file:
[13]:
with file.local_path.open("r") as f:
print(f.read())
5 4 9 11 15 12 7 6 1
If we wanted to download all files, we could pass select=True
(or nothing, True
is the default) to client.download_files
. See Client.download_files for more options to select files.