Downloading datasets#

All communication with SciCat is handled by a client object. Normally, one would construct one using something like

from scitacean import Client
from scitacean.transfer.sftp import SFTPFileTransfer
client = Client.from_token(url="https://scicat.ess.eu/api/v3",
                           token=...,
                           file_transfer=SFTPFileTransfer(
                               host="login.esss.dk"
                           ))

In this example, we use ESS’s SciCat. If you want to use a different one, you need to figure out its URL. Note that this is not the same URL that you open in a browser but typically ends in a suffix like /api/v3.

Here, we authenticate using a token. You can find your token in the web interface by logging in and opening the settings. Alternatively, we could use username and password via Client.from_credentials.

WARNING:

Do not hard code secrets like tokens or passwords in notebooks or scripts! There is a high risk of exposing them when code is under version control or uploaded to SciCat.

Scitacean currently requires secrets to be passed as function arguments. So you will have to find your own solution for now.

While the client itself is responsible for talking to SciCat, a file_transfer object is required to download data files. Here, we use SFTPFileTransfer which downloads / uploads files via SFTP.

The file transfer needs to authenticate separately from the SciCat connection. By default, it requires an SSH agent to be running an set up for the selected host.

For the purposes of this guide, we don’t want to connect to a real SciCat server in order to avoid the complications associated with that. So we set up a fake client that only pretends to connect to SciCat and file servers. Everything else in this guide works in the same way with a real client. See Developer Documentation/Testing if you are interested in the details.

[1]:

from scitacean.testing.docs import setup_fake_client
client = setup_fake_client()

Metadata#

We need the ID (pid) of a dataset in order to download it. The fake client provides a dataset with id 20.500.12269/72fe3ff6-105b-4c7f-b9d0-073b67c90ec3. We can download it using

[2]:

dset = client.get_dataset("20.500.12269/72fe3ff6-105b-4c7f-b9d0-073b67c90ec3")

Datasets can easily be inspected in Jupyter notebooks:

[3]:

dset

[3]:

RawDataset

	Name	Type	Value	Description
*	creation_time	datetime	2022-06-29 14:01:05+0000	Time when dataset became fully available on disk, i.e. all containing files have been written, or the dataset was created in SciCat.<br>It is expected to be in ISO8601 format according to specifications for internet date/time format in RFC 3339, chapter 5.6 (https://www.rfc-editor.org/rfc/rfc3339#section-5).<br>Local times without timezone/offset info are automatically transformed to UTC using the timezone of the API server.
*	input_datasets	list[PID]	None	Array of input dataset identifiers used in producing the derived dataset. Ideally these are the global identifier to existing datasets inside this or federated data catalogs.
*	source_folder	RemotePath	RemotePath('/hex/ps/thaum')	Absolute file path on file server containing the files of this dataset, e.g. /some/path/to/sourcefolder. In case of a single file dataset, e.g. HDF5 data, it contains the path up to, but excluding the filename. Trailing slashes are removed.
	description	str	Measured the thaum flux	Free text explanation of contents of dataset.
	name	str	Thaum flux	A name for the dataset, given by the creator to carry some semantic meaning. Useful for display purposes e.g. instead of displaying the pid. Will be autofilled if missing using info from sourceFolder.
	pid	PID	20.500.12269/72fe3ff6-105b-4c7f-b9d0-073b67c90ec3	Persistent identifier of the dataset.
	proposal_id	str	None	The ID of the proposal to which the dataset belongs.
	sample_id	str	None	ID of the sample used when collecting the data.

Advanced fields

*	contact_email	str	p.stibbons@uu.am	Email of the contact person for this dataset. The string may contain a list of emails, which should then be separated by semicolons.
*	creation_location	str	UnseenUniversity	Unique location identifier where data was taken, usually in the form /Site-name/facility-name/instrumentOrBeamline-name. This field is required if the dataset is a Raw dataset.
*	investigator	str	None	First name and last name of the person or people pursuing the data analysis. The string may contain a list of names, which should then be separated by semicolons.
*	owner	str	Ponder Stibbons	Owner or custodian of the dataset, usually first name + last name. The string may contain a list of persons, which should then be separated by semicolons.
*	owner_group	str	uu	Name of the group owning this item.
*	principal_investigator	str	p.stibbons@uu.am	First name and last name of principal investigator(s). If multiple PIs are present, use a semicolon separated list. This field is required if the dataset is a Raw dataset.
*	used_software	list[str]	None	A list of links to software repositories which uniquely identifies the pieces of software, including versions, used for yielding the derived data.
	access_groups	list[str]	['faculty']	List of groups which have access to this item.
	api_version	str	None	Version of the API used in creation of the dataset.
	classification	str	None	ACIA information about AUthenticity,COnfidentiality,INtegrity and AVailability requirements of dataset. E.g. AV(ailabilty)=medium could trigger the creation of a two tape copies. Format 'AV=medium,CO=low'
	comment	str	None	Comment the user has about a given dataset.
	created_at	datetime	2022-08-17 14:20:23+0000	Date and time when this record was created. This field is managed by mongoose with through the timestamp settings. The field should be a string containing a date in ISO 8601 format (2024-02-27T12:26:57.313Z)
	created_by	str	Ponder Stibbons	Indicate the user who created this record. This property is added and maintained by the system.
	data_format	str	None	Defines the format of the data files in this dataset, e.g Nexus Version x.y.
	data_quality_metrics	int	None	Data Quality Metrics is a number given by the user to rate the dataset.
	end_time	datetime	None	End time of data acquisition for the current dataset.<br>It is expected to be in ISO8601 format according to specifications for internet date/time format in RFC 3339, chapter 5.6 (https://www.rfc-editor.org/rfc/rfc3339#section-5).<br>Local times without timezone/offset info are automatically transformed to UTC using the timezone of the API server.
	instrument_group	str	None	Group of the instrument which this item was acquired on.
	instrument_id	str	None	ID of the instrument where the data was created.
	is_published	bool	None	Flag is true when data are made publicly available.
	job_log_data	str	None	The output job logfile. Keep the size of this log data well below 15 MB.
	job_parameters	dict[str, typing.Any]	None	The creation process of the derived data will usually depend on input job parameters. The full structure of these input parameters are stored here.
	keywords	list[str]	None	Array of tags associated with the meaning or contents of this dataset. Values should ideally come from defined vocabularies, taxonomies, ontologies or knowledge graphs.
	license	str	None	Name of the license under which the data can be used.
	lifecycle	Lifecycle	None	Describes the current status of the dataset during its lifetime with respect to the storage handling systems.
	orcid_of_owner	str	None	ORCID of the owner or custodian. The string may contain a list of ORCIDs, which should then be separated by semicolons.
	owner_email	str	None	Email of the owner or custodian of the dataset. The string may contain a list of emails, which should then be separated by semicolons.
	relationships	list[Relationship]	None	Stores the relationships with other datasets.
	run_number	str	None	Run number assigned by the system to the data acquisition for the current dataset.
	shared_with	list[str]	None	List of users that the dataset has been shared with.
	source_folder_host	str	None	DNS host name of file server hosting sourceFolder, optionally including a protocol e.g. [protocol://]fileserver1.example.com
	start_time	datetime	None	Start time of data acquisition for the current dataset.<br>It is expected to be in ISO8601 format according to specifications for internet date/time format in RFC 3339, chapter 5.6 (https://www.rfc-editor.org/rfc/rfc3339#section-5).<br>Local times without timezone/offset info are automatically transformed to UTC using the timezone of the API server.
	techniques	list[Technique]	None	Stores the metadata information for techniques.
	updated_at	datetime	2022-11-01 13:22:08+0000	Date and time when this record was updated last. This field is managed by mongoose with through the timestamp settings. The field should be a string containing a date in ISO 8601 format (2024-02-27T12:26:57.313Z)
	updated_by	str	anonymous	Indicate the user who updated this record last. This property is added and maintained by the system.
	validation_status	str	None	Defines a level of trust, e.g. a measure of how much data was verified or used by other persons.

Files: 2 (95 B)

Local	Remote	Size
None	RemotePath('flux.dat')	20 B
None	RemotePath('logs/measurement.log')	75 B

Scientific Metadata

Name	Value
data_type	histogram
temperature	123 [K]

All attributes listed above can be accessed directly:

[4]:

dset.type

[4]:

<DatasetType.RAW: 'raw'>

[5]:

dset.name

[5]:

'Thaum flux'

[6]:

dset.owner

[6]:

'Ponder Stibbons'

See Dataset for a list of available fields.

In addition, datasets can have free form scientific metadata which we can be accessed using

[7]:

dset.meta

[7]:

{'data_type': 'histogram', 'temperature': {'value': '123', 'unit': 'K'}}

Files#

The data files associated with this dataset can be accessed using

[8]:

for f in dset.files:
    print(f"{f.remote_access_path(dset.source_folder) = }")
    print(f"{f.local_path = }")
    print(f"{f.size = } bytes")
    print("----")

f.remote_access_path(dset.source_folder) = RemotePath('/hex/ps/thaum/flux.dat')
f.local_path = None
f.size = 20 bytes
----
f.remote_access_path(dset.source_folder) = RemotePath('/hex/ps/thaum/logs/measurement.log')
f.local_path = None
f.size = 75 bytes
----

Note that the local_path for both files is None. This indicates that the files have not been downloaded. Indeed, client.get_dataset downloads only the metadata from SciCat, not the files.

We can download the first file using

[9]:

dset_with_local_file = client.download_files(dset, target="download", select="flux.dat")

[10]:

for f in dset_with_local_file.files:
    print(f"{f.remote_access_path(dset.source_folder) = }")
    print(f"{f.local_path = }")
    print(f"{f.size = } bytes")
    print("----")

f.remote_access_path(dset.source_folder) = RemotePath('/hex/ps/thaum/flux.dat')
f.local_path = PosixPath('download/flux.dat')
f.size = 20 bytes
----
f.remote_access_path(dset.source_folder) = RemotePath('/hex/ps/thaum/logs/measurement.log')
f.local_path = None
f.size = 75 bytes
----

Which populates the local_path:

[11]:

file = list(dset_with_local_file.files)[0]

[12]:

file.local_path

[12]:

PosixPath('download/flux.dat')

We can use it to read the file:

[13]:

with file.local_path.open("r") as f:
    print(f.read())

5 4 9 11 15 12 7 6 1

If we wanted to download all files, we could pass select=True (or nothing, True is the default) to client.download_files. See Client.download_files for more options to select files.

This Page

Downloading datasets#

Metadata#

Files#