scitacean.Dataset#

class scitacean.Dataset(type, access_groups=None, classification=None, comment=None, contact_email=None, creation_location=None, creation_time='now', data_format=None, data_quality_metrics=None, description=None, end_time=None, input_datasets=None, instrument_group=None, instrument_id=None, investigator=None, is_published=None, job_log_data=None, job_parameters=None, keywords=None, license=None, lifecycle=None, name=None, orcid_of_owner=None, owner=None, owner_email=None, owner_group=None, principal_investigator=None, proposal_id=None, relationships=None, run_number=None, sample_id=None, shared_with=None, source_folder=None, source_folder_host=None, start_time=None, techniques=None, used_software=None, validation_status=None, meta=None, checksum_algorithm='blake2b')[source]#

Metadata and linked data files for a measurement, simulation, or analysis.

Constructors

__init__(type[, access_groups, ...])

from_download_models(dataset_model, ...[, ...])

Construct a new dataset from SciCat download models.

Methods

add_files(*files[, datablock])

Add files to the dataset.

add_local_files(*paths[, datablock])

Add files on the local file system to the dataset.

add_orig_datablock(*, checksum_algorithm)

Append a new orig datablock to the list of orig datablocks.

as_new()

Return a new dataset with lifecycle-related fields erased.

derive(*[, keep])

Return a new dataset that is derived from self.

fields([dataset_type, read_only])

Iterate over dataset fields.

items()

Dict-like items(name and value pairs of fields) method.

keys()

Dict-like keys(names of fields) method.

make_attachment_upload_models()

Build models for all registered attachments.

make_datablock_upload_models()

Build models for all contained (orig) datablocks.

make_upload_model()

Construct a SciCat upload model from self.

replace(*[, _read_only, _orig_datablocks])

Return a new dataset with replaced fields.

replace_files(*files)

Return a new dataset with replaced files.

validate()

Validate the fields of the dataset.

values()

Dict-like values(values of fields) method.

Attributes

access_groups

List of groups which have access to this item.

api_version

Version of the API used in creation of the dataset.

attachments

List of attachments for this dataset.

classification

ACIA information about AUthenticity,COnfidentiality,INtegrity and AVailability requirements of dataset.

comment

Comment the user has about a given dataset.

contact_email

Email of the contact person for this dataset.

created_at

26:57.313Z)

created_by

Indicate the user who created this record.

creation_location

Unique location identifier where data was taken, usually in the form /Site-name/facility-name/instrumentOrBeamline-name.

creation_time

//www.rfc-editor.org/rfc/rfc3339#section-5).<br>Local times without timezone/offset info are automatically transformed to UTC using the timezone of the API server.

data_format

Defines the format of the data files in this dataset, e.g Nexus Version x.y.

data_quality_metrics

Data Quality Metrics is a number given by the user to rate the dataset.

description

Free text explanation of contents of dataset.

end_time

//www.rfc-editor.org/rfc/rfc3339#section-5).<br>Local times without timezone/offset info are automatically transformed to UTC using the timezone of the API server.

files

Files linked with the dataset.

input_datasets

Array of input dataset identifiers used in producing the derived dataset.

instrument_group

Group of the instrument which this item was acquired on.

instrument_id

ID of the instrument where the data was created.

investigator

First name and last name of the person or people pursuing the data analysis.

is_published

Flag is true when data are made publicly available.

job_log_data

The output job logfile.

job_parameters

The creation process of the derived data will usually depend on input job parameters.

keywords

Array of tags associated with the meaning or contents of this dataset.

license

Name of the license under which the data can be used.

lifecycle

Describes the current status of the dataset during its lifetime with respect to the storage handling systems.

meta

Dict of scientific metadata.

name

A name for the dataset, given by the creator to carry some semantic meaning.

number_of_files

Number of files in directly accessible storage in the dataset.

number_of_files_archived

Total number of archived files in the dataset.

orcid_of_owner

ORCID of the owner or custodian.

owner

Owner or custodian of the dataset, usually first name + last name.

owner_email

Email of the owner or custodian of the dataset.

owner_group

Name of the group owning this item.

packed_size

Total size of all datablock package files created for this dataset.

pid

Persistent identifier of the dataset.

principal_investigator

First name and last name of principal investigator(s).

proposal_id

The ID of the proposal to which the dataset belongs.

relationships

Stores the relationships with other datasets.

run_number

Run number assigned by the system to the data acquisition for the current dataset.

sample_id

ID of the sample used when collecting the data.

shared_with

List of users that the dataset has been shared with.

size

Total size of files in directly accessible storage in the dataset.

source_folder

Absolute file path on file server containing the files of this dataset, e.g. /some/path/to/sourcefolder.

source_folder_host

//]fileserver1.example.com

start_time

//www.rfc-editor.org/rfc/rfc3339#section-5).<br>Local times without timezone/offset info are automatically transformed to UTC using the timezone of the API server.

techniques

Stores the metadata information for techniques.

type

Characterize type of dataset, either 'raw' or 'derived'.

updated_at

26:57.313Z)

updated_by

Indicate the user who updated this record last.

used_software

A list of links to software repositories which uniquely identifies the pieces of software, including versions, used for yielding the derived data.

validation_status

Defines a level of trust, e.g. a measure of how much data was verified or used by other persons.

__init__(type, access_groups=None, classification=None, comment=None, contact_email=None, creation_location=None, creation_time='now', data_format=None, data_quality_metrics=None, description=None, end_time=None, input_datasets=None, instrument_group=None, instrument_id=None, investigator=None, is_published=None, job_log_data=None, job_parameters=None, keywords=None, license=None, lifecycle=None, name=None, orcid_of_owner=None, owner=None, owner_email=None, owner_group=None, principal_investigator=None, proposal_id=None, relationships=None, run_number=None, sample_id=None, shared_with=None, source_folder=None, source_folder_host=None, start_time=None, techniques=None, used_software=None, validation_status=None, meta=None, checksum_algorithm='blake2b')#
add_files(*files, datablock=-1)[source]#

Add files to the dataset.

Return type:

None

add_local_files(*paths, datablock=-1)[source]#

Add files on the local file system to the dataset.

The files are set up to be uploaded to the dataset’s source folder without preserving the local directory structure. That is, given

dataset.source_folder = "remote/source"
dataset.add_local_files("/path/to/file1", "other_path/file2")

and uploading this dataset to SciCat, the files will be uploaded to:

remote/source/file1
remote/source/file2
Parameters:
  • paths (str | Path) – Local paths to the files.

  • datablock (int | str, default: -1) –

    Advanced feature, do not set unless you know what this is!

    Select the orig datablock to store the file in. If an int, use the datablock with that index. If a str or PID, use the datablock with that id; if there is none with matching id, raise KeyError.

Return type:

None

add_orig_datablock(*, checksum_algorithm)[source]#

Append a new orig datablock to the list of orig datablocks.

Parameters:

checksum_algorithm (str | None) – Use this algorithm to compute checksums of files associated with this datablock.

Returns:

OrigDatablock – The newly added datablock.

as_new()[source]#

Return a new dataset with lifecycle-related fields erased.

The returned dataset has the same fields as self. But fields that indicate when the dataset was created or by who are set to None. This if, for example, created_at, history, and lifecycle.

Returns:

Dataset – A new dataset without lifecycle-related fields.

property attachments: list[Attachment] | None#

List of attachments for this dataset.

This property can be in two distinct ‘falsy’ states:

  • dset.attachments is None: It is unknown whether there are attachments. This happens when datasets are downloaded without downloading the attachments.

  • dset.attachments == []: It is known that there are no attachments. This happens either when downloading datasets or when initializing datasets locally without assigning attachments.

derive(*, keep=('contact_email', 'investigator', 'orcid_of_owner', 'owner', 'owner_email', 'techniques'))[source]#

Return a new dataset that is derived from self.

The returned dataset has most fields set to None. But a number of fields can be carried over from self. By default, this assumes that the owner of the derived dataset is the same as the owner of the original. This can be customized with the keep argument.

Parameters:

keep (Iterable[str], default: ('contact_email', 'investigator', 'orcid_of_owner', 'owner', 'owner_email', 'techniques')) – Fields to copy over to the derived dataset.

Returns:

Dataset – A new derived dataset.

Raises:

ValueError – If self has no PID. The derived dataset requires a PID in order to link back to self.

classmethod fields(dataset_type=None, read_only=None)[source]#

Iterate over dataset fields.

This is similar to dataclasses.fields().

Parameters:
  • dataset_type (Union[DatasetType, Literal['raw', 'derived'], None], default: None) – If set, return only the fields for this dataset type. If unset, do not filter fields.

  • read_only (bool | None, default: None) – If true or false, return only fields which are read-only or allow write-access, respectively. If unset, do not filter fields.

Returns:

Generator[Field, None, None] – Iterable over the fields of datasets.

property files: tuple[File, ...]#

Files linked with the dataset.

classmethod from_download_models(dataset_model, orig_datablock_models, attachment_models=None)[source]#

Construct a new dataset from SciCat download models.

Parameters:
  • dataset_model (DownloadDataset) – Model of the dataset.

  • orig_datablock_models (list[DownloadOrigDatablock]) – List of all associated original datablock models for the dataset.

  • attachment_models (Iterable[DownloadAttachment] | None, default: None) – List of all associated attachment models for the dataset. Use None if the attachments were not downloaded. Use an empty list if the attachments were downloaded, but there aren’t any.

Returns:

Dataset – A new Dataset instance.

items()[source]#

Dict-like items(name and value pairs of fields) method.

Returns:

Iterable[tuple[str, Any]] – Generator of (Name, Value) pairs of all fields corresponding to self.type and other fields that are not None.

Added in version 23.10.0.

keys()[source]#

Dict-like keys(names of fields) method.

Returns:

Iterable[str] – Generator of names of all fields corresponding to self.type and other fields that are not None.

Added in version 23.10.0.

make_attachment_upload_models()[source]#

Build models for all registered attachments.

Raises:

ValueError – If self.attachments is None, i.e., the attachments are uninitialized.

Returns:

list[UploadAttachment] – List of attachment models.

make_datablock_upload_models()[source]#

Build models for all contained (orig) datablocks.

Returns:

DatablockUploadModels – Structure with datablock and orig datablock models.

make_upload_model()[source]#

Construct a SciCat upload model from self.

Return type:

UploadDerivedDataset | UploadRawDataset

property number_of_files: int#

Number of files in directly accessible storage in the dataset.

This includes files on both the local and remote filesystems.

Corresponds to OrigDatablocks.

property number_of_files_archived: int#

Total number of archived files in the dataset.

Corresponds to Datablocks.

property packed_size: int#

Total size of all datablock package files created for this dataset.

replace(*, _read_only=None, _orig_datablocks=None, **replacements)[source]#

Return a new dataset with replaced fields.

Parameters starting with an underscore are for internal use. Using them may result in a broken dataset.

Parameters:

replacements (Any) – New field values.

Returns:

Dataset – The new dataset has the same fields as the input but all fields given as keyword arguments are replaced by the given values.

replace_files(*files)[source]#

Return a new dataset with replaced files.

For each argument, if the input dataset has a file with the same remote path, that file is replaced. Otherwise, a new file is added. Other existing files are kept in the returned dataset.

Parameters:

files (File) – New files for the dataset.

Returns:

Dataset – A new dataset with given files.

property size: int#

Total size of files in directly accessible storage in the dataset.

This includes files on both the local and remote filesystems.

Corresponds to OrigDatablocks.

validate()[source]#

Validate the fields of the dataset.

Raises pydantic.ValidationError if validation fails. Returns normally if it passes.

Return type:

None

values()[source]#

Dict-like values(values of fields) method.

Returns:

Iterable[Any] – Generator of values of all fields corresponding to self.type and other fields that are not None.

Added in version 23.10.0.