scitacean.Dataset#

class scitacean.Dataset(type, access_groups=None, classification=None, comment=None, contact_email=None, creation_location=None, creation_time='now', data_format=None, data_quality_metrics=None, description=None, end_time=None, input_datasets=None, instrument_group=None, instrument_id=None, investigator=None, is_published=None, job_log_data=None, job_parameters=None, keywords=None, license=None, lifecycle=None, name=None, orcid_of_owner=None, owner=None, owner_email=None, owner_group=None, principal_investigator=None, proposal_id=None, relationships=None, run_number=None, sample_id=None, shared_with=None, source_folder=None, source_folder_host=None, start_time=None, techniques=None, used_software=None, validation_status=None, meta=None, checksum_algorithm='blake2b')[source]#

Metadata and linked data files for a measurement, simulation, or analysis.

Constructors

`__init__`(type[, access_groups, ...])
`from_download_models`(dataset_model, ...[, ...])	Construct a new dataset from SciCat download models.

Methods

`add_attachment`([thumbnail, owner_group, ...])	Create a new attachment and add it to the dataset.
`add_files`(*files[, datablock])	Add files to the dataset.
`add_local_files`(*paths[, datablock])	Add files on the local file system to the dataset.
`add_orig_datablock`(*, checksum_algorithm)	Append a new orig datablock to the list of orig datablocks.
`as_new`()	Return a new dataset with lifecycle-related fields erased.
`derive`(*[, keep])	Return a new dataset that is derived from self.
`fields`([dataset_type, read_only])	Iterate over dataset fields.
`items`()	Dict-like items(name and value pairs of fields) method.
`keys`()	Dict-like keys(names of fields) method.
`make_attachment_upload_models`()	Build models for all registered attachments.
`make_datablock_upload_models`()	Build models for all contained (orig) datablocks.
`make_upload_model`()	Construct a SciCat upload model from self.
`replace`(*[, _read_only, _orig_datablocks])	Return a new dataset with replaced fields.
`replace_files`(*files)	Return a new dataset with replaced files.
`validate`()	Validate the fields of the dataset.
`values`()	Dict-like values(values of fields) method.

Attributes

`access_groups`	List of groups which have access to this item.
`api_version`	Version of the API used in creation of the dataset.
`attachments`	List of attachments for this dataset.
`classification`	ACIA information about AUthenticity,COnfidentiality,INtegrity and AVailability requirements of dataset.
`comment`	Comment the user has about a given dataset.
`contact_email`	Email of the contact person for this dataset.
`created_at`	26:57.313Z)
`created_by`	Indicate the user who created this record.
`creation_location`	Unique location identifier where data was taken, usually in the form /Site-name/facility-name/instrumentOrBeamline-name.
`creation_time`	//www.rfc-editor.org/rfc/rfc3339#section-5).<br>Local times without timezone/offset info are automatically transformed to UTC using the timezone of the API server.
`data_format`	Defines the format of the data files in this dataset, e.g Nexus Version x.y.
`data_quality_metrics`	Data Quality Metrics is a number given by the user to rate the dataset.
`description`	Free text explanation of contents of dataset.
`end_time`	//www.rfc-editor.org/rfc/rfc3339#section-5).<br>Local times without timezone/offset info are automatically transformed to UTC using the timezone of the API server.
`files`	Files linked with the dataset.
`input_datasets`	Array of input dataset identifiers used in producing the derived dataset.
`instrument_group`	Group of the instrument which this item was acquired on.
`instrument_id`	ID of the instrument where the data was created.
`investigator`	First name and last name of the person or people pursuing the data analysis.
`is_published`	Flag is true when data are made publicly available.
`job_log_data`	The output job logfile.
`job_parameters`	The creation process of the derived data will usually depend on input job parameters.
`keywords`	Array of tags associated with the meaning or contents of this dataset.
`license`	Name of the license under which the data can be used.
`lifecycle`	Describes the current status of the dataset during its lifetime with respect to the storage handling systems.
`meta`	Dict of scientific metadata.
`name`	A name for the dataset, given by the creator to carry some semantic meaning.
`number_of_files`	Number of files in directly accessible storage in the dataset.
`number_of_files_archived`	Total number of archived files in the dataset.
`orcid_of_owner`	ORCID of the owner or custodian.
`owner`	Owner or custodian of the dataset, usually first name + last name.
`owner_email`	Email of the owner or custodian of the dataset.
`owner_group`	Name of the group owning this item.
`packed_size`	Total size of all datablock package files created for this dataset.
`pid`	Persistent identifier of the dataset.
`principal_investigator`	First name and last name of principal investigator(s).
`proposal_id`	The ID of the proposal to which the dataset belongs.
`relationships`	Stores the relationships with other datasets.
`run_number`	Run number assigned by the system to the data acquisition for the current dataset.
`sample_id`	ID of the sample used when collecting the data.
`shared_with`	List of users that the dataset has been shared with.
`size`	Total size of files in directly accessible storage in the dataset.
`source_folder`	Absolute file path on file server containing the files of this dataset, e.g. /some/path/to/sourcefolder.
`source_folder_host`	//]fileserver1.example.com
`start_time`	//www.rfc-editor.org/rfc/rfc3339#section-5).<br>Local times without timezone/offset info are automatically transformed to UTC using the timezone of the API server.
`techniques`	Stores the metadata information for techniques.
`type`	Characterize type of dataset, either 'raw' or 'derived'.
`updated_at`	26:57.313Z)
`updated_by`	Indicate the user who updated this record last.
`used_software`	A list of links to software repositories which uniquely identifies the pieces of software, including versions, used for yielding the derived data.
`validation_status`	Defines a level of trust, e.g. a measure of how much data was verified or used by other persons.

__init__(type, access_groups=None, classification=None, comment=None, contact_email=None, creation_location=None, creation_time='now', data_format=None, data_quality_metrics=None, description=None, end_time=None, input_datasets=None, instrument_group=None, instrument_id=None, investigator=None, is_published=None, job_log_data=None, job_parameters=None, keywords=None, license=None, lifecycle=None, name=None, orcid_of_owner=None, owner=None, owner_email=None, owner_group=None, principal_investigator=None, proposal_id=None, relationships=None, run_number=None, sample_id=None, shared_with=None, source_folder=None, source_folder_host=None, start_time=None, techniques=None, used_software=None, validation_status=None, meta=None, checksum_algorithm='blake2b')#

add_attachment(thumbnail=None, *, caption, owner_group=None, access_groups=None, instrument_group=None, proposal_id=None, sample_id=None)[source]#

Create a new attachment and add it to the dataset.

Parameters:

thumbnail (str | PathLike[str] | Thumbnail | None, default: None) – If a scitacean.thumbnail.Thumbnail object, it is added to the attachment. If a string or path, a thumbnail is loaded from that path.
caption (str) – Caption of the attachment.
owner_group (str | None, default: None) – Owner group of the attachment. Defaults to self.owner_group.
access_groups (list[str] | None, default: None) – Access groups of the attachment. Defaults to self.access_groups.
instrument_group (str | None, default: None) – Instrument group of the attachment. Defaults to self.instrument_group.
proposal_id (str | None, default: None) – Proposal ID of the attachment. Defaults to self.proposal_id.
sample_id (str | None, default: None) – Sample ID of the attachment. Defaults to self.sample_id.

Return type:

None

add_files(*files, datablock=None)[source]#

Add files to the dataset.

Parameters:

files (File) – File object to add.
datablock (int | str | PID | None, default: None) –
Advanced feature, do not set unless you know what this is!

Select the orig datablock to store the file in.
- None: Use the last datablock in the list if possible or add a new one if needed.
- If an int, use the datablock with that index.
- If a str or PID, use the datablock with that id; if there is none with matching id, raise KeyError.

Return type:

None

add_local_files(*paths, datablock=None)[source]#

Add files on the local file system to the dataset.

The files are set up to be uploaded to the dataset’s source folder without preserving the local directory structure. That is, given

dataset.source_folder = "remote/source"
dataset.add_local_files("/path/to/file1", "other_path/file2")

and uploading this dataset to SciCat, the files will be uploaded to:

remote/source/file1
remote/source/file2

Parameters:

paths (str | PathLike[str]) – Local paths to the files.
datablock (int | str | PID | None, default: None) –
Advanced feature, do not set unless you know what this is!

Select the orig datablock to store the file in.
- None: Use the last datablock in the list if possible or add a new one if needed.
- If an int, use the datablock with that index.
- If a str or PID, use the datablock with that id; if there is none with matching id, raise KeyError.

Return type:

None

add_orig_datablock(*, checksum_algorithm)[source]#

Append a new orig datablock to the list of orig datablocks.

Parameters:: checksum_algorithm (str | None) – Use this algorithm to compute checksums of files associated with this datablock.
Returns:: OrigDatablock – The newly added datablock.

as_new()[source]#

Return a new dataset with lifecycle-related fields erased.

The returned dataset has the same fields as self. But fields that indicate when the dataset was created or by who are set to None. This if, for example, created_at, history, and lifecycle.

Returns:: Dataset – A new dataset without lifecycle-related fields.

property attachments: list[Attachment] | None#

List of attachments for this dataset.

This property can be in two distinct ‘falsy’ states:

dset.attachments is None: It is unknown whether there are attachments. This happens when datasets are downloaded without downloading the attachments.
dset.attachments == []: It is known that there are no attachments. This happens either when downloading datasets or when initializing datasets locally without assigning attachments.

derive(*, keep=('contact_email', 'investigator', 'orcid_of_owner', 'owner', 'owner_email', 'techniques'))[source]#

Return a new dataset that is derived from self.

The returned dataset has most fields set to None. But a number of fields can be carried over from self. By default, this assumes that the owner of the derived dataset is the same as the owner of the original. This can be customized with the keep argument.

Parameters:: keep (Iterable[str], default: ('contact_email', 'investigator', 'orcid_of_owner', 'owner', 'owner_email', 'techniques')) – Fields to copy over to the derived dataset.
Returns:: Dataset – A new derived dataset.
Raises:: ValueError – If self has no PID. The derived dataset requires a PID in order to link back to self.

classmethod fields(dataset_type=None, read_only=None)[source]#

Iterate over dataset fields.

This is similar to dataclasses.fields().

Parameters:

dataset_type (Union[DatasetType, Literal['raw', 'derived'], None], default: None) – If set, return only the fields for this dataset type. If unset, do not filter fields.
read_only (bool | None, default: None) – If true or false, return only fields which are read-only or allow write-access, respectively. If unset, do not filter fields.

Returns:

Generator[Field, None, None] – Iterable over the fields of datasets.

property files: tuple[File, ...]#: Files linked with the dataset.

classmethod from_download_models(dataset_model, orig_datablock_models, attachment_models=None)[source]#

Construct a new dataset from SciCat download models.

Parameters:

dataset_model (DownloadDataset) – Model of the dataset.
orig_datablock_models (list[DownloadOrigDatablock]) – List of all associated original datablock models for the dataset.
attachment_models (Iterable[DownloadAttachment] | None, default: None) – List of all associated attachment models for the dataset. Use None if the attachments were not downloaded. Use an empty list if the attachments were downloaded, but there aren’t any.

Returns:

Dataset – A new Dataset instance.

items()[source]#

Dict-like items(name and value pairs of fields) method.

Returns:: Iterable[tuple[str, Any]] – Generator of (Name, Value) pairs of all fields corresponding to self.type and other fields that are not None.

Added in version 23.10.0.

keys()[source]#

Dict-like keys(names of fields) method.

Returns:: Iterable[str] – Generator of names of all fields corresponding to self.type and other fields that are not None.

Added in version 23.10.0.

make_attachment_upload_models()[source]#

Build models for all registered attachments.

Raises:: ValueError – If self.attachments is None, i.e., the attachments are uninitialized.
Returns:: list[UploadAttachment] – List of attachment models.

make_datablock_upload_models()[source]#

Build models for all contained (orig) datablocks.

Returns:: DatablockUploadModels – Structure with datablock and orig datablock models.

make_upload_model()[source]#

Construct a SciCat upload model from self.

Return type:: UploadDerivedDataset | UploadRawDataset

property number_of_files: int#

Number of files in directly accessible storage in the dataset.

This includes files on both the local and remote filesystems.

Corresponds to OrigDatablocks.

property number_of_files_archived: int#

Total number of archived files in the dataset.

Corresponds to Datablocks.

property packed_size: int#: Total size of all datablock package files created for this dataset.

replace(*, _read_only=None, _orig_datablocks=None, **replacements)[source]#

Return a new dataset with replaced fields.

Parameters starting with an underscore are for internal use. Using them may result in a broken dataset.

Parameters:: replacements (Any) – New field values.
Returns:: Dataset – The new dataset has the same fields as the input but all fields given as keyword arguments are replaced by the given values.

replace_files(*files)[source]#

Return a new dataset with replaced files.

For each argument, if the input dataset has a file with the same remote path, that file is replaced. Otherwise, a new file is added. Other existing files are kept in the returned dataset.

Parameters:: files (File) – New files for the dataset.
Returns:: Dataset – A new dataset with given files.

property size: int#

Total size of files in directly accessible storage in the dataset.

This includes files on both the local and remote filesystems.

Corresponds to OrigDatablocks.

validate()[source]#

Validate the fields of the dataset.

Raises pydantic.ValidationError if validation fails. Returns normally if it passes.

Return type:: None

values()[source]#

Dict-like values(values of fields) method.

Returns:: Iterable[Any] – Generator of values of all fields corresponding to self.type and other fields that are not None.

Added in version 23.10.0.

This Page

scitacean.Dataset#