scitacean.Dataset#
- class scitacean.Dataset(type, access_groups=None, classification=None, comment=None, contact_email=None, creation_location=None, creation_time='now', data_format=None, data_quality_metrics=None, description=None, end_time=None, input_datasets=None, instrument_group=None, instrument_id=None, investigator=None, is_published=None, job_log_data=None, job_parameters=None, keywords=None, license=None, lifecycle=None, name=None, orcid_of_owner=None, owner=None, owner_email=None, owner_group=None, principal_investigator=None, proposal_id=None, relationships=None, run_number=None, sample_id=None, shared_with=None, source_folder=None, source_folder_host=None, start_time=None, techniques=None, used_software=None, validation_status=None, meta=None, checksum_algorithm='blake2b')[source]#
Metadata and linked data files for a measurement, simulation, or analysis.
Constructors
__init__
(type[, access_groups, ...])from_download_models
(dataset_model, ...[, ...])Construct a new dataset from SciCat download models.
Methods
add_files
(*files[, datablock])Add files to the dataset.
add_local_files
(*paths[, datablock])Add files on the local file system to the dataset.
add_orig_datablock
(*, checksum_algorithm)Append a new orig datablock to the list of orig datablocks.
as_new
()Return a new dataset with lifecycle-related fields erased.
derive
(*[, keep])Return a new dataset that is derived from self.
fields
([dataset_type, read_only])Iterate over dataset fields.
items
()Dict-like items(name and value pairs of fields) method.
keys
()Dict-like keys(names of fields) method.
Build models for all registered attachments.
Build models for all contained (orig) datablocks.
Construct a SciCat upload model from self.
replace
(*[, _read_only, _orig_datablocks])Return a new dataset with replaced fields.
replace_files
(*files)Return a new dataset with replaced files.
validate
()Validate the fields of the dataset.
values
()Dict-like values(values of fields) method.
Attributes
access_groups
List of groups which have access to this item.
api_version
Version of the API used in creation of the dataset.
List of attachments for this dataset.
classification
ACIA information about AUthenticity,COnfidentiality,INtegrity and AVailability requirements of dataset.
comment
Comment the user has about a given dataset.
contact_email
Email of the contact person for this dataset.
created_at
26:57.313Z)
created_by
Indicate the user who created this record.
creation_location
Unique location identifier where data was taken, usually in the form /Site-name/facility-name/instrumentOrBeamline-name.
creation_time
//www.rfc-editor.org/rfc/rfc3339#section-5).<br>Local times without timezone/offset info are automatically transformed to UTC using the timezone of the API server.
data_format
Defines the format of the data files in this dataset, e.g Nexus Version x.y.
data_quality_metrics
Data Quality Metrics is a number given by the user to rate the dataset.
description
Free text explanation of contents of dataset.
end_time
//www.rfc-editor.org/rfc/rfc3339#section-5).<br>Local times without timezone/offset info are automatically transformed to UTC using the timezone of the API server.
Files linked with the dataset.
input_datasets
Array of input dataset identifiers used in producing the derived dataset.
instrument_group
Group of the instrument which this item was acquired on.
instrument_id
ID of the instrument where the data was created.
investigator
First name and last name of the person or people pursuing the data analysis.
is_published
Flag is true when data are made publicly available.
job_log_data
The output job logfile.
job_parameters
The creation process of the derived data will usually depend on input job parameters.
keywords
Array of tags associated with the meaning or contents of this dataset.
license
Name of the license under which the data can be used.
lifecycle
Describes the current status of the dataset during its lifetime with respect to the storage handling systems.
meta
Dict of scientific metadata.
name
A name for the dataset, given by the creator to carry some semantic meaning.
Number of files in directly accessible storage in the dataset.
Total number of archived files in the dataset.
orcid_of_owner
ORCID of the owner or custodian.
owner
Owner or custodian of the dataset, usually first name + last name.
owner_email
Email of the owner or custodian of the dataset.
owner_group
Name of the group owning this item.
Total size of all datablock package files created for this dataset.
pid
Persistent identifier of the dataset.
principal_investigator
First name and last name of principal investigator(s).
proposal_id
The ID of the proposal to which the dataset belongs.
relationships
Stores the relationships with other datasets.
run_number
Run number assigned by the system to the data acquisition for the current dataset.
sample_id
ID of the sample used when collecting the data.
shared_with
List of users that the dataset has been shared with.
Total size of files in directly accessible storage in the dataset.
source_folder
Absolute file path on file server containing the files of this dataset, e.g. /some/path/to/sourcefolder.
source_folder_host
//]fileserver1.example.com
start_time
//www.rfc-editor.org/rfc/rfc3339#section-5).<br>Local times without timezone/offset info are automatically transformed to UTC using the timezone of the API server.
techniques
Stores the metadata information for techniques.
type
Characterize type of dataset, either 'raw' or 'derived'.
updated_at
26:57.313Z)
updated_by
Indicate the user who updated this record last.
used_software
A list of links to software repositories which uniquely identifies the pieces of software, including versions, used for yielding the derived data.
validation_status
Defines a level of trust, e.g. a measure of how much data was verified or used by other persons.
- __init__(type, access_groups=None, classification=None, comment=None, contact_email=None, creation_location=None, creation_time='now', data_format=None, data_quality_metrics=None, description=None, end_time=None, input_datasets=None, instrument_group=None, instrument_id=None, investigator=None, is_published=None, job_log_data=None, job_parameters=None, keywords=None, license=None, lifecycle=None, name=None, orcid_of_owner=None, owner=None, owner_email=None, owner_group=None, principal_investigator=None, proposal_id=None, relationships=None, run_number=None, sample_id=None, shared_with=None, source_folder=None, source_folder_host=None, start_time=None, techniques=None, used_software=None, validation_status=None, meta=None, checksum_algorithm='blake2b')#
- add_local_files(*paths, datablock=-1)[source]#
Add files on the local file system to the dataset.
The files are set up to be uploaded to the dataset’s source folder without preserving the local directory structure. That is, given
dataset.source_folder = "remote/source" dataset.add_local_files("/path/to/file1", "other_path/file2")
and uploading this dataset to SciCat, the files will be uploaded to:
remote/source/file1 remote/source/file2
- Parameters:
- Return type:
- add_orig_datablock(*, checksum_algorithm)[source]#
Append a new orig datablock to the list of orig datablocks.
- Parameters:
checksum_algorithm (
str
|None
) – Use this algorithm to compute checksums of files associated with this datablock.- Returns:
OrigDatablock
– The newly added datablock.
- as_new()[source]#
Return a new dataset with lifecycle-related fields erased.
The returned dataset has the same fields as
self
. But fields that indicate when the dataset was created or by who are set toNone
. This if, for example,created_at
,history
, andlifecycle
.- Returns:
Dataset
– A new dataset without lifecycle-related fields.
- property attachments: list[Attachment] | None#
List of attachments for this dataset.
This property can be in two distinct ‘falsy’ states:
dset.attachments is None
: It is unknown whether there are attachments. This happens when datasets are downloaded without downloading the attachments.dset.attachments == []
: It is known that there are no attachments. This happens either when downloading datasets or when initializing datasets locally without assigning attachments.
- derive(*, keep=('contact_email', 'investigator', 'orcid_of_owner', 'owner', 'owner_email', 'techniques'))[source]#
Return a new dataset that is derived from self.
The returned dataset has most fields set to
None
. But a number of fields can be carried over fromself
. By default, this assumes that the owner of the derived dataset is the same as the owner of the original. This can be customized with thekeep
argument.- Parameters:
keep (
Iterable
[str
], default:('contact_email', 'investigator', 'orcid_of_owner', 'owner', 'owner_email', 'techniques')
) – Fields to copy over to the derived dataset.- Returns:
Dataset
– A new derived dataset.- Raises:
ValueError – If
self
has no PID. The derived dataset requires a PID in order to link back toself
.
- classmethod fields(dataset_type=None, read_only=None)[source]#
Iterate over dataset fields.
This is similar to
dataclasses.fields()
.- Parameters:
dataset_type (
Union
[DatasetType
,Literal
['raw'
,'derived'
],None
], default:None
) – If set, return only the fields for this dataset type. If unset, do not filter fields.read_only (
bool
|None
, default:None
) – If true or false, return only fields which are read-only or allow write-access, respectively. If unset, do not filter fields.
- Returns:
Generator
[Field
,None
,None
] – Iterable over the fields of datasets.
- classmethod from_download_models(dataset_model, orig_datablock_models, attachment_models=None)[source]#
Construct a new dataset from SciCat download models.
- Parameters:
dataset_model (
DownloadDataset
) – Model of the dataset.orig_datablock_models (
list
[DownloadOrigDatablock
]) – List of all associated original datablock models for the dataset.attachment_models (
Iterable
[DownloadAttachment
] |None
, default:None
) – List of all associated attachment models for the dataset. UseNone
if the attachments were not downloaded. Use an empty list if the attachments were downloaded, but there aren’t any.
- Returns:
Dataset
– A new Dataset instance.
- items()[source]#
Dict-like items(name and value pairs of fields) method.
- Returns:
Iterable
[tuple
[str
,Any
]] – Generator of (Name, Value) pairs of all fields corresponding toself.type
and other fields that are notNone
.
Added in version 23.10.0.
- keys()[source]#
Dict-like keys(names of fields) method.
- Returns:
Iterable
[str
] – Generator of names of all fields corresponding toself.type
and other fields that are notNone
.
Added in version 23.10.0.
- make_attachment_upload_models()[source]#
Build models for all registered attachments.
- Raises:
ValueError – If
self.attachments
isNone
, i.e., the attachments are uninitialized.- Returns:
list
[UploadAttachment
] – List of attachment models.
- make_datablock_upload_models()[source]#
Build models for all contained (orig) datablocks.
- Returns:
DatablockUploadModels
– Structure with datablock and orig datablock models.
- property number_of_files: int#
Number of files in directly accessible storage in the dataset.
This includes files on both the local and remote filesystems.
Corresponds to OrigDatablocks.
- property number_of_files_archived: int#
Total number of archived files in the dataset.
Corresponds to Datablocks.
- replace(*, _read_only=None, _orig_datablocks=None, **replacements)[source]#
Return a new dataset with replaced fields.
Parameters starting with an underscore are for internal use. Using them may result in a broken dataset.
- replace_files(*files)[source]#
Return a new dataset with replaced files.
For each argument, if the input dataset has a file with the same remote path, that file is replaced. Otherwise, a new file is added. Other existing files are kept in the returned dataset.
- property size: int#
Total size of files in directly accessible storage in the dataset.
This includes files on both the local and remote filesystems.
Corresponds to OrigDatablocks.