lamindb.Collection¶
- class lamindb.Collection(artifacts: list[Artifact], key: str, description: str | None = None, meta: Any | None = None, reference: str | None = None, reference_type: str | None = None, run: Run | None = None, revises: Collection | None = None)¶
Bases:
SQLRecord
,IsVersioned
,TracksRun
,TracksUpdates
Collections of artifacts.
Collections provide a simple way of versioning collections of artifacts.
- Parameters:
artifacts –
list[Artifact]
A list of artifacts.key –
str
A file-path like key, analogous to thekey
parameter ofArtifact
andTransform
.description –
str | None = None
A description.revises –
Collection | None = None
An old version of the collection.run –
Run | None = None
The run that creates the collection.meta –
Artifact | None = None
An artifact that defines metadata for the collection.reference –
str | None = None
A simple reference, e.g. an external ID or a URL.reference_type –
str | None = None
A way to indicate to indicate the type of the simple reference"url"
.
See also
Examples
Create a collection from a list of
Artifact
objects:>>> collection = ln.Collection([artifact1, artifact2], key="my_project/my_collection")
Create a collection that groups a data & a metadata artifact (e.g., here RxRx: cell imaging):
>>> collection = ln.Collection(data_artifact, key="my_project/my_collection", meta=metadata_artifact)
Attributes¶
- DoesNotExist = <class 'lamindb.models.collection.Collection.DoesNotExist'>¶
- Meta = <class 'lamindb.models.sqlrecord.SQLRecord.Meta'>¶
- MultipleObjectsReturned = <class 'lamindb.models.collection.Collection.MultipleObjectsReturned'>¶
- artifacts: Artifact¶
Artifacts in collection.
- branch: int¶
Whether record is on a branch or in another “special state”.
This dictates where a record appears in exploration, queries & searches, whether a record can be edited, and whether a record acts as a template.
Branch name coding is handled through LaminHub. “Special state” coding is as defined below.
One should note that there is no “main” branch as in git, but that all five special codes (-1, 0, 1, 2, 3) act as sub-specfications for what git would call the main branch. This also means that for records that live on a branch only the “default state” exists. E.g., one can only turn a record into a template, lock it, archive it, or trash it once it’s merged onto the main branch.
3: template (hidden in queries & searches)
2: locked (same as default, but locked for edits except for space admins)
1: default (visible in queries & searches)
0: archive (hidden, meant to be kept, locked for edits for everyone)
-1: trash (hidden, scheduled for deletion)
An integer higher than >3 codes a branch that can be used for collaborators to create drafts that can be merged onto the main branch in an experience akin to a Pull Request. The mapping onto a semantic branch name is handled through LaminHub.
- branch_id¶
- clinical_trials¶
Accessor to the related objects manager on the forward and reverse sides of a many-to-many relation.
In the example:
class Pizza(Model): toppings = ManyToManyField(Topping, related_name='pizzas')
Pizza.toppings
andTopping.pizzas
areManyToManyDescriptor
instances.Most of the implementation is delegated to a dynamically defined manager class built by
create_forward_many_to_many_manager()
defined below.
- created_by: User¶
Creator of record.
- created_by_id¶
- property data_artifact: Artifact | None¶
Access to a single data artifact.
If the collection has a single data & metadata artifact, this allows access via:
collection.data_artifact # first & only element of collection.artifacts collection.meta_artifact # metadata
- input_of_runs: Run¶
Runs that use this collection as an input.
- links_artifact¶
Accessor to the related objects manager on the reverse side of a many-to-one relation.
In the example:
class Child(Model): parent = ForeignKey(Parent, related_name='children')
Parent.children
is aReverseManyToOneDescriptor
instance.Most of the implementation is delegated to a dynamically defined manager class built by
create_forward_many_to_many_manager()
defined below.
- links_project¶
Accessor to the related objects manager on the reverse side of a many-to-one relation.
In the example:
class Child(Model): parent = ForeignKey(Parent, related_name='children')
Parent.children
is aReverseManyToOneDescriptor
instance.Most of the implementation is delegated to a dynamically defined manager class built by
create_forward_many_to_many_manager()
defined below.
- links_reference¶
Accessor to the related objects manager on the reverse side of a many-to-one relation.
In the example:
class Child(Model): parent = ForeignKey(Parent, related_name='children')
Parent.children
is aReverseManyToOneDescriptor
instance.Most of the implementation is delegated to a dynamically defined manager class built by
create_forward_many_to_many_manager()
defined below.
- links_ulabel¶
Accessor to the related objects manager on the reverse side of a many-to-one relation.
In the example:
class Child(Model): parent = ForeignKey(Parent, related_name='children')
Parent.children
is aReverseManyToOneDescriptor
instance.Most of the implementation is delegated to a dynamically defined manager class built by
create_forward_many_to_many_manager()
defined below.
- meta_artifact: Artifact | None¶
An artifact that stores metadata that indexes a collection.
It has a 1:1 correspondence with an artifact. If needed, you can access the collection from the artifact via a private field:
artifact._meta_of_collection
.
- meta_artifact_id¶
- property name: str¶
Name of the collection.
Splits
key
on/
and returns the last element.
- objects = <lamindb.models.query_manager.QueryManager object>¶
- property ordered_artifacts: QuerySet¶
Ordered
QuerySet
of.artifacts
.Accessing the many-to-many field
collection.artifacts
directly gives you non-deterministic order.Using the property
.ordered_artifacts
allows to iterate through a set that’s ordered in the order of creation.
- property pk¶
- projects: Project¶
Linked projects.
- references: Reference¶
Linked references.
- run_id¶
- space: Space¶
The space in which the record lives.
- space_id¶
- property stem_uid: str¶
Universal id characterizing the version family.
The full uid of a record is obtained via concatenating the stem uid and version information:
stem_uid = random_base62(n_char) # a random base62 sequence of length 12 (transform) or 16 (artifact, collection) version_uid = "0000" # an auto-incrementing 4-digit base62 number uid = f"{stem_uid}{version_uid}" # concatenate the stem_uid & version_uid
Class methods¶
- classmethod get(idlike=None, *, is_run_input=False, **expressions)¶
Get a single collection.
- Parameters:
idlike (
int
|str
|None
, default:None
) – Either a uid stub, uid or an integer id.is_run_input (
bool
|Run
, default:False
) – Whether to track this collection as run input.expressions – Fields and values passed as Django query expressions.
- Raises:
lamindb.errors.DoesNotExist – In case no matching record is found.
- Return type:
See also
Method in
SQLRecord
base class:get()
Examples
collection = ln.Collection.get("okxPW6GIKBfRBE3B0000") collection = ln.Collection.get(key="scrna/collection1")
Methods¶
- async adelete(using=None, keep_parents=False)¶
- append(artifact, run=None)¶
Append an artifact to the collection.
This does not modify the original collection in-place, but returns a new version of the original collection with the appended artifact.
- Parameters:
- Return type:
Examples
collection_v1 = ln.Collection(artifact, key="My collection").save() collection_v2 = collection.append(another_artifact) # returns a new version of the collection collection_v2.save() # save the new version
- async arefresh_from_db(using=None, fields=None, from_queryset=None)¶
- async asave(*args, force_insert=False, force_update=False, using=None, update_fields=None)¶
- cache(is_run_input=None)¶
Download cloud artifacts in collection to local cache.
Follows synching logic: only caches outdated artifacts.
Returns paths to locally cached on-disk artifacts.
- Parameters:
is_run_input (
bool
|None
, default:None
) – Whether to track this collection as run input.- Return type:
list
[UPath
]
- clean()¶
Hook for doing any extra model-wide validation after clean() has been called on every field by self.clean_fields. Any ValidationError raised by this method will not be associated with a particular field; it will have a special-case association with the field defined by NON_FIELD_ERRORS.
- clean_fields(exclude=None)¶
Clean all fields and raise a ValidationError containing a dict of all validation errors if any occur.
- date_error_message(lookup_type, field_name, unique_for)¶
- delete(permanent=None)¶
Delete collection.
- Parameters:
permanent (
bool
|None
, default:None
) – Whether to permanently delete the collection record (skips trash).- Return type:
None
Examples
For any
Collection
objectcollection
, call:>>> collection.delete()
- describe()¶
Describe relations of record.
- Return type:
None
Examples
>>> artifact.describe()
- get_constraints()¶
- get_deferred_fields()¶
Return a set containing names of deferred fields for this instance.
- load(join='outer', is_run_input=None, **kwargs)¶
Cache and load to memory.
Returns an in-memory concatenated
DataFrame
orAnnData
object.- Return type:
DataFrame
|AnnData
- mapped(layers_keys=None, obs_keys=None, obsm_keys=None, obs_filter=None, join='inner', encode_labels=True, unknown_label=None, cache_categories=True, parallel=False, dtype=None, stream=False, is_run_input=None)¶
Return a map-style dataset.
Returns a pytorch map-style dataset by virtually concatenating
AnnData
arrays.By default (
stream=False
)AnnData
arrays are moved into a local cache first.__getitem__
of theMappedCollection
object takes a single integer index and returns a dictionary with the observation data sample for this index from theAnnData
objects in the collection. The dictionary has keys forlayers_keys
(.X
is in"X"
),obs_keys
,obsm_keys
(underf"obsm_{key}"
) and also"_store_idx"
for the index of theAnnData
object containing this observation sample.Note
For a guide, see Train a machine learning model on a collection.
This method currently only works for collections or query sets of
AnnData
artifacts.- Parameters:
layers_keys (
str
|list
[str
] |None
, default:None
) – Keys from the.layers
slot.layers_keys=None
or"X"
in the list retrieves.X
.obs_keys (
str
|list
[str
] |None
, default:None
) – Keys from the.obs
slots.obsm_keys (
str
|list
[str
] |None
, default:None
) – Keys from the.obsm
slots.obs_filter (
dict
[str
,str
|list
[str
]] |None
, default:None
) – Select only observations with these values for the given obs columns. Should be a dictionary with obs column names as keys and filtering values (a string or a list of strings) as values.join (
Literal
['inner'
,'outer'
] |None
, default:'inner'
) –"inner"
or"outer"
virtual joins. IfNone
is passed, does not join.encode_labels (
bool
|list
[str
], default:True
) – Encode labels into integers. Can be a list with elements fromobs_keys
.unknown_label (
str
|dict
[str
,str
] |None
, default:None
) – Encode this label to -1. Can be a dictionary with keys fromobs_keys
ifencode_labels=True
or fromencode_labels
if it is a list.cache_categories (
bool
, default:True
) – Enable caching categories ofobs_keys
for faster access.parallel (
bool
, default:False
) – Enable sampling with multiple processes.dtype (
str
|None
, default:None
) – Convert numpy arrays from.X
,.layers
and.obsm
stream (
bool
, default:False
) – Whether to stream data from the array backend.is_run_input (
bool
|None
, default:None
) – Whether to track this collection as run input.
- Return type:
Examples
>>> import lamindb as ln >>> from torch.utils.data import DataLoader >>> ds = ln.Collection.get(description="my collection") >>> mapped = collection.mapped(obs_keys=["cell_type", "batch"]) >>> dl = DataLoader(mapped, batch_size=128, shuffle=True) >>> # also works for query sets of artifacts, '...' represents some filtering condition >>> # additional filtering on artifacts of the collection >>> mapped = collection.artifacts.all().filter(...).order_by("-created_at").mapped() >>> # or directly from a query set of artifacts >>> mapped = ln.Artifact.filter(..., otype="AnnData").order_by("-created_at").mapped()
- open(engine='pyarrow', is_run_input=None, **kwargs)¶
Open a dataset for streaming.
Works for
pyarrow
andpolars
compatible formats (.parquet
,.csv
,.ipc
etc. files or directories with such files).- Parameters:
engine (
Literal
['pyarrow'
,'polars'
], default:'pyarrow'
) – Which module to use for lazy loading of a dataframe frompyarrow
orpolars
compatible formats.is_run_input (
bool
|None
, default:None
) – Whether to track this artifact as run input.**kwargs – Keyword arguments for
pyarrow.dataset.dataset
orpolars.scan_*
functions.
- Return type:
Dataset
|Iterator
[LazyFrame
]
Notes
For more info, see guide: Slice arrays.
- prepare_database_save(field)¶
- refresh_from_db(using=None, fields=None, from_queryset=None)¶
Reload field values from the database.
By default, the reloading happens from the database this instance was loaded from, or by the read router if this instance wasn’t loaded from any database. The using parameter will override the default.
Fields can be used to specify which fields to reload. The fields should be an iterable of field attnames. If fields is None, then all non-deferred fields are reloaded.
When accessing deferred fields of an instance, the deferred loading of the field will call this method.
- restore()¶
Restore collection record from trash.
- Return type:
None
Examples
For any
Collection
objectcollection
, call:>>> collection.restore()
- save(using=None)¶
Save the collection and underlying artifacts to database & storage.
- Parameters:
using (
str
|None
, default:None
) – The database to which you want to save.- Return type:
Examples
>>> collection = ln.Collection("./myfile.csv", name="myfile")
- save_base(raw=False, force_insert=False, force_update=False, using=None, update_fields=None)¶
Handle the parts of saving which should be done only once per save, yet need to be done in raw saves, too. This includes some sanity checks and signal sending.
The ‘raw’ argument is telling save_base not to save any parent models and not to do any changes to the values before save. This is used by fixture loading.
- serializable_value(field_name)¶
Return the value of the field name for this instance. If the field is a foreign key, return the id value instead of the object. If there’s no Field object with this name on the model, return the model attribute’s value.
Used to serialize a field’s value (in the serializer, or form output, for example). Normally, you would just access the attribute directly and not use this method.
- unique_error_message(model_class, unique_check)¶
- validate_constraints(exclude=None)¶
- validate_unique(exclude=None)¶
Check unique constraints on the model and raise ValidationError if any failed.
- view_lineage(with_children=True, return_graph=False)¶
Graph of data flow.
- Return type:
Digraph
|None
Notes
For more info, see use cases: Data lineage.
Examples
>>> collection.view_lineage() >>> artifact.view_lineage()