🚚 Migrations for lamindb 1.0 #2323

falexwolf · 2025-01-05T12:13:33Z

Other changes

upgraded Django version to >=5 because we're using db_default with Python >= 3.10 -- discussion here on GitHub

Schema changes

remove _source_code_artifact from Transform, it's been deprecated since 0.75 222931a
- data migration: for all transforms that have _source_code_artifact populated, populate source_code
rename Transform.name to Transform.description because it behaves like Artifact.description and it's confusing because people expect versioning to happen based on name 5d68c5e Claude chat
- backward compat:
  - in the Transform constructor use name to populate key in all cases in which only name is passed
  - return the same transform based on key in case source_code is None via ._name_field = "key"
- data migrations:
  - there already was a legacy description field that was never exposed on the constructor; to be safe, we concatenated potential data in it on the new description field Claude chat
  - for all transforms that have key=None and name!=None, use name to pre-populate key
rename Collection.name to Collection.key for consistency with Artifact & Transform and the high likelihood of users wanting to organize them hierarchically 2bef93a
a _branch_code integer on every record to model pull requests 483fb37
- include visibility within that code
- repurpose visibility=0 as _branch_code=0 as "archive"
- put an index on it: see Claude chat
- code a "draft" as _branch_code = 2, and "draft prs" as negative branch codes Claude thread -- this is better than adding another field as originally planned (internal Slack thread)
rename values "number" to "num" in dtype
an .aux json field on Record: Claude thread
a SmallInteger run._status_code that allows to write finished_at in clean up operations so that there is a run time also for aborted runs
rename Run.is_consecutive to Run._is_consecutive
a _template_id FK to store the information of the generating template (whether a record is a template is coded via _branch_code)
rename _accessor to otype to publicly declare the data format as suffix, accessor
rename Artifact.type to Artifact.kind (internal Slack thread) Claude chat
a FK to artifact run._logfile which holds logs
a hash field on ParamValue and FeatureValue to enforce uniqueness without running the danger of failure for large dictionaries 9e4b503 Claude chat
an Artifact._curator: dict field 068bcf7
add a boolean field ._expect_many to Feature/Param that defaults to True/False and indicates whether values for this feature/param are expected to occur a single or multiple times for every single artifact/run 1cc1ca8
- for feature
  - if it's True (default), the values come from an observation-level aggregation and a dtype of datetime on the observation-level mean set[datetime] on the artifact-level
  - if it's False it's an artifact-level value and datetime means datetime; this is an edge case because an arbitrary artifact would always be a set of arbitrary measurements that would need to be aggregated ("one just happens to measure a single cell line in that artifact")
- for param
  - if it's False (default), the values mean artifact/run-level values and datetime means datetime
  - if it's True, the values would be from an aggregation, this seems like an edge case but say when characterizing a model ensemble trained with different parameters it could be relevant
remove the .transform foreign key from artifact and collection for consistency with all other records; introduce a property and a simple filter statement instead (context: https://github.com/laminlabs/laminhub/issues/1498) a4e0fb2
store provenance metadata for TransformULabel, RunParamValue, ArtifactParamValue 576d858
enable linking projects & references to transforms & collections: laminlabs/ourprojects@370a869
rename Run.parent to Run.initiated_by_run 9fbdd6f
introduce a boolean flag on artifact that's called _overwrite_versions, which indicates whether versions are overwritten or stored separately; it defaults to False for file-like artifacts and to True for folder-like artifacts - see Slack thread
Rename n_objects to n_files for more clarity laminlabs/lamindb-setup@eba36ba Claude chat

TODO:

a boolean on ULabel called is_concept to distinguish ontological concepts from instances of the concepts: labels meant for labeling; for instance, you would never want to label an artifact with a ulabel Project, you'll only want to label with actual project values Project 1, Project 2, etc. 6bd7f62
attach curator to FeatureSet or re-purpose FeatureSet as curator in some way Slack thread
think through naming conventions of lamindb-related records: transfer, save_vitessce_config, run reports, etc. and migrate them
rename "glue" transform type

Materials

Needs:

Affects:

Triggered these issues:

Did not move forward with these changes

a Record._page_md textfield so that every entity can hold markdown content
- Versioning would go through the Event model with fields
- Reverted (58e46ef) during this PR in favor of: https://github.com/laminlabs/laminhub/issues/1784
a Record._public boolean that indicates whether a record is public Claude chat
- Reverted (483fb37) because of this internal Slack discussion
rename Feature to Dimension and FeatureSet to DimGroup
check that there is an index on paramvalue.value and featurevalue.value, there is None but indexes on jsons are complicated and these tables won't be big so we rather hold off
consider whether we want to eliminate the link tables associated with "_previous_runs" runs; decided against it Claude thread
given that feature_sets is a public attribute and given that some people might want to query for feature_values or param_values outside the .features and .params managers; we should consider making _feature_values and _param_values public. This will likely also help people better understand the underlying mechanics if they're ever interested. Decided against because then there would be two ways to query by a feature or param. Everything is already supported through the lamindb-native API and by that considerably simpler than doing this with Django, in particular since ✨ Enable to easily join features onto artifacts via Artifact.df() #2238. Also see the run docs where nested queries are exemplified:

…ription field, re-purpose Transform.key as the central semantic identifier, and rename Transform.name to Transform.description

Koncopd · 2025-01-05T15:17:54Z

automatically tracking artifact folder: https://claude.ai/chat/85b7a953-36d5-4afa-8efc-1bd6886eaefe

The conversation you were looking for could not be found.
what is it actually?

Koncopd · 2025-01-05T15:20:34Z

a Folder model with registry | key (see internal Claude discussion)

I don't really understand the point, as we have Storage already. Also we have different storage types, like http, hf, s3 etc, not clear what is a folder for http or hf. If a meaningful grouping is needed, we have Collection for that. This will b super confusing if we have Storage, Artifact, Collection and then also Folder.
Do we have a discussion for this?

falexwolf · 2025-01-05T17:57:00Z

The conversation you were looking for could not be found. what is it actually?

This is something for the frontend; I shared the Claude discussion with you. After the whole exploration around the block model this will likely make more sense in laminhub_client (https://github.com/laminlabs/laminhub/issues/1784).

Koncopd · 2025-01-05T19:08:06Z

I still don't see it.

falexwolf · 2025-01-05T19:09:51Z

Sorry: https://claude.ai/share/c554a663-75de-4dda-884e-a910b1b261bd

Koncopd · 2025-01-05T19:15:48Z

Ok, that is clearer to me now, but i am still not sure. If we have a path in s3, gs, and http with the same "folder" in the key, is it really the same folder and do we want to see it as a unified folder really?

falexwolf · 2025-01-05T21:28:22Z

If we have a path in s3, gs, and http with the same "folder" in the key, is it really the same folder and do we want to see it as a unified folder really?

ArtifactFolder and TransformFolder and CollectionFolder are merely used in the UI and merely for virtual keys. Let's discuss this here when we get to it: https://github.com/laminlabs/laminhub/issues/1787

…compasses visibility

falexwolf · 2025-01-08T22:22:52Z

This is ready.

After another review of at least @sunnyosun we can merge this to main and then deploy the migrations against our own instances to test for a day or two.

On the weekend, we can complete the entire migration.

Please be aware that this is the "final core schema". We'll not change it for the year to come unless a mistake happened here. So, it's good to spot anything that looks fishy (@Koncopd already spotted some that I resolved).

falexwolf · 2025-01-08T22:24:49Z

FYI: the tests that hit our remote instances are commented out because I'll need to deploy the migrations first; I will only do this once it's reviewed.

sunnyosun · 2025-01-09T09:08:39Z

Could we rename FeatureManager._add_from to FeatureManager.add_from? (backward compatible)

lamindb/lamindb/core/_feature_manager.py

Line 1184 in 519b1d7

def _add_from(self, data: Artifact | Collection, transfer_logs: dict = None):

Koncopd

Note that the docs fail, otherwise looks good to me.

lamindb/_record.py

Koncopd · 2025-01-09T10:27:46Z

lamindb/_transform.py

+                    f"`name` will be removed soon, please pass '{description}' to `description` instead"
+                )
+            else:
+                raise ValueError("name doesn't exist anymore `description`")


It is unclear to me what this tries to convey.

tests/core/test_describe_and_df_calls.py

Koncopd · 2025-01-09T10:31:35Z

I have added a few comments about minor things.

Zethson

Awesome!

ln.Artifact.filter(visibility=None).df() gets changed to ln.Artifact.filter(_branch_code=None).df() in the deletion FAQ. I don't like that we're suggesting to filter by an underscore attribute. Should we consider not making this an underscore attribute given that visibility wasn't either?
Generally, for all things visibility, I found visibility to be much easier to understand than _branch_code although I of course see how much power we gain with it. It reads hacky though for users in my opinion.
WDYT about using obj_type instead of otype? It took me a few seconds too much to understand what you meant.
I don't get warm with kind at all but well type is a very overloaded term...

.pre-commit-config.yaml

lamindb/_collection.py

lamindb/_query_set.py

Zethson · 2025-01-09T10:38:54Z

lamindb/_query_set.py

+        parts = field.split("__")
+        if parts[0] in name_mappings:
+            if parts[0] != "transform":
+                logger.warning(


DeprecationWarning?

lamindb/_transform.py

Zethson · 2025-01-09T10:39:36Z

lamindb/_transform.py

+        else:
+            if description is None:
+                description = kwargs.pop("name")
+                logger.warning(


FutureWarning?

tests/core/test_describe_and_df_calls.py

falexwolf · 2025-01-09T11:40:44Z

Thanks, @Zethson. Responses below.

Should we consider not making this an underscore attribute given that visibility wasn't either?

In fact this is backward compatible and people can keep doing filter(visibility=0). I fixed this in the FAQ.

Generally, for all things visibility, I found visibility to be much easier to understand than _branch_code although I of course see how much power we gain with it. It reads hacky though for users in my opinion.

💯 visibility will stay around at least for a while for the user (this PR doesn't break the calls). _branch_code is a private/internal concept that is exposed through "Archive", "Trash", "PR", etc. and until lamindb 2 also with "visibility".

WDYT about using obj_type instead of otype? It took me a few seconds too much to understand what you meant.

The main reason for otype is dtype. You bringing up obj_type makes me realize a problem I hadn't thought of. We also have n_objects and refer to the things in storage as "objects". I think we might want to rename n_objects to n_files. The latter name again isn't so great because these files can be fragments in an array representation. But ultimately they're blobs like files. Which means we could also use n_blobs. I will ask Claude.

But either way I don't like obj_type because it doesn't look pretty and isn't a known convention. If anything I'd say object_type. But then I guess otype is OK because it parallels dtype.

I don't get warm with kind at all but well type is a very overloaded term...

Yes, I agree. But it's still better than type. I linked the Claude discussions for everything at the top. It's hard to find something better than kind.

github-actions · 2025-01-09T12:42:59Z

🚀 Deployed on https://677fd73702de6ec05f2783f7--lamindb-qnwk.netlify.app

falexwolf added 3 commits January 5, 2025 13:12

🚚 Remove _source_code_artifact M2M from Transform

222931a

💚 Fix

8e9be04

🚚 Rename transform.name to transform.description

5d68c5e

falexwolf changed the title ~~🚚 Remove _source_code_artifact M2M from Transform~~ 🚚 Migrations for lamindb 1.0 Jan 5, 2025

falexwolf added 7 commits January 5, 2025 14:43

♻️ Add _page_md to all registries, remove the existing Transform.desc…

3969869

…ription field, re-purpose Transform.key as the central semantic identifier, and rename Transform.name to Transform.description

♻️ Fix code

6d25842

♻️ No longer auto-lookup record upon Transform.name & version match

c549a50

♻️ Attempt fix behavior

d2b03c6

💚 Fixes

eac670c

💚 Fix

ab5524c

♻️ This one loads fine

1fe973d

falexwolf force-pushed the migrate branch from 3de5795 to 1fe973d Compare January 5, 2025 15:07

♻️ Refactor still loads fine

bb0fe2f

falexwolf linked an issue Jan 5, 2025 that may be closed by this pull request

🏗️ Migrations for LaminDB 1.0 #2316

Open

falexwolf changed the title ~~🚚 Migrations for lamindb 1.0~~ 🚚 Migrations for LaminDB 1.0 Jan 5, 2025

falexwolf added 2 commits January 5, 2025 20:00

⏪️ Remove the _page_md field

58e46ef

💚 Fix

4fe7a7d

🚚 Add a boolean _public field to Record

2ce631e

falexwolf force-pushed the migrate branch from 4a03222 to 2ce631e Compare January 7, 2025 14:09

falexwolf added 4 commits January 7, 2025 15:09

🔀 Resolve conflict on submodule

d1496a3

Merge branch 'main' into migrate

5ff9bee

🔀 Point back to correct lamin-cli branch

442472c

🚚 Remove the _public flag and introduce the _branch_code flag that en…

483fb37

…compasses visibility

falexwolf added 7 commits January 8, 2025 22:51

💚 Fix

bf6a7f3

💚 Fix

05c734f

💚 More fixes

652e8ec

🚚 Create _overwrite_versions boolean on Artifact

9c4559b

💚 Fix

f9d10cb

♻️ Refactor

7c19f81

💚 Make it work

0f3b2e5

falexwolf requested review from Koncopd, Zethson, chaichontat and sunnyosun January 8, 2025 22:21

sunnyosun approved these changes Jan 9, 2025

View reviewed changes

Koncopd approved these changes Jan 9, 2025

View reviewed changes

Koncopd reviewed Jan 9, 2025

View reviewed changes

lamindb/_record.py Show resolved Hide resolved

Koncopd reviewed Jan 9, 2025

View reviewed changes

tests/core/test_describe_and_df_calls.py Show resolved Hide resolved

Zethson reviewed Jan 9, 2025

View reviewed changes

falexwolf added 4 commits January 9, 2025 13:00

📝 Fix visibility FAQ

2173332

📝 Try to fix docs

3b3d880

📝 Fix docs

1b182f4

♻️ Fix import statements

857090f

github-actions bot temporarily deployed to pull request January 9, 2025 12:43 Inactive

🚚 Rename n_files to n_objects

830ed37

github-actions bot temporarily deployed to pull request January 9, 2025 14:03 Inactive

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

🚚 Migrations for lamindb 1.0 #2323

🚚 Migrations for lamindb 1.0 #2323

falexwolf commented Jan 5, 2025 •

edited

Loading

Koncopd commented Jan 5, 2025

Koncopd commented Jan 5, 2025 •

edited

Loading

falexwolf commented Jan 5, 2025

Koncopd commented Jan 5, 2025

falexwolf commented Jan 5, 2025 •

edited

Loading

Koncopd commented Jan 5, 2025

falexwolf commented Jan 5, 2025

falexwolf commented Jan 8, 2025 •

edited

Loading

falexwolf commented Jan 8, 2025

sunnyosun commented Jan 9, 2025

Koncopd left a comment

Koncopd Jan 9, 2025

Koncopd commented Jan 9, 2025 •

edited

Loading

Zethson left a comment

Zethson Jan 9, 2025

Zethson Jan 9, 2025

falexwolf commented Jan 9, 2025

github-actions bot commented Jan 9, 2025 •

edited

Loading

🚚 Migrations for lamindb 1.0 #2323

Are you sure you want to change the base?

🚚 Migrations for lamindb 1.0 #2323

Conversation

falexwolf commented Jan 5, 2025 • edited Loading

Other changes

Schema changes

Materials

Did not move forward with these changes

Koncopd commented Jan 5, 2025

Koncopd commented Jan 5, 2025 • edited Loading

falexwolf commented Jan 5, 2025

Koncopd commented Jan 5, 2025

falexwolf commented Jan 5, 2025 • edited Loading

Koncopd commented Jan 5, 2025

falexwolf commented Jan 5, 2025

falexwolf commented Jan 8, 2025 • edited Loading

falexwolf commented Jan 8, 2025

sunnyosun commented Jan 9, 2025

Koncopd left a comment

Choose a reason for hiding this comment

Koncopd Jan 9, 2025

Choose a reason for hiding this comment

Koncopd commented Jan 9, 2025 • edited Loading

Zethson left a comment

Choose a reason for hiding this comment

Zethson Jan 9, 2025

Choose a reason for hiding this comment

Zethson Jan 9, 2025

Choose a reason for hiding this comment

falexwolf commented Jan 9, 2025

github-actions bot commented Jan 9, 2025 • edited Loading

falexwolf commented Jan 5, 2025 •

edited

Loading

Koncopd commented Jan 5, 2025 •

edited

Loading

falexwolf commented Jan 5, 2025 •

edited

Loading

falexwolf commented Jan 8, 2025 •

edited

Loading

Koncopd commented Jan 9, 2025 •

edited

Loading

github-actions bot commented Jan 9, 2025 •

edited

Loading