Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

🚚 Migrations for lamindb 1.0 #2323

Open
wants to merge 101 commits into
base: main
Choose a base branch
from
Open

🚚 Migrations for lamindb 1.0 #2323

wants to merge 101 commits into from

Conversation

falexwolf
Copy link
Member

@falexwolf falexwolf commented Jan 5, 2025

Other changes

  • upgraded Django version to >=5 because we're using db_default with Python >= 3.10 -- discussion here on GitHub

Schema changes

  • remove _source_code_artifact from Transform, it's been deprecated since 0.75 222931a
    • data migration: for all transforms that have _source_code_artifact populated, populate source_code
  • rename Transform.name to Transform.description because it behaves like Artifact.description and it's confusing because people expect versioning to happen based on name 5d68c5e Claude chat
    • backward compat:
      • in the Transform constructor use name to populate key in all cases in which only name is passed
      • return the same transform based on key in case source_code is None via ._name_field = "key"
    • data migrations:
      • there already was a legacy description field that was never exposed on the constructor; to be safe, we concatenated potential data in it on the new description field Claude chat
      • for all transforms that have key=None and name!=None, use name to pre-populate key
  • rename Collection.name to Collection.key for consistency with Artifact & Transform and the high likelihood of users wanting to organize them hierarchically 2bef93a
  • a _branch_code integer on every record to model pull requests 483fb37
    • include visibility within that code
    • repurpose visibility=0 as _branch_code=0 as "archive"
    • put an index on it: see Claude chat
    • code a "draft" as _branch_code = 2, and "draft prs" as negative branch codes Claude thread -- this is better than adding another field as originally planned (internal Slack thread)
  • rename values "number" to "num" in dtype
  • an .aux json field on Record: Claude thread
  • a SmallInteger run._status_code that allows to write finished_at in clean up operations so that there is a run time also for aborted runs
  • rename Run.is_consecutive to Run._is_consecutive
  • a _template_id FK to store the information of the generating template (whether a record is a template is coded via _branch_code)
  • rename _accessor to otype to publicly declare the data format as suffix, accessor
  • rename Artifact.type to Artifact.kind (internal Slack thread) Claude chat
  • a FK to artifact run._logfile which holds logs
  • a hash field on ParamValue and FeatureValue to enforce uniqueness without running the danger of failure for large dictionaries 9e4b503 Claude chat
  • an Artifact._curator: dict field 068bcf7
  • add a boolean field ._expect_many to Feature/Param that defaults to True/False and indicates whether values for this feature/param are expected to occur a single or multiple times for every single artifact/run 1cc1ca8
    • for feature
      • if it's True (default), the values come from an observation-level aggregation and a dtype of datetime on the observation-level mean set[datetime] on the artifact-level
      • if it's False it's an artifact-level value and datetime means datetime; this is an edge case because an arbitrary artifact would always be a set of arbitrary measurements that would need to be aggregated ("one just happens to measure a single cell line in that artifact")
    • for param
      • if it's False (default), the values mean artifact/run-level values and datetime means datetime
      • if it's True, the values would be from an aggregation, this seems like an edge case but say when characterizing a model ensemble trained with different parameters it could be relevant
  • remove the .transform foreign key from artifact and collection for consistency with all other records; introduce a property and a simple filter statement instead (context: https://github.com/laminlabs/laminhub/issues/1498) a4e0fb2
  • store provenance metadata for TransformULabel, RunParamValue, ArtifactParamValue 576d858
  • enable linking projects & references to transforms & collections: laminlabs/ourprojects@370a869
  • rename Run.parent to Run.initiated_by_run 9fbdd6f
  • introduce a boolean flag on artifact that's called _overwrite_versions, which indicates whether versions are overwritten or stored separately; it defaults to False for file-like artifacts and to True for folder-like artifacts - see Slack thread
  • Rename n_objects to n_files for more clarity laminlabs/lamindb-setup@eba36ba Claude chat

TODO:

  • a boolean on ULabel called is_concept to distinguish ontological concepts from instances of the concepts: labels meant for labeling; for instance, you would never want to label an artifact with a ulabel Project, you'll only want to label with actual project values Project 1, Project 2, etc. 6bd7f62
  • attach curator to FeatureSet or re-purpose FeatureSet as curator in some way Slack thread
  • think through naming conventions of lamindb-related records: transfer, save_vitessce_config, run reports, etc. and migrate them
  • rename "glue" transform type

Materials

Needs:

Affects:

Triggered these issues:

Did not move forward with these changes

  • a Record._page_md textfield so that every entity can hold markdown content
  • a Record._public boolean that indicates whether a record is public Claude chat
  • rename Feature to Dimension and FeatureSet to DimGroup
  • check that there is an index on paramvalue.value and featurevalue.value, there is None but indexes on jsons are complicated and these tables won't be big so we rather hold off
  • consider whether we want to eliminate the link tables associated with "_previous_runs" runs; decided against it Claude thread
  • given that feature_sets is a public attribute and given that some people might want to query for feature_values or param_values outside the .features and .params managers; we should consider making _feature_values and _param_values public. This will likely also help people better understand the underlying mechanics if they're ever interested. Decided against because then there would be two ways to query by a feature or param. Everything is already supported through the lamindb-native API and by that considerably simpler than doing this with Django, in particular since ✨ Enable to easily join features onto artifacts via Artifact.df() #2238. Also see the run docs where nested queries are exemplified: image

@falexwolf falexwolf changed the title 🚚 Remove _source_code_artifact M2M from Transform 🚚 Migrations for lamindb 1.0 Jan 5, 2025
@Koncopd
Copy link
Member

Koncopd commented Jan 5, 2025

automatically tracking artifact folder: https://claude.ai/chat/85b7a953-36d5-4afa-8efc-1bd6886eaefe

The conversation you were looking for could not be found.
what is it actually?

@Koncopd
Copy link
Member

Koncopd commented Jan 5, 2025

a Folder model with registry | key (see internal Claude discussion)

I don't really understand the point, as we have Storage already. Also we have different storage types, like http, hf, s3 etc, not clear what is a folder for http or hf. If a meaningful grouping is needed, we have Collection for that. This will b super confusing if we have Storage, Artifact, Collection and then also Folder.
Do we have a discussion for this?

@falexwolf falexwolf linked an issue Jan 5, 2025 that may be closed by this pull request
@falexwolf falexwolf changed the title 🚚 Migrations for lamindb 1.0 🚚 Migrations for LaminDB 1.0 Jan 5, 2025
@falexwolf
Copy link
Member Author

The conversation you were looking for could not be found. what is it actually?

This is something for the frontend; I shared the Claude discussion with you. After the whole exploration around the block model this will likely make more sense in laminhub_client (https://github.com/laminlabs/laminhub/issues/1784).

@Koncopd
Copy link
Member

Koncopd commented Jan 5, 2025

I still don't see it.

@falexwolf
Copy link
Member Author

falexwolf commented Jan 5, 2025

@Koncopd
Copy link
Member

Koncopd commented Jan 5, 2025

Ok, that is clearer to me now, but i am still not sure. If we have a path in s3, gs, and http with the same "folder" in the key, is it really the same folder and do we want to see it as a unified folder really?

@falexwolf
Copy link
Member Author

If we have a path in s3, gs, and http with the same "folder" in the key, is it really the same folder and do we want to see it as a unified folder really?

ArtifactFolder and TransformFolder and CollectionFolder are merely used in the UI and merely for virtual keys. Let's discuss this here when we get to it: https://github.com/laminlabs/laminhub/issues/1787

@falexwolf
Copy link
Member Author

falexwolf commented Jan 8, 2025

This is ready.

After another review of at least @sunnyosun we can merge this to main and then deploy the migrations against our own instances to test for a day or two.

On the weekend, we can complete the entire migration.

Please be aware that this is the "final core schema". We'll not change it for the year to come unless a mistake happened here. So, it's good to spot anything that looks fishy (@Koncopd already spotted some that I resolved).

@falexwolf
Copy link
Member Author

FYI: the tests that hit our remote instances are commented out because I'll need to deploy the migrations first; I will only do this once it's reviewed.

@sunnyosun
Copy link
Member

Could we rename FeatureManager._add_from to FeatureManager.add_from? (backward compatible)

def _add_from(self, data: Artifact | Collection, transfer_logs: dict = None):

Copy link
Member

@Koncopd Koncopd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that the docs fail, otherwise looks good to me.

f"`name` will be removed soon, please pass '{description}' to `description` instead"
)
else:
raise ValueError("name doesn't exist anymore `description`")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is unclear to me what this tries to convey.

@Koncopd
Copy link
Member

Koncopd commented Jan 9, 2025

I have added a few comments about minor things.

Copy link
Member

@Zethson Zethson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome!

  1. ln.Artifact.filter(visibility=None).df() gets changed to ln.Artifact.filter(_branch_code=None).df() in the deletion FAQ. I don't like that we're suggesting to filter by an underscore attribute. Should we consider not making this an underscore attribute given that visibility wasn't either?
  2. Generally, for all things visibility, I found visibility to be much easier to understand than _branch_code although I of course see how much power we gain with it. It reads hacky though for users in my opinion.
  3. WDYT about using obj_type instead of otype? It took me a few seconds too much to understand what you meant.
  4. I don't get warm with kind at all but well type is a very overloaded term...

.pre-commit-config.yaml Outdated Show resolved Hide resolved
lamindb/_collection.py Show resolved Hide resolved
lamindb/_query_set.py Show resolved Hide resolved
parts = field.split("__")
if parts[0] in name_mappings:
if parts[0] != "transform":
logger.warning(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DeprecationWarning?

lamindb/_transform.py Show resolved Hide resolved
else:
if description is None:
description = kwargs.pop("name")
logger.warning(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FutureWarning?

tests/core/test_describe_and_df_calls.py Show resolved Hide resolved
@falexwolf
Copy link
Member Author

Thanks, @Zethson. Responses below.

Should we consider not making this an underscore attribute given that visibility wasn't either?

In fact this is backward compatible and people can keep doing filter(visibility=0). I fixed this in the FAQ.

Generally, for all things visibility, I found visibility to be much easier to understand than _branch_code although I of course see how much power we gain with it. It reads hacky though for users in my opinion.

💯 visibility will stay around at least for a while for the user (this PR doesn't break the calls). _branch_code is a private/internal concept that is exposed through "Archive", "Trash", "PR", etc. and until lamindb 2 also with "visibility".

WDYT about using obj_type instead of otype? It took me a few seconds too much to understand what you meant.

The main reason for otype is dtype. You bringing up obj_type makes me realize a problem I hadn't thought of. We also have n_objects and refer to the things in storage as "objects". I think we might want to rename n_objects to n_files. The latter name again isn't so great because these files can be fragments in an array representation. But ultimately they're blobs like files. Which means we could also use n_blobs. I will ask Claude.

But either way I don't like obj_type because it doesn't look pretty and isn't a known convention. If anything I'd say object_type. But then I guess otype is OK because it parallels dtype.

I don't get warm with kind at all but well type is a very overloaded term...

Yes, I agree. But it's still better than type. I linked the Claude discussions for everything at the top. It's hard to find something better than kind.

Copy link

github-actions bot commented Jan 9, 2025

@github-actions github-actions bot temporarily deployed to pull request January 9, 2025 12:43 Inactive
@github-actions github-actions bot temporarily deployed to pull request January 9, 2025 14:03 Inactive
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

🏗️ Migrations for LaminDB 1.0
4 participants