Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enhance Kedro Namespaces adoption #4343

Open
6 tasks
DimedS opened this issue Nov 21, 2024 · 13 comments
Open
6 tasks

Enhance Kedro Namespaces adoption #4343

DimedS opened this issue Nov 21, 2024 · 13 comments

Comments

@DimedS
Copy link
Member

DimedS commented Nov 21, 2024

Kedro namespaces are currently not widely used. The team is divided on the reasons for this:

  1. Local issues such as incomplete docs, unresolved technical challenges, and potential user concerns about the interface.
  2. The namespace feature has been primarily suited for pipeline reusability. However, due to its complexity and lack of successful adoption over the past five years, it may require a significant redesign.

This parent issue aims to facilitate an agreed-upon decision regarding the points above and address these concerns. It is also tied to the goal of improving deployment functionality, where namespaces should play a pivotal role in node grouping.

History


Improving docs

  • The current documentation focuses primarily on how namespaces enhance pipeline reusability (see docs). However, this ticket proposes updating the docs to include a clear definition of namespaces, highlighting that they are similar to node tagging but do not allow overlaps. This makes namespaces an excellent choice for creating groups of nodes that can be executed together without conflicts.
    Suggested docs example:
    -Create pipelines without namespaces: Show how to build basic pipelines.
    -Create namespaced pipelines: Use the initial pipelines to create namespaced versions.
    -Combine pipelines: Build a final pipeline by combining the namespaced ones.
    -Visualise: Include a visualisation using Kedro-Viz
    (link to ticket in progress).

  • Decide new name for "reused pipelines with namespaces" #4016

  • Clarifying Modularity. The term "modularity" currently appears to relate to creating pipelines in separate folders, not namespaces. If this interpretation is correct, we should explicitly clarify this distinction in the docs.


Technical issues

Several technical issues were highlighted by @idanov during the last TD. These will be moved here for tracking (details in progress).


User interface

There is a potential user interface concern affecting namespace adoption, which might benefit from design attention (@stephkaiser, @iamelijahko).

  • Namespaces are conceptually similar to tags (but without node overlaps), yet tagging adoption is strong, especially for deployment purposes.
  • The UI for tags and namespaces differs significantly:

Tagging Example: Tags are added directly during node or pipeline creation:

node(func=add, inputs=["a", "b"], outputs="sum", name="adding_a_and_b", tags="node_tag")

Alternatively, for pipelines:

my_pipeline = pipeline([...], tags="pipeline_tag")

Namespace Example: Namespaces are applied at the pipeline creation level and involve multiple steps:

  1. Create a pipeline without namespaces:
    part1 = pipeline([
        node(func=split_data, inputs=["a", "b"], outputs=["c", "d"], name="node1"),
        node(func=split_data, inputs=["c", "d"], outputs=["e", "f"], name="node2"),
    ])
  2. Add a namespace:
    part1_ns = pipeline(part1, namespace="part1_ns") # pipeline name most likely repeats namespace name
    This prefixes all inputs, outputs, and parameters with part1., which most likely not to be desired. To preserve naming:
    part1_ns = pipeline(part1, namespace="part1_ns", inputs={"a", "b"}, outputs={"e", "f"})
    # I need to specify my inputs and outputs twice
  3. Combine pipelines in the registry:
    my_pipeline = part1_ns + part2_ns + ...
    # Each of my node group - actually a pipeline, but in tags - it's just a tag of a node

Tags are applied directly to nodes, whereas namespaces require changes at the pipeline level. Simplifying the namespace UI or aligning it more closely with tagging might also improve adoption.

Few other UI gaps reported by users:

Hackathon

Namespaces in deployment

We aim to unify and implement node grouping functionality for deployment purposes in #4319. Namespaces appear to be a great fit for this purpose. However, the ongoing work to increase namespace adoption from the current ticket must be completed on the same time.

@datajoely
Copy link
Contributor

Adding a bit of context - deep integration with Kedro-Viz was the first attempt to drive adoption and improve explainability:

We have spent nearly 5 years trying to explain this to users in various ways - We must pivot strategy.

@astrojuanlu
Copy link
Member

astrojuanlu commented Nov 22, 2024

Thanks @DimedS for opening this issue.

First, I would like to agree that tags do not guarantee non-overlapping pipeline partitioning. This has been said time and time again.

But I am going to push back against the idea that namespaces are the right solution for that problem. The main reason is that they were probably never designed to solve it in the first place!

Namespaces were born as "prefixes" and were introduced in Kedro 0.15.4 in October, 2019:

3c0f097

(https://github.com/McK-Private/private-kedro/pull/286, private link)

And then in 0.16.0 the modern concept of "modular pipelines" with namespace was introduced in March 2020:

af046ca

Therefore a bit less or a bit more than 5 years have passed, depending on how you look at it.

The original context and discussion have forever been lost in time https://jira.quantumblack.com/browse/KED-1105 (broken internal link) but we can get a glimpse of what the intent of the feature was from this comment:

Nikos pointed me to this and having thought a bunch about vertical pipeline development and pipeline re-use, recently

(https://github.com/McK-Private/private-kedro/pull/286#issuecomment-542717548, private link)

In addition, this is how the documentation of prefixes, and later namespaces, looked like:

The docs have always described namespaces (prefixes) as a way to reuse pipelines. There were zero review comments in those two PRs raising concerns about that.

To note, nobody from the current team participated in the original 0.15 discussion.

Therefore, I can only conclude that namespaces were always designed for pipeline reuse in mind.

Implying that namespaces have always been the solution for pipeline non-overlapping partitioning is, in my view, a big unqualified opinion that has no backing in historical written evidence. And as such, saying that "the docs are wrong" is a misrepresentation of what those docs were supposed to describe.

If anything, we're now retrofitting namespaces to solve a problem they weren't intended to solve in the first place.


I am going to push back against doing incremental improvements on a feature that nobody has dared to touch in 5 years, that's difficult to understand even for Kedro engineers, let alone for our users (regardless of their intended use case), and that we're probably retrofitting to solve a problem they weren't designed for.

My recommendation is that we look at the problem of non-overlapping pipeline partitioning with fresh eyes, go back to the drawing board, and prototype.

@datajoely
Copy link
Contributor

I would also say from users

  • Pipeline reuse is either a solved or minimal problem these days
  • Deployment is the much more acute sore sport. Dependency isolation, container granularity all fall into this space and we're not doing a good job of any.

@DimedS
Copy link
Member Author

DimedS commented Nov 22, 2024

Thank you for your comments, @datajoely and @astrojuanlu. I see that there isn’t a consensus within the team about the future of namespaces, so I’ve updated the header of this issue to reflect your perspectives.

I propose that we continue the discussion about deployment node grouping in the next Tech Design meeting with an open mind to all grouping possibilities - not limited to namespaces. If, during that discussion, we determine that namespaces are essential for deployment, we can revisit this conversation and make a decision on their future.

@datajoely
Copy link
Contributor

Great - I'll also link to this write up from last year:
https://github.com/kedro-org/kedro/wiki/Synthesis-of-research-related-to-deployment-of-Kedro-to-modern-MLOps-platforms

@deepyaman
Copy link
Member

Implying that namespaces have always been the solution for pipeline non-overlapping partitioning is, in my view, a big unqualified opinion that has no backing in historical written evidence.

Modular pipelines have long been the solution for non-overlapping pipeline composition. They are the original subunit that you can create with a simple command, and they came with provisions for defining their own set of requirements, packaging, documentation, etc.

Namespaces simply provided a way to enable reuse of modular pipelines without overlapping.

Between modular pipelines and namespaces, you have sufficient power for deployment purposes. You can deploy a full pipeline, or a set of subpipelines.

Tags are OK for running subsets of pipelines, but provide no guarantees around overlapping; it's probably fine if they exist. They don't necessarily have to be a recommended deployment solution.

Deployment is the much more acute sore sport. Dependency isolation, container granularity all fall into this space and we're not doing a good job of any.

IMO more effectively automatically mapping to containers for orchestration was a minimum bar a couple years ago. This one-way translation was perhaps a good start then. Realistically, modern solutions now provide more aspects of a data platform than just a way to deploy pieces of logic in isolation--they provide lineage, data quality, partial/incremental materialization, etc. These desires are all repeatedly echoed by users, especially data engineers; namespaces are useful, but will go 2% of the way towards providing the full value users expect from data pipelines today.

@astrojuanlu
Copy link
Member

Modular pipelines have long been the solution for non-overlapping pipeline composition.

This! Why emulate the concept of hierarchical directories with namespaces when modular pipelines are laid out in actual directories?

@idanov
Copy link
Member

idanov commented Nov 26, 2024

Hard truths

If we'd like to group parts of a bigger pipeline into separate units of execution, there's no alternative to exclusive grouping, hence namespaces. You might want to change the name however you like, but it's the hard truth of mathematics and I we cannot pretend that this is not the case. A point to discuss could be whether we need deep (nested) namespaces or not, but not the hard truth that exclusive grouping is necessary and different than the inclusive grouping (tags).

History

As probably the only person in the team who has been since the beginning of the discussions around the feature called namespaces, I can confirm that the initial introduction of namespaces has been mostly related to reusability indeed and the automatic prefixing was crucial due to requirements of reusing the same pipeline twice in a bigger pipeline.

However I can also assure you that the discussions about exclusive / inclusive grouping have started very shortly after, namely with the conceiving of the Kedro Viz feature of visualising "modular pipelines".

Meanwhile, at that time there was similar confusion and very strong opinions on how difficult the concept of registered pipelines is, and how we shouldn't have such a thing, so and so. A few years later, people seem to understand it quite well, once we managed to explain it well and provide a good example and good entry point for it.

So it's not going to be a precedent that we failed to explain something for a long time and then we got it right and it turned out to be actually a very simple and easy concept.

False premises

Some digging is needed for the data and evidence thrown around, namely:

yet tagging adoption is strong, especially for deployment purposes.

I haven't seen any hard data on that yet, only a few anecdotal quotes amplified by a couple of people who use it to support their opinion mainly. And even if this is right, I haven't seen a double-click on it - why is tagging used for deployment? Why are namespaces not used?

As for this:

spent nearly 5 years trying to explain this to users in various ways

It is not an accurate statement, we spent exactly two (at most three, if I am generous) attempts to explain it in our docs, each attempt with less than 1 week of effort put into it by a single person each time. Also every time the explanation was tightly coupled with the ambiguous concept of reusability, and not the higher concept of grouping.

I am not sure how we measure adoption of this to state that it isn't adopted feature, but even more so, you only need to adopt namespaces when you need to adopt namespaces. So pure adoption metrics are not great measurement. I am pretty sure if we start counting the nodes in pipelines, we'll find out that only a single digit percentage of users have more than 100 nodes, but I would never use it as an argument that we need to restrict the number of nodes to never exceed 100 nodes, because adoption of bigger pipelines is low.

Deployment is the much more acute sore sport. Dependency isolation, container granularity all fall into this space and we're not doing a good job of any.

These are irrelevant concerns for this discussion, we are talking about grouping. Dependency isolation is a choice of how you structure your project. Is it one Python package or several Python packages? Kedro focuses on making one Python package, thus assuming all nodes share the same set of dependencies. If you want a few packages, you can create a few Kedro projects.

Focus the discussion

So in order to limit digressions, could we narrow down the discussion and start from first principles:

  • How to group your nodes exclusively in the easiest way possible?
    • Applications of this could be:
      • pipeline reuse
      • grouping for deployments,
      • simplify graph visualisations
      • ...feel free to add more
  • How to explain this to users with a realistic example, which doesn't involve cooking breakfast or focus on one specific application only?
  • How to differentiate it from non-exclusive grouping (tagging)?
  • Is there a better and more intuitive way of achieving exclusive grouping than what we have currently?
  • Can we provide guidance on when to use exclusive grouping and when to use non-exclusive grouping?
  • What other parts of the framework make the use of exclusive grouping difficult?
    • Adding config for all your datasets?
    • Adding parameters?
    • Running a group of nodes?
    • Intermediate datasets?

Compare tags with namespaces

I will be repurposing @DimedS example to have a fair comparison and highlight accurately the difference between namespaces and tags. Single node tagging is not relevant for the discussion, as grouping one node is a meaningless operation (you need at least 2 entities to call something a group).

  1. Create a pipeline without namespaces / tags (same for namespaces and tags):
    part1 = pipeline([
        node(func=split_data, inputs=["a", "b"], outputs=["c", "d"], name="node1"),
        node(func=split_data, inputs=["c", "d"], outputs=["e", "f"], name="node2"),
        node(func=split_data, inputs=["e", "f"], outputs=["g", "h"], name="node3"),
        ...
        node(func=split_data, inputs=["w", "x"], outputs=["y", "z"], name="nodeN"),
    ])
  2. Namespace / tag a pipeline
    a. Add a tag:
    part1_ns = pipeline(part1, tags="part1_ns") # pipeline name most likely repeats tag name
    No change in input / output / node names.

b. Add a namespace:

part1_ns = pipeline(part1, namespace="part1_ns") # pipeline name most likely repeats namespace name

This prefixes all inputs, outputs, and parameters with part1., which most likely not desired. To preserve naming:

part1_ns = pipeline(part1, namespace="part1_ns", inputs={"a", "b"}, outputs={"e", "f"})
# I need to specify my inputs and outputs twice
  1. Combine pipelines in the registry (same for namespaces and tags):
    my_pipeline = part1_ns + part2_ns + ...
    # Each of my nodes having either a namespace or a tag per node

As you can see, if we look at the problem dispassionately, we'll see that there's minimal difference in the API for both, so it's very unlikely that the "problem of adoption" hides there. The only difference is semantics, which is the automatic renaming of inputs / outputs. This can lead us to a way more productive discussion on how to go forward and what tradeoffs to make.

Questions

  • Do we care about encapsulating dataset naming or not?
    • Should we prefix inputs / outputs of a pipeline automatically (currently yes, original proposal by me was no, voted down by the team back then supported by user research)?
    • Should we prefix intermediate datasets automatically?
    • Should we prefix parameters automatically?
  • If we drop automatic renaming for datasets, how to avoid naming clashes?
  • If we keep automatic renaming, how to have a more meaningful default?
  • If we keep automatic renaming for intermediates only, how to expose an intermediate dataset as an output?

Bear in mind that the discussion for automated renaming has happened a couple of times long time ago, was informed by "user research" and the behaviour changed twice to reflect the findings. Then again, we followed opinion of our users and not logic when making the decisions and we ended up here. This should serve as a cautionary tale that feature development is not a popularity contest, but an exercise of logic, vision and foreseeing future problems earlier than they will arise.

@astrojuanlu
Copy link
Member

Another issue we spotted: #4039

@astrojuanlu
Copy link
Member

Digging on a recent comment by @noklam :

[An asset leader] wants to show pipeline as "collapsable" pipeline but he is not using namespace pipeline. In his mind, he thought the keys of pipeline_registry is the "collapsable" part

@datajoely
Copy link
Contributor

^ On this I would love to proxy tags / registered pipelines which do meet the requirements of a namespace to be thought of as one and collapsed accordingly

@idanov
Copy link
Member

idanov commented Dec 12, 2024

^ On this I would love to proxy tags / registered pipelines which do meet the requirements of a namespace to be thought of as one and collapsed accordingly

I don't think this is a good idea. It's a recipe for more confusion - it's super weird that something works only sometimes in a non-obvious way.

Again, let's refocus the discussion to answer the question, "what exactly makes namespaces less intuitive than tags for example". Without creating a list of challenges, there's no point of making any decisions, because it'll be all basically an opinion against opinion.

I've shown in my earlier comment that the difference in API is minimal and clearly cannot be the culprit of the issues. There's only one difference there, namely auto-renaming of datasets behind the scenes. Shall we not dig in there first to see if we can improve that before actually suggesting we go ahead with some other random ideas like repurposing tags, registered pipelines and what-not?

@MatthiasRoels
Copy link

MatthiasRoels commented Jan 7, 2025

In this whole discussion about tags and namespaces, I never really understood why we need 3 concepts to group nodes (tags, pipelines and namespaces). To me, it feels weird to have to "force" a user to create a bunch of pipelines. But when it comes to deployment, you completely forget about this concept and you would need another concept (a namespace or a tag) to make it work.

Wouldn't a better alternative be that we think about a pipeline as a "main" function calling several other functions (nodes in this case). That pipeline could then have an input and output defined that is used in de nodes. When designing it right, it should be enough to only have those inputs and outputs defined in the catalog. Again, following the analogy with coding, the objects you return in your main function are the only variables you have access to later on (references to all other objects are lost).

So when you then want to add two pipelines, you have the following scenarios depending on what the graphs look like:

  • option 1: you have pipeline_1 with inputs A, B and outputs C, D and a pipeline_2 with inputs E, F and outputs G, H so the sum is a pipeline pipeline_1 + pipeline_2 with inputs A, B, E, F and outputs C, D, G, H
  • option 2: you have pipeline_1 with inputs A, B and outputs C, D and a pipeline_2 with inputs A, E and outputs G, H so the sum is a pipeline pipeline_1 + pipeline_2 with inputs A, B, E and outputs C, D, G, H
  • option 3: you have pipeline_1 with inputs A, B and outputs C, D and a pipeline_2 with inputs B, C and outputs G, H so the sum is a pipeline pipeline_1 + pipeline_2 with inputs A, B and outputs C, D, G, H
  • unless I forgot another valid option, all other options should result in a descriptive error that summation is impossible.

In this scenario, as @astrojuanlu and @noklam mentioned, the keys of the pipelines in the registry are then the ones suitable to collapse (and also used in deployment). As a result, in the deployments, you will see very simple commands like kedro run --env=... --pipeline=... instead of complex ones with kedro run --from-nodes=... --to-nodes=...

And while I understand this is a big and breaking change to what we currently have, I think it simplifies a whole lot.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

8 participants