Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Hint about extra fields in conda-forge.yml #1920

Open
wants to merge 9 commits into
base: main
Choose a base branch
from

Conversation

ytausch
Copy link
Contributor

@ytausch ytausch commented May 3, 2024

Checklist

  • Added a news entry

This PR contains the functional changes of #1900, to be merged separately. See also parts of the discussion there.

TLDR: recipe-lint now hints about additional fields as compared with the Pydantic schema.

@@ -1,6 +1,11 @@
# -*- coding: utf-8 -*-

from collections.abc import Sequence, Mapping
from typing import List

from pydantic import BaseModel
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@isuruf Let's continue the discussion about adding pydantic here.

Copy link
Contributor Author

@ytausch ytausch May 3, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By the way: pydantic is already listed as a project dependency in environment.yml
Edit: This comment does not contribute to the discussion, my mistake.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For clarity pydantic is a development-only dependency. So that only affects development, not installation. We generate a JSON schema for runtime use. For context, it was added in PR: #1756

Is it possible to use jsonschema for this case too?

Copy link
Contributor Author

@ytausch ytausch May 5, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand that pydantic is not currently a part of https://github.com/conda-forge/conda-smithy-feedstock/blob/main/recipe/meta.yaml dependencies.

I am not aware of a good way of doing this with jsonschema.

Also, I want to repeat my argument:

I would like and strongly propose to keep pydantic here. Pydantic provides an elegant way to find additional fields that are not part of the schema (see my implementation) - and also, I think the pydantic model should be used in more parts of the smithy code anyway since it provides a type-safe way to access fields.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is adding pydantic as a new dependency so bad?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is adding pydantic as a new dependency so bad?

Please see my comment on the other thread.

I am not aware of a good way of doing this with jsonschema.

This is easy. Load the json schema and see if "additionalProperties": false is there for the attribute.

Copy link
Contributor Author

@ytausch ytausch May 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please see my comment on the other thread.

@isuruf I just scrolled through every comment you left at my conda-smithy PRs but could not find an answer to my question. You said:

This adds a new dependency pydantic on users. Previously it was only a dev dependency. Is it possible to remove this?

and

I strongly disagree. This PR use pydantic to check for which fields to hint about. If we can't do that using the json schema we are essentially making a distinction between pydantic model and json schema.

I do not see how this explains why we want to avoid pydantic as a dependency. To your latter comment, I replied:

What do you mean with "distinction"? The pydantic model should be equivalent to the JSON schema anyway - if it's not, we have a bug.


This is easy. Load the json schema and see if "additionalProperties": false is there for the attribute.

This requires us to have a separate JSON schema that actually has additionalProperties set to false for the fields we want to warn about since we have agreed to allow additional properties in the main schema. Is this what you propose?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@isuruf I would really really like to solve this debate as I want this PR to get merged.

@ytausch
Copy link
Contributor Author

ytausch commented May 5, 2024

Note: The signature of lintify_forge_yaml changed but makes a lot more sense now since not all lints or hints must be JSON validation-related. Formally, this would require a deprecation process but I want to raise the question of who actually uses this method.

We had a discussion about deprecation in #1906

@ytausch
Copy link
Contributor Author

ytausch commented May 10, 2024

What is the status of this PR? @conda-forge/core @isuruf

@ytausch ytausch force-pushed the hint-extra-fields branch from 036d3df to 155b56e Compare May 13, 2024 14:44
@ytausch
Copy link
Contributor Author

ytausch commented May 27, 2024

@isuruf

@ytausch
Copy link
Contributor Author

ytausch commented Jun 7, 2024

@isuruf There are still some open discussions about pydantic

@ytausch ytausch requested a review from isuruf June 7, 2024 12:43
@ytausch ytausch force-pushed the hint-extra-fields branch from 155b56e to 8a429ed Compare July 10, 2024 10:03
@ytausch
Copy link
Contributor Author

ytausch commented Jul 10, 2024

@isuruf You did not reply here since almost 2 months although I pinged you multiple times. Is there anything I need to do? Fine of course if you're currently busy, I'll wait here.

@ytausch ytausch mentioned this pull request Jul 26, 2024
@h-vetinari
Copy link
Member

Sorry for the very long delay here @ytausch! I stumbled over this due to #2152 and would be very interested to help you unblock this work, as it's necessary for a lint I'd like to add (#2155).

CC @beckermr

Copy link
Member

@beckermr beckermr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The issue here is requiring pydantic at run time IIRC. We need to resolve this sticking point before we can merge.

@ytausch
Copy link
Contributor Author

ytausch commented Nov 21, 2024

What exactly is the reason that we don't want to have pydantic as a runtime dependency?

@isuruf
Copy link
Member

isuruf commented Nov 21, 2024

I thought I made my reasons clear in #1900. Let me know if you need more clarification.

@h-vetinari
Copy link
Member

@isuruf: I thought I made my reasons clear in #1900. Let me know if you need more clarification.

Looking at that PR, the only relevant-looking thread I see is

@isuruf: This adds a new dependency pydantic on users. Previously it was only a dev dependency. Is it possible to remove this?

@ytausch: I would like and strongly propose to keep pydantic here. Pydantic provides an elegant way to find additional fields that are not part of the schema (see my implementation) - and also, I think the pydantic model should be used in more parts of the smithy code anyway since it provides a type-safe way to access fields.

@isuruf: I strongly disagree. This PR use pydantic to check for which fields to hint about. If we can't do that using the json schema we are essentially making a distinction between pydantic model and json schema.

@ytausch: What do you mean with "distinction"? The pydantic model should be equivalent to the JSON schema anyway - if it's not, we have a bug. I am not aware of a good way of doing this with JSON schema.

@ytausch: @isuruf Let's continue this in #1920.

So the sticking point is not having a runtime dependence on pydantic? Why is that an issue? I don't care much, but I've found that #2155 without this PR doesn't actually work. I definitely think fixing linting on wrong distro-values is more important than whether we depend on pydantic, but I'd also be happy if there's a fix for #2152/#2155 that works without the runtime dependence.

@isuruf
Copy link
Member

isuruf commented Nov 21, 2024

So the sticking point is not having a runtime dependence on pydantic?

No, the main point is that I want the json schema to be complete.

If we can't do that using the json schema we are essentially making a distinction between pydantic model and json schema.

@h-vetinari
Copy link
Member

h-vetinari commented Nov 21, 2024

No, the main point is that I want the json schema to be complete.

How do you imagine we encode the fact that additional fields are forbidden in the json schema? AFAICT json has no way to specify that, so whatever we do, we'd have to introduce a marker-field that says "this JSON schema cannot be extended". Which is what this PR is doing with "additionalProperties" AFAIU (naming can be bikeshod of course, e.g. __no_extend / __closed, etc.).

To me saying "json schema needs to be complete" thus becomes equivalent to "we shouldn't lint on unknown fields", which I strongly disagree with.

@isuruf
Copy link
Member

isuruf commented Nov 21, 2024

How do you imagine we encode the fact that additional fields are forbidden in the json schema?

additionalProperties: false

@h-vetinari
Copy link
Member

So the issue is just the choice of default? I.e. whether we need to add "additionalProperties": false to non-extensible schemas, or whether - as this PR seems to be doing - set "additionalProperties": true (?!) and build on that?

It'd be completely fine for me to opt into extra="forbid" with "additionalProperties": false; that would be a natural equivalence IMO. Is this what you had in mind?

@isuruf
Copy link
Member

isuruf commented Nov 21, 2024

It'd be completely fine for me to opt into extra="forbid" with "additionalProperties": false; that would be a natural equivalence IMO. Is this what you had in mind?

Yes, but with a hint in the linter instead of a lint.

@h-vetinari
Copy link
Member

@ytausch, now that we've elucidated Isuru's objection, do you think you'd be able to implement this request?

Also, @beckermr, in case you're still opposed to the runtime dependence on pydantic, could you explain the problem with that please?

@beckermr
Copy link
Member

I don't care personally. I was trying to make sure more discussion was had since others had objected.

@ytausch
Copy link
Contributor Author

ytausch commented Nov 22, 2024

I appreciate this PR is going forward!

It'd be completely fine for me to opt into extra="forbid" with "additionalProperties": false;

Yes, but with a hint in the linter instead of a lint.

The problem with combining these 3 requirements is that we cannot partially validate a JSON schema with jsonschema, ignoring additionalProperties: false attributes. Similarly, if a Pydantic model has extra=forbid set, we cannot validate data with additional attributes. I think there is also a good reason: A schema or model is built to either validate or fail to validate; there should be nothing in between.

However, if we don't want to emit a hint and not a lint on extra fields (this PR already does that!), we need to differentiate between extra fields and other schema violations. The way I did it here was to allow extra fields in the Pydantic model (which is also reflected in the JSON schema) and then manually traverse the config to identify additional fields in a second hint step.

Doing it like this also preserves the following invariant: "All conda-forge.yml files in feedstocks can be validated with the Pydantic model of conda-smithy". This is very useful because it allows us to parse all feedstocks with the Pydantic model, which I see as a potential future use case for the autotickbot. Hints can be overridden and do not preserve this invariant.

If I understand correctly, the reason why you are not happy with the current solution is that you want to forbid extra fields in the JSON schema - which I also understand because, for example, IDE support.

Because of the two reasons I mentioned above (we need extra=allow to separate the check for extra fields from the rest of the schema check; Pydantic model invariant), I propose solving this with the following approach:

  1. The current Pydantic model stays as-is.
  2. We add a method that returns a new, "strict", Pydantic model, dynamically modifying all relevant instances of extra=allow to extra=forbid. Never did this before but it should be similar to adding new fields dynamically.
  3. The JSON schema is generated from the strict Pydantic model, thus it has additionalProperties: false in all relevant cases.
  4. The non-strict Pydantic model is used for generating lints, and we generate the hints for extra fields like this PR currently already does.

The strict Pydantic model (which is dynamically accessible via the added method) exactly reflects what is present in the JSON schema. And the non-strict (original) Pydantic model poses the minimum schema every feedstock satisfies.

What do you think?

@isuruf
Copy link
Member

isuruf commented Nov 22, 2024

That sounds good. It wouldn't require a run time dependence on pydantic right?

@h-vetinari
Copy link
Member

The problem with combining these 3 requirements is that we cannot partially validate a JSON schema with jsonschema, ignoring additionalProperties: false attributes.

I don't understand that problem yet. How does jsonschema come into it, if the models between JSON and pydantic are equivalent, and we can just use pydantic to enforce extra="forbid"?

@ytausch
Copy link
Contributor Author

ytausch commented Nov 22, 2024

That sounds good. It wouldn't require a run time dependence on pydantic right?

If absolutely necessary, we could avoid the runtime dependency by generating a strict and a non-strict JSON schema. The non-strict schema would allow extra fields while the strict schema wouldn't. If the non-strict schema passes and the strict schema fails, we know the issue lies in the extra fields.

However, we probably couldn't format the lint messages as nicely as we currently can. And there will be compromises if a file violates the non-strict JSON schema, and on top of that, has extra fields since we cannot really detect that with a JSON-schema-only approach. This is why I propose keeping pydantic as a runtime dependency to have a more elegant way of extracting extra fields.

I don't understand that problem yet. How does jsonschema come into it, if the models between JSON and pydantic are equivalent, and we can just use pydantic to enforce extra="forbid"?

  • If the models between JSON and pydantic are equivalent and extra=forbid is set in pydantic, additionalProperties: false is set in the JSON schema.
  • This makes the JSON schema useless for generating schema lints (and hints) because jsonschema cannot ignore the additionalProperties JSON schema definitions. (Exception: There are separate strict and non-strict JSON schemas, see above to why I oppose that).
  • Thus, we need to use pydantic to 1) enforce the non-strict model to generate lints; and to 2) enforce the strict model to generate hints for extra fields.
  • We could just validate the strict model on the input yaml and process the ValidationError that might occur. Depending on the type of errors, we create lints or hints. While this would work for sole schema checking, it makes the Pydantic model useless for the future use case of parsing conda-forge.yml with it. This is because we do not get a Pydantic object if ValidationErrors are raised on validation.
  • To ensure that the model is good for parsing too, we have to set extra=allow or extra=ignore.
  • Because we wanted to have additionalProperties: true in the JSON schema, this means we need to dynamically modify the model to generate it. This is what I proposed above.

Hopefully, this clarified a bit? :)

@h-vetinari
Copy link
Member

Hopefully, this clarified a bit? :)

Thank you for the elaboration, though it's the second time where you build a longer argument and I already stumble on one of the first assumptions/statements.

This makes the JSON schema useless for generating schema lints (and hints) because jsonschema cannot ignore the additionalProperties JSON schema definitions.

How does it become useless? The fact that it cannot seems irrelevant to me because it shouldn't ignore additionalProperties / extra fields (and we turn them into lints or hints).

@ytausch
Copy link
Contributor Author

ytausch commented Nov 22, 2024

Thank you for the elaboration, though it's the second time where you build a longer argument and I already stumble on one of the first assumptions/statements.

Good that we are discussing this here then!

How does it become useless? The fact that it cannot seems irrelevant to me because it shouldn't ignore additionalProperties / extra fields (and we turn them into lints or hints).

Counter-question: How do you want to differentiate between schema violations that turn into a lint vs. violations that turn into a hint?

@h-vetinari
Copy link
Member

h-vetinari commented Nov 22, 2024

How do you want to differentiate between schema violations that turn into a lint vs. violations that turn into a hint?

Counter-counter-question: Is the extra complexity that's necessary to make that distinction actually worth it, or should we just lint on any violation, including otherwise benign extra fields?

I haven't been involved in this discussion as long as you and Isuru, but so far my impression is that a simpler approach would be clearly preferable, and so what if wrong fields get lints instead if hints? 🤷‍♂️

@ytausch
Copy link
Contributor Author

ytausch commented Nov 22, 2024

How do you want to differentiate between schema violations that turn into a lint vs. violations that turn into a hint?

Counter-counter-question: Is the extra complexity that's necessary to make that distinction actually worth it, or should we just lint on any violation, including otherwise benign extra fields?

I haven't been involved in this discussion as long as you and Isuru, but so far my impression is that a simpler approach would be clearly preferable, and so what if wrong fields get lints instead if hints? 🤷‍♂️

I'm totally in for classifying extra fields as lints instead of hints. This is also what I initially proposed a few months ago in #1865 :D
It has to do with Isuruf wanting to add new platforms to conda-forge "without touching conda-smithy", see the discussion in #1865.

@h-vetinari
Copy link
Member

It has to do with Isuru wanting to add new platforms to conda-forge "without touching conda-smithy".

I understand the sentiment, but given all the overengineered complexity that flows from it, this requirement must be put in question.

New platforms get added extremely rarely, while we can do smithy releases as often as we want, so it's really not an unreasonable ask to make a smithy update for rolling out a new platform.

@ytausch
Copy link
Contributor Author

ytausch commented Nov 22, 2024

I absolutely agree.

@h-vetinari
Copy link
Member

New platforms get added extremely rarely, while we can do smithy releases as often as we want, so it's really not an unreasonable ask to make a smithy update for rolling out a new platform.

Even more than that: all the platform-related values get populated with

>>> from conda.base.constants import KNOWN_SUBDIRS
>>> [subdir.replace("-", "_") for subdir in KNOWN_SUBDIRS if "-" in subdir]
['emscripten_wasm32', 'wasi_wasm32', 'freebsd_64', 'linux_32', 'linux_64', 'linux_aarch64', 'linux_armv6l', 'linux_armv7l', 'linux_ppc64', 'linux_ppc64le', 'linux_riscv64', 'linux_s390x', 'osx_64', 'osx_arm64', 'win_32', 'win_64', 'win_arm64', 'zos_z']

so all conda-build-supported platforms are already part of the schema information, and won't need further changes to smithy even if we start rolling out a new platform in conda-forge.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants