Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add context manager functionality using hooks #121

Draft
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

vnmabus
Copy link

@vnmabus vnmabus commented Jul 3, 2024

This is an alternative (simpler and safer) implementation of #70 using a meta hook. By using this approach, typing should work without the need of repeating code or using typing stubs (it would allow to remove most of the code in this package, unless there is some problem that I did not notice).

The syntax used to define the lazy imports would be the same as #70:

import lazy_loader as lazy

with lazy.lazy_imports():
    from .some_func import some_func
    from . import some_mod, nested_pkg

This is currently a draft PR because I wanted to receive feedback and do more tests. It should work with any kind of import.

The code should be quite easy to understand and extend if necessary.

One think to take into account: I did NOT implement error_on_import=False with this syntax. I think that it would complicate the implementation and I see no point in delaying to report the ImportError to the user.

@tlambert03, @stefanv what are your thoughts on this?

vnmabus added 2 commits July 3, 2024 09:07
A different (easier/safer) approach to lazy importing using a meta
import hook.
@stefanv
Copy link
Member

stefanv commented Jul 3, 2024

Wow, this almost looks too good to be true 👀 What's the catch!

Is https://github.com/scientific-python/lazy-loader/pull/121/files#diff-0f92aed2087ad50adf43e527efcd07237adacc5031b7296ae1ece2693462986bR320 a recommended practice for taking over imports?

@tlambert03
Copy link
Contributor

certainly does look elegant! I don't have much experience with the practical details of using importlib.util.LazyLoader, so I can't provide much experience here. The official docs do make it sound like A) it might indeed be a good choice for very large libraries that are trying to avoid up-front greedy import times... but also that B) it's not without caveats. From the docs:

This class only works with loaders that define exec_module() as control over what module type is used for the module is required. For those same reasons, the loader’s create_module() method must return None or a type for which its __class__ attribute can be mutated along with not using slots. Finally, modules which substitute the object placed into sys.modules will not work as there is no way to properly replace the module references throughout the interpreter safely; ValueError is raised if such a substitution is detected.

Note

Note For projects where startup time is critical, this class allows for potentially minimizing the cost of loading a module if it is never used. For projects where startup time is not essential then use of this class is heavily discouraged due to error messages created during loading being postponed and thus occurring out of context.

Certainly, none of that is a deal-breaker (and indeed, we're already deep down in the magic here simply by trying to support lazy imports in a type safe way), but perhaps those warnings should be passed along in the readme here along with docs on how to use this feature

@vnmabus
Copy link
Author

vnmabus commented Jul 4, 2024

Is https://github.com/scientific-python/lazy-loader/pull/121/files#diff-0f92aed2087ad50adf43e527efcd07237adacc5031b7296ae1ece2693462986bR320 a recommended practice for taking over imports?

So, adding the finder to sys.meta_path seems to be the way. AFAIK, the finders are invoked in order, so it make sense to put ours at the beginning, in order to be called before the others. Of course, nothing prevents other finder to do the same afterwards and run before ours, but then it should have a good reason to do so.

certainly does look elegant! I don't have much experience with the practical details of using importlib.util.LazyLoader, so I can't provide much experience here. The official docs do make it sound like A) it might indeed be a good choice for very large libraries that are trying to avoid up-front greedy import times... but also that B) it's not without caveats.

I didn't either. I just saw it inside the load function in this same package, and my solution was inspired by that. However, I am not proposing here the one and only way to implement this functionality, but just a way in which lazy imports can be implemented in a safe way and with a nice syntax (and one that tricks type-checkers into thinking that they are normal imports). What I want is some discussion around this functionality to see the best way to proceed forward. Should we use LazyLoader or a different approach? For example, it would be possible, albeit a bit more cumbersome, to record each import in the finder, remove the loaded modules after exiting the context manager, and use a mechanism similar to what attach does, if we feel that approach is better.

@vyasr
Copy link

vyasr commented Jul 13, 2024

@stefanv asked me to take a look at this. Planning to review soon, but messaging in advance so you can ping me if I drop the ball.

Copy link

@vyasr vyasr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall this approach seems pretty sound. I've done a decent bit of hacking around importlib, but I haven't used the LazyLoader specifically before so I did a bit of testing. I was mainly concerned about whether extension modules or built-ins might use custom loaders, but built-ins do not and the extension modules I tested (built either directly with the Python C API or using pybind11) seemed to be fine. On the restrictions:

This class only works with loaders that define exec_module() as control over what module type is used for the module is required.

The restriction that finders must use exec_module is basically a statement that this doesn't work with finders written prior to Python 3.4 because exec_module and create_module were the replacements introduced then and by default load_module simply calls those two functions in modern versions of Python. That seems fine.

For those same reasons, the loader’s create_module() method must return None or a type for which its class attribute can be mutated along with not using slots.

Similarly, I don't think any modern loaders will return something with an immutable __class__.

Finally, modules which substitute the object placed into sys.modules will not work as there is no way to properly replace the module references throughout the interpreter safely; ValueError is raised if such a substitution is detected.

I could see this restriction impacting one of the more common use cases of custom loaders, which is to intercept imports of a specific package. In some of those cases it is reasonable to imagine sys.modules being rewritten as a way to alias or mask a package's import. However, I still think that is a special enough case that expecting lazy_imports to work seamlessly with those custom loaders is out of scope.

return spec


sys.meta_path.insert(0, LazyFinder())
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Most meta path finders will be targeted at specific set of modules. In those cases it is probably fine to insert the finder like this because in general if there are multiple finders/loaders affecting the same set of modules and injecting specialized behavior there is no way to guarantee that they can both do so safely anyway. In this case, however, we need the LazyFinder to respect other finders. If a finder is inserted into the meta path after this finder then it could preempt the lazy find. To address that, instead of inserting the finder here we should insert it inside the __enter__ of the context manager and then remove it in __exit__ with something like

with threading.Lock():
    sys.meta_path = [f for f in sys.meta_path if not isinstance(finder, LazyFinder)]

We will need to lock meta_path modification (both insertion and deletion) for thread safety. Since the lazy_imports context is by construction not reentrant it should be fine to not include any logic to handle the object already being in the list.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for not answering earlier, as I was on vacation.

I know that it is possible for another hook to be inserted before this one after this is imported. However, that would only happen if it is explicitly inserted at the beginning of the list, which is not very common.

Although ensuring that we are at the first position on __enter__ seems sensible at first glance, in a multithread context this simply cannot be done safely, AFAIK, because hooks are global and not per-thread, so you would be messing with the meta path for other threads too (and other threads could change it too!).

Your code seems broken. Not only it assumes that all threads would synchronize the meta path using the same lock (impracticable), but if I am not mistaken you are creating a new lock every time, so you are not even protecting anything (see pylint-dev/pylint#5208).

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No problem!

Apologies, my example wasn't intended to be taken verbatim. Concretely, I would suggest storing a lock either at module scope or as a class member of the LazyFinder (e.g. upon first import of the lazy_loader itself). Then you would have a single instance that is shared across all threads and all import calls. I am confused as to why you think having all threads synchronizing the meta path via the same lock would be impractical. In general, thread-safe code requires that all threads would use the same lock, otherwise the lock serves no purpose. Perhaps I am misunderstanding what you mean? Are you concerned that imports might be occurring in a performance-sensitive context? My assumption is that even if you have dozens/hundreds/more threads, the import shouldn't be on the critical path for performance.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, what I mean is that you cannot control what other threads do. In a multithreading context some threads may be created by other libraries or the application. These could in principle access the list while you are checking/changing it, and they won't use your lock. You can only protect by locking the APIs that you define, not the APIs defined elsewhere, unless you are in complete control of all threads of the process.

Copy link

@vyasr vyasr Aug 22, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah yes, if any other library includes modification of sys.meta_path beyond simply usage of this context manager in different threads then I agree that we have no control over that. There is nothing that we can do in that case. The question to me is really which case has more risk; allowing the lazy finder to be preempted by different insertion order, or allowing the lazy finder to have possible nondeterministic behavior in a multithreaded scenario where multiple distinct code paths modify sys.meta_path. I was assuming the latter would be far less likely than the former (granting that the former is also quite unlikely) and was largely operating under the assumption that the lazy loader would be the only code modifying sys.meta_path in a multithreaded context. OTOH nondeterministic code is a far greater evil and I wasn't really thinking much about it. I'm fine with leaving the implementation as is and just accepting the limitation.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What I think we could do mostly safely is to warn if we are not in first position, possibly even listing which other hooks are before ours, so that users are aware of that and do not spend hours to debug it.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, worst case with the warning is that we report a false positive if some other libraries is inserted and then removed with bad timing in another thread.

if spec is None:
raise ModuleNotFoundError(f"No module named '{fullname}'")

spec.loader = importlib.util.LazyLoader(spec.loader)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since the lazy loader produces a delayed evaluation such that any import errors will show up later than expected, and since the user of the lazy_imports context may not realize what is happening under the hood, it might make sense to subclass the LazyLoader and do something like

class MyLazyLoader(LazyLoader):
    def exec_module(self, module):
        try:
            super().exec_module(module)
        except BaseException as e:
            raise ImportError(f"Lazy load of module {module} failed") from e

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I in principle agree with improving the user reporting of errors, so this sounds fine to me.

@stefanv
Copy link
Member

stefanv commented Oct 30, 2024

After @vyasr's review (thank you!) that this approach is viable, so should we try to get this out there for people to try?

@DavidCEllis
Copy link

DavidCEllis commented Nov 5, 2024

Hi, @vnmabus mentioned this on a python discuss thread about deferred (originally async) imports so I thought I'd comment here as I ran into some issues trying it out.

It seems like you need to import threading somewhere before the context manager is used. If threading isn't already imported, importlib.util.LazyLoader will try to import it within its own exec_module method. As you're in the with block, inside_context_manager is True and so LazyFinder tries to lazily load threading and fails.

Code:

import lazy_loader as lazy

with lazy.lazy_imports():
    import numpy

Error:

Traceback (most recent call last):
  File "C:\Users\ducks\Source\scratch\lazy_demo.py", line 4, in <module>
    import numpy
  File "<frozen importlib._bootstrap>", line 1360, in _find_and_load
  File "<frozen importlib._bootstrap>", line 1331, in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 935, in _load_unlocked
  File "<frozen importlib.util>", line 257, in exec_module
  File "<frozen importlib._bootstrap>", line 1360, in _find_and_load
  File "<frozen importlib._bootstrap>", line 1331, in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 935, in _load_unlocked
  File "<frozen importlib.util>", line 267, in exec_module
AttributeError: partially initialized module 'threading' has no attribute 'RLock' (most likely due to a circular import)

Code for LazyLoader.exec_module

https://github.com/python/cpython/blob/ffb44cd08c625dc4eda7eea058e1049d1f630491/Lib/importlib/util.py#L257-L274


I also see some examples that use from imports, but as far as I can tell these are not lazily loaded with this method, however any modules imported by the module you are importing from are turned into _LazyModules. I'm not sure this is the desired behaviour.

(This is with the threading import added to avoid the error)

my_module.py

import numpy

print(type(numpy))

def my_function():
    pass

lazy_demo.py

import lazy_loader as lazy

with lazy.lazy_imports():
    import my_module

python -Ximporttime lazy_demo.py

import time: self [us] | cumulative | imported package
...
import time:      1145 |      38170 | lazy_loader
import time:       113 |        113 | my_module

lazy_demo2.py

import lazy_loader as lazy

with lazy.lazy_imports():
    from my_module import my_function

python -Ximporttime lazy_demo2.py

import time: self [us] | cumulative | imported package
...
import time:      1511 |      39843 | lazy_loader
import time:       193 |        193 | my_module
import time:       470 |        470 | numpy
<class 'importlib.util._LazyModule'>

Edit: Noticed the initial error was from a different run than the script from the traceback, must have copied from the wrong run.

@stefanv
Copy link
Member

stefanv commented Nov 5, 2024

@DavidCEllis Thanks very much for taking a look.

More generally speaking, given your depth of experience, if you run into any issues with lazy_loader, or have recommendations on how to make it safer or easier to use, your input would be much appreciated.

@mikeshardmind
Copy link

mikeshardmind commented Nov 5, 2024

There's some other existing prior/parallel art here with the context manager approach. Something that's done in prior/parallel art is declaring which namespaces should be affected.

Taken from the readme of another project working on this problem:

with defer_imports.install_import_hook(module_names=("discord",), recursive=True):
    ...

This avoids your above issue with needing to be careful about importing something (threading) merely for effect before using this context manager.

@vnmabus
Copy link
Author

vnmabus commented Nov 6, 2024

Thanks @DavidCEllis and @mikeshardmind for giving feedback!

Apparently the issue with threading is new, due to a recent commit (python/cpython#117983). For solving that, and the "from" issue maybe we should use our own class, to have more control of what is loaded and when.

In #70 I had another draft implementation that relied on replacing temporarily the __import__ function. However, at the moment that felt to me more brittle than using hooks, and I don't know about other options. However if I am not mistaken that took into account also "from" statements, and it could be possible to do the same here by using our own loader.

To be honest, I am not an expert on importing internals, so I hoped that one of these PRs was inspirational enough to "nerd-snip" someone with more knowledge that can help finish the implementation successfully.

@DavidCEllis
Copy link

I'm not sure you can cleanly handle from imports this way. You might be able to get close but I don't know if you wouldn't run into a great deal of edge cases and headaches.

The _LazyModule objects returned by LazyLoader imports do some __class__ switching as part of their mechanics.

When a module is imported the partially initialised module has its __class__ switched for _LazyModule in LazyLoader.exec_module and then when __getattribute__ is called it does the exec_module call from the original loader and switches the __class__ back.

(Note that even with __getattribute__ overridden you can still call type(module) and see that it's a _LazyModule if you do so before accessing anything.)

By contrast the objects retrieved with a from import can be anything and there's no loading knowledge you can have before the object is accessed to swap out its class, and even if you did the object that existed will still be observable and may cause issues (for example if type(obj) is <type> is used before obj is accessed some other way). PEP 690 indicates there's not really a good way to do this that can be completely transparent without having to modify how module __dict__ works.


The solution I use myself for lazy imports uses roughly the same mechanics as your attach function but covers what I need for the equivalent of attach and load.

Instead of building a function that returns a __getattr__ function for the module I create an instance of a LazyImporter class with a similar __getattr__ method1. This means I can use the attributes of the instance within the module, or export them via getattr(importer, name) to the module __getattr__ (with a helper function).

Does make a noticeable performance improvement in the CLI tools I use it for. Still doesn't play nice with static analysis sadly.

Footnotes

  1. In my case this class doesn't actually do any importing and delegates that to another class which handles the differences in module, from and try/except imports.

@vnmabus
Copy link
Author

vnmabus commented Nov 6, 2024

Wouldn't an hybrid solution be possible, in which the with statement does not import anything (but records the things that should be imported, and fools the type checkers) and then an explicit attach function is called to modify __getattr__?

@DavidCEllis
Copy link

I think if you want to prevent any import machinery from happening you might be better served by replacing __import__. I believe the import hooks are really for if you want to modify how some part of the process happens, rather than preventing it all together.

Trying to match things up does get a bit hairy when you're trying to handle from x import y as z as 'z' isn't captured in the import call. I have half a solution sketched out but I'm not happy with how I'm matching things up and I'm not sure it won't fall apart with some import type I haven't anticipated.

@DavidCEllis
Copy link

Something along these lines might work the way you're suggesting? I'm not sure I like it though.

https://gist.github.com/DavidCEllis/29c5e77262f522c26b9f810fcb031a43

@vnmabus
Copy link
Author

vnmabus commented Nov 6, 2024

What I worried with replacing __import__ (and the reason I tried with hooks and ContextVar) is about what happens with concurrency (either multithreading or coroutines). I do not think it is safe to change __import__ and change it back in the presence of multiple threads (I think coroutines are safe, as they need to be yield explicitly, if I am not mistaken, and you won't do that inside the with block).

@DavidCEllis
Copy link

DavidCEllis commented Nov 6, 2024

Yes, that's one of the reasons I didn't really like it. In the context of multiple threads trying to import simultaneously you'll run into issues. Even with path hooks, something else could insert a path hook and break the process though.

Defining the imports separately either in a stub or in a dict for attach is probably still the safest way, but I'll have another look to see if there's a way to use path hooks.

Edit: I think it takes some of the tools you'd need to do this out of your hands, but I'm not certain yet.

@DavidCEllis
Copy link

DavidCEllis commented Nov 6, 2024

The docs for the import system state (emphasis mine):

If it is acceptable to only alter the behaviour of import statements without affecting other APIs that access the import system, then replacing the builtin __import__() function may be sufficient. This technique may also be employed at the module level to only alter the behaviour of import statements within that module.

However I haven't actually found a way to make it so the import statement is only replaced in one module, maybe I'm missing something or that's not what this sentence is supposed to imply.


I did make a handful of changes to mitigate some of the issues anyway.

Imports that occur in other modules while the replacement __import__ is in place are now passed through to whatever __import__ was before replacement (__name__ must be explicitly provided). I've also put a threading lock around the replacement, so at least this context manager shouldn't be able to replace itself.

If builtins.__import__ is replaced with something else while within the context manager this will now error, I don't see this as significantly different to finding that something has put a meta path hook earlier in sys.meta_path while within a block that used that mechanism.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants