-
Notifications
You must be signed in to change notification settings - Fork 185
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Google Maps filter #473
base: main
Are you sure you want to change the base?
Google Maps filter #473
Conversation
Make GMapsDurationProcessor export unformatted data as well, to enable later steps in the pipeline (such as filters) to access and process this data.
You can't re-open a NamedTemporaryFile on Windows while it's still open.
This PR involves some refactoring. The problem is that previously, filters ran before making any calls to external APIs. Therefore, just adding another filter for the distance doesn't actually work: the distance information is not yet available when we apply the filters. We can't just run the filters later, because then we would run the Google Maps API calls before we filtered out any properties, meaning that we would incur Google Maps API calls for all properties that we find in our search, including those that we later filter out anyway based on price, size, etc. - and we actually have to pay for those requests! My solution is to group the filters in two chains, and then run one chain before and one after external API calls have been made. This way, we can run the distance filter after the API calls are made, but keep everything else the same.
I swear it was green just now. :( brb |
Amazing! I love the refactor and the introduction of pre- and post- filters. And it's great that someone is using the Google Maps Distance feature - I was tempted to remove that a while back because I don't know if it was active. The linter and type-checker seem pretty unhappy, but assuming you can resolve that I'm happy to merge this. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm happy to merge this, pending passing tests, but if you want to look at the stylistic things, that would be most welcome.
@@ -354,6 +349,23 @@ def max_price_per_square(self): | |||
"""Return the configured maximum price per square meter""" | |||
return self._get_filter_config("max_price_per_square") | |||
|
|||
def max_distance(self) -> List[DistanceConfig] | None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if it would be more natural here just to return the empty list of there's nothing configured. Makes the typing judgement simpler.
self._append_filter_if_not_empty(MaxRoomsFilter, config.max_rooms()) | ||
self._append_filter_if_not_empty( | ||
PPSFilter, config.max_price_per_square()) | ||
if filter_chain == FilterChainName.preprocess: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm a bit "meh" on if-else chains. Do you think we could do this with Enum and Match?
https://realpython.com/python-enum/#using-enumerations-in-if-and-match-statements
I agree and am happy to do it that way, but be advised that this is inconsistent with some things already in the code: in particular, max_rooms() returns None if the filter is not configured, and so do most of the other "trivial" filters -- whereas excluded_titles(), for example, returns an empty list if it's unconfigured. I chose to go with your recommendation, but somebody should probably clean this up.
We could, but the tests still run for Python 3.8 and Python 3.9, neither of which support the I have the tests almost green now. Most of them were broken on Windows anyway, so I had quite some cleaning up to do around them. By the way, the test case for ebay Kleinanzeigen doesn't use a mock for What I did fix is a lot of code in the tests that looked like this:
where the assertion obviously does nothing. |
I also just plain broke some of the tests, because the Edit: I also just realize that this has certain implications for multi-user instances. It does mean that if at the time an expose is crawled it passes no filters, it will be marked as processed, but will not be saved. This is probably fine for a single-user instance, but in a multi-user instance, if an expose gets crawled, and then later a user registers with filter settings that would include that expose, they won't see it (because it hasn't been saved, and it will not crawled again). I am uncertain what the correct solution is. Edit 2: The correct solution is somewhat more complicated - we would have to differentiate between exposes that have been saved with extra data (like GMaps data), and exposes that have been saved without that extra data. Then later, if a filter happens to include an expose that has been saved without extra data, we need to run the processor (like the GMaps distance calculation), and then filter again in the postprocessing chain. This might be possible to fit in the current "pipe" model somewhere, but it requires some thought. (Just braindumping here.) |
Also, I introduced another real bug: because I used I chose to go for The correct solution for when one wants to have type-safe data being passed around and serialized is obviously to use a proper serialization library, like I am not sure if you are alright with me introducing type-safe objects at all, given those constraints. If you would rather have me give up on type safety and just use stringly-typed |
Thanks for the feedback and the comments - looks like you're pretty deep into the weeds on this one, and I appreciate you looking into the feedback so deeply. I think I will need to take some more time to also look at this in more detail, and I won't get that done today (or probably this week). For Python version, I would be happy to move to supporting 3.11 and 3.10 - 3.8 and 3.9 have been out of active support for a while now. I don't think it's a big deal for new users of the code to use the current version of python: Debian stable has 3.11 and Ubuntu 22.04 LTS has 3.10. From my side, it's up to you how much effort you put into type safety on dicts. I'm happy if most of the functions have a meaningful type restriction, and more happy the better the constraint is, but already a generic |
I've updated |
Hi again @colinemonds , I've prepared another PR based on your code - #477 - that passes the linter and type-checker. I also made most of the tests pass, but the web interface hunting tests are still failing, probably because of the issue you described. If you have time to check it out, or you want to copy those changes over to your branch, please do. Otherwise I will look again some time in the coming weeks at how to make the remaining tests pass. |
Oh, thanks @codders ! Sorry for not getting back to you earlier, it seems we have done some duplicate work now; I already created a commit that likewise passes the type checks, the linter, and allmost all the tests, but I didn't push it yet. The reason is that, like you, I can't make the web hunter tests pass. I think the web hunter needs to be thoroughly refactored for this to work. It's quite obvious from the way that a lot of the storage mechanism is set up that the database was never really designed to support a multi-user instance. (The class name "IDMaintainer" is a tell on its own - it was clearly designed to remember which expose IDs had already been scraped, and to do nothing else. It does a lot more now.) A possible hack might be to give up on scraping the exposes only once, and instead try and scrape them only once per user. The advantage would be that later steps in the pipeline (like the Google Maps distance calculation) would run per user, and hence also run with the configuration for that user (which is obviously important in the Google Maps case, as each user might have different locations set in their configuration). I have a half-baked implementation of this. The downside is that on large instances, there is a lot more scraping to be done. In the case of immobilienscout24 and other sites with strong anti-botting protection, that's a real pain. A proper solution would entail splitting not only the filters in two steps, but also the processors: we would need one dedicated scraping chain that has to run only once per expose (e.g. to do the fetching, anti-anti-botting, and the actual magic on the HTML), and a dedicated processor chain that has to run once per user (currently, for the Google Maps stuff, but potentially also for other things in the future). Then we could store the interim results after the scraping chain in the database (same as we currently do), and for each user who has configured pre-processor filters such that the stored expose is interesting to them purely based on that, run the processor chain on that expose for them and store the result of that in a new table indexed by a composite key (expose_id, user_id), and then run the post-processor filters on that. This will ensure that minimal work is done. Unfortunately, the hunter implementations aren't very DRY, so this is going to take a little work just making changes all over the place. It's still probably the best thing to do. |
Hi @colinemonds , Thanks for the update. I'm supportive of larger refactors, but I don't have the capacity myself to do that right now. So for me the question whether to hack it or not is a question of whether you have the enthusiasm to do a larger refactor of the functionality. I would prefer to have the functionality integrated with a bit of a hack than to lose it on a feature branch that only you have. What would be your preference? |
The problem is that I can't say what will break if I actually finish this hack. I can happily push the commit later (tomorrow, probably) for you to look at, then at least it won't get lost on a feature branch that only I have. But still. I think the refactor is needed before my work can get properly pulled. The code base has enough technical debt as it is, and this might also introduce bugs on top of that. And I think that might matter a lot for the (your?) public web-hosted instance of flathunter, where you'll just end up with more support tickets. Might as well put in the work now and not have to worry later. I can't promise when I'll actually have the time, though, but I might be able to fit this in on the weekend. |
Yeah. Scraping once per user isn't an option - Kleinanzeigen has IP blocking for bots that scrape too fast and Immoscout as you say will probably get annoyed. I don't know the exact user numbers for my instance because I don't keep any analytics, but I think it's in the hundreds of active users. So at least my use case for WebHunter requires that we only scrape once. I might try and take a look at this in the coming weeks. Will have to see how my workload is. |
feature #472: filter based on GMaps distance
This PR involves some refactoring. The problem is that previously,
filters ran before making any calls to external APIs. Therefore, just
adding another filter for the distance doesn't actually work: the
distance information is not yet available when we apply the filters. We
can't just run the filters later, because then we would run the Google
Maps API calls before we filtered out any properties, meaning that we
would incur Google Maps API calls for all properties that we find in our
search, including those that we later filter out anyway based on price,
size, etc. - and we actually have to pay for those requests! My solution
is to group the filters in two chains, and then run one chain before and
one after external API calls have been made. This way, we can run the
distance filter after the API calls are made, but keep everything else
the same.