Redirect URLs #845

jkumz · 2024-12-28T13:24:13Z

jkumz
Dec 28, 2024

Hi,

Is there a way to prevent crawling domains if a URL re-directs to a different URL?

For me right now, if it hits a URL that redirects to a different domain it proceeds to crawl that domain as well, even using a strategy of enqueueing links.

For example, www.somelink.com/github redirects to their github profile, which then leads to crawling every URL on page, which leads to endless crawling of GitHub.

My code:
`async def main() -> None:
crawler = PlaywrightCrawler()

urls_found = []

# Define a request handler and attach it to the crawler using the decorator.
@crawler.router.default_handler
async def request_handler(context: PlaywrightCrawlingContext) -> None:
    # See BeautifulSoup documentation for API docs.
    url = context.request.url
    context.log.info(f"On URL: {url}")
    urls_found.append(url)

    await context.enqueue_links(strategy=EnqueueStrategy.SAME_ORIGIN)

await crawler.run(["some url here"])

print(f"Found {len(urls_found)} URLs")`

Thanks in advance

Answered by jkumz

Dec 28, 2024

I made a workaround via putting enqueue links in a conditional block so that it will only start crawling if the domain is the same.
Still redirects to URLs outside of original domain, but doesn't start crawling any URLs under different domains which is sufficient.

View full answer

jkumz · 2024-12-28T15:57:22Z

jkumz
Dec 28, 2024
Author

I made a workaround via putting enqueue links in a conditional block so that it will only start crawling if the domain is the same.
Still redirects to URLs outside of original domain, but doesn't start crawling any URLs under different domains which is sufficient.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Redirect URLs #845

{{title}}

Replies: 1 comment

{{title}}

Select a reply

Redirect URLs #845

jkumz Dec 28, 2024

Replies: 1 comment

jkumz Dec 28, 2024 Author

jkumz
Dec 28, 2024

jkumz
Dec 28, 2024
Author