Slow Queries when using Postgres + Multiple Scrapers #69

Wentil · 2024-09-20T20:24:33Z

Wentil
Sep 20, 2024

I set up a central Postgres server and ten (10) scraping servers, each of which connects to the Postgres Server and rotates its outgoing thread requests randomly through each Server's assigned CIDR /26 pool of public IPs (utilizing 10 such IP pools, in total). While the IP side of it works well enough, it seems like only one Scraping Server is being serviced at a time by the Postgres Server on the internal network in terms of handing out new assignments, and the other Scraping Servers have to queue up and wait for their next tasks -- which brings the Jobs / Minute below 2 in some cases... pretty abysmal, I'm sure you'll agree.

Is there a way to configure the central Postgres Server to better handle I/O from/to the ten scraping servers, so each one can be hitting 100+ Scrapes per Minute?

devdovdav · 2024-09-26T19:16:16Z

devdovdav
Sep 26, 2024

It might be something with the query logic. It uses a query that updates the status of jobs in the gmaps_jobs table from new to queued. It selects a single job, ordering by priority and creation time, and locks the job row (FOR UPDATE SKIP LOCKED). This ensures only one scraper processes each job at a time.

Only one job is fetched at a time, which limits scalability, especially when you have multiple nodes attempting to fetch jobs concurrently.
The FOR UPDATE SKIP LOCKED approach works well to prevent multiple workers from processing the same job, but it may cause contention, especially if multiple nodes are rapidly trying to access jobs from the database. This could explain why only one scraper seems to be serviced at a time.

I would start by allowing workers to fetch multiple jobs. Then workers can process multiple jobs in parallel, which reduces the wait time between db queries. You can also consider pre fetching & assigning the jobs to each node and write down the result to a central db, but that might be more complicated.

Also the default connections to postregsql is 100. I saw it immediately goes to 200. I did set it to 300.
SELECT
max_conn,
used_conn,
max_conn - used_conn AS remaining_conn
FROM
(SELECT count(*) as used_conn FROM pg_stat_activity) as used,
(SELECT setting::int AS max_conn FROM pg_settings WHERE name = 'max_connections') as max;

Like:
deploy:
resources:
limits:
cpus: '0.50'
memory: 512M
command: ["postgres", "-c", "max_connections=300"] # Directly set max_connections at runtime

For now, i think it makes no sense to use a lot of nodes.

0 replies

gosom · 2024-09-29T18:43:28Z

gosom
Sep 29, 2024
Maintainer

@devdovdav @Wentil

I need to check a bit and I will come back to you .

@devdovdav the connection go up because it's worker (the -c option) uses the connection pool to fetch ONE job.

It makes sense a separate goroutine to fetch N jobs in a batch so all workers .

Will get back to you once i find and correct the issue.

1 reply

Wentil Oct 1, 2024
Author

@gosom Thanks! I appreciate it, as well as (of course) you building the scraper in the first place. I look forward to your solution.

gosom · 2024-10-13T16:33:15Z

gosom
Oct 13, 2024
Maintainer

@Wentil @devdovdav in the latest version I have done the following changes:

(1)
the jobs are fetched per worker in one goroutine in batches.
(worker is not concurrency but what you refer as scraping server)

So we expect 1 db connection per worker. The limit however is 10 .

(2)
The results are saved in batches now. This is also expected to contribute to perfomance improvement.

@Wentil if you try it out can you let me know how it works?

thank you

3 replies

Wentil Oct 22, 2024
Author

Hi @gosom, I had some time today and so I tried out the new version. I am pleased to report that the batching that you implemented has produced a significant improvement over the original method. However, although there is a large improvement, each of the separate scraping servers is spending more time waiting for the next tasks than it is in actually working on them. I recorded a one-minute video showing all of them processing at the same time, to better explain what I mean. As you can see, there are long pauses on each between receiving the next batch.

Apologies for the terrible, terrible video quality -- Github forced me to compress it to less than 10 MB to place it here.

Screen.Recording.2024-10-22.145050.2.mp4

gosom Oct 23, 2024
Maintainer

@Wentil can you please let me know -c (concurrency) parameter you use in each worker?

Wentil Oct 23, 2024
Author

@gosom There are 32 CPU Cores in each Server, so I am using -c 32

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Slow Queries when using Postgres + Multiple Scrapers #69

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 3 comments 4 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

Slow Queries when using Postgres + Multiple Scrapers #69

Wentil Sep 20, 2024

Replies: 3 comments · 4 replies

devdovdav Sep 26, 2024

gosom Sep 29, 2024 Maintainer

Wentil Oct 1, 2024 Author

gosom Oct 13, 2024 Maintainer

Wentil Oct 22, 2024 Author

gosom Oct 23, 2024 Maintainer

Wentil Oct 23, 2024 Author

Wentil
Sep 20, 2024

Replies: 3 comments 4 replies

devdovdav
Sep 26, 2024

gosom
Sep 29, 2024
Maintainer

Wentil Oct 1, 2024
Author

gosom
Oct 13, 2024
Maintainer

Wentil Oct 22, 2024
Author

gosom Oct 23, 2024
Maintainer

Wentil Oct 23, 2024
Author