Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add indexes to submissions.created_at column #23

Open
tekin opened this issue Jul 31, 2018 · 0 comments
Open

Add indexes to submissions.created_at column #23

tekin opened this issue Jul 31, 2018 · 0 comments

Comments

@tekin
Copy link
Contributor

tekin commented Jul 31, 2018

We are now querying based on this column so we should add an index on it.

edavey added a commit that referenced this issue Dec 4, 2019
In July 2018 it was noted that the `submissions.created_at`
field ought to be indexed as it was now used in a query:

#23

This is `Task#latest_submission`
(f46dd46)
which is used by the 'no_business' endpoint
(`/v1/tasks/#{task.id}/no_business`)

As the query selecting the most recent submission uses:

`has_one :latest_submission, -> { order(created_at: :desc) }`

I've opted for an index which is 'sorted' to suit.

This work was requested in order to improve the problematic
daily export procedure invoked via
`DataWarehouseExport.generate!` which runs out of memory
and times out regularly.

This unindexed field (`submissions.created_at`) isn't
actually involved in any of the 4 "extractions" which run.
We believe that the 2 problem extractions are

- `Export::Invoices::Extract`
- `Export::Contracts::Extract`

and that improvements to these may be brought about by
adding the following indexes:

- `tasks.updated_at`
- `submissions.updated_at`
- `submission_entries.updated_at`

The following PR will add these indexes.

But we actually believe that the main problem is Postgres
memory:

- the index on SubmissionEntry no longer fits in memory: we
  are going to look into increasing the memory allocation in
  Postgres

- the `submission_entries.*` glob in the SELECT statement
  may include more data than is required: we will
  explore whether a sub-set of the SubmissionEntry fields can
  be returned by the DB

Zendesk
: https://dxw.zendesk.com/agent/tickets/10463

Trello
: https://trello.com/c/GeaLyAWy/1162
edavey added a commit that referenced this issue Dec 6, 2019
Background

It was proposed that the very slow daily `DataWarehouseExport.generate!`
could be improved by adding indexes. Back in July 2018 Tekin identified
that `Submissions#created_at` should be indexed
(#23) as it
started to be used in `Task#latest_submission`
(f46dd46).

However `submissions.created_at` is not used in any of the four
'extractions' which make up the daily data warehouse export. But
there are some other unindexed fields involved in extractions, as follows:

`Export::Tasks::Extract` -> uses `Task#updated_at` (unindexed)
`Export::Submissions::Extract` -> uses `Submission#updated_at` (unindexed)
`Export::Invoices::Extract` -> uses `SubmissionEntry#updated_at` (unindexed)
`Export::Contracts::Extract` -> uses `SubmissionEntry#updated_at` (unindexed)

Accordingly there are some PRs coming through which:

- 559: Add index to `submissions.created_at`
- 560: Add indexes to `tasks.updated_at`, `submissions.updated_at`,
       `submission_entries.updated_at`
- 561: Increase `maintenance_work_mem` setting for Postgres to
       allow these indexes to be added

This commit improves the speed of one of the 4 extractions:
`Export::Submissions::Extract`.

It was identified (thanks Russell Garner!) that the subquery joining
'invoices' was a major bottleneck. In local tests, removing the join
reduced a sample query from 26s to 0.4s. In order to properly remove
this element from the query:

- it has been recognised that `entry_count` (the count of submission
entries of the type 'invoice') is not needed as users derive this
information using other tooling in the 'data warehouse'

- the `total_management_charge` projection could be removed,
as since
d7db363
this value has been precomputed and stored on ingestion as
`submissions.management_charge_total`. This field was already
being returned as part of the top level selection
`SELECT submissions.*`.

- the `invoice_value` projection (`invoices.total_value`) is
not required as again this information is available in the
'data warehouse'.

- now that `invoice_entry_count` is no longer available it is
now necessary find a new way to determine the `submission_type`
('no_business' or 'file'). It's been agreed that the presence
(or not) of `submission_file_type`  is sufficient to know if
there’s a ‘file’ to be had or whether on the other hand it's
a case of ‘no business’.

As the structure of the exports (the headings or columns) has
changed we've taken the opportunity to describe both:

- the expected CSV headers more clearly as a vertical list,
 and

- the values of the example row more closely by ensuring
  that the expected values as well as being present are
  also in the expected column.
edavey added a commit that referenced this issue Dec 9, 2019
In July 2018 it was noted that the `submissions.created_at`
field ought to be indexed as it was now used in a query:

#23

This is `Task#latest_submission`
(f46dd46)
which is used by the 'no_business' endpoint
(`/v1/tasks/#{task.id}/no_business`)

As the query selecting the most recent submission uses:

`has_one :latest_submission, -> { order(created_at: :desc) }`

I've opted for an index which is 'sorted' to suit.

This work was requested in order to improve the problematic
daily export procedure invoked via
`DataWarehouseExport.generate!` which runs out of memory
and times out regularly.

This unindexed field (`submissions.created_at`) isn't
actually involved in any of the 4 "extractions" which run.
We believe that the 2 problem extractions are

- `Export::Invoices::Extract`
- `Export::Contracts::Extract`

and that improvements to these may be brought about by
adding the following indexes:

- `tasks.updated_at`
- `submissions.updated_at`
- `submission_entries.updated_at`

The following PR will add these indexes.

But we actually believe that the main problem is Postgres
memory:

- the index on SubmissionEntry no longer fits in memory: we
  are going to look into increasing the memory allocation in
  Postgres

- the `submission_entries.*` glob in the SELECT statement
  may include more data than is required: we will
  explore whether a sub-set of the SubmissionEntry fields can
  be returned by the DB

Zendesk
: https://dxw.zendesk.com/agent/tickets/10463

Trello
: https://trello.com/c/GeaLyAWy/1162
lozette pushed a commit that referenced this issue Jan 3, 2020
Background

It was proposed that the very slow daily `DataWarehouseExport.generate!`
could be improved by adding indexes. Back in July 2018 Tekin identified
that `Submissions#created_at` should be indexed
(#23) as it
started to be used in `Task#latest_submission`
(f46dd46).

However `submissions.created_at` is not used in any of the four
'extractions' which make up the daily data warehouse export. But
there are some other unindexed fields involved in extractions, as follows:

`Export::Tasks::Extract` -> uses `Task#updated_at` (unindexed)
`Export::Submissions::Extract` -> uses `Submission#updated_at` (unindexed)
`Export::Invoices::Extract` -> uses `SubmissionEntry#updated_at` (unindexed)
`Export::Contracts::Extract` -> uses `SubmissionEntry#updated_at` (unindexed)

Accordingly there are some PRs coming through which:

- 559: Add index to `submissions.created_at`
- 560: Add indexes to `tasks.updated_at`, `submissions.updated_at`,
       `submission_entries.updated_at`
- 561: Increase `maintenance_work_mem` setting for Postgres to
       allow these indexes to be added

This commit improves the speed of one of the 4 extractions:
`Export::Submissions::Extract`.

It was identified (thanks Russell Garner!) that the subquery joining
'invoices' was a major bottleneck. In local tests, removing the join
reduced a sample query from 26s to 0.4s. In order to properly remove
this element from the query:

- it has been recognised that `entry_count` (the count of submission
entries of the type 'invoice') is not needed as users derive this
information using other tooling in the 'data warehouse'

- the `total_management_charge` projection could be removed,
as since
d7db363
this value has been precomputed and stored on ingestion as
`submissions.management_charge_total`. This field was already
being returned as part of the top level selection
`SELECT submissions.*`.

- the `invoice_value` projection (`invoices.total_value`) is
not required as again this information is available in the
'data warehouse'.

- now that `invoice_entry_count` is no longer available it is
now necessary find a new way to determine the `submission_type`
('no_business' or 'file'). It's been agreed that the presence
(or not) of `submission_file_type`  is sufficient to know if
there’s a ‘file’ to be had or whether on the other hand it's
a case of ‘no business’.

As the structure of the exports (the headings or columns) has
changed we've taken the opportunity to describe both:

- the expected CSV headers more clearly as a vertical list,
 and

- the values of the example row more closely by ensuring
  that the expected values as well as being present are
  also in the expected column.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant