-
Notifications
You must be signed in to change notification settings - Fork 504
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add public doc for scheduler #8825
Merged
kolchfa-aws
merged 14 commits into
opensearch-project:main
from
noCharger:scheduled-query-acceleration
Dec 19, 2024
Merged
Changes from 2 commits
Commits
Show all changes
14 commits
Select commit
Hold shift + click to select a range
fdb032c
Add public doc for scheduler
noCharger 46f6bdc
Merge branch 'main' into scheduled-query-acceleration
kolchfa-aws 5a61ff5
Doc review
kolchfa-aws ab809d6
Add time units
kolchfa-aws cfb56f9
Clarify checkpoint location
kolchfa-aws d5d8758
Add description
kolchfa-aws e84acb4
Added more links and command
kolchfa-aws 08951cb
Convert settings back to list
kolchfa-aws e9e79c8
More links
kolchfa-aws 3ca1ea5
Formatting fix
kolchfa-aws 8f4c8c5
Review comments
kolchfa-aws 214b7da
Tech and editorial review
kolchfa-aws 3783d96
One more comments
kolchfa-aws f7b401d
More editorial comments
kolchfa-aws File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,228 @@ | ||
--- | ||
layout: default | ||
title: Optimizing direct queries from OpenSearch Service to Amazon S3 with Scheduled Query Acceleration | ||
parent: Data sources | ||
nav_order: 18 | ||
has_children: true | ||
--- | ||
|
||
# Optimizing direct queries from OpenSearch Service to Amazon S3 with Scheduled Query Acceleration | ||
Introduced 2.17 | ||
{: .label .label-purple } | ||
|
||
## Overview | ||
|
||
Scheduled Query Acceleration (SQA) is designed to enhance user experience by addressing the challenges often faced with managing and refreshing indexes, views, and data in an automated way. This guide will help you understand how SQA can improve workflows, reduce costs, and enhance transparency in managing your data. | ||
|
||
## Key Benefits | ||
|
||
### Cost Reduction Through Optimized Resource Usage | ||
|
||
By optimizing the driver node utilization, SQA significantly reduces the costs linked to maintaining auto-refresh capabilities for indexes and views. With fewer continuous driver operations needed, you can expect more efficient resource usage and a lower impact on your budget. | ||
|
||
For example, by comparing the total cost of using an internal scheduler versus an external scheduler at different refresh intervals (e.g., 5 minutes, 15 minutes, 30 minutes), SQA shows significant cost advantages. | ||
|
||
### Improved Observability of Refresh Operations | ||
|
||
SQA enhances transparency around index states and refresh operations, allowing you to see when refreshes occur. This observability provides improved insights into data processing, making it easier to understand the state of your data in real time. This feature helps in reducing uncertainty around the "refreshing" status, enabling data-driven decision-making based on actual system state. | ||
|
||
### Better Control Over Refresh Scheduling | ||
|
||
With SQA, you have more flexible scheduling options. You can set refresh intervals based on your specific requirements, ensuring efficient data management without overburdening driver nodes. This means greater control over resource usage and refresh frequency that better aligns with your needs. | ||
|
||
### Simplified Index Management | ||
|
||
SQA also simplifies index management by allowing you to adjust index settings without multiple, complicated queries. You can easily update refresh intervals or other parameters, thereby streamlining your workflow and reducing manual effort. | ||
|
||
## Getting Started | ||
|
||
### Prerequisites | ||
|
||
- **OpenSearch Version Requirements**: OpenSearch 2.17 or later. | ||
- **Required Plugins and Dependencies**: SQL plugin. | ||
- **System Requirements**: EMR Serverless, Amazon S3. | ||
|
||
### Recommended Reading | ||
|
||
- [Materialized Views / Cached Views / Scheduled Indexing (MV / CV / SI)](https://opensearch.org/docs/latest/dashboards/management/accelerate-external-data/) | ||
- [Scheduled Refresh vs. Continuous Streaming](https://github.com/opensearch-project/opensearch-spark/blob/main/docs/index.md#flint-index-refresh) | ||
- [Index State Management](https://github.com/opensearch-project/opensearch-spark/blob/main/docs/index.md#flint-index-refresh) | ||
|
||
## Configuration | ||
|
||
### Cluster Settings | ||
|
||
To enable and configure SQA for efficient query management and cost-effective resource utilization, two essential cluster settings need to be adjusted: | ||
|
||
#### Enable Async Query Execution | ||
|
||
This setting allows users to enable the asynchronous query execution feature, which is crucial for handling queries in a non-blocking manner. Enabling async query execution improves query efficiency and resource allocation by allowing tasks to run in the background without tying up system resources. | ||
|
||
[Setting Link](https://github.com/opensearch-project/sql/blob/main/docs/user/admin/settings.rst#pluginsqueryexecutionengineasync_queryenabled) | ||
|
||
#### Configure Async Query External Scheduler Interval | ||
|
||
This setting specifies the interval at which the external scheduler checks for tasks, allowing users to customize how often the scheduler initiates refresh operations. Adjusting this interval based on workload requirements can optimize resource usage and control the frequency of driver node operations, helping to manage costs effectively. | ||
|
||
[Setting Link](https://github.com/opensearch-project/sql/blob/main/docs/user/admin/settings.rst#pluginsqueryexecutionengineasync_queryexternal_schedulerinterval) | ||
|
||
### Spark Configurations | ||
|
||
- `spark.flint.job.externalScheduler.enabled`: Default is `false`. Enable external scheduler for Flint auto-refresh to schedule refresh jobs outside of Spark. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. on sql plugin 2.17 this value is passsed as true by default There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. So is the default for |
||
- `spark.flint.job.externalScheduler.interval`: Default is `5 minutes`. A string of refresh interval for external scheduler to trigger index refresh. | ||
|
||
### Sample Accelerated Query | ||
|
||
```sql | ||
CREATE SKIPPING INDEX example_index | ||
WITH ( | ||
auto_refresh = true, | ||
refresh_interval = '15 Minutes', | ||
scheduler_mode = 'external' | ||
); | ||
``` | ||
|
||
### Index Options | ||
|
||
When creating indexes, users can specify the following options in the `WITH` clause to control refresh behavior, scheduling, and timing: | ||
|
||
- **`auto_refresh`** | ||
- **Description**: Enables or disables automatic refresh for the index. When set to `true`, the index refreshes automatically at the specified interval; if `false`, the user must manually trigger a refresh using the `REFRESH` statement. | ||
- **Default**: `false` | ||
|
||
- **`refresh_interval`** | ||
- **Description**: Defines the time interval between refresh operations for the index. This setting is only applicable when `auto_refresh` is enabled and can be specified in formats such as `1 minute` or `10 seconds`. It controls how frequently new data is integrated into the index. | ||
- **Note**: Check `org.apache.spark.unsafe.types.CalendarInterval` for valid time duration identifiers. | ||
|
||
- **`scheduler_mode`** | ||
- **Description**: Specifies the scheduling mode for the auto-refresh feature, allowing users to choose between internal or external scheduling. The external scheduler requires a `checkpoint_location` for managing state. | ||
- **Valid Values**: `internal`, `external` | ||
|
||
For more comprehensive information, including additional settings, please refer to the [Flint Index Refresh Documentation](https://github.com/opensearch-project/opensearch-spark/blob/main/docs/index.md#flint-index-refresh). | ||
|
||
## Usage Guide | ||
|
||
### Creating Scheduled Refresh Jobs | ||
|
||
```sql | ||
-- Example command for creating an index with scheduled refresh | ||
CREATE SKIPPING INDEX example_index | ||
WITH ( | ||
auto_refresh = true, | ||
refresh_interval = '15 Minutes', | ||
scheduler_mode = 'external' | ||
); | ||
``` | ||
|
||
### Modifying Refresh Settings | ||
|
||
```sql | ||
-- Example of ALTER command to modify refresh settings | ||
ALTER INDEX example_index | ||
WITH (refresh_interval = '30 Minutes'); | ||
``` | ||
|
||
### Monitoring Index Status | ||
|
||
- **Common Monitoring Queries** | ||
```sql | ||
SHOW FLINT INDEXES IN spark_catalog.default; | ||
``` | ||
|
||
### Managing Scheduled Jobs | ||
|
||
- **Enabling/Disabling Jobs** | ||
- Alter index from auto to manual will disable the external scheduler: | ||
```sql | ||
ALTER MATERIALIZED VIEW myglue_test.default.count_by_status_v9 WITH (auto_refresh = false); | ||
``` | ||
- Alter index from manual to auto will enable the external scheduler: | ||
```sql | ||
ALTER MATERIALIZED VIEW myglue_test.default.count_by_status_v9 WITH (auto_refresh = true); | ||
``` | ||
|
||
- **Updating Schedules** | ||
```sql | ||
ALTER INDEX example_index | ||
WITH (refresh_interval = '30 Minutes'); | ||
``` | ||
|
||
- **Updating Scheduler Mode** | ||
```sql | ||
ALTER MATERIALIZED VIEW myglue_test.default.count_by_status_v9 WITH (scheduler_mode = 'internal'); | ||
``` | ||
|
||
### How to Check Scheduler Job Status | ||
|
||
- Use the following command to check scheduler job status: | ||
``` | ||
GET /.async-query-scheduler/_search | ||
``` | ||
|
||
## Best Practices | ||
|
||
### Performance Optimization | ||
|
||
#### Recommended Refresh Intervals | ||
|
||
- Choosing the right refresh interval is crucial for balancing resource usage and system performance. Consider your workload requirements and the freshness of data you need when setting intervals. | ||
|
||
#### Concurrent Job Limits | ||
|
||
- Limit the number of concurrent jobs running to avoid overloading system resources. Monitor system capacity and adjust job limits accordingly to ensure optimal performance. | ||
|
||
#### Resource Consideration | ||
|
||
- Efficient resource allocation is key to maximizing performance. Properly allocate memory, CPU, and I/O based on the workload and the type of queries being run. | ||
|
||
### Cost Management | ||
|
||
#### External Scheduler Cost Analysis | ||
|
||
- The use of an external scheduler can provide cost benefits by offloading refresh operations, reducing the demand on core driver nodes. | ||
|
||
#### Understanding Billing Impacts | ||
|
||
- It's important to understand the costs associated with refresh operations, particularly with varying refresh intervals. Longer intervals can mean reduced costs but may impact data freshness. | ||
|
||
#### Optimizing Refresh Schedules | ||
|
||
- Adjust refresh intervals based on workload patterns to reduce unnecessary refresh operations, which can lead to significant cost savings. | ||
|
||
#### Cost Monitoring Tips | ||
|
||
- Regularly monitor the costs related to scheduled queries and refresh operations. Using observability tools can help you gain insights into resource usage and costs over time. | ||
|
||
## FAQ | ||
|
||
### General Questions | ||
|
||
#### Common Questions About Functionality | ||
|
||
- **How does SQA handle refresh operations?** | ||
- SQA manages refresh operations using either an internal or external scheduler, which you can configure based on your needs. | ||
|
||
#### Performance-Related Questions | ||
|
||
- **What factors impact the performance of scheduled refreshes?** | ||
- Factors such as the refresh interval, system resource availability, and number of concurrent jobs can all impact performance. | ||
|
||
#### Configuration Questions | ||
|
||
- **Can I change the refresh interval after creating an index?** | ||
- Yes, you can modify the refresh interval using the `ALTER INDEX` command. | ||
|
||
### Technical FAQ | ||
|
||
#### Detailed Technical Questions | ||
|
||
- For in-depth technical questions, refer to the following [RFC for OpenSearch-Spark](https://github.com/opensearch-project/opensearch-spark/issues/416). | ||
|
||
#### Troubleshooting Scenarios | ||
|
||
- **Refresh Not Triggering as Expected**: Ensure the `auto_refresh` setting is enabled and the refresh interval is properly configured. | ||
|
||
#### Validations | ||
|
||
- **Adding Validations**: You can validate your settings by running test queries and verifying the scheduler configurations. | ||
|
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also the document of prereq:
Step1: for setup data source https://opensearch.org/docs/latest/dashboards/management/S3-data-source/, after then dashboard user can fire query via workbench, ref https://opensearch.org/docs/latest/dashboards/management/query-data-source/
Step2: user can accelerate query using secondary index ref https://opensearch.org/docs/latest/dashboards/management/accelerate-external-data/
During this process, user can change config to try out everything described on this doc