Skip to content

Commit

Permalink
[SPARK-49944][DOCS] Fix broken main.js import and fix image links for…
Browse files Browse the repository at this point in the history
… streaming documentation

### What changes were proposed in this pull request?

We use the `rel_path_to_root` Jekyll variable in front of all paths that require it.

### Why are the changes needed?

Currently, our import to `main.js` and AnchorJS are broken in the Spark 4.0.0-2 preview. Also, images aren't appearing for the Structured Streaming doc pages. See the [ASF issue](https://issues.apache.org/jira/browse/SPARK-49944) for more detail.

You can see how the pages are broken [here](https://spark.apache.org/docs/4.0.0-preview2/streaming/getting-started.html); here's a screenshot, for example:

<img width="1168" alt="image" src="https://github.com/user-attachments/assets/d0dbc970-a5aa-445a-ae21-f4e32973f031">

### Does this PR introduce _any_ user-facing change?

The preview documentation will now have correctly rendered code blocks, and images will appear.

### How was this patch tested?

Local testing. Please build the docs site if you would like to verify. It now looks like:

<img width="1271" alt="image" src="https://github.com/user-attachments/assets/08b69f58-d6f4-41b0-bcb5-1af80782c133">

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes #48438 from neilramaswamy/nr/fix-broken-streaming-links-images.

Authored-by: Neil Ramaswamy <neil.ramaswamy@databricks.com>
Signed-off-by: Kent Yao <yao@apache.org>
  • Loading branch information
neilramaswamy authored and yaooqinn committed Oct 22, 2024
1 parent 2c904e4 commit e7cdb5a
Show file tree
Hide file tree
Showing 4 changed files with 13 additions and 12 deletions.
6 changes: 3 additions & 3 deletions docs/_layouts/global.html
Original file line number Diff line number Diff line change
Expand Up @@ -28,7 +28,7 @@
<link rel="preconnect" href="https://fonts.gstatic.com" crossorigin>
<link href="https://fonts.googleapis.com/css2?family=DM+Sans:ital,wght@0,400;0,500;0,700;1,400;1,500;1,700&Courier+Prime:wght@400;700&display=swap" rel="stylesheet">
<link href="{{ rel_path_to_root }}css/custom.css" rel="stylesheet">
<script src="/js/vendor/modernizr-2.6.1-respond-1.1.0.min.js"></script>
<script src="{{ rel_path_to_root}}js/vendor/modernizr-2.6.1-respond-1.1.0.min.js"></script>

<link rel="stylesheet" href="{{ rel_path_to_root }}css/pygments-default.css">
<link rel="stylesheet" href="https://cdn.jsdelivr.net/npm/docsearch.js@2/dist/cdn/docsearch.min.css" />
Expand Down Expand Up @@ -198,8 +198,8 @@ <h1 class="title">{{ page.title }}</h1>
crossorigin="anonymous"></script>
<script src="https://code.jquery.com/jquery.js"></script>

<script src="/js/vendor/anchor.min.js"></script>
<script src="/js/main.js"></script>
<script src="{{ rel_path_to_root }}js/vendor/anchor.min.js"></script>
<script src="{{ rel_path_to_root}}js/main.js"></script>

<script type="text/javascript" src="https://cdn.jsdelivr.net/npm/docsearch.js@2/dist/cdn/docsearch.min.js"></script>
<script type="text/javascript">
Expand Down
10 changes: 5 additions & 5 deletions docs/streaming/apis-on-dataframes-and-datasets.md
Original file line number Diff line number Diff line change
Expand Up @@ -436,7 +436,7 @@ Imagine our [quick example](./getting-started.html#quick-example) is modified an

The result tables would look something like the following.

![Window Operations](/img/structured-streaming-window.png)
![Window Operations](../img/structured-streaming-window.png)

Since this windowing is similar to grouping, in code, you can use `groupBy()` and `window()` operations to express windowed aggregations. You can see the full code for the below examples in
[Python]({{site.SPARK_GITHUB_URL}}/blob/v{{site.SPARK_VERSION_SHORT}}/examples/src/main/python/sql/streaming/structured_network_wordcount_windowed.py)/[Scala]({{site.SPARK_GITHUB_URL}}/blob/v{{site.SPARK_VERSION_SHORT}}/examples/src/main/scala/org/apache/spark/examples/sql/streaming/StructuredNetworkWordCountWindowed.scala)/[Java]({{site.SPARK_GITHUB_URL}}/blob/v{{site.SPARK_VERSION_SHORT}}/examples/src/main/java/org/apache/spark/examples/sql/streaming/JavaStructuredNetworkWordCountWindowed.java).
Expand Down Expand Up @@ -512,7 +512,7 @@ naturally in our window-based grouping – Structured Streaming can maintain the
for partial aggregates for a long period of time such that late data can update aggregates of
old windows correctly, as illustrated below.

![Handling Late Data](/img/structured-streaming-late-data.png)
![Handling Late Data](../img/structured-streaming-late-data.png)

However, to run this query for days, it's necessary for the system to bound the amount of
intermediate in-memory state it accumulates. This means the system needs to know when an old
Expand Down Expand Up @@ -605,7 +605,7 @@ the engine will keep updating counts of a window in the Result Table until the w
than the watermark, which lags behind the current event time in column "timestamp" by 10 minutes.
Here is an illustration.

![Watermarking in Update Mode](/img/structured-streaming-watermark-update-mode.png)
![Watermarking in Update Mode](../img/structured-streaming-watermark-update-mode.png)

As shown in the illustration, the maximum event time tracked by the engine is the
*blue dashed line*, and the watermark set as `(max event time - '10 mins')`
Expand All @@ -628,7 +628,7 @@ This is illustrated below.
Note that using `withWatermark` on a non-streaming Dataset is no-op. As the watermark should not affect
any batch query in any way, we will ignore it directly.

![Watermarking in Append Mode](/img/structured-streaming-watermark-append-mode.png)
![Watermarking in Append Mode](../img/structured-streaming-watermark-append-mode.png)

Similar to the Update Mode earlier, the engine maintains intermediate counts for each window.
However, the partial counts are not updated to the Result Table and not written to sink. The engine
Expand All @@ -641,7 +641,7 @@ appended to the Result Table only after the watermark is updated to `12:11`.

Spark supports three types of time windows: tumbling (fixed), sliding and session.

![The types of time windows](/img/structured-streaming-time-window-types.jpg)
![The types of time windows](../img/structured-streaming-time-window-types.jpg)

Tumbling windows are a series of fixed-sized, non-overlapping and contiguous time intervals. An input
can only be bound to a single window.
Expand Down
7 changes: 4 additions & 3 deletions docs/streaming/getting-started.md
Original file line number Diff line number Diff line change
Expand Up @@ -448,14 +448,15 @@ table, and Spark runs it as an *incremental* query on the *unbounded* input
table. Let’s understand this model in more detail.

## Basic Concepts

Consider the input data stream as the "Input Table". Every data item that is
arriving on the stream is like a new row being appended to the Input Table.

![Stream as a Table](/img/structured-streaming-stream-as-a-table.png "Stream as a Table")
![Stream as a Table](../img/structured-streaming-stream-as-a-table.png "Stream as a Table")

A query on the input will generate the "Result Table". Every trigger interval (say, every 1 second), new rows get appended to the Input Table, which eventually updates the Result Table. Whenever the result table gets updated, we would want to write the changed result rows to an external sink.

![Model](/img/structured-streaming-model.png)
![Model](../img/structured-streaming-model.png)

The "Output" is defined as what gets written out to the external storage. The output can be defined in a different mode:

Expand All @@ -476,7 +477,7 @@ will continuously check for new data from the socket connection. If there is
new data, Spark will run an "incremental" query that combines the previous
running counts with the new data to compute updated counts, as shown below.

![Model](/img/structured-streaming-example-model.png)
![Model](../img/structured-streaming-example-model.png)

**Note that Structured Streaming does not materialize the entire table**. It reads the latest
available data from the streaming data source, processes it incrementally to update the result,
Expand Down
2 changes: 1 addition & 1 deletion docs/streaming/performance-tips.md
Original file line number Diff line number Diff line change
Expand Up @@ -26,7 +26,7 @@ license: |

Asynchronous progress tracking allows streaming queries to checkpoint progress asynchronously and in parallel to the actual data processing within a micro-batch, reducing latency associated with maintaining the offset log and commit log.

![Async Progress Tracking](/img/async-progress.png)
![Async Progress Tracking](../img/async-progress.png)

## How does it work?

Expand Down

0 comments on commit e7cdb5a

Please sign in to comment.