Skip to content

Latest commit

 

History

History
109 lines (76 loc) · 8.68 KB

File metadata and controls

109 lines (76 loc) · 8.68 KB

Supercharge Your Streamlit Visualizations: Batch Processing Iceberg Data with Apache Flink

After you have all your fun running the Java-based Flink application DataGeneratorApp to kickstart the data pipeline. This app powers up your Kafka topics—airline.skyone and airline.sunset—by generating sample records that fuel the rest of the process. Once the data flows into Kafka, launch the FlightImporterApp Flink application. This crucial step reads the enriched data from Kafka and writes it into the apache_kickstarter.airlines.flight Apache Iceberg table, seamlessly preparing your data for advanced analytics and insight generation.

datageneratorapp-flightimporterapp

Get ready to see the magic of Flink in action! Now, it's time to share those insights with the world! One fantastic way to do that is with Streamlit, which allows you to easily create interactive visualizations. Streamlit is intuitive, powerful, and designed with Python developers in mind, making it a breeze to turn raw data into captivating dashboards. 😉

iceberg-flink-streamlit-drawing

Table of Contents

1.0 Power up the Apache Flink Docker containers

Prerequisite

Before you can run scripts/run-flink-locally.sh Bash script, you need to install the aws2-wrap utility. If you have a Mac machine, run this command from your Terminal:

brew install aws2-wrap

If you are not using a Mac, make sure you have Python3.x installed on your machine, and run this command from your Terminal:

pip install aws2-wrap

This section guides you through the local setup (on one machine but in separate containers) of the Apache Flink cluster in Session mode using Docker containers with support for Apache Iceberg. Run the bash script below to start the Apache Flink cluster in Session Mode on your machine:

scripts/run-flink-locally.sh <DOCKER_SWITCH> --profile=<AWS_SSO_PROFILE_NAME>
                                             --chip=<amd64 | arm64>
                                             --aws-s3-bucket=<AWS_S3_BUCKET_NAME>
Argument placeholder Replace with
<DOCKER_SWITCH> on to start up your very own local Apache Cluster running in Docker containers, otherwise off to stop the Docker containers.
<AWS_SSO_PROFILE_NAME> your AWS SSO profile name for your AWS infrastructue that host your AWS Secrets Manager.
<CHIP> if you are running on a Mac with M1, M2, or M3 chip, use arm64. Otherwise, use amd64.
<AWS_S3_BUCKET_NAME> specify the name of the AWS S3 bucket used to store Apache Iceberg files.

To learn more about this script, click here.

2.0 Supercharge Your Streamlit Visualizations

To access the Flink JobManager (supercharge_streamlit-apache_flink-jobmanager-1) container, open the interactive shell by running:

docker exec -it -w /opt/flink/python_apps supercharge_streamlit-apache_flink-jobmanager-1 /bin/bash

Jump right into the container and take charge! You’ll have full control to run commands, explore the file system, and tackle any tasks you need. You’ll land directly in the /opt/flink/python_apps directory—this is the headquarters for the Python script in the repo.

To illustrate, I created a Streamlit script that queries the apache_kickstarter.airlines.flight Apache Iceberg Table, harnessing Flink SQL to extract valuable insights. These insights are then brought to life through a Streamlit dashboard, transforming raw data into an accessible, visual experience.

Here you go, run this in the docker container terminal command line:

uv run streamlit run app.py -- --aws-s3-bucket <AWS_S3_BUCKET> --aws-region <AWS_REGION_NAME>

Notice the extra -- between streamlit run app.py and the actual script arguments. This is necessary to pass arguments to the Streamlit script without causing conflicts with Streamlit's own CLI options.

When you run the script, for instance, it produces the following output:

streamlit-run-from-terminal-screenshot

Open your host web browser, enter the local URL, localhost:8501, and in a few moments this web page will be displayed:

streamlit-screenshot

"After many years in this industry, I’m still amazed by what we can achieve today! The possibilities are endless—enjoy the ride!"

---J3

2.1 Did you notice we prepended uv run to streamlit run?

You maybe asking yourself why. Well, uv is an incredibly fast Python package installer and dependency resolver, written in Rust, and designed to seamlessly replace pip, pipx, poetry, pyenv, twine, virtualenv, and more in your workflows. By prefixing uv run to a command, you're ensuring that the command runs in an optimal Python environment.

Now, let's go a little deeper into the magic behind uv run:

  • When you use it with a file ending in .py or an HTTP(S) URL, uv treats it as a script and runs it with a Python interpreter. In other words, uv run file.py is equivalent to uv run python file.py. If you're working with a URL, uv even downloads it temporarily to execute it. Any inline dependency metadata is installed into an isolated, temporary environment—meaning zero leftover mess! When used with -, the input will be read from stdin, and treated as a Python script.
  • If used in a project directory, uv will automatically create or update the project environment before running the command.
  • Outside of a project, if there's a virtual environment present in your current directory (or any parent directory), uv runs the command in that environment. If no environment is found, it uses the interpreter's environment.

So what does this mean when we put uv run before streamlit run? It means uv takes care of all the setup—fast and seamless—right in your local Docker container. If you think AI/ML is magic, the work the folks at Astral have done with uv is pure wizardry!

Curious to learn more about Astral's uv? Check these out:

3.0 Local Integration: How This App Harnesses Apache Flink

This app.py Python script streamlines Apache Flink integration by leveraging PyFlink directly in the app.py script. Running the Flink job using the same Python process as the Streamlit app offers an intuitive setup for local testing and debugging. It's an ideal solution for quickly iterating on data processing logic without additional infrastructure.

However, a more advanced setup is recommended for production environments where scalability is critical. Running Flink jobs in a separate process and enabling communication through Kafka or a REST API provides greater scalability, albeit at the cost of added complexity.

The simplicity of this app’s current approach makes it perfect for prototyping and development while offering a foundation to explore more sophisticated integration methods as your requirements grow.

4.0 Resources

Flink Python Docs

PyFlink API Reference

Apache Iceberg in Action with Apache Flink using Python

Streamlit Documentation