Created by gh-md-toc
This Git repository features use cases of good and bad practices when using Spark-based tools to process and analyze data.
- From a dedicated terminal window/tab, launch Spark Connect server.
Note that the
SPARK_REMOTE
environment variable should not be set at this stage, otherwise the Spark Connect server will try to connect to the corresponding Spark Connect server and will therefore not start
$ sparkconnectstart
- From the current terminal/tab, different from the window/tab having launched
the Spark Connect server, launch PySpark from the command-line, which in
turn launches Jupyter Lab
- Follow the details given by PySpark to open Jupyter in a web browser
$ export SPARK_REMOTE="sc://localhost:15002"; pyspark
...
[C 2023-06-27 21:54:04.720 ServerApp]
To access the server, open this file in a browser:
file://$HOME/Library/Jupyter/runtime/jpserver-21219-open.html
Or copy and paste one of these URLs:
http://localhost:8889/lab?token=dd69151c26a3b91fabda4b2b7e9724d13b49561f2c00908d
http://127.0.0.1:8889/lab?token=dd69151c26a3b91fabda4b2b7e9724d13b49561f2c00908d
...
- Open Jupyter in a web browser. For instance, on MacOS:
$ open ~/Library/Jupyter/runtime/jpserver-*-open.html
- Open a notebook, for instance
ipython-notebooks/simple-connect.ipynb
- Run the cells. The third cell should give a result like:
+-------+--------+-------+-------+
|User ID|Username|Browser| OS|
+-------+--------+-------+-------+
| 1580| Barry|FireFox|Windows|
| 5820| Sam|MS Edge| Linux|
| 2340| Harry|Vivaldi|Windows|
| 7860| Albert| Chrome|Windows|
| 1123| May| Safari| macOS|
+-------+--------+-------+-------+
- Notes:
- The first cell stops the initial Spark session,
when that latter has been started by Spark without making use of
Spark Connect, for instance when the
SPARK_REMOTE
environment variable has not been set properly. There is a try-catch clause, as once the Spark session has been started through Spark Connect, it cannot be stopped that way; the first cell may thus be re-executed at will with no further side-effect on the Spark session - The same first cell then starts, or uses when already existing, the Spark session through Spark Connect
- The first cell stops the initial Spark session,
when that latter has been started by Spark without making use of
Spark Connect, for instance when the
-
As per the official Apache Spark documentation, PyPi-installed PySpark (
pip install pyspark[connect]
) comes with Spark Connect from Spark version 3.4 or later. However, as of Spark version up to 3.4.1, the PySpark installation lacks the two new administration scripts allowing to start and to stop the Spark Connect server. For convenience, these two scripts have therefore been copied into this Git repository, in thetools/
directory. They may then simply copied in the PySparksbin
directory, once PySpark has been installed withpip
-
Install PySpark and JupyterLab, along with a few other Python libraries, from PyPi:
$ pip install -U pyspark[connect,sql,pandas_on_spark] plotly pyvis jupyterlab
- Add the following in the Bash/Zsh init script:
$ cat >> ~/.bashrc << _EOF
# Spark
PY_LIBDIR="$(python -mpip show pyspark|grep "^Location:"|cut -d' ' -f2,2)"
export SPARK_VERSION="\$(python -mpip show pyspark|grep "^Version:"|cut -d' ' -f2,2)"
export SPARK_HOME="\$PY_LIBDIR/pyspark"
export PATH="\$SPARK_HOME/sbin:\$PATH"
export PYSPARK_PYTHON="\$(which python3)"
export PYSPARK_DRIVER_PYTHON='jupyter'
export PYSPARK_DRIVER_PYTHON_OPTS='lab --no-browser --port=8889'
_EOF
- Re-read the Shell init scripts:
$ exec bash
- Copy the two Spark connect administrative scripts into the PySpark installation:
$ cp tools/st*-connect*.sh $SPARK_HOME/sbin/
- Check that the scripts are installed correctly:
$ ls -lFh $SPARK_HOME/sbin/*connect*.sh
-rwxr-xr-x 1 user staff 1.5K Jun 28 16:54 $PY_LIBDIR/pyspark/sbin/start-connect-server.sh*
-rwxr-xr-x 1 user staff 1.0K Jun 28 16:54 $PY_LIBDIR/pyspark/sbin/stop-connect-server.sh*
- Add the following Shell aliases to start and stop Spark Connect server:
$ cat >> ~/.bash_aliases << _EOF
# Spark Connect
alias sparkconnectstart=unset SPARK_REMOTE; start-connect-server.sh --packages org.apache.spark:spark-connect_2.12:\$SPARK_VERSION,io.delta:delta-core_2.12:2.4.0 --conf "spark.sql.extensions=io.delta.sql.DeltaSparkSessionExtension" --conf "spark.sql.catalog.spark_catalog=org.apache.spark.sql.delta.catalog.DeltaCatalog"'
alias sparkconnectstop='stop-connect-server.sh'
# PySpark
alias pysparkdelta='pyspark --packages io.delta:delta-core_2.12:2.4.0 --conf "spark.sql.extensions=io.delta.sql.DeltaSparkSessionExtension" --conf "spark.sql.catalog.spark_catalog=org.apache.spark.sql.delta.catalog.DeltaCatalog"'
_EOF
- Re-read the Shell aliases:
. ~/.bash_aliases
-
That section is kept for reference only. It is normally not needed
-
Install Spark/PySpark manually, e.g. with Spark 3.4.1:
$ export SPARK_VERSION="3.4.1"
wget https://dlcdn.apache.org/spark/spark-$SPARK_VERSION/spark-$SPARK_VERSION-bin-hadoop3.tgz
tar zxf spark-$SPARK_VERSION-bin-hadoop3.tgz && \
mv spark-$SPARK_VERSION-bin-hadoop3 ~/ && \
rm -f spark-$SPARK_VERSION-bin-hadoop3.tgz
- Add the following in the Bash/Zsh init script:
$ cat >> ~/.bashrc << _EOF
# Spark
export SPARK_VERSION="${SPARK_VERSION}"
export SPARK_HOME="\$HOME/spark-\$SPARK_VERSION-bin-hadoop3"
export PATH="\$SPARK_HOME/bin:\$SPARK_HOME/sbin:\${PATH}"
export PYTHONPATH=\$(ZIPS=("\$SPARK_HOME"/python/lib/*.zip); IFS=:; echo "\${ZIPS[*]}"):\$PYTHONPATH
export PYSPARK_PYTHON="\$(which python3)"
export PYSPARK_DRIVER_PYTHON='jupyter'
export PYSPARK_DRIVER_PYTHON_OPTS='lab --no-browser --port=8889'
_EOF
exec bash
- Add the following Shell aliases to start and stop Spark Connect server:
$ cat >> ~/.bash_aliases << _EOF
# Spark Connect
alias sparkconnectstart='start-connect-server.sh --packages org.apache.spark:spark-connect_2.12:${SPARK_VERSION}'
alias sparkconnectstop='stop-connect-server.sh'
_EOF
. ~/.bash_aliases