Skip to content

Commit

Permalink
[SPARK-50647][INFRA] Add a daily build for PySpark with old dependencies
Browse files Browse the repository at this point in the history
### What changes were proposed in this pull request?
Add a daily build for PySpark with old dependencies

### Why are the changes needed?
to guard the installation described in https://apache.github.io/spark/api/python/getting_started/install.html

The installation guide is outdated:

- pyspark-sql/connect requires
-- pyarrow>=11.0
-- numpy>=1.21
-- pandas>=2.0.0

- pyspark-pandas requires a even new versions of pandas/pyarrow/numpy
-- pyarrow>=11.0
-- numpy>=1.22.4
-- pandas>=2.2.0

This PR excludes PS: we can either

- make PS works in the old versions, and then add it in this workflow;
- or upgrade the minimum requirements, and add a separate workflow for it;

### Does this PR introduce _any_ user-facing change?
no, infra-only

### How was this patch tested?
PR build with
```
envs:
default: '{"PYSPARK_IMAGE_TO_TEST": "python-minimum", "PYTHON_TO_TEST": "python3.9"}'

jobs:
default: '{"pyspark": "true"}'
```

https://github.com/zhengruifeng/spark/runs/34827211339

### Was this patch authored or co-authored using generative AI tooling?
no

Closes #49267 from zhengruifeng/infra_py_old.

Authored-by: Ruifeng Zheng <ruifengz@apache.org>
Signed-off-by: Ruifeng Zheng <ruifengz@apache.org>
  • Loading branch information
zhengruifeng committed Dec 26, 2024
1 parent 9c9bdab commit 4ad7f3d
Show file tree
Hide file tree
Showing 3 changed files with 140 additions and 0 deletions.
13 changes: 13 additions & 0 deletions .github/workflows/build_infra_images_cache.yml
Original file line number Diff line number Diff line change
Expand Up @@ -109,6 +109,19 @@ jobs:
- name: Image digest (SparkR)
if: hashFiles('dev/spark-test-image/sparkr/Dockerfile') != ''
run: echo ${{ steps.docker_build_sparkr.outputs.digest }}
- name: Build and push (PySpark with old dependencies)
if: hashFiles('dev/spark-test-image/python-minimum/Dockerfile') != ''
id: docker_build_pyspark_python_minimum
uses: docker/build-push-action@v6
with:
context: ./dev/spark-test-image/python-minimum/
push: true
tags: ghcr.io/apache/spark/apache-spark-github-action-image-pyspark-python-minimum-cache:${{ github.ref_name }}-static
cache-from: type=registry,ref=ghcr.io/apache/spark/apache-spark-github-action-image-pyspark-python-minimum-cache:${{ github.ref_name }}
cache-to: type=registry,ref=ghcr.io/apache/spark/apache-spark-github-action-image-pyspark-python-minimum-cache:${{ github.ref_name }},mode=max
- name: Image digest (PySpark with old dependencies)
if: hashFiles('dev/spark-test-image/python-minimum/Dockerfile') != ''
run: echo ${{ steps.docker_build_pyspark_python_minimum.outputs.digest }}
- name: Build and push (PySpark with PyPy 3.10)
if: hashFiles('dev/spark-test-image/pypy-310/Dockerfile') != ''
id: docker_build_pyspark_pypy_310
Expand Down
46 changes: 46 additions & 0 deletions .github/workflows/build_python_minimum.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,46 @@
#
# Licensed to the Apache Software Foundation (ASF) under one
# or more contributor license agreements. See the NOTICE file
# distributed with this work for additional information
# regarding copyright ownership. The ASF licenses this file
# to you under the Apache License, Version 2.0 (the
# "License"); you may not use this file except in compliance
# with the License. You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing,
# software distributed under the License is distributed on an
# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
# KIND, either express or implied. See the License for the
# specific language governing permissions and limitations
# under the License.
#

name: "Build / Python-only (master, Python with old dependencies)"

on:
schedule:
- cron: '0 9 * * *'
workflow_dispatch:

jobs:
run-build:
permissions:
packages: write
name: Run
uses: ./.github/workflows/build_and_test.yml
if: github.repository == 'apache/spark'
with:
java: 17
branch: master
hadoop: hadoop3
envs: >-
{
"PYSPARK_IMAGE_TO_TEST": "python-minimum",
"PYTHON_TO_TEST": "python3.9"
}
jobs: >-
{
"pyspark": "true"
}
81 changes: 81 additions & 0 deletions dev/spark-test-image/python-minimum/Dockerfile
Original file line number Diff line number Diff line change
@@ -0,0 +1,81 @@
#
# Licensed to the Apache Software Foundation (ASF) under one or more
# contributor license agreements. See the NOTICE file distributed with
# this work for additional information regarding copyright ownership.
# The ASF licenses this file to You under the Apache License, Version 2.0
# (the "License"); you may not use this file except in compliance with
# the License. You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
#

# Image for building and testing Spark branches. Based on Ubuntu 22.04.
# See also in https://hub.docker.com/_/ubuntu
FROM ubuntu:jammy-20240911.1
LABEL org.opencontainers.image.authors="Apache Spark project <dev@spark.apache.org>"
LABEL org.opencontainers.image.licenses="Apache-2.0"
LABEL org.opencontainers.image.ref.name="Apache Spark Infra Image For PySpark with old dependencies"
# Overwrite this label to avoid exposing the underlying Ubuntu OS version label
LABEL org.opencontainers.image.version=""

ENV FULL_REFRESH_DATE=20241223

ENV DEBIAN_FRONTEND=noninteractive
ENV DEBCONF_NONINTERACTIVE_SEEN=true

RUN apt-get update && apt-get install -y \
build-essential \
ca-certificates \
curl \
gfortran \
git \
gnupg \
libcurl4-openssl-dev \
libfontconfig1-dev \
libfreetype6-dev \
libfribidi-dev \
libgit2-dev \
libharfbuzz-dev \
libjpeg-dev \
liblapack-dev \
libopenblas-dev \
libpng-dev \
libpython3-dev \
libssl-dev \
libtiff5-dev \
libxml2-dev \
openjdk-17-jdk-headless \
pkg-config \
qpdf \
tzdata \
software-properties-common \
wget \
zlib1g-dev


# Should keep the installation consistent with https://apache.github.io/spark/api/python/getting_started/install.html

# Install Python 3.9
RUN add-apt-repository ppa:deadsnakes/ppa
RUN apt-get update && apt-get install -y \
python3.9 \
python3.9-distutils \
&& apt-get autoremove --purge -y \
&& apt-get clean \
&& rm -rf /var/lib/apt/lists/*


ARG BASIC_PIP_PKGS="numpy==1.21 pyarrow==11.0.0 pandas==2.0.0 six==1.16.0 scipy scikit-learn coverage unittest-xml-reporting"
# Python deps for Spark Connect
ARG CONNECT_PIP_PKGS="grpcio==1.67.0 grpcio-status==1.67.0 googleapis-common-protos==1.65.0 graphviz==0.20 protobuf"

# Install Python 3.9 packages
RUN curl -sS https://bootstrap.pypa.io/get-pip.py | python3.9
RUN python3.9 -m pip install --force $BASIC_PIP_PKGS $CONNECT_PIP_PKGS && \
python3.9 -m pip cache purge

0 comments on commit 4ad7f3d

Please sign in to comment.