Python Interpreter in Docker and Pyspark Tests in Docker
Overview
There are two main ideas behind this article: security
and mobility
. When you create your environment only on your server, laptop, Raspberry Pi, etc., it may be great, but without any backup, regular updates, or automation, it can easily become a SnowflakeServer anti-pattern. Python offers virtual environments to avoid this issue.
Another point is the security of the environment. Virtual environments are great, but they can easily become insecure. You should use the latest major version of Python and be careful when installing and using external libraries. It is possible for a malicious module with the same name to be in a popular open-source library and find its way into the system path. If the malicious module is found before the actual module, it will be imported and could be used to exploit applications that have it in their dependency tree. To prevent this, you must ensure that you use either an absolute import or an explicit relative import, as the latter guarantees the import of the actual and intended modules.
Fortunately, Docker can keep many of these things in mind.
Using Docker as a Python Interpreter
There are a few different approaches to using Docker as a Python interpreter, depending on your needs. If you have an existing Python application that you want to run in a Docker container, you can “dockerize” your application by creating a Docker image.
Dockerize your Python application
Docker images are a great way to ensure that your application runs consistently across different machines and servers. By containerizing your application, you can separate it from the underlying hardware and operating system as much as possible, making it easier to move and deploy.
To dockerize your Python application, you’ll need to create a Dockerfile that specifies the base image, any dependencies you need to install, and the command to run your application. For more info see Dockerize your python application.
Benefits of using Docker as a Python interpreter
Using Docker as a Python interpreter has a number of benefits. For one, it makes it easier to ensure that your application runs consistently across different environments. Additionally, it can simplify the process of managing dependencies, as all the dependencies for your application can be included in the Docker image. Finally, by using Docker, you can avoid some of the pitfalls of maintaining a “snowflake server” - a server that is difficult to reproduce and maintain over time.
Using Docker as a remote Python interpreter
What about when you want to develop and build your application from scratch and you want to use separated python interpreter. There is an existing solution like PyCharm from JetBrains, but it requires Professional version of PyCharm.
Visual Studio Code can handle Dev Containers plugin this job. Basically, you can attach to a Docker container that contains Python.
First step is to create some simple Dockerfile
FROM python:latest
WORKDIR /app
COPY . ./
Build docker build -t pyground .
and start docker container docker run -it --rm --name pyground pyground:latest
. After these command your python interpreter is alive, you need to attach with your code editor (vscode in our example), however when you have terminal you can basically connect and run whatever you want.
In VSCode via plugin pick up new interpreters
Attach your running docker containers
After that you should see new visual studio window open and in left corner attached running containers
Via terminal I get current python version:
root@704b87b076d8:~# python --version
Python 3.11.2
If you have files in the same folder as your Dockerfile, you can run and use them in the container:
root@704b87b076d8:~# ls /app/
Dockerfile solver.py
root@704b87b076d8:~# python /app/solver.py
a: 1
b: 10
c: 1
(-0.10102051443364424, -9.898979485566356)
Code for this solver.py
is available here JetBrains Example.
After you done with your work you can stop docker container and vscode should automatically detach that container. Keep in mind option --rm
will remove docker container after run. For more a persistent solution, take a look at bind mount solution by Docker
We can improve Dockerfile to be ready with preinstalled libraries. Best practices for python libraries is to use requirement.txt
file.
numpy==1.24.2
pandas==1.5.3
And improved Dockerfile:
FROM python:latest
RUN pip install --upgrade pip
ADD requirements.txt .
RUN pip install -r requirements.txt
Build docker image with specific Dockerfile
filename docker build -t pyground2 -f .\Dockerfile-with-requirements.dockerfile .
And then we can test it:
Python 3.11.2 (main, Mar 1 2023, 14:46:02) [GCC 10.2.1 20210110] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import pandas as pd
>>> import numpy as np
>>>
We can put some pre-configuration in Dockerfile or sample files. This way we can prepare the same environment for colleagues or students. Everyone will have the same version of python along with same versions of all libraries and dependencies.
One last note: Don’t overfill your Dockerfile, always check if you need these things in your images, for example my base pyground
image with the latest version of python Python 3.11.2
was 925 MB and with the pandas and numpy libraries we got over 1 GB!
> docker images
REPOSITORY TAG IMAGE ID CREATED SIZE
pyground latest c69a7214f5e5 33 minutes ago 925MB
pyground2 latest bfe8fc2400e4 7 minutes ago 1.12GB
Docker containers as test instances for PySpark
We used the previous example for testing purposes. We have several goals that we need to achieve. First, we need to ensure that we are able to test our code locally (on any operating system and any hardware). We also need to check our code during the deployment pipeline. Docker is perfect for all these tasks. Moreover, our code is written for PySpark.
First, we need to configure a Dockerfile containing PySpark and Java.
ARG IMAGE_VARIANT=slim-buster
ARG OPENJDK_VERSION=8
ARG PYTHON_VERSION=3.9.8
FROM python:${PYTHON_VERSION}-${IMAGE_VARIANT} AS py3
FROM openjdk:${OPENJDK_VERSION}-${IMAGE_VARIANT}
As you can see, more lines need to be explained. We use the basic slim-buster
image. This slim
image generally contains only the minimal packages needed to run Python. The buster
is the codename for the stable version of Debian, whose release is 10.4. This python image was based on it.
Next, we will use a second base image openjdk
with a similar codename and a Java version of OPENJDK_VERSION=8.
ARG PYSPARK_VERSION=3.2.0
RUN pip --no-cache-dir install pyspark==${PYSPARK_VERSION}
With run command we install PySpark
(version 3.2.0) itself and then we install all system dependencies and often used libraries.
WORKDIR app
RUN apt-get update && apt-get install -y build-essential libxml2
COPY . /app
RUN pip3 install cython numpy pytest pandas coverage pyspark_test dummy_spark IPython pytest-cov
RUN pip3 install -r requirements.txt
ENTRYPOINT python3 -m coverage run --source=. -m pytest -v test/ && coverage report && coverage xml && cat coverage.xml
The last command is for the test itself. Here we call coverage
and pytest
tools. This command will run all the unit tests in the test/
folder. A test report is generated and a coverage message is returned (shown).
Lift and shift
As mentioned earlier, we can reuse a Dockerfile in multiple environments. Here are some examples where we can use our previous Dockerfile image (sparktest
).
In an Azure Devops pipeline, we can use following approach to run this image and get a coverage report:
- task: Docker@2
displayName: 'Build an image'
inputs:
repository: 'sparktest'
command: 'build'
Dockerfile: '**/Dockerfile'
tags: 'latest'
- script: |
docker run -e PYTHONPATH=./src -v :/app --name sparktest sparktest
CONTAINER_ID=`docker ps -aqf "name=sparktest"`
docker cp $CONTAINER_ID:/app/coverage.xml test-coverage.xml
displayName: "Image unittest by Docker"
workingDirectory: ${{ parameters.appPath }}
In docker-compose.yaml
we can specify the Dockerfile with context:
version: "3.9"
services:
test:
environment:
- PYTHONPATH=./src
image: "sparktest"
build:
context: .
dockerfile: ./Dockerfile
volumes:
- ./our_current_project:/app
command: python3 -m coverage run --source=. -m pytest -v test/ && coverage report && coverage xml && cat coverage.xml
And of course you can build and run that image from your local machine.
In summary, using Docker as a Python interpreter can be a powerful tool for managing your Python applications and development environment. By containerizing your application or development environment, you can ensure that it runs consistently across different machines and servers, simplify dependency management, and avoid the pitfalls of maintaining a snowflake server.
Sources
- https://code.visualstudio.com/docs/devcontainers/attach-container
- https://docs.docker.com/develop/develop-images/dockerfile_best-practices/
- https://learn.microsoft.com/en-us/visualstudio/docker/tutorials/docker-tutorial
- https://stackoverflow.com/questions/69326427/select-interpreter-of-docker-container-in-the-vscode
- https://martinfowler.com/bliki/SnowflakeServer.html
- https://dev.to/alvarocavalcanti/setting-up-a-python-remote-interpreter-using-docker-1i24
- https://stackoverflow.com/questions/54954187/docker-images-types-slim-vs-slim-stretch-vs-stretch-vs-alpine