Dockerization project

Joseph Chazalon, Clément Demoulins {firstname.lastname@lrde.epita.fr}

February 2021

Assignment

For this project you have to package a document analysis service into Docker image. This service is composed of two components:

a Python server using Flask/Gunicorn and exposing a REST API;
a C++ library which does all the hard work and relies on several other libraries.

This is an example of “semi-industrial” product: it is a prototype used in collaborative research SODUCO project the LRDE takes part in. As a consequence, the code is quite messy and hard to deploy. Docker can help a lot the users, though there are challenges for the one(s) who build the Docker image.

You will have several subtasks to achieve. Each of them is graded, so you can get points along the process.

Step 1: Create a build image for the C++ code

This image will not contain any part of our code: it will only contain the necessary tools to build the C++ code of the project. Hence, the final CMD should be a shell invocation.

Using this image, with the appropriate bind-mount of the current directory, it should be possible to run the compilation of the C++ code using the following procedure (which should be put in a build.sh file):

# Assume we are in the "soduco-server" directory
echo "Creating build dir"
mkdir -p build && cd build
echo "conan: install deps"
conan install -u .. --build missing -s compiler.libcxx=libstdc++11 -s compiler.cppstd=20
echo "cmake: generate build scripts"
cmake .. -G Ninja -DPYTHON_EXECUTABLE=$(which python3) -DCMAKE_BUILD_TYPE=Release
echo "cmake: launch build"
cmake --build . --parallel --config Release
echo "cpack: create artefacts"
cpack -G ZIP -G TGZ .    
echo "**********************"
echo "BUILD COMPLETE"
echo "**********************"

When run in a container based on the correct image, this sequence of commands produces a package of the C++ code at build/soduco-py37-*.tar.gz. This artifact will be used in later stages, either manually or using a multi-stage build. The important files it contains are:

soduco-py37-0.1.1-Linux/back/soducocxx.cpython-37m-x86_64-linux-gnu.so:
The image processing library.
soduco-py37-0.1.1-Linux/lib/libblend2d.so:
A dependency which is also rebuilt.

As you can immediately see, this project depends on the conan C++ package manager, and on cmake: you will have to install them on the build image.

We suggest you use the base image gcc:9.3 which contains the right version of GCC for our project.

Also, you will have to install the following dependencies using APT, the system package manager (this list is not exhaustive):

cmake: build tools
libboost-dev: what lacks in C++’s STL
libfreeimage-dev: to read and write images
libpoppler-cpp-dev: to manipulate PDFs
libtesseract-dev: to OCR images
ninja-build: build tool

You will also need a working installation of Python 3 (version 3.7 is OK here) with PIP.

You can install conan using PIP. Beware: its dependencies were ill-defined last time we installed it: make sure you have the following Python packages already installed before trying to install conan:

setuptools
wheel

Finally, you also need to tell conan where the Pylene library (developed at the LRDE) can be downloaded (you can run this after conan is installed):

conan remote add lrde-public https://artifactory.lrde.epita.fr/artifactory/api/conan/lrde-public

With all these steps, you should have a working toolchain in a Docker image. Doing this on your personal system would probably have messed up a lot of things.

The workflow to test this first image would be the following:

Create the “builder” image using something like
docker build -t soduco/back_builder -f soduco-back-cppbuilder.dockerfile .
Use this image to build the C++ library. If we put the sequence of build commands in a build.sh script, then when we run:
docker run --rm -it -v ${PATH_TO_CODE}:/app/ -w /app soduco/back_builder sh build.sh
we should have our artifacts ready in the build/ folder.

Step 2: Package the server in a Docker image (single stage build)

This is the most complex part, because there is no (to our knowledge) public image which can has Python pre-installed and is compatible with C++ libraries built with GCC 9.3.

The goal here is to produce an image which can run a Python server (Flask and Gunicorn) and a C++ library built with a recent version of GCC. The server will receive incoming requests, identify the name of the document to process (documents should be stored under /data/directories/) and request a processing to the library. The results will then be sent back to the server which will cache a copy in /data/annotations and return the result to the client.

You will need to install in your image dependencies for two programs: the Python server and the C++ image processing library.

For the Python server, the required modules are the following:

unidecode
filelock
regex
gunicorn
fastspellchecker
pillow
numpy
python-dotenv
flask

We recommend using a base image with Python 3.7 preinstalled.

For the C++ image processing library, we need the runtime version of the development libraries we installed in the build image, plus some others:

tesseract-ocr-fra
libfreeimage3
enchant
libpoppler-cpp0v5
aspell-fr
libtesseract4

The recommended layout for the application files in the image is:

/app/resources
/app/back
/app/server
/app/lib

There will be two extra directories which should be provided when the container is run:

/data/directories: contains PDF files to process and visualize;
/data/annotations: contains information extracted from PDFs (and eventually modified by the client).

We suggest you proceed as follows:

From a base image containing Python 3.7, install all the required dependencies for the C++ image processing library.
Install the library and check the status of the image: does ldd reports some missing dependencies?
You may need to extract libgcc_s.so and libstdc++.so from the build image.
Then, install all Python dependencies.
Install all server (Python) files and resources.
Use LD_LIBRARY_PATH to activate the libraries under /app/lib.
Make sure you set LC_ALL=C to avoid issues with locales and encodings.
Set the working directory.
Configure the final command for server startup:
gunicorn -t 500 --bind 0.0.0.0:8000 --proxy-allow-from='*' server:app

When successfully built, the server should start and listen to port 8000 upon container launch. To start the server properly, you should bind the sample data from the data/directories directory from the resources we provided to the right mount point. Also, make sure the /data/annotations directory is writable within the container.

You can test the server using some sample CURL commands like:

# Get the list of directories
$ curl -X GET "http://localhost:8000/directories/"   
{
  "directories": [
    "Didot_1842a_sample.pdf": {}, 
    "Didot_1848a_sample.pdf": {}, 
    "Didot_1851a_sample.pdf": {}
  ]
}

# Get a corrected image for a given page from a given PDF
$ curl -X GET http://localhost:8000/directories/Didot_1851a_sample.pdf/3/image > img.jpg
# ...
# You can view the result using any image viewer.

# Get the annotations (extracted data) for a given page from a given PDF
$ curl -X GET http://localhost:8000/directories/Didot_1851a_sample.pdf/3/annotation
{
    "content": [
        ...
    ]
    "mode": "cached"
}

Note: you may need to disable SELinux (e.g. on Fedora):

setenforce 0

Step 3: Create a `docker-compose.yml` file

Wrap up all the configuration of the server’s build and launch in a docker-compose.yml file so we can build and launch everything in a single command.

You can assume that the C++ library is built manually, ie. that the artifact is available when you run the Dockerfile of step 2.

Step 4: Merge step 1 and 2 in a multi-stage build

Instead of assuming that the C++ library is built manually, use the build image to construct a build stage, and add a runtime stage based on the Python image to produce the final image.

Update the final docker-compose.yml file, if needed (the reference to the Dockerfile may be different).

Grading grid

This is subject to change

Description	Points
Step 1: correct Dockerfile	4
Step 1: build OK (image can compile the C++ code)	4
Step 2: correct Dockerfile	3
Step 2: Use `requirements.txt` to store Python dependencies	1
Step 2: server is launched and works	2
Steps 1&2: Don’t keep cached data from package manager (Python)	1
Steps 1&2: Don’t keep cached data from package manager (system)	1
Steps 1&2: Good build cache management, minimization of the number of layers (using req.txt among others)	1
Step 3: docker-compose file with build, ports and volume management	2
Step 4: working solution	1
TOTAL	20

Bonuses and penalties	Points
Step 1: image size > 1.5 GB	-1
Step 2: image size > 500 MB	-1
Step 2: image size < 200 MB	+1
Step 2: image size < 100 MB	+1
Step 3: container resource capping (CPU, MEM)	+1
TOTAL	-2/+3

For you information, here are the image sizes we obtain with our solution:

builder: 1.44 GB
deploy: 357 MB

Submission

You MUST submit your project using Moodle.

You submission MUST be a compressed archive (.tar.gz) containing:

a Dockerfile for stage 1
a Dockerfile for stage 2
a docker-compose.yml file for stage 3
a Dockerfile file and a docker-compose.yml file for stage 4
all the files required to build your Docker images

Please do not include the PDF file of test data in your archive!

Good luck.