Dockerization project

Joseph Chazalon, Clément Demoulins {firstname.lastname@lrde.epita.fr}

March 2022

Assignment

For this project you have to deploy a very recent OCR engine and expose its service as a REST API. You will also have to set up a task queue to manage several workers, and cope with long-running tasks. We will reach Docker’s limit here as we will run all our workers on a single machine, while it would make sense, at this point, to scale out, i.e. run workers on different machines. Going further would require some Docker Swarm, or Kubernetes, skills, and is the work of DevOps you should be able to talk to, at this point.

Please enjoy this great component diagram to illustrate our architecture design: Simplified architecture

We split the project into 4 stages, and you must complete them all to get the maximal grade. These stages are build each on top of the other, progressively increasing the difficulty, so start with stage 1.

What you will have to do:

For each stage, you must produce the appropriate Dockerfiles and docker-compose.yaml files, as instructed.
Because the code we give you is a bit messy, you should not have to change it. You will have to read it though.
You will also have to write some minimal Python code at the beginning to help you understand the basics of Flask and Gunicorn.

Stage 1: Mock image processing service in development mode

Using the files provided in the resources/stage1 folder, create a mock web service (Flask application) which:

accepts images on a POST /imgshape route
returns the shape of the image as a JSON payload (synchronous call).

You will need to:

Complete the code of the resources/stage1/OCR_route.py file.
Create a requirements.txt file.
Write a Dockerfile file.
Write a docker-compose.yaml file.

Deliverables

solution/stage1/Dockerfile: the description of the steps required to build the image containing your application
solution/stage1/docker-compose.yaml: a file which enables anyone to run your service using docker-compose up
solution/stage1/sources/*: all the files needed to build and run a container with your application

Acceptance conditions

When running, the server should be listening on port 5000 on the host machine.
The following test cases should exhibit the same behavior

Connectivity check

$ curl   --url http://localhost:5000/
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 3.2 Final//EN">
<title>404 Not Found</title>
<h1>Not Found</h1>
<p>The requested URL was not found on the server. If you entered the URL manually please check your spelling and try again.</p>

Correct usage

$ curl -X POST --header "Content-type: image/jpeg" --url http://localhost:5000/imgshape -T testimage.jpg
{
  "content": {
    "depth": 3, 
    "height": 500, 
    "width": 500
  }
}

Wrong HTTP method

$ curl -X GET  --url http://localhost:5000/imgshape 
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 3.2 Final//EN">
<title>405 Method Not Allowed</title>
<h1>Method Not Allowed</h1>
<p>The method is not allowed for the requested URL.</p>

Submit bad file

$ curl -X POST --header "Content-type: image/jpeg" --url http://localhost:5000/imgshape -T requirements.txt  # any buggy file
{
  "error": "Cannot open image."
}

Hints

Select an appropriate base image from https://hub.docker.com/_/python/ (we obtained a 464 MB image for this stage without any particular effort).
Identify the 2 required dependencies and create a requirements.txt.
If you use OpenCV to open the image, 1) make sure you use a “headless” version, and 2) you may need to install libglib2.0-0 on Debian-based images.
Check the imports of the OCR_route.py file, they contain all the functions we used.

Grading

1 point(s): server code has correct image saving
1 point(s): server code has correct image opening
1 point(s): server code computes result correctly
1 point(s): server code returns error when the file cannot be processed
1 point(s): requirements.txt is present and correct
1 point(s): correct base image in Dockerfile
1 point(s): correct version pinning in requirements.txt and Dockerfile image and packages
1 point(s): correct base command in Dockerfile or docker-compose.yaml
1 point(s): correct minimal docker-compose.yaml which runs a working server, correctly listening in port 5000 on the host, when running docker-compose up

Total for stage 1: 9 points

Stage 2: Mock image processing service with production-ready server

Flask is not ready for production, i.e. it cannot handle efficiently and safely an important number of connections. We will add a production-ready Python WSGI HTTP server, Gunicorn, to our image.

This server will have exactly the same features as the one from stage 1, but will listen on port 8000 (more common for Python servers).

You will need to:

Update your dependencies to install Gunicorn.
Update your Dockerfile and/or docker-compose.yaml to update the launch command.

Deliverables

Same as stage 1, but organized under a solutions/stage2/ directory.

Acceptance conditions

Same as stage 1, but with connection to port 8000 instead of 5000.

Hints

The documentation on the homepage of Gunicorn should be sufficient.

Grading

1 point(s): correct installation of Gunicorn
1 point(s): correct Dockerfile and docker-compose.yaml which runs a working server, correctly listening in port 8000 on the host, when running docker-compose up

Total for stage 2: 2 points
Total so far: 11 points

Stage 3: Containerized OCR with synchronous Gunicorn serving

This is a tricky part as the OCR we want to deploy requires a super-heavy image. However, it is pretty easy to install: you can install it directly from the GitHub repo using pip. Warning: if you plan to test it on your computer, make sure you have at least 10 GB of free space and at least 4 GB of free RAM.

You will need to:

Include the pero_ocr_driver.py file in your project.
Update your Dockerfile to install the Python package at https://github.com/jchazalon/pero-ocr/archive/refs/heads/master.zip and its dependencies. pip should do it automatically.
Update your OCR_route.py file to call the OCR using the code provided below.
Update your Dockerfile to install the content of https://www.lrde.epita.fr/~jchazalo/SHARE/pero_eu_cz_print_newspapers_2020-10-09.tar.gz under some well-defined path in your image.
Export the appropriate environment variable (check the code below) to point to the location of the files contained in the previous archive:
- ParseNet.pb
- checkpoint_350000.pth
- config.ini
- ocr_engine.json
Expose a new /ocr route on the web server which will accept images and return their transcription.
Optionaly, cache the content of https://download.pytorch.org/models/vgg16-397923af.pth to /root/.cache/torch/hub/checkpoints/vgg16-397923af.pth in your image, or any appropriate place if you set up users, to avoid re-downloading VGG16 weights each time you run your container.

Code to launch the OCR

from pero_ocr_driver import PERO_driver

# TODO reuse previous code from the /imgshape/ route to read the image content
# `img` should be a valid numpy array representing an image in what follows

# Init the OCR engine if needed
start_time = time.time()
ocr_engine = PERO_driver(os.environ['PERO_CONFIG_DIR'])
elapsed_time = int((time.time() - start_time) * 1000)
print("init 'pero ocr engine' performed in %.1f ms.", elapsed_time)

# Perform the actual computation
ocr_results = ocr_engine.detect_and_recognize(img)
ocr_results = "\n".join([textline.transcription for textline in ocr_results])
print(ocr_results)

# TODO return result as json payload

Deliverables

Same as stage 1, but organized under a solutions/stage3/ directory.

Acceptance conditions

When running, the server should be listening on port 8000 on the host machine.
The following test cases should exhibit the same behavior

Check running

$ curl   --url http://localhost:8000/check
Hello

Check OCR service

$ curl -X POST --header "Content-type: image/jpeg" --url http://localhost:8000/ocr -T text2.jpg | jq .content
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 22051  100   340  100 21711     57   3657  0:00:05  0:00:05 --:--:--    71
"Goujet, Cordeliers, 7.\nGoupil, spécialité pour la fonderie\nde caractères, rabotte toutes\nsortes de métaux, fait les prèces\ndétachées, Monsieur-le-Prince,\n20."

Hints

No need to change the command to run the server in your container.
You may want to start by simply installing the OCR in a base Python container, and check its installation with a simple import of the driver file, or even a simple, direct, function call before exposing it with the web server.

Grading

1 point(s): correct update of OCR_route.py
1 point(s): correct installation of PERO OCR Python code
1 point(s): correct installation of PERO OCR parameter files (incl. env. variable config.)
1 point(s): pre-deployment of the VGG16 weights

Total for stage 3: 5 points
Total so far: 16 points

Stage 4: Decoupling OCR and web services, then enable task queuing

Here we are getting serious. This is actually the step which motivated us to propose this project. The trouble with the previous stage is that requests to the web server trigger a long-running and CPU-bounded computation on the server, eventually freezing the web server, or even leading to HTTP timeouts.

We will use Celery to decouple the web frontend from the OCR worker(s). In this project we will use only one OCR worker, but the idea is that this can scale out (be distributed to multiple computers) quite easily once this is done. This will require setting up two images: 1 for the web frontend (super light) and 1 for the OCR worker (super heavy).

This works as follows:

Celery enables to send tasks to workers (a form of RPC).
Messages are distributed using a broker: we will use RabbitMQ here.
Each component must know how to connect to the broker: we will use environment variables here.
The OCR image will contain a celery worker.
The web frontend image will simply send images to process to the task queue, and recover the results when needed.

The new web frontend will contain the following routes: - GET /check: simply answers “Hello” to check for a working server - POST /ocr enqueues a task with the image submitted and return the task id - GET /results/<task_id>: returns the status of the task with the given id, and the result if it is available.

You will need to:

Split the code into two components: “web” and “ocr”.
Use the new code from resources/stage4/web/OCR_route.py for the web component.
Use the new code from resources/stage4/ocr/*.py for the OCR component.
Write a Dockerfile for each component.
Write a global docker-compose.yaml.
Find which environment variables need to be defined (they are 3 of them), and where to define them.

Deliverables

Same as stage 1, but organized under a solutions/stage4/ directory.

Acceptance conditions

When running, the server should be listening on port 8000 on the host machine.
The following test cases should exhibit the same behavior

Check running

$ curl   --url http://localhost:8000/check
Hello

Check OCR service

$ curl -X POST --header "Content-type: image/jpeg" --url http://localhost:8000/ocr -T text2.jpg 
{
  "submitted": "8404359a-6b2f-4ae1-a479-66d1fc09819b"
}
$ sleep 10
$ curl  --url http://localhost:8000/results/8404359a-6b2f-4ae1-a479-66d1fc09819b 
{
  "content": {
    "content": "Goujet, Cordeliers, 7.\nGoupil, sp\u00e9cialit\u00e9 pour la fonderie\nde caract\u00e8res, rabotte toutes\nsortes de m\u00e9taux, fait les pr\u00e8ces\nd\u00e9tach\u00e9es, Monsieur-le-Prince,\n20."
  }
}

Hints

Start by getting the simple mock server to work with the Celery setup.

Run the OCR using the following command:

celery --app=worker.celery worker --concurrency=1 -P threads --loglevel=INFO

The RabbitMQ image comes with default user and password when none is configured: guest:guest, so you can use the following URI for your Celery broker: amqp://guest:guest@rabbitmq:5672 (provided you named the Rabbit MQ service rabbitmq).
You can use the default rpc:// backend for Celery results.
Your docker-compose.yaml will contain 3 services: ocr, web and rabbitmq.

Grading

1 point(s): correct Dockerfile for web component
1 point(s): correct Dockerfile for OCR component
1 point(s): correct configuration of the components (env. variables)
1 point(s): correct docker-compose.yaml which runs a working server, correctly listening in port 8000 on the host, when running docker-compose up

Total for stage 4: 4 points
Total so far: 20 points

BONUS Stage 5 (docker-compose) — Almost-ready product (missing auth)

If you want to make money.

gunicorn front (+ postgreSQL to keep track of submissions)
celery master + rabbitmq
celery worker + OCR
supervisord to perform regular clean-up of old submissions
deploy a flower monitoring
add nginx or Caddy to have a safe reverse proxy and TLS endpoint
scale out using Docker Swarm on several machines

Final Submission

You MUST submit your project using Moodle.

Your submission MUST be a compressed archive (.tar.gz) containing the following files:

solutions/stage1/Dockerfile
solutions/stage1/docker-compose.yaml
solutions/stage1/sources/OCR_routes.py
solutions/stage1/sources/requirements.txt
solutions/stage2/Dockerfile
solutions/stage2/docker-compose.yaml
solutions/stage2/sources/OCR_routes.py
solutions/stage2/sources/requirements.txt
solutions/stage3/Dockerfile
solutions/stage3/docker-compose.yaml
solutions/stage3/sources/OCR_routes.py
solutions/stage3/sources/pero_ocr_driver.py
solutions/stage3/sources/requirements.txt
solutions/stage4/Dockerfile-ocr
solutions/stage4/Dockerfile-web
solutions/stage4/docker-compose.yaml
solutions/stage4/sources-ocr/pero_ocr_driver.py
solutions/stage4/sources-ocr/requirements.txt
solutions/stage4/sources-ocr/celeryconfig.py
solutions/stage4/sources-ocr/worker.py
solutions/stage4/sources-web/requirements.txt
solutions/stage4/sources-web/OCR_routes.py

Do not put the test resources (images) in the archive!

Good luck.