Dockerization project

Joseph Chazalon, Clément Demoulins {firstname.lastname@lrde.epita.fr}

March 2022

Assignment

For this project you have to deploy a very recent OCR engine and expose its service as a REST API. You will also have to set up a task queue to manage several workers, and cope with long-running tasks. We will reach Docker’s limit here as we will run all our workers on a single machine, while it would make sense, at this point, to scale out, i.e. run workers on different machines. Going further would require some Docker Swarm, or Kubernetes, skills, and is the work of DevOps you should be able to talk to, at this point.

Please enjoy this great component diagram to illustrate our architecture design: Simplified architecture

We split the project into 4 stages, and you must complete them all to get the maximal grade. These stages are build each on top of the other, progressively increasing the difficulty, so start with stage 1.

What you will have to do:

Stage 1: Mock image processing service in development mode

Target deployment for stage 1

Using the files provided in the resources/stage1 folder, create a mock web service (Flask application) which:

You will need to:

Deliverables

Acceptance conditions

Connectivity check

$ curl   --url http://localhost:5000/
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 3.2 Final//EN">
<title>404 Not Found</title>
<h1>Not Found</h1>
<p>The requested URL was not found on the server. If you entered the URL manually please check your spelling and try again.</p>

Correct usage

$ curl -X POST --header "Content-type: image/jpeg" --url http://localhost:5000/imgshape -T testimage.jpg
{
  "content": {
    "depth": 3, 
    "height": 500, 
    "width": 500
  }
}

Wrong HTTP method

$ curl -X GET  --url http://localhost:5000/imgshape 
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 3.2 Final//EN">
<title>405 Method Not Allowed</title>
<h1>Method Not Allowed</h1>
<p>The method is not allowed for the requested URL.</p>

Submit bad file

$ curl -X POST --header "Content-type: image/jpeg" --url http://localhost:5000/imgshape -T requirements.txt  # any buggy file
{
  "error": "Cannot open image."
}

Hints

Grading

Total for stage 1: 9 points

Stage 2: Mock image processing service with production-ready server

Target deployment for stage 2

Flask is not ready for production, i.e. it cannot handle efficiently and safely an important number of connections. We will add a production-ready Python WSGI HTTP server, Gunicorn, to our image.

This server will have exactly the same features as the one from stage 1, but will listen on port 8000 (more common for Python servers).

You will need to:

Deliverables

Same as stage 1, but organized under a solutions/stage2/ directory.

Acceptance conditions

Same as stage 1, but with connection to port 8000 instead of 5000.

Hints

Grading

Total for stage 2: 2 points
Total so far: 11 points

Stage 3: Containerized OCR with synchronous Gunicorn serving

Target deployment for stage 3

This is a tricky part as the OCR we want to deploy requires a super-heavy image. However, it is pretty easy to install: you can install it directly from the GitHub repo using pip. Warning: if you plan to test it on your computer, make sure you have at least 10 GB of free space and at least 4 GB of free RAM.

You will need to:

Code to launch the OCR

from pero_ocr_driver import PERO_driver

# TODO reuse previous code from the /imgshape/ route to read the image content
# `img` should be a valid numpy array representing an image in what follows

# Init the OCR engine if needed
start_time = time.time()
ocr_engine = PERO_driver(os.environ['PERO_CONFIG_DIR'])
elapsed_time = int((time.time() - start_time) * 1000)
print("init 'pero ocr engine' performed in %.1f ms.", elapsed_time)

# Perform the actual computation
ocr_results = ocr_engine.detect_and_recognize(img)
ocr_results = "\n".join([textline.transcription for textline in ocr_results])
print(ocr_results)

# TODO return result as json payload

Deliverables

Same as stage 1, but organized under a solutions/stage3/ directory.

Acceptance conditions

Check running

$ curl   --url http://localhost:8000/check
Hello

Check OCR service

$ curl -X POST --header "Content-type: image/jpeg" --url http://localhost:8000/ocr -T text2.jpg | jq .content
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 22051  100   340  100 21711     57   3657  0:00:05  0:00:05 --:--:--    71
"Goujet, Cordeliers, 7.\nGoupil, spécialité pour la fonderie\nde caractères, rabotte toutes\nsortes de métaux, fait les prèces\ndétachées, Monsieur-le-Prince,\n20."

Hints

Grading

Total for stage 3: 5 points
Total so far: 16 points

Stage 4: Decoupling OCR and web services, then enable task queuing

Target deployment for stage 4

Here we are getting serious. This is actually the step which motivated us to propose this project. The trouble with the previous stage is that requests to the web server trigger a long-running and CPU-bounded computation on the server, eventually freezing the web server, or even leading to HTTP timeouts.

We will use Celery to decouple the web frontend from the OCR worker(s). In this project we will use only one OCR worker, but the idea is that this can scale out (be distributed to multiple computers) quite easily once this is done. This will require setting up two images: 1 for the web frontend (super light) and 1 for the OCR worker (super heavy).

This works as follows:

The new web frontend will contain the following routes: - GET /check: simply answers “Hello” to check for a working server - POST /ocr enqueues a task with the image submitted and return the task id - GET /results/<task_id>: returns the status of the task with the given id, and the result if it is available.

You will need to:

Deliverables

Same as stage 1, but organized under a solutions/stage4/ directory.

Acceptance conditions

Check running

$ curl   --url http://localhost:8000/check
Hello

Check OCR service

$ curl -X POST --header "Content-type: image/jpeg" --url http://localhost:8000/ocr -T text2.jpg 
{
  "submitted": "8404359a-6b2f-4ae1-a479-66d1fc09819b"
}
$ sleep 10
$ curl  --url http://localhost:8000/results/8404359a-6b2f-4ae1-a479-66d1fc09819b 
{
  "content": {
    "content": "Goujet, Cordeliers, 7.\nGoupil, sp\u00e9cialit\u00e9 pour la fonderie\nde caract\u00e8res, rabotte toutes\nsortes de m\u00e9taux, fait les pr\u00e8ces\nd\u00e9tach\u00e9es, Monsieur-le-Prince,\n20."
  }
}

Hints

Grading

Total for stage 4: 4 points
Total so far: 20 points

BONUS Stage 5 (docker-compose) — Almost-ready product (missing auth)

If you want to make money.

  1. gunicorn front (+ postgreSQL to keep track of submissions)
  2. celery master + rabbitmq
  3. celery worker + OCR
  4. supervisord to perform regular clean-up of old submissions
  5. deploy a flower monitoring
  6. add nginx or Caddy to have a safe reverse proxy and TLS endpoint
  7. scale out using Docker Swarm on several machines

Final Submission

You MUST submit your project using Moodle.

Your submission MUST be a compressed archive (.tar.gz) containing the following files:

Do not put the test resources (images) in the archive!

Good luck.