EPITA 2021 MLRF practice_03-01_ORB_AR v2021-06-02_172455 by Joseph CHAZALON
We will demonstrate a simple technique, light enough to be run on an old smartphone, which detects an instance of a know document in a video frame, and overlays some dynamic content over this document in the frame.
We will use an excerpt of a dataset we created for a funny little app a few years ago, which allows children to point at a songbook page and play the associated song using a tablet. This is illustrated below.
This is much like marker-based Augmented Reality (AR), where the marker is a complex image.
This approach requires to prepare of document model prior to matching documents within frames.
We will proceed in 5 steps:
The resources for this session are packaged directly within this notebook's archive: you can access them under the resources/
folder:
model.png
: the model image we will use;frame_0010.jpeg
: a frame image extracted from a video.# deactivate buggy jupyter completion
%config Completer.use_jedi = False
import numpy as np
import cv2
import matplotlib.pyplot as plt
%matplotlib inline
import os
cv2.__version__
'4.0.0'
PATH_TO_RESOURCES = "./resources"
model_img = cv2.imread(os.path.join(PATH_TO_RESOURCES, "model.png"))
model_img.shape, model_img.dtype
((1654, 2340, 3), dtype('uint8'))
# to remain sane
def bgr2rgb(img):
return cv2.cvtColor(img, cv2.COLOR_BGR2RGB)
plt.imshow(bgr2rgb(model_img), cmap='gray')
<matplotlib.image.AxesImage at 0x7f4a68fc0d30>
We need to convert it to grayscale to extract ORB keypoints from it.
model_img_gray = cv2.cvtColor(model_img, cv2.COLOR_BGR2GRAY)
plt.imshow(model_img_gray, cmap='gray')
<matplotlib.image.AxesImage at 0x7f4a68ee15f8>
frame_img = cv2.imread(os.path.join(PATH_TO_RESOURCES, "frame_0010.jpeg"))
frame_img.shape, model_img.dtype
((1080, 1920, 3), dtype('uint8'))
plt.imshow(bgr2rgb(frame_img))
<matplotlib.image.AxesImage at 0x7f4a68e0cd68>
We also need to convert it to grayscale, for the same reason.
frame_img_gray = cv2.cvtColor(frame_img, cv2.COLOR_BGR2GRAY)
plt.imshow(frame_img_gray, cmap='gray')
<matplotlib.image.AxesImage at 0x7f4a68d74470>
First, we will detect and display some keypoints using the ORB method.
# Run me!
cv2.ORB.create?
# TODO create the ORB detector and descriptor
# orb = cv2.ORB.create(...) # FIXME
# prof
orb = cv2.ORB.create(nfeatures=2000,
scaleFactor=1.2,
nlevels=10,
edgeThreshold=5,
firstLevel=0,
WTA_K=2,
scoreType=cv2.ORB_HARRIS_SCORE,
patchSize=15)
# TODO detect keypoints
# model_kpts = # FIXME
# len(model_kpts)
#prof
model_kpts = orb.detect(model_img_gray)
len(model_kpts)
2000
Expected result:
# because the function from OpenCV's python wrapper is buggy
def draw_keypoints(color_image, keypoints, color=(0,255,0)):
'''
Display keypoints in some color over an image.
Parameters
----------
color_image: ndarray, shape=(rows, cols, 3 channels)
color image in BGR order
keypoints: list of cv2.KeyPoint
keypoints detected in the image
color: tuple of uint8 (optional)
color of the keypoints to drawn, in BGR order
'''
if color_image.ndim != 3:
raise ValueError(
"draw_keypoints: parameter `color_image` must be a... (wait for it) color image!")
draw = color_image.copy()
for k in keypoints:
angle = k.angle
class_id = k.class_id
convert = k.convert
octave = k.octave
overlap = k.overlap
pt_x, pt_y = k.pt
pt_int = int(pt_x), int(pt_y)
response = k.response
size = k.size
cv2.circle(draw, pt_int, int(size), color)
pt2 = int(pt_x + np.sin(angle)*size), int(pt_y + np.cos(angle)*size)
cv2.line(draw, pt_int, pt2, color, thickness=2)
plt.imshow(bgr2rgb(draw))
# TODO draw the keypoints detected in the model image
# draw_keypoints(...) # FIXME
# prof
draw_keypoints(model_img, model_kpts)
# TODO compute the descriptors
# model_kpts, model_desc = ... # FIXME
# len(model_kpts), model_desc.shape
# prof
model_kpts, model_desc = orb.compute(model_img_gray, model_kpts)
len(model_kpts), model_desc.shape, model_desc.dtype
(2000, (2000, 32), dtype('uint8'))
TODO you answer here
Storing an ORB descriptor takes ... bytes (without indexing overhead).
PROF
Storing an ORB descriptor takes 32 bytes (without indexing overhead).
Expected result of draw_keypoints()
:
# TODO detect keypoints and compute descriptors for the frame
# frame_kpts, frame_descr = # FIXME
# len(frame_kpts), frame_descr.shape
# prof
frame_kpts, frame_descr = orb.detectAndCompute(frame_img_gray, mask=None)
len(frame_kpts), frame_descr.shape
(2000, (2000, 32))
# Run me!
draw_keypoints(frame_img, frame_kpts)
TODO you answer here
PROF
Keypoints are detected in textured areas. Uniform areas do not permit to extract any discriminant element.
A matcher object is used to compare two sets of descriptors.
The relevant OpenCV documentation is available at the DescriptorMatcher documentation page.
There are two matchers available in OpenCV:
In both cases, we need to specifiy the distance the matcher will use to compare descriptors. There are several built-in norms:
NORM_HAMMING
, but in the calculation, each two bits of the input sequence will be added and treated as a single bit to be used in the same calculation as NORM_HAMMING
(only useful if you set the WTA_K
parameter of ORB to something else than 2
).It has only 1 parameter, beside the distance function: crossCheck
. It allows to perform a symmetry test, ie to keep only descriptors pairs where each one is the closest to the other one in each set, or more formally:
$$
\{
(\hat{d_i},\hat{d_j}) \mid
\hat{d_j} = \underset{d_j \in D_2}{\mathrm{argmin}} \operatorname{dist}(\hat{d_i}, d_j)
\land
\hat{d_i} = \underset{d_i \in D_1}{\mathrm{argmin}} \operatorname{dist}(d_i, \hat{d_j})
\},
$$
otherwise, we get the following set, $\forall d_i \in D_1$:
$$
\{
(d_i,\hat{d_j}) \mid
\hat{d_j} = \underset{d_j \in D_2}{\mathrm{argmax}} \operatorname{score}(d_i, d_j)
\}.
$$
We recommend to create a BF matcher using cv2.BFMatcher_create(normType, crossCheck)
.
FLANN stands for Fast Library for Approximate Nearest Neighbors.
The FLANN-based matcher is much more complex than the BF one, as it can use multiple indexing strategies (which may or may not be compatible with your descriptor type!) which have, in turn, parameters to be set.
This matcher may be faster when matching a large train collection than the brute force matcher.
A good but old documentation is available for OpenCV 2.4 implementation.
OpenCV supports several indexing algorithms:
trees
The number of parallel kd-trees to use. Good values are in the range [1..16]
.Composite: the index created combines the randomized kd-trees and the hierarchical k-means tree.
KD-Tree: the index is contructed using a single KD-tree.
table_number
: the number of hash tables to use (between 10 and 30 usually).key_size
: the size of the hash key in bits (between 10 and 20 usually).multi_probe_level
: the number of bits to shift to check for neighboring buckets (0 is regular LSH, 2 is recommended).table_number = 6
key_size = 12
multi_probe_level = 1
This matcher also has search parameters (like whether to sort the results) but there are very little reasons to change the default values.
To create a FLANN-based matcher, we recommend to use the following technique:
# Create a dictionary for indexing parameters:
flann_index_params= dict(algorithm = 6, # LSH
table_number = 6, # LSH parameters
key_size = 12,
multi_probe_level = 1)
# Then create the matcher
matcher = cv2.FlannBasedMatcher(indexParams=flann_index_params)
Hint: keep in mind that your ORB descriptors will be binary.
# TODO
# matcher_BF = ...
# matcher_FLANN = ...
# prof
# BF
matcher_BF = cv2.BFMatcher_create(normType=cv2.NORM_HAMMING, crossCheck=True)
# FLANN
# Create a dictionary for indexing parameters:
flann_index_params = dict(algorithm=6, # LSH
table_number=6, # LSH parameters
key_size=12,
multi_probe_level=1)
# Then create the matcher
matcher_FLANN = cv2.FlannBasedMatcher(indexParams=flann_index_params)
While it is possible to directly call matcher.match(descriptors1, descriptors2)
,
we usually index descriptors before matching them.
This is useful in real conditions for the case we are working on: we have to match each frame against every possible model (there were severa songs available), so this allows to:
This is performed using the matcher.add(list_of_list_of_descriptors)
which adds sets of descriptors for several training (or "model") images.
The index then retains for each single descriptor:
We will therefore distinguish between:
# TODO
# matcher_BF.add(...)
# matcher_FLANN.add(...)
# prof
matcher_BF.add([model_desc])
matcher_FLANN.add([model_desc])
We are now ready to match descriptors.
We suggest to use a FLANN-based matcher to be able to perform a ratio-test.
Matching descriptors is performed using one of the following functions:
matcher.match(query_descriptors)
: actually is knnMatch
, with k=1
.matcher.knnMatch(query_descriptors, k=...)
:
Performs a K-nearest neighbor search for a given query point using the index.k
> 1.k
> 1 is not possible with BF matcher when crossCheck
is True
!matcher.radiusMatch(query_descriptors, maxDistance=...)
:
Performs a radius nearest neighbor search for a given query point, ie returning only results within the specified radius).# TODO compute the matches
# matches = ...
# len(matches)
# prof
matches = matcher_FLANN.match(frame_descr)
len(matches)
2000
Here is a simple way to display the matches using cv2.drawMatches()
.
We could keep only the closest matches, but we will keep this simple for now.
# Run me
def draw_matches(img1, kpts1, img2, kpts2, matches, color=(0,0,255), title=""):
'''img1 and img2 are color images.'''
img_matches = np.empty((max(img1.shape[0], img2.shape[0]),
img1.shape[1]+img2.shape[1],
3),
dtype=np.uint8)
img_matches = cv2.drawMatches(img1, kpts1, img2, kpts2,
matches,
img_matches,
matchColor=color,
flags=cv2.DrawMatchesFlags_NOT_DRAW_SINGLE_POINTS)
plt.figure(figsize=(12,4))
plt.imshow(bgr2rgb(img_matches))
plt.title(title + " - %d matches" % (len(matches),))
Expected result of draw_matches()
:
# TODO draw the first matches
# draw_matches(...)
# prof
draw_matches(model_img, model_kpts,
frame_img, frame_kpts,
matches,
color=(0,0,255),
title="Frame → Model")
Let us now use the BF matcher to ask for a cross check.
Expected result of draw_matches()
:
# TODO
# matches = ...
# draw_matches(...)
# prof
# bruteforce version with crosscheck: fewer matches!
matches = matcher_BF.match(frame_descr, model_desc)
draw_matches(model_img, model_kpts,
frame_img, frame_kpts,
matches,
color=(255,0,0),
title="Frame ⇔ Model")
Let's stop using the BF matcher now, and use the FLANN matcher for what remains.
Hint: matches
will contain a list of pairs of matches, as opposed to single matches in previous steps.
# TODO
# matches = ...
# prof
matches = matcher_FLANN.knnMatch(frame_descr, k=2)
len(matches), len(matches[0])
(2000, 2)
The result of matches = matcher.match*(query_descriptors)
line is a list of DMatch
objects.
A DMatch
object has following attributes:
DMatch.distance
: Distance between descriptors. The lower, the better it is.DMatch.trainIdx
: Index of the descriptor in train descriptorsDMatch.queryIdx
: Index of the descriptor in query descriptorsDMatch.imgIdx
: Index of the train image.# TODO filter matches
# good_matches = ...
# len(good_matches)
# prof
RATIO_TEST_VALUE = 0.75
good_matches = [m1 for m1, m2 in matches if m1.distance < m2.distance * RATIO_TEST_VALUE]
# FIXME newer versions of OpenCV may return only 1 element in the match!?!? Need to check len(...)
len(good_matches)
202
Expected result of draw_matches()
:
# TODO
# draw_matches(...)
# prof
draw_matches(model_img, model_kpts,
frame_img, frame_kpts,
good_matches,
color=(255,255,0),
title="Frame → Model (ratio test)")
TODO write your answer here
PROF
The ratio test filters out more matches. It is cheaper to computer and more reliable.
Recommended by David Lowe.
Finally, using the good matches we computed using the ratio test, we can estimate the perspective transform between the model and the frame (in this direction, because we will project a modified model image over the scene/frame).
First we need to build two corresponding lists of point coordinates, for the source and for the destination.
# prof
pts_mdl = []
pts_frame = []
# for m in good_matches:
# # TODO
len(pts_mdl), len(pts_frame)
(0, 0)
# prof
pts_mdl = []
pts_frame = []
for m in good_matches:
pts_mdl.append(model_kpts[m.trainIdx].pt)
pts_frame.append(frame_kpts[m.queryIdx].pt)
len(pts_mdl), len(pts_frame)
(202, 202)
As the RANSAC implementation in OpenCV requires float numbers, we will convert our coordinates.
# Run me
pts_mdl, pts_frame = np.float32(pts_mdl), np.float32(pts_frame)
We are now ready to estimate the homography using RANSAC.
cv2.findHomography?
# TODO
# H, pts_inliers_mask = cv2.findHomography(...)
# H
# Note: there probably are keypoint duplicates (same coordinates) at different octaves
H, pts_inliers_mask = cv2.findHomography(pts_mdl, pts_frame, cv2.RANSAC, 3.0)
H
array([[ 4.12308437e-01, -5.34564698e-02, 3.95275174e+02], [ 1.53889198e-02, 3.13182834e-01, 2.09917143e+02], [ 3.69411999e-05, -7.92524624e-05, 1.00000000e+00]])
# prof
# sanity check: we usually want at least 15 inliers for the homography to be trustworthly
np.count_nonzero(pts_inliers_mask)
129
# TODO
# matches_ransac_inliers = ...
# len(matches_ransac_inliers)
# prof
matches_ransac_inliers = [gm for gm, ok in zip(good_matches, pts_inliers_mask) if ok == 1]
len(matches_ransac_inliers)
129
Expected result of draw_matches()
:
# TODO
# draw_matches(...)
# prof
draw_matches(model_img, model_kpts,
frame_img, frame_kpts,
matches_ransac_inliers,
color=(0,255,0),
title="Frame → Model (ratio test + RANSAC)")
Finally, we can project some image over the frame.
# TODO
# model_quad = np.float32([[[0, 0],
# ...]])
# prof
model_quad = np.float32([[[0, 0],
[model_img.shape[1]-1, 0],
[model_img.shape[1]-1, model_img.shape[0]-1],
[0, model_img.shape[0]-1]]])
model_quad
array([[[ 0., 0.], [2339., 0.], [2339., 1653.], [ 0., 1653.]]], dtype=float32)
# TODO
# frame_quad = cv2.perspectiveTransform(...)
# frame_quad
# prof
frame_quad = cv2.perspectiveTransform(model_quad, H)
frame_quad
array([[[ 395.27518, 209.91714], [1251.5259 , 226.35364], [1330.6464 , 799.2486 ], [ 353.1797 , 837.29803]]], dtype=float32)
We can now draw the detected object over the frame.
dbg_img = frame_img.copy()
cv2.polylines(dbg_img, np.int32(frame_quad), True, (0, 255, 0), 10)
plt.imshow(bgr2rgb(dbg_img))
<matplotlib.image.AxesImage at 0x7f7860b9de10>
# prof
# extra illustration for the lecture
draw_matches(model_img, model_kpts,
dbg_img, frame_kpts,
matches_ransac_inliers,
color=(0,255,0),
title="Frame → Model (ratio test + RANSAC)")
Let us use a very simple modified model image, to indicate we detected it:
model_img_modified = np.uint8(model_img * (1,1,0))
plt.imshow(bgr2rgb(model_img_modified))
<matplotlib.image.AxesImage at 0x7f785c02e4e0>
Expected output:
cv2.warpPerspective?
# TODO
# warped_img = cv2.warpPerspective(...)
# plt.imshow(bgr2rgb(warped_img))
# prof
warped_img = cv2.warpPerspective(model_img_modified,
H,
(frame_img.shape[1], frame_img.shape[0])) # xy coord, not shape!!!
plt.imshow(bgr2rgb(warped_img))
<matplotlib.image.AxesImage at 0x7f7854f854a8>
We need to use a mask to blend this warped image with the original frame.
Expected output:
cv2.fillPoly?
# TODO
# warped_img_msk= np.zeros(...)
# warped_img_msk = cv2.fillPoly(...)
# plt.imshow(warped_img_msk)
# prof
warped_img_msk= np.zeros(frame_img.shape[:2], dtype=np.uint8)
warped_img_msk = cv2.fillPoly(warped_img_msk, np.int32(frame_quad), 255)
plt.imshow(warped_img_msk)
<matplotlib.image.AxesImage at 0x7f7854f67358>
Expected output:
# TODO
# prof
frame_ar = frame_img.copy()
frame_ar[warped_img_msk>0] = warped_img[warped_img_msk>0]
plt.imshow(bgr2rgb(frame_ar))
<matplotlib.image.AxesImage at 0x7f7854ec0eb8>
Assume you have the four coordinates of the corners of the documents in the frame (they are in frame_quad.squeeze()
), and that it is a landscape A4 page (you have its corners in model_quad.squeeze()
), create a dewarped (cropped, without perspective) document image.
Said differently: Knowing the model shape, from the coordinates of the object
produce the following cropper image:
Hints:
cv2.getPerspectiveTransform
Extra kudos:
H_inv = cv2.getPerspectiveTransform(frame_quad.squeeze(), model_quad.squeeze())
H_inv
array([[ 2.53803668e+00, 1.70295108e-01, -1.03897076e+03], [-5.87479569e-02, 3.06044481e+00, -6.19218226e+02], [-9.84140923e-05, 2.36256993e-04, 1.00000000e+00]])
# More fun: invert H (previously computed)
# NOTE: it may be more stable to recompute the homography and asking for the invert
# NOTE2: usually we find the corners then we dewarp the image (no model, we just know the document aspect ratio),
# so we use cv2.getPerspectiveTransform(from_4_points, to_4_points)
H_inv = np.linalg.inv(H)
H_inv /= H_inv[2,2] # because we know H_inv[2,2] should be equal to 1
H_inv
array([[ 2.53803661e+00, 1.70295061e-01, -1.03897072e+03], [-5.87479464e-02, 3.06044461e+00, -6.19218185e+02], [-9.84140372e-05, 2.36256867e-04, 1.00000000e+00]])
frame_dewarped = cv2.warpPerspective(frame_img,
H_inv,
(model_img.shape[1], model_img.shape[0])) # xy coord, not shape!!!
plt.imshow(bgr2rgb(frame_dewarped))
<matplotlib.image.AxesImage at 0x7f7854ea0dd8>
But we have a strong interpolation; we could keep a smaller image, using the shape of the region detected in the frame. We will compute an optimal surface and adjust the homograpy.
# destination base size
dst_size = np.int32(np.ptp(frame_quad.squeeze(), axis=0))
dst_size
array([977, 627], dtype=int32)
# model aspect ratio
model_ar = model_img.shape[1] / model_img.shape[0]
model_ar
1.414752116082225
# take a mean surface with the same AR
dst_size[0] = (dst_size[0] + dst_size[1]*model_ar) // 2
dst_size[1] = dst_size[0] // model_ar
dst_size, dst_size[0] / dst_size[1]
(array([932, 658], dtype=int32), 1.4164133738601823)
We need to introduce a scaling in H.
scale_factor = dst_size[1] / model_img.shape[0] # yes, rowcol vs xy coordinates…
scale_factor
0.3978234582829504
# For the lazy or when in doubt with homographies…
H_scaling = cv2.getPerspectiveTransform(
np.float32([[0.,0.], [0.,1.], [1.,1.], [1.,0.]]), # base in frame's referential
np.float32([[0.,0.], [0.,scale_factor], [scale_factor,scale_factor], [scale_factor,0]])) # in target's ref
H_scaling
array([[0.39782345, 0. , 0. ], [0. , 0.39782345, 0. ], [0. , 0. , 1. ]])
H_scaling.dot(H_inv) # apply H_inv then H_scaling
array([[ 1.00969049e+00, 6.77473691e-02, -4.13326918e+02], [-2.33713109e-02, 1.21751664e+00, -2.46339516e+02], [-9.84140372e-05, 2.36256867e-04, 1.00000000e+00]])
H_scaling @ H_inv # @ is matrix multiplication, was not possible in early NumPy versions
array([[ 1.00969049e+00, 6.77473691e-02, -4.13326918e+02], [-2.33713109e-02, 1.21751664e+00, -2.46339516e+02], [-9.84140372e-05, 2.36256867e-04, 1.00000000e+00]])
frame_dewarped = cv2.warpPerspective(frame_img,
H_scaling @ H_inv,
tuple(dst_size)) # xy coord, not shape!!!
plt.imshow(bgr2rgb(frame_dewarped))
<matplotlib.image.AxesImage at 0x7f7854e11978>
And now we have an image very close to the frame area.
Not that if the perspective is very low, it is usually better to crop the image (without any perspective correction) to avoid introducing interpolation errors. This makes a sensible difference when running an OCR on the resulting image.