EPITA 2022 MLRF practice_01-04_color-histogram v2022-03-25_104727 by Joseph CHAZALON

Creative Commons License This work is licensed under a [Creative Commons Attribution 4.0 International License](http://creativecommons.org/licenses/by/4.0/).

Practice 1 part 4: (Indexed) color histograms as global descriptors

The principle is very simple and illustrated below: from a set of pixels we count how many times each color appears, and we build an histogram of the frequencies (as opposed to raw counts, because we normalize the values) of occurrences of colors.

Here are some examples of the histograms we can compute from some bubbles: hist1 hist2 hist3

Using such descriptors, we can very easily group similar bubbles with a reasonable confidence.

This part contains the following steps:

  1. Color quantization: reduce the colors of the bubbles.
  2. Compute the color histogram of each bubble.
  3. Compute the distance matrix between each bubble, using its color histogram.
  4. Visualize the bubbles in an interesting way using hierarchical clustering.

1. Color quantization

It is hard to compare the full RGB histogram of an image (a bubble) with the histogram of another, so we will first reduce the number of colors used to represent each image.

Color quantization is a practical application of vector quantization where each color is replaced by the closest color in a pre-defined palette.

We have two options here:

  1. build a palette manually
  2. use a clustering technique to build it automatically

We will use K-Means clustering to discover a reduced set of representative colors.

stop **If you do not manage to build the palette automatically, then build your codebook manually and convert the color of the images using [`scipy.cluster.vq.vq`](https://docs.scipy.org/doc/scipy/reference/generated/scipy.cluster.vq.vq.html).**

1.1. Sample some pixels

Because K-Means is a costly algorithm, we will first sample our pixels (viewing them as plain 3-dimensional vectors) to avoid filling up our memory during KMeans fitting.

We will use the base image to facilitate the sampling, because otherwise we would have to select pixels from every bubble image and merge the results.

work **Use the large (200 DPI) poster image WITH ITS MASK to sample some pixels. 5000 is a good number of samples.** *Tips:* - The code for loading the image is provided below to save you time. - `poster[poster_mask]` returns a set of valid pixels. - You can use [`np.random.choice`](https://numpy.org/doc/stable/reference/random/generated/numpy.random.choice.html) to select **random pixems** from the set of pixels belonging to the bubbles. - The overall idea is the run 2 "select" operations on the original image: 1) select the pixels which belong to bubbles (i.e. not in the background) and 2) select some random pixels among these.

warning

WARNING: All our images should be in RGB format in this session! Convert them when loading them to avoid mistakes.

1.2. K-Means clustering stage 1/2: fitting

We are now ready to perform K-Means clustering.

work **Use K-Means clustering to compute the color palette. Use a small value like 7 for the number of target colors.** Tip: - Set the `random_state` parameter to some fixed value to ensure the reproducibility of your results. - Use the `fit` method to compute cluster centers. - The cluster centers will be available with `kmeans.cluster_centers_`.

warning

WARNING: K-Means is a RAM-hungry algorithm. Save your work regularly (and start now)!

1.3. K-Means clustering stage 2/2: projection

The KMeans class provided by scikit-learn has two methods for transforming our data:

Make sure you understand the difference between those two functions.

work **Use the `KMeans` object and super fancy Numpy indexing to create a new image where the color of each pixel is the color of the closest cluster to the original pixel color.** Tip: - Do not forget to mask the background of the image during prediction (because we did not train the predictor on them). - Start by producing a new image with cluster labels instead of color values. - Then create a color lookup table (LUT) using cluster centers (make sure to use `np.uint8` values to avoid issues with later conversion). - Finally use Numpy advanced indexing to apply the LUT to the image with cluster labels.
work **Does this look correct?**

(you can write some observations here)

(prof) As we did not train the predictor on white color, we observe a significant color shift in the "twin it!" bubble. The color difference is other bubbles is hard to notice.

work **Save the image and compare its size with the original one. Make sure you use the right color space when saving!** Tips: - If saving the RGB image with OpenCV `cv2.imwrite()`, do not forget it expects a BGR image!

1.4. Load and convert all the bubbles

We can now load and convert all the bubbles. We need to:

  1. load them
  2. convert them to the same color space as use during training (RGB here)
  3. project their colors using the previous method

We will need to ignore the area where all the pixels are black (0,0,0), because they do not belong to the bubble and it may change their value.

work **First define a function to compute the mask of a bubble.** Tip: - It is a boolean image where $(x,y)$ is `True` as soon as the value of any channel of the original image, at $(x,y)$ is $> 0$.
work **Now load all the bubbles and convert them to RGB (or any color space your choosed).**
work **And reduce the color of all bubbles.**

2. Compute the color histograms

To compute the color histogram of a bubble, we do not need to recolorize it, we just need to compute its "label map" to count the number of pixel belonging to each cluster.

work **Compute the color histogram for each bubble. Use [`np.bincount`](https://docs.scipy.org/doc/numpy/reference/generated/numpy.bincount.html#numpy.bincount) to count the number of number of occurence of each cluster label. Do not forget to normalize the histogram using the number of non-zero values in the mask with [`np.count_nonzero`](https://docs.scipy.org/doc/numpy/reference/generated/numpy.count_nonzero.html).**

3. Compute the distance matrix

Because color histograms are very compact, it is very fast to compute the distance matrix (even if the complexity is $O(n^2)$).

work **Compute the distance matrix between each pair of bubbles. Use an appropriate distance from [`scipy.spatial.distance`](https://docs.scipy.org/doc/scipy/reference/spatial.distance.html).**

Correct the diagonal to avoid getting the same result over and over: we set the distance of one element against itself to the maximum distance.

work **Display the best matches for some (all?) bubbles, ie the bubbles which are the closest to a given one. Use `np.argsort` along the appropriate axis to get the indices of the closest elements along each line.**
work **Write some notes about the advantages and the limitations of the color histogram approach.**

TODO write some notes here

Advantages of color hist:

Limitations:

4. Hierarchical clustering and dendrograms

Instead of computing a distance matrix, it is also possible to aggregate the elements starting by the closest one, then iterating. The trick is to be able to compute the distance between a cluster and another cluster, and a simple solution is to average the descriptor of two clusters to form the descriptor of a new parent cluster.

The code below does all this work and produces a dendrogram.

work **Call the functions (2nd and 3rd cells below) appropriately to generate a beautiful image.**

Job done!

Do not forget to submit your notebooks (and maybe a scaled version of your dendrogram)!