Improve the hand gesture stability of the Vision framework's using filters

This is Shuochen Wang from R&D department in Flect. In this blog I am going to explain how to improve the hand gesture stability of the Apple's Vision framework by using filters.

The Vision framework performs face and face landmark detection, text detection, barcode recognition, image registration, and general feature tracking. Vision also allows the use of custom Core ML models for tasks like classification or object detection. In this blog, we will use the body and hand pose detection function. Specifically, we will use the hand pose detection function to obtain the coordinates of the finger joints.

In my previous blog I have explained about how to manipulate AR object using hand gestures. We used the Vision Framework to obtain the hand position. Because we do not use actual sensors or trackers that can measure the finger positions accurately, there will be errors in the measurement.

In this blog, I have performed an experiment to examine how filter can reduce the uncertainties of the hand pose detection of the Vision Framework. Before explaining the experiment, I would like to demonstrate the problem with the current hand gesture detection.

The problems of the Vision Framework
Experiment overview
What is a filter?
Different types of filters
Implementation of the filters
- SMA
- EWMA
- SG
- Kalman filter
Filter evaluators
Results
Conclusion
Appendix

The problems of the Vision Framework

The Vision Framework hand pose detection works by using machine learning model that first recognizes if the object is a hand or not and then tries to identify the joint positions positions. It has a high accuracy on hand recognition, but when it comes to tracking the coordinate of the finger joints, sometimes the predicted can be very wrong. The accuracy of the detection varies according to the use case situation.

Stationary

To illustrate the accuracy of the Vision Framework, I will show the effect by labeling the 4 finger joints. Let us first look at an example where it is working correctly. In the figure 1 below, the red color is thumb tip point, the blue point is the index tip point, the green point is the thumbIP and the orange color is the indexDIP. ThumbIP stands for thumb’s interphalangeal (IP) joint, which is the first thumb joint. The indexDIP is the index finger’s distal interphalangeal (DIP) joint, index first joint respectively.

f:id:shuochenwang:20210910105403p:plain — Figure 1. Normal recognition

Moving

In the case when the hand is moving, the tracking sometimes cannot keep up and the location is incorrect. In the Figure 2 below, the thumb tip location is incorrect.

f:id:shuochenwang:20210910110838p:plain — Figure 2. Moving

Occlusion

In the case of occlusion (when finger is partially blocked), the location is also incorrect. In the Figure 3 below, only thumbIP location is correct, every other locations are incorrect.

f:id:shuochenwang:20210910112146p:plain — Figure 3. Occlusion

Incomplete fingers

In the case when the finger is only partially shown, the coordinates of the joints have the worst result. In the Figure 4 below, all the location positions are incorrect.

f:id:shuochenwang:20210910135745p:plain — Figure 4. Incomplete fingers

Therefore, it is clear that the hand pose detection of the Vision Framework is best suited for relatively stationary fingers, it will not perform well when the fingers are moving quickly, or is blocked or not shown completely. To improve its usability, the best solution is to design the scenario that avoids using moving fingers, avoid gestures that block other fingers and instruct the user to position their whole fingers inside the camera.

Even when the image is relatively still, the coordinates can still "vibrate". I have provided the video below to demonstrate the effect of a stationary hand. This is where a filter can be helpful in reducing the random noises in the observation. Filters will not help with the situation of moving, occlusion and incomplete fingers.

I will also use the same file data video for the filter experiment.

Experiment overview

The goal of the experiment is to find the filter that has the best effect of reducing noises when the hand is stationary. This is done by applying each filter algorithm to the data obtained from the capture video file. We will test the effect of the filters on Python and then decide which filter to implement on Swift.

Before examining the type of filters, it is necessary to explain the concept of window size. Window size is the size of window that we consider when we apply the filter. For example, a window size of 5 takes the data up to previous 5 frames of data for the filter calculation. The larger the window size, the stronger the effect of the filter will come, and the same time, it is more likely to be further away from the true values. There is no optimal window size, a window size of 5 is commonly used so we will use 5 for all our filters with the exception of the SG filter.

The experiment I have conducted is using the same video data file, which contains the thumb tip, index tip, thumbIP and indexDIP points data determined by the Vision Framework. I will apply the 4 different types of filters to the same data and compare their performance using the 2 type of evaluator metrics which I will mention below.

What is a filter?

To quote the definition from Wikipedia, a filter is

In statistics and image processing, to smooth a data set is to create an approximating function that attempts to capture important patterns in the data, while leaving out noise or other fine-scale structures/rapid phenomena. In smoothing, the data points of a signal are modified so individual points higher than the adjacent points (presumably because of noise) are reduced, and points that are lower than the adjacent points are increased leading to a smoother signal.

By reducing the potential noises, we can expect to observe a smaller deviation and the readings will become more stable and we are closer to the true underlying value. Filters are very important in many areas, and there are many different types of filters, ranging from simple filters to complex filters.

In this blog I will investigate 4 types of filters: SMA (Simple moving average), EWMA (Exponential Weighted Moving Average), SG (Savitzky–Golay) and Kalman filter. We will briefly examine each filter briefly before moving to the main experiment.

Different types of filters

SMA (Simple moving average)

Simple moving average, or SMA is the simplest filter of all the filters. Instead of taking the current observation value, it just takes the average of the window size value data.

This is the formula for the SMA:

$SMA = \dfrac{A_1 + A_2 + \ldots + A_n}{n} \\$

Where An is observation number and n is the number of observations.

EWMA (Exponential Weighted Moving Average)

The main weakness of the previous filter is the all the observations have the same weight, which is usually not true for most situations. EWMA addresses this issue by changing the weights accordingly.

The Exponentially Weighted Moving Average (EWMA) is a quantitative or statistical measure used to model or describe a time series. EWMA is designed as such that older observations are given lower weights. The weights fall exponentially as the data point gets older – hence the name exponentially weighted.

The only decision a user of the EWMA must make is the parameter α. The parameter decides how important the current observation is in the calculation of the EWMA. The higher the value of α, the more closely the EWMA tracks the original time series. The formula of the EWMA is described as:

$EWMA(t) = a \ast x(t) + (1-a) \ast EWMA(t-1) \\$

Where:

EWMA(t) = moving average at time t
a or α = degree of mixing parameter value between 0 and 1
x(t) = value of signal x at time t

The optimal α depends on the data.

SG (Savitzky–Golay) filter

To quote the definition from Wikipedia:

A Savitzky–Golay filter is a digital filter that can be applied to a set of digital data points for the purpose of smoothing the data, that is, to increase the precision of the data without distorting the signal tendency. This is achieved, in a process known as convolution, by fitting successive sub-sets of adjacent data points with a low-degree polynomial by the method of linear least squares.

SG filter is a much more complex filter than the previous filters. Basically, SG filter will fit (transform) the data to the an order of the polynomial of your choice.

The overall process is to calculate the matrix coefficient. To find the coefficient, the following formula needs to be solved:

$a = (J^{T}J)^{-1} J^{T}y \\$

Where J is a Vandermonde matrix, and y is the current observation data.

Then the matrix is apply to the data to transform the data to the chosen order. When the order is 1, it has the same effect of SMA. There is one limitation that is the order chosen cannot exceed the window size.

The following figure below demonstrates the effect of SG filter, taken from the original publication.

f:id:shuochenwang:20210910175201p:plain — Figure 5. Demonstrating the effect of SG filter

Prior to the experiment, I have found that SG filter effect is usually quite weak when the window size is small. Therefore in the experiment, I will use window size 7 and order 3 for a stronger effect.

Kalman filter

Kalman filtering, is an algorithm that uses a series of measurements observed over time, including statistical noise and other inaccuracies, and produces estimates of unknown variables that tends to be more accurate than those based on a single measurement alone, by estimating a joint probability distribution over the variables for each timeframe.

The algorithm works by a two-phase process. For the prediction phase, the Kalman filter produces estimates of the current state variables, along with their uncertainties. Once the outcome of the next measurement (necessarily corrupted with some error, including random noise) is observed, these estimates are updated using a weighted average, with more weight being given to estimates with greater certainty. The algorithm is recursive.

The diagram below gives a brief overview of the Kalman filter, cited from.

f:id:shuochenwang:20210910180719p:plain — Figure 6. Overview of 1d Kalman filter

Optimality of Kalman filtering assumes that errors have a normal (Gaussian) distribution.

Kalman filter also works in multiple dimension, but for our use case, it will be 1-dimensional Kalman filter.

Implementation of the filters

Swift does not contain any of the filter library by default, therefore all of our filters will be implemented in Python. We will test the effect of each filter first, then decide the best filter to be implemented on Swift.

SMA

Because SMA is a very simple filter, it is possible to write the filter using just a simple for loop.

We will use deque function to speed up the execution speed and for conciseness.

#SMA
from collections import deque
 
def simplemovingaverage(period):
    assert period == int(period) and period > 0, "Period must be an integer >0"
 
    summ = n = 0.0
    values = deque([0.0] * period)     # old value queue
 
    def sma(x):
        nonlocal summ, n
 
        values.append(x)
        summ += x - values.popleft()
        n = min(n+1, period)
        return summ / n
 
    return sma

EWMA

The Pandas library contains the EWMA implementation by default. But we will use the implementation from stackoverflow which is based on numpy. Numpy implementation execution is faster than Pandas implementation.

As I mentioned before, the value of α depends on the nature of the data. We will first test the effect of the default value by the numpy package first, which is 2 / float(window + 1) . In this case, window refers to the window size.

SG

SG filter is a very complicated filter which is already supported by scipy. However, we will use this github implementation because the code is concise and you can see what is going on in the function.

Kalman filter

Again, there are a few variations of Kalman filter. We will use this kaggle implementation.

Filter evaluators

Now we have introduced the different type of filters and their implementations, how do we decide which filter is better? We can evaluate the filter performances in 2 ways.

The first, most straightforward way is to measure the standard deviation of the filtered data. Because this is a relatively still video, the position of the finger data is suppose to be relatively still. Therefore, the smaller the standard deviation, the better effect the filter has. The formula for calculating the standard deviation is:

$\Large \sigma = \sqrt{\frac{\sum(x_i - \mu)^{2}}{N}} \\$

Where:

$\sigma = \text{population standard deviation} \\ N = \text{the size of the population} \\ x_i = \text{each value from the population} \\ \mu = \text{the population mean} \\$

The second evaluator I have used is the distance to the true finger positions (difference of squares). The formula for square difference/ MSE (mean square error) is:

$\Large MSE = \frac{1}{n}\displaystyle\sum_{i=1}^{n}(Y_i - \hat{Y}_i)^{2} \\$

Where:

$\mathrm{MSE} = \text{mean squared error} \\ {n} = \text{number of data points} \\ Y_{i} = \text{true values} \\ \hat{Y}_{i} = \text{observed values} \\$

Observed values are values obtained by the filter, and true values is the true values labelled by myself.

How do I determine the true position? I have broken down the movie into single frames, determine their true position by hand and label the position and the value in a separate file. For example, this is one frame that I have labelled. There are total of 30 labelled frames.

f:id:shuochenwang:20210910201956p:plain — Figure 7. Labelled image

Because labeling the true position for the fingers is very time consuming, I have only labelled the true coordinates of thumb tip and index tip.

Results

Standard deviation

This is the standard deviations for the different filters. For the sake of presentation, I used the value of the sum of the standard deviations of thumb x position, thumb y position, index x position and index y position.

f:id:shuochenwang:20210910202449p:plain — Figure 8. Comparison of filter standard deviations

As I have stated before, the lower the value the better, in this sense Kalman filter has the best performance, followed by EWMA, SMA and finally SG filter.

Mean square difference

f:id:shuochenwang:20210910204017p:plain — Figure 9. Least square difference

This is the results of the least square difference for different filters. Again, I have combined the values of the 4 positions into one value. The results are very close to each other, as the filters do no have access to the true value, there is no way the filter can dramatically shift the values to the true values.

Because of the difference is so small, the percentage all round up to 100%.

As for the filter performance, again Kalman filter has the best performance, followed by SMA, EWMA and finally SG filter.

Before stating the conclusion, it is important to state that, this experiment assumes that the labelled positions are the true values.

There are 3 possible scenarios that is possible with this experiment.

After the filter, the standard deviations decreases, but the mse increases.
After the filter, the standard deviations decreases, but the mse remains the same.
After the filter, the standard deviations decreases, but the mse decreases.

The first case scenario is not ideal, because if the mse increases this would mean the filter will be further from the true values. Then it is better not to use filters. For scenario 2 and 3, using filter will be better than not using filters.

Our experiment result is scenario 3, which means having filter makes the value closer to the true values and at the same time reduce the standard deviation. Therefore the conclusion is having filters is better than not having filters.

Second experiment with adjusted EWMA filter

From the previous results, it is clear that SG filter has the worst performance and should not be used in our case. SMA and EWMA has comparable performance and Kalman filter has the best performance. However, our Kalman filter uses all observations data and is difficult to implement in real time because it is relatively computationally intensive.

In this aspect, SMA and EWMA are the preferred filter if possible. SMA does not have adjustable parameter and therefore cannot be improved. However, for EWMA, we can improve its performance by adjusting the α value. If the α value is high, then the current observation values has greater weight, and vice versa.

We already know that the data contains rather heavy noise, so I will lower the α value by halving the default value to α = 1 / float(window + 1).

As seen in the figure below, Kalman filter still have the best (lowest) standard deviation, but EWMA's standard deviation is very close to it.

f:id:shuochenwang:20210917114634p:plain — Figure 10. Adjusted EWMA standard deviation

This time the adjusted EWMA has the best mse out of all the filters.

f:id:shuochenwang:20210917114703p:plain — Figure 11. Adjusted EWMA MSE

Experiment with moving data

One important assumption that I have made is that the hand is relatively stationary. What happens if the data is moving all the time? I repeated the same procedure with hand doing circular motion.

This is the result of the MSE for the moving data:

f:id:shuochenwang:20210917114033p:plain — MSE with moving data

As you can see, applying filter significantly increases the MSE, which means it is better not to use any filter at all.

Conclusion

By adjusting the α parameter of EWMA, it has the best standard deviation and least square difference. EWMA also have the advantage of easy to implement. Therefore the best filter to apply to reduce noise is EWMA. However this has one important assumption that the data is not moving, otherwise it is better to not apply any filter at all. Another possible solution is to use multi-dimensional Kalman filter, which is difficult to set up correctly.

By applying filter to stationary we have reduced the noise of the data, making our data more stable. This creates one problem however, how do we determine when the hand is stationary or not? I think there are two approaches to this problem.

The first approach is design the app according to the use cases. For example, switch off the filter when it is expected that the user is going to move around the around fingers. When the hand is stationary, switch the filters back on. This will be the approach I will use for now.

The second approach is measure the velocity of the hand in addition to the position data. Then turn the filter on and off according to the velocity of the hand. This can be done by Kalman filter, or by estimation based on the positional data. This is preferred approach, but it will take time to implement.

In the future, I will investigate on how to reduce noise for moving data.