This is Shuochen Wang from R&D department in Flect. In this blog I am going to explain how to improve the hand gesture stability of the Apple's Vision framework by using filters.
The Vision framework performs face and face landmark detection, text detection, barcode recognition, image registration, and general feature tracking. Vision also allows the use of custom Core ML models for tasks like classification or object detection. In this blog, we will use the body and hand pose detection function. Specifically, we will use the hand pose detection function to obtain the coordinates of the finger joints.
In my previous blog I have explained about how to manipulate AR object using hand gestures. We used the Vision Framework to obtain the hand position. Because we do not use actual sensors or trackers that can measure the finger positions accurately, there will be errors in the measurement.
In this blog, I have performed an experiment to examine how filter can reduce the uncertainties of the hand pose detection of the Vision Framework. Before explaining the experiment, I would like to demonstrate the problem with the current hand gesture detection.
- The problems of the Vision Framework
- Experiment overview
- What is a filter?
- Different types of filters
- Implementation of the filters
- Filter evaluators
- Results
- Conclusion
- Appendix
The problems of the Vision Framework
The Vision Framework hand pose detection works by using machine learning model that first recognizes if the object is a hand or not and then tries to identify the joint positions positions. It has a high accuracy on hand recognition, but when it comes to tracking the coordinate of the finger joints, sometimes the predicted can be very wrong. The accuracy of the detection varies according to the use case situation.
Stationary
To illustrate the accuracy of the Vision Framework, I will show the effect by labeling the 4 finger joints. Let us first look at an example where it is working correctly. In the figure 1 below, the red color is thumb tip point, the blue point is the index tip point, the green point is the thumbIP and the orange color is the indexDIP. ThumbIP stands for thumb’s interphalangeal (IP) joint, which is the first thumb joint. The indexDIP is the index finger’s distal interphalangeal (DIP) joint, index first joint respectively.
Moving
In the case when the hand is moving, the tracking sometimes cannot keep up and the location is incorrect. In the Figure 2 below, the thumb tip location is incorrect.
Occlusion
In the case of occlusion (when finger is partially blocked), the location is also incorrect. In the Figure 3 below, only thumbIP location is correct, every other locations are incorrect.
Incomplete fingers
In the case when the finger is only partially shown, the coordinates of the joints have the worst result. In the Figure 4 below, all the location positions are incorrect.
Therefore, it is clear that the hand pose detection of the Vision Framework is best suited for relatively stationary fingers, it will not perform well when the fingers are moving quickly, or is blocked or not shown completely. To improve its usability, the best solution is to design the scenario that avoids using moving fingers, avoid gestures that block other fingers and instruct the user to position their whole fingers inside the camera.
Even when the image is relatively still, the coordinates can still "vibrate". I have provided the video below to demonstrate the effect of a stationary hand. This is where a filter can be helpful in reducing the random noises in the observation. Filters will not help with the situation of moving, occlusion and incomplete fingers.
I will also use the same file data video for the filter experiment.
Experiment overview
The goal of the experiment is to find the filter that has the best effect of reducing noises when the hand is stationary. This is done by applying each filter algorithm to the data obtained from the capture video file. We will test the effect of the filters on Python and then decide which filter to implement on Swift.
Before examining the type of filters, it is necessary to explain the concept of window size. Window size is the size of window that we consider when we apply the filter. For example, a window size of 5 takes the data up to previous 5 frames of data for the filter calculation. The larger the window size, the stronger the effect of the filter will come, and the same time, it is more likely to be further away from the true values. There is no optimal window size, a window size of 5 is commonly used so we will use 5 for all our filters with the exception of the SG filter.
The experiment I have conducted is using the same video data file, which contains the thumb tip, index tip, thumbIP and indexDIP points data determined by the Vision Framework. I will apply the 4 different types of filters to the same data and compare their performance using the 2 type of evaluator metrics which I will mention below.
What is a filter?
To quote the definition from Wikipedia, a filter is
In statistics and image processing, to smooth a data set is to create an approximating function that attempts to capture important patterns in the data, while leaving out noise or other fine-scale structures/rapid phenomena. In smoothing, the data points of a signal are modified so individual points higher than the adjacent points (presumably because of noise) are reduced, and points that are lower than the adjacent points are increased leading to a smoother signal.
By reducing the potential noises, we can expect to observe a smaller deviation and the readings will become more stable and we are closer to the true underlying value. Filters are very important in many areas, and there are many different types of filters, ranging from simple filters to complex filters.
In this blog I will investigate 4 types of filters: SMA (Simple moving average), EWMA (Exponential Weighted Moving Average), SG (Savitzky–Golay) and Kalman filter. We will briefly examine each filter briefly before moving to the main experiment.
Different types of filters
SMA (Simple moving average)
Simple moving average, or SMA is the simplest filter of all the filters. Instead of taking the current observation value, it just takes the average of the window size value data.
This is the formula for the SMA:
Where An is observation number and n is the number of observations.
EWMA (Exponential Weighted Moving Average)
The main weakness of the previous filter is the all the observations have the same weight, which is usually not true for most situations. EWMA addresses this issue by changing the weights accordingly.
The Exponentially Weighted Moving Average (EWMA) is a quantitative or statistical measure used to model or describe a time series. EWMA is designed as such that older observations are given lower weights. The weights fall exponentially as the data point gets older – hence the name exponentially weighted.
The only decision a user of the EWMA must make is the parameter α. The parameter decides how important the current observation is in the calculation of the EWMA. The higher the value of α, the more closely the EWMA tracks the original time series. The formula of the EWMA is described as:
Where:
EWMA(t) = moving average at time t
a or α = degree of mixing parameter value between 0 and 1
x(t) = value of signal x at time t
The optimal α depends on the data.
SG (Savitzky–Golay) filter
To quote the definition from Wikipedia:
A Savitzky–Golay filter is a digital filter that can be applied to a set of digital data points for the purpose of smoothing the data, that is, to increase the precision of the data without distorting the signal tendency. This is achieved, in a process known as convolution, by fitting successive sub-sets of adjacent data points with a low-degree polynomial by the method of linear least squares.
SG filter is a much more complex filter than the previous filters. Basically, SG filter will fit (transform) the data to the an order of the polynomial of your choice.
The overall process is to calculate the matrix coefficient. To find the coefficient, the following formula needs to be solved:
Where J is a Vandermonde matrix, and y is the current observation data.
Then the matrix is apply to the data to transform the data to the chosen order. When the order is 1, it has the same effect of SMA. There is one limitation that is the order chosen cannot exceed the window size.
The following figure below demonstrates the effect of SG filter, taken from the original publication.
Prior to the experiment, I have found that SG filter effect is usually quite weak when the window size is small. Therefore in the experiment, I will use window size 7 and order 3 for a stronger effect.
Kalman filter
Kalman filtering, is an algorithm that uses a series of measurements observed over time, including statistical noise and other inaccuracies, and produces estimates of unknown variables that tends to be more accurate than those based on a single measurement alone, by estimating a joint probability distribution over the variables for each timeframe.
The algorithm works by a two-phase process. For the prediction phase, the Kalman filter produces estimates of the current state variables, along with their uncertainties. Once the outcome of the next measurement (necessarily corrupted with some error, including random noise) is observed, these estimates are updated using a weighted average, with more weight being given to estimates with greater certainty. The algorithm is recursive.
The diagram below gives a brief overview of the Kalman filter, cited from.
Optimality of Kalman filtering assumes that errors have a normal (Gaussian) distribution.
Kalman filter also works in multiple dimension, but for our use case, it will be 1-dimensional Kalman filter.
Implementation of the filters
Swift does not contain any of the filter library by default, therefore all of our filters will be implemented in Python. We will test the effect of each filter first, then decide the best filter to be implemented on Swift.
SMA
Because SMA is a very simple filter, it is possible to write the filter using just a simple for loop.
We will use deque function to speed up the execution speed and for conciseness.
#SMA from collections import deque def simplemovingaverage(period): assert period == int(period) and period > 0, "Period must be an integer >0" summ = n = 0.0 values = deque([0.0] * period) # old value queue def sma(x): nonlocal summ, n values.append(x) summ += x - values.popleft() n = min(n+1, period) return summ / n return sma
EWMA
The Pandas library contains the EWMA implementation by default. But we will use the implementation from stackoverflow which is based on numpy. Numpy implementation execution is faster than Pandas implementation.
As I mentioned before, the value of α depends on the nature of the data. We will first test the effect of the default value by the numpy package first, which is 2 / float(window + 1) . In this case, window refers to the window size.
SG
SG filter is a very complicated filter which is already supported by scipy. However, we will use this github implementation because the code is concise and you can see what is going on in the function.
Kalman filter
Again, there are a few variations of Kalman filter. We will use this kaggle implementation.
Filter evaluators
Now we have introduced the different type of filters and their implementations, how do we decide which filter is better? We can evaluate the filter performances in 2 ways.
The first, most straightforward way is to measure the standard deviation of the filtered data. Because this is a relatively still video, the position of the finger data is suppose to be relatively still. Therefore, the smaller the standard deviation, the better effect the filter has. The formula for calculating the standard deviation is:
Where:
The second evaluator I have used is the distance to the true finger positions (difference of squares). The formula for square difference/ MSE (mean square error) is:
Where:
- Observed values are values obtained by the filter, and true values is the true values labelled by myself.
How do I determine the true position? I have broken down the movie into single frames, determine their true position by hand and label the position and the value in a separate file. For example, this is one frame that I have labelled. There are total of 30 labelled frames.
Because labeling the true position for the fingers is very time consuming, I have only labelled the true coordinates of thumb tip and index tip.
Results
Standard deviation
This is the standard deviations for the different filters. For the sake of presentation, I used the value of the sum of the standard deviations of thumb x position, thumb y position, index x position and index y position.
As I have stated before, the lower the value the better, in this sense Kalman filter has the best performance, followed by EWMA, SMA and finally SG filter.
Mean square difference
This is the results of the least square difference for different filters. Again, I have combined the values of the 4 positions into one value. The results are very close to each other, as the filters do no have access to the true value, there is no way the filter can dramatically shift the values to the true values.
Because of the difference is so small, the percentage all round up to 100%.
As for the filter performance, again Kalman filter has the best performance, followed by SMA, EWMA and finally SG filter.
Before stating the conclusion, it is important to state that, this experiment assumes that the labelled positions are the true values.
There are 3 possible scenarios that is possible with this experiment.
- After the filter, the standard deviations decreases, but the mse increases.
- After the filter, the standard deviations decreases, but the mse remains the same.
- After the filter, the standard deviations decreases, but the mse decreases.
The first case scenario is not ideal, because if the mse increases this would mean the filter will be further from the true values. Then it is better not to use filters. For scenario 2 and 3, using filter will be better than not using filters.
Our experiment result is scenario 3, which means having filter makes the value closer to the true values and at the same time reduce the standard deviation. Therefore the conclusion is having filters is better than not having filters.
Second experiment with adjusted EWMA filter
From the previous results, it is clear that SG filter has the worst performance and should not be used in our case. SMA and EWMA has comparable performance and Kalman filter has the best performance. However, our Kalman filter uses all observations data and is difficult to implement in real time because it is relatively computationally intensive.
In this aspect, SMA and EWMA are the preferred filter if possible. SMA does not have adjustable parameter and therefore cannot be improved. However, for EWMA, we can improve its performance by adjusting the α value. If the α value is high, then the current observation values has greater weight, and vice versa.
We already know that the data contains rather heavy noise, so I will lower the α value by halving the default value to α = 1 / float(window + 1).
As seen in the figure below, Kalman filter still have the best (lowest) standard deviation, but EWMA's standard deviation is very close to it.
This time the adjusted EWMA has the best mse out of all the filters.
Experiment with moving data
One important assumption that I have made is that the hand is relatively stationary. What happens if the data is moving all the time? I repeated the same procedure with hand doing circular motion.
This is the result of the MSE for the moving data:
As you can see, applying filter significantly increases the MSE, which means it is better not to use any filter at all.
Conclusion
By adjusting the α parameter of EWMA, it has the best standard deviation and least square difference. EWMA also have the advantage of easy to implement. Therefore the best filter to apply to reduce noise is EWMA. However this has one important assumption that the data is not moving, otherwise it is better to not apply any filter at all. Another possible solution is to use multi-dimensional Kalman filter, which is difficult to set up correctly.
By applying filter to stationary we have reduced the noise of the data, making our data more stable. This creates one problem however, how do we determine when the hand is stationary or not? I think there are two approaches to this problem.
The first approach is design the app according to the use cases. For example, switch off the filter when it is expected that the user is going to move around the around fingers. When the hand is stationary, switch the filters back on. This will be the approach I will use for now.
The second approach is measure the velocity of the hand in addition to the position data. Then turn the filter on and off according to the velocity of the hand. This can be done by Kalman filter, or by estimation based on the positional data. This is preferred approach, but it will take time to implement.
In the future, I will investigate on how to reduce noise for moving data.
Appendix
Original publication of the SG filter https://aip.scitation.org/doi/pdf/10.1063/1.4822961
Detailed explanation of 1d Kalman filter https://www.kalmanfilter.net/kalman1d.html
Definition of EWMA https://www.wallstreetmojo.com/ewma/