CS205 Computing Foundations for Computational Science Final Project
Group 4: Weihang Zhang, Xuefeng Peng, Jiacheng Shi and Ziqi Guo
Harvard University, Spring 2018
In this project, we want to use big compute techniques to parallelize the algorithms of image stitching, so that we can stream videos from adjascent camera into a single panoramic view.
Image stitching or photo stitching is the process of combining multiple photographic images with overlapping fields of view to produce a segmented panorama or high-resolution image (example below).
Image stitching is a rather complicated application in computer vision. It is composed of several key stages, some of which involve heavy computation. We will give an intuitive explanation of each of the key procedures below. Please follow the links provided for more technical details, and please refer to the Design Approach section for a complexity profiling of these tasks.
As a first step, keypoints on the two images you want to stitch together need to be identified. These keypoints usually correspond to the most distinguish features of an image, such as corners and edges of an object. There are several famous algorithms that dedicately achieve this task, including Harris Corner Detection, Scale Invariant Feature Transform (SIFT) and Speed-Up Robust Features (SURF). SIFT and SURF are the state-of-the-art due to their robustness.
Not only do these algorithms identify the keypoints, they will also generate a descriptor vector for each of the keypoints. These descriptors will capture information about the keypoints' location, orientation and relation to their surroundings.
After keypoint detection, each image will have a set of keypoints together with their descriptor vectors. We will then need to establish matching relations between these descriptors. Most methods are based on Euclidean distance between descriptors. For each keypoint in image 1, if its best match is significantly better than the second best match, then we consider the best match valid.
Once we have establish matching keypoints between images, we want to estimate a transformation matrix H that will be used to warp the image. Normally, Random Sample Consensus (RANSAC) will be used to derive reliable transformation matrix after removing false matches. The algorithm basically iteratively try out different match combinations and only keep the one that is the best-fitting to the matches.
With the transformation matrix, we can project the image to the right to the plane that the image to the left is at. This is called warping.
Finally, we have the original image 1, and the warped image 2. We can stitch them together by placing pixels from both images on a blank canvas.
Border blending is to smooth out the differences in light and hue across the stitching seam, so that the stitched image can look more homogeneous.
The key idea of our project is to focus on the word ‘real-time’. Although there are existing algorithms for the application, doing it in ‘real-time’ can still be challenging.
Imagine the scenario where we need a real-time panorama streaming view, this would require the backend application to stitch the view of different cameras together frame by frame. For a laggy video of 5 frames per second, we will need processing time of at most 0.2 seconds for each pair of images. This would be very difficult to achieve with regular computational resources. With a faster stitching process of one pair of images, we can process more image pairs per second, which would result in a higher Frames Per Second (FPS). FPS is a direct measure of how fluid the video looks to human eyes.
In addition, most of the existing real-time stitching solution assume all of the cameras stay still during the whole streaming process. In this way it only require one single iteration of detecting features, matching features and estimating transformation. However, this assumption posed too much limitation on the application of this problem. We, however, do not have such assumption in our project. Ideally, we want to be able to stitch the view of several handheld cameras. The price we need to pay, is that we will need to detect features, match features and estimate transformations at each frame of the video. This will require us to further accelerate the code while ensuring the quality of the overall stitching result.
In addition, with high performance computing, we can stitch images of higher resolution, at a visually fluid speed. This is the whole motivation of parallelizing the task of image stitching.
There are several challenges associated with our project:
We base our parallelization based on a sequential implementation of keypoint detection using Speed-Up Robust Features (SURF) as well as matching. The code can be found at https://github.com/abhinavgupta/SURF.
To finish up the whole stitching process, we implemented our own transformation estimation, warping and stitching algorithm. For transformation estimation, we relied on OpenCV's implementation as this step is not one of the computationally heaviest ones. For warping and stitching, we implemented them without using any vision libraries so that we can have full control of the parallelization.
As for data, we shot images and videos on 2 Logitech HD Laptop Webcam C615. It is capable of shooting up to 1080p video.
We run the sequential code of stitching on two images of different resolutions and plot the execution time of each stage. We omitted transformation estimation, as OpenCV has a very efficient implementation of it, which takes negligible time compared to the other stages.
With the larger image, keypoint detection, keypoint description and keypoint matching are the three most time-consuming steps. With the smaller image, the running time of matching decreases so much that it even takes shorter than warping. This is expected, as when the image is larger, there will be exponentially more interest points, which takes exponentially more time to match. Given that the maximum video resolution we are targeting is 1080p, which is smaller than the small image, we shouldn't be concerned too much about parallelizing it. As for stitching, it is clearly negligible compared to the other steps.
From this profiling, it tells us we should be more concerned about keypoint detection, keypoint description and warping.
Based on the profiling results shown above, we can see that keypoint detection, descriptor extraction and warping are the computationally heaviest and thus should be parallelized the most.
On a code level we will use OpenMP and OpenACC, and as an advanced feature we will employ multi-threading for pipeline. However, we will not delve too deep into distributed memory parallelization with MPI. With some experimentation, we found out that the cost of communicating high resolution videos and keypoint vectors is too high for achieving a real-time program.
Below we detailed our use of the parallel computing paradigms.
For shared memory parallelization, we just need to add directives to independent loops. Shared variables and private variables need to be differentiated. We also need to specify region that should be executed one thread at a time to be critical.
GPU acceleration is very suitable for this type of task. The many cores in GPU provide fundamentally larger computing power. However, we need to pay special attention to the following aspects:
The whole task of image stitching can be broken down to various stages. In addition to the algorithms mentioned previously, we still need to extract frames from webcam video stream, and output video from stitched images.
In the sequential version, after one task is finished, it needs to wait for all the downstream steps to be completed in order to be restarted. In this case, the critical path is all the stages of the entire workflow.
By pipelining the different stages with the multiple CPU threads available, we can achieve task-level parallelization so that idle time is reduced. Each adjacent task thread share a variable pointer between them and update the pointer once they finish their current iteration and jump right into the next. In this way, the total execution time will be constrained by the bottleneck of the four threads. We designed our pipeline to be the following diagram:
The pipeline utilizes three threads that focus on different tasks. The main thread has to render images due to the constraint of OpenCV. The main thread will continuously render images inside a critical section secured by mutex lock that contains the stitched images, which prevent threads from potential race conditions. The capture threads shared pointers with stitching thread to pass the captured image in the cameras to the stitiching thread.
To achieve maximum speedup, we can integrate task-level parallelization with procedure/loop level parallelization.
The way we designed our task-level pipeline allows for integration with GPU acceleration. In our pipeline, the stitching thread is the only thread that needs to do heavy computation. Therefore, we can accelerate the stitching thread by letting it operate on GPU. The other threads such as image i/o are less compute-intensive, and thus will be just fine operating on CPU. The only constraint here is that OpenACC only allows to be called from only one thread. It does not support the existence of CUDA stream. For one of the future work, we are aiming at implementing pure CUDA version of this code and therefore could further divide the computationally expensive stitiching thred into more concurrent threads running tasks of different stages in different CUDA streams.
To evaluate the performance of our parallelized program, we will use videos captured at different resolutions. The higher the resolution, the more computationally heavy it is to stitch the videos. We will use the average execution time of each frame to calculate speedup, and use the FPS as a measure of throughput.
As for testing platform, we will use the AWS EC2 t2.2xlarge instance for baseline and shared memory parallel program. For GPU involved programs, we will evaluate the performance on two devices, a local PC with GPU and an AWS EC2 g3.4xlarge instance. The specifications of the local PC in comparison to the AWS instance are tabulated below:
Variable Name | Local PC | AWS g3.4xlarge |
---|---|---|
Operating system | Ubuntu 16.04 | Ubuntu 16.04 |
CPU model | Intel(R) Xeon(R) CPU E3-1231 v3 @ 3.40GHz | Intel(R) Xeon(R) CPU E5-2676 v3 @ 2.40GHz |
vCPU(s) | 8 | 8 |
GPU model | NVIDIA GeForce GTX 1070 | NVIDIA Tesla M60 |
GPU architecture | Pascal | Maxwell 2.0 |
Core clock / boost Clock (MHz) | 1506/1683 | 930/1180 |
Floating-point performance (gflops) | 6463 | 4833 |
All used codes and test cases are documented with detailed instructions, which can be found at our Github repository.
In this section, we see the performance results of the serial version as a baseline.
First let's see the execution time of each stage in the stitching process for videos of different resolutions. We only include stages that involve heavier computation.
Resolution | 480p | 720p | 1080p |
---|---|---|---|
Keypoint detection | 0.255 | 0.565 | 1.299 |
Keypoint description | 0.187 | 0.239 | 0.347 |
Keypoint matching | 0.025 | 0.040 | 0.081 |
Homography | 0.005 | 0.007 | 0.034 |
Warping | 0.051 | 0.115 | 0.261 |
Stitching | 0.001 | 0.004 | 0.007 |
Total | 0.524 | 0.970 | 2.029 |
From this benchmarking of serial version, we can see that most stages in the stitching process grows linearly with the height of the video, which indicates the problem size. Among all the stages, keypoint detection, keypoint description and warping are the most computationally heavy stages that we should focus on parallelizing.
Then let's see the how Frames Per Second varies for videos of different resolutions.
Resolution | 480p | 720p | 1080p |
---|---|---|---|
FPS | 1.7 | 1 | 0.5 |
We can see that, with the serial version, even with 480P videos, we can only achieve a FPS of 1.7. This is clearly not enough for a visually fluid video.
In this section, we show the performance results after implementing shared memory parallelization with OpenMP.
Resolution | 480p | 720p | 1080p |
---|---|---|---|
Keypoint detection | 0.064 | 0.129 | 0.229 |
Keypoint description | 0.272 | 0.502 | 0.928 |
Keypoint matching | 0.017 | 0.021 | 0.031 |
Homography | 0.001 | 0.003 | 0.008 |
Warping | 0.010 | 0.024 | 0.057 |
Stitching | 0.001 | 0.004 | 0.007 |
Total | 0.365 | 0.683 | 1.260 |
Below we plotted the speedup of each stage against the serial version:
From the graph above, we can see that OpenMP achieves some speedup to some extent. However, for keypoint description, which is a rather costly stage, OpenMP did not do very well accelerating it. This might be due to the fact that keypoint description is memory bound, so assigning loops to different threads will not fundamentally speedup the program.
Then let's see the how Frames Per Second varies for videos of different resolutions.
Resolution | 480p | 720p | 1080p |
---|---|---|---|
FPS | 2.5 | 1.4 | 0.7 |
Based on FPS, OpenMP results in a more fluid video. But this is still not enough for our goal of "real-time".
In this section, we show the performance results after implementing GPU acceleration with OpenACC.
Platform | AWS g3.4xlarge | GTX 1070 | ||||
---|---|---|---|---|---|---|
Resolution | 480p | 720p | 1080p | 480p | 720p | 1080p |
Keypoint detection | 0.016 | 0.033 | 0.073 | 0.011 | 0.027 | 0.061 |
Keypoint description | 0.020 | 0.024 | 0.030 | 0.010 | 0.014 | 0.018 |
Keypoint matching | 0.003 | 0.004 | 0.009 | 0.0007 | 0.003 | 0.006 |
Homography | 0.003 | 0.006 | 0.024 | 0.009 | 0.009 | 0.034 |
Warping | 0.005 | 0.011 | 0.026 | 0.003 | 0.008 | 0.023 |
Stitching | 0.002 | 0.0025 | 0.007 | 0.0007 | 0.001 | 0.013 |
Total | 0.055 | 0.125 | 0.267 | 0.048 | 0.109 | 0.264 |
Below we plot the speedup for the two types of GPU:
For both GPUs, we achieve significant speedup on keypoint detection, description, matching as well as warping. We did not focus too much on homography and stitching because their OpenCV implementations are already very efficient.
If we look at different video resolutions, we can also see that the speedup is pretty homogeneous. This demonstrates that we controlled overhead like memory copies pretty well so that it does not shrink the speedup.
Finally, if we compare the results of the two GPUs, with GTX 1070, which has more computing power, a higher speedup can be achieved. This shows the scalability of the program - with GPU that has higher clock rate and more Gflops, we can always scale up to accommodate videos of higher quality and potentially incorporate more complicated video processing techniques.
Platform | AWS g3.4xlarge | GTX 1070 | ||||
---|---|---|---|---|---|---|
Resolution | 480p | 720p | 1080p | 480p | 720p | 1080p |
FPS | 18 | 8 | 3.74 | 20 | 10 | 3.8 |
Based on the FPS table, we can see that with GTX 1070, we can stream stitched images at a rate of 20 frames per second for 480P, and 10 frames per second for 720P. Videos of this FPS is very fluid and will not show apparent delays to human eyes. For video of 1080P, the FPS is not as good, but we have shown that the program is highly scalable with the GPU's computing power.
In this section, we show the performance results after implementing task-level parallelization with our multi-threading pipeline, on top of GPU acceleration.
With task-level parallelization, we will not be able to further speed up the execution time of individual task. In fact, we even observe some slight slowdown on individual tasks, potentially due to frequent context switches between threads.
However, what's important about task-level parallelization is that it isolates the image capturing and rendering from the entire pipeline so that they can be carried out independently. As a result, we see some improvement in the rendering speed of the stitched images, which can be seen from the table below:
Resolution | 480p | 720p | 1080p |
---|---|---|---|
FPS | 35 | 14 | 5 |
One option of our developed software is stitching the images without recomputing the keypoints and homography at every frame. In such cases, image rendering will become the bottleneck of the pipeline. With faster image rendering libraries, we could further improve the video FPS.
In this section, we compare the performance results across all parallel versions. First, we tabulated the speedup of all stages for different parallel versions at a resolution of 720P:
Parallel Version | OpenMP | OpenACC (Tesla M60) | OpenACC (GeForce GTX 1070) |
---|---|---|---|
Keypoint detection | 4.4 | 17.1 | 20.9 |
Keypoint description | 0.5 | 10.0 | 17.1 |
Keypoint matching | 1.9 | 10.0 | 13.3 |
Homography | 2.3 | 1.2 | 0.8 |
Warping | 4.8 | 10.5 | 14.4 |
Stitching | 1 | 1.6 | 4 |
Total | 1.4 | 8 | 9 |
We can see that on time-consuming stages including keypoint detection, keypoint description and warping, OpenACC is able to achieve significant speedup from the serial version.
In the previous sections, we have shown that with parallel computing techniques and mainly GPU acceleration, we are able to achieve 9x speedup on 720P videos.
Below we show how does the stitching two videos from two webcams look like with different versions of the software. For all the videos below, the 2 frames on the top are the inputs from webcams, whereas the frame below is the stitched output.
Before parallelization, the stitched video looks very bad due to significant delays.
With OpenACC, the stitched video is very fluid. Almost no delays can be detected by human eyes.
With moving camera, the stitched video has some perturbations. This can be improved by either mechanicalizing the movement or setting the moving average coefficient higher, which is an option built in our software.
With blending, the color difference between cameras and the stitching seam can be blurred. The output video seems more homogeneous. The downside is the extra computation it incurs. With our pipeline structure, future work can parallelizing blending as well.
Through this project, we conquered quite a few obstacles and obtained some precious insights from the process:
There are of course some areas for improvement that call for future endeavors. For example, although we do not favor the idea of distributed memory parallelization with MPI due to the amount of data that have to be communicated, it might still be worth a try.
In addition, a lateral comparison of the scalability of different keypoint detection algorithms (SIFT, SURF, corner detection) might be interesting. Possibly more advanced operations like bundle adjustment can be assigned to the rest of the CPU threads available.