Introduction

Problem Statement

In this project, we want to use big compute techniques to parallelize the algorithms of image stitching, so that we can stream videos from adjascent camera into a single panoramic view.

Image stitching or photo stitching is the process of combining multiple photographic images with overlapping fields of view to produce a segmented panorama or high-resolution image (example below).

Procedures of Image Stitching

Image stitching is a rather complicated application in computer vision. It is composed of several key stages, some of which involve heavy computation. We will give an intuitive explanation of each of the key procedures below. Please follow the links provided for more technical details, and please refer to the Design Approach section for a complexity profiling of these tasks.

Keypoint Detection and Description

As a first step, keypoints on the two images you want to stitch together need to be identified. These keypoints usually correspond to the most distinguish features of an image, such as corners and edges of an object. There are several famous algorithms that dedicately achieve this task, including Harris Corner Detection, Scale Invariant Feature Transform (SIFT) and Speed-Up Robust Features (SURF). SIFT and SURF are the state-of-the-art due to their robustness.

Not only do these algorithms identify the keypoints, they will also generate a descriptor vector for each of the keypoints. These descriptors will capture information about the keypoints' location, orientation and relation to their surroundings.

Keypoint Matching

After keypoint detection, each image will have a set of keypoints together with their descriptor vectors. We will then need to establish matching relations between these descriptors. Most methods are based on Euclidean distance between descriptors. For each keypoint in image 1, if its best match is significantly better than the second best match, then we consider the best match valid.

Transformation Estimation

Once we have establish matching keypoints between images, we want to estimate a transformation matrix H that will be used to warp the image. Normally, Random Sample Consensus (RANSAC) will be used to derive reliable transformation matrix after removing false matches. The algorithm basically iteratively try out different match combinations and only keep the one that is the best-fitting to the matches.

Warping

With the transformation matrix, we can project the image to the right to the plane that the image to the left is at. This is called warping.

Stitching

Finally, we have the original image 1, and the warped image 2. We can stitch them together by placing pixels from both images on a blank canvas.

Border Blending

Border blending is to smooth out the differences in light and hue across the stitching seam, so that the stitched image can look more homogeneous.

Need for High Performance Computing

The key idea of our project is to focus on the word ‘real-time’. Although there are existing algorithms for the application, doing it in ‘real-time’ can still be challenging.

Imagine the scenario where we need a real-time panorama streaming view, this would require the backend application to stitch the view of different cameras together frame by frame. For a laggy video of 5 frames per second, we will need processing time of at most 0.2 seconds for each pair of images. This would be very difficult to achieve with regular computational resources. With a faster stitching process of one pair of images, we can process more image pairs per second, which would result in a higher Frames Per Second (FPS). FPS is a direct measure of how fluid the video looks to human eyes.

In addition, most of the existing real-time stitching solution assume all of the cameras stay still during the whole streaming process. In this way it only require one single iteration of detecting features, matching features and estimating transformation. However, this assumption posed too much limitation on the application of this problem. We, however, do not have such assumption in our project. Ideally, we want to be able to stitch the view of several handheld cameras. The price we need to pay, is that we will need to detect features, match features and estimate transformations at each frame of the video. This will require us to further accelerate the code while ensuring the quality of the overall stitching result.

In addition, with high performance computing, we can stitch images of higher resolution, at a visually fluid speed. This is the whole motivation of parallelizing the task of image stitching.

Challenges

There are several challenges associated with our project:

Some algorithm such as SIFT or SURF is rather complicated to understand as well as to re-implement. They are not composed solely of independent loops, but have many data dependencies that need to be dismantled before parallelization.
The desirable speedup might be challenging to achieve. The sequential version will take only a few seconds to run once. This means overheads such as memory access, synchronization and communication are by no means negligible. Speeding it up to achieve multiple frames per second entails careful optimization of overheads.
The program might be heavily memory bound, as a high resolution image will have millions of pixels, and we can potentially find a large amount of keypoints. Accessing and communicating the original images and these keypoints together with their descriptors will add to the execution time.
Our parallelization target is not a single algorithm. It is a sequence of algorithms achieving different tasks. We need to consider the possibility of task-level scheduling and parallelization of these different tasks.
This problem is nauturally a realtime programming problem, which implies it might be easily affected by many subtle problems in design and scheduling of the whole process.

Design Approach

Code and Data Source

We base our parallelization based on a sequential implementation of keypoint detection using Speed-Up Robust Features (SURF) as well as matching. The code can be found at https://github.com/abhinavgupta/SURF.

To finish up the whole stitching process, we implemented our own transformation estimation, warping and stitching algorithm. For transformation estimation, we relied on OpenCV's implementation as this step is not one of the computationally heaviest ones. For warping and stitching, we implemented them without using any vision libraries so that we can have full control of the parallelization.

As for data, we shot images and videos on 2 Logitech HD Laptop Webcam C615. It is capable of shooting up to 1080p video.

Profiling

We run the sequential code of stitching on two images of different resolutions and plot the execution time of each stage. We omitted transformation estimation, as OpenCV has a very efficient implementation of it, which takes negligible time compared to the other stages.

With the larger image, keypoint detection, keypoint description and keypoint matching are the three most time-consuming steps. With the smaller image, the running time of matching decreases so much that it even takes shorter than warping. This is expected, as when the image is larger, there will be exponentially more interest points, which takes exponentially more time to match. Given that the maximum video resolution we are targeting is 1080p, which is smaller than the small image, we shouldn't be concerned too much about parallelizing it. As for stitching, it is clearly negligible compared to the other steps.

From this profiling, it tells us we should be more concerned about keypoint detection, keypoint description and warping.

Parallelization Implementation

Parallelization Targets

Based on the profiling results shown above, we can see that keypoint detection, descriptor extraction and warping are the computationally heaviest and thus should be parallelized the most.

Technologies

Programming language: C++
Tools: OpenACC, OpenMP, POSIX pthread

Programming Models

On a code level we will use OpenMP and OpenACC, and as an advanced feature we will employ multi-threading for pipeline. However, we will not delve too deep into distributed memory parallelization with MPI. With some experimentation, we found out that the cost of communicating high resolution videos and keypoint vectors is too high for achieving a real-time program.

Below we detailed our use of the parallel computing paradigms.

Shared Memory Parallelization (OpenMP)

For shared memory parallelization, we just need to add directives to independent loops. Shared variables and private variables need to be differentiated. We also need to specify region that should be executed one thread at a time to be critical.

GPU Acceleration (OpenACC)

GPU acceleration is very suitable for this type of task. The many cores in GPU provide fundamentally larger computing power. However, we need to pay special attention to the following aspects:

Many loops are dealing with image or matrix variables that take up much memory. Redundant memory copies will create huge overhead that might just negate the speedup from parallelization. We need to limit memory copies as much as possible by copying in variables before parallel regions.
OpenACC (PGI compiler) currently does not support many special data structures on device (GPU), for example the mat object in OpenCV, IplImage, and supprisingly, std::vector. In order to parallelize regions where these data structures are used, we created many walk around by rewriting these parts with the most common data structures (supported by PGI compiler) in c++.
For variables that have different values in each iteration of the loop, we need to define those as private variables to ensure the correctness of code results.
Since we did most of our implementation and part of our testing on a Pascal based GPU (GTX 1070), while PGI compiler is solely desinged for Tesla based GPU, we need to pay special attention to compile the OpenACC code on Pascal GPUs.

Advanced Feature: Task-Level Parallelization (Threading)

The whole task of image stitching can be broken down to various stages. In addition to the algorithms mentioned previously, we still need to extract frames from webcam video stream, and output video from stitched images.

In the sequential version, after one task is finished, it needs to wait for all the downstream steps to be completed in order to be restarted. In this case, the critical path is all the stages of the entire workflow.

By pipelining the different stages with the multiple CPU threads available, we can achieve task-level parallelization so that idle time is reduced. Each adjacent task thread share a variable pointer between them and update the pointer once they finish their current iteration and jump right into the next. In this way, the total execution time will be constrained by the bottleneck of the four threads. We designed our pipeline to be the following diagram:

The pipeline utilizes three threads that focus on different tasks. The main thread has to render images due to the constraint of OpenCV. The main thread will continuously render images inside a critical section secured by mutex lock that contains the stitched images, which prevent threads from potential race conditions. The capture threads shared pointers with stitching thread to pass the captured image in the cameras to the stitiching thread.

Hybrid Parallelization

To achieve maximum speedup, we can integrate task-level parallelization with procedure/loop level parallelization.

The way we designed our task-level pipeline allows for integration with GPU acceleration. In our pipeline, the stitching thread is the only thread that needs to do heavy computation. Therefore, we can accelerate the stitching thread by letting it operate on GPU. The other threads such as image i/o are less compute-intensive, and thus will be just fine operating on CPU. The only constraint here is that OpenACC only allows to be called from only one thread. It does not support the existence of CUDA stream. For one of the future work, we are aiming at implementing pure CUDA version of this code and therefore could further divide the computationally expensive stitiching thred into more concurrent threads running tasks of different stages in different CUDA streams.

Parallelism Taxonomy

Performance Results

Evaluation Approach

To evaluate the performance of our parallelized program, we will use videos captured at different resolutions. The higher the resolution, the more computationally heavy it is to stitch the videos. We will use the average execution time of each frame to calculate speedup, and use the FPS as a measure of throughput.

As for testing platform, we will use the AWS EC2 t2.2xlarge instance for baseline and shared memory parallel program. For GPU involved programs, we will evaluate the performance on two devices, a local PC with GPU and an AWS EC2 g3.4xlarge instance. The specifications of the local PC in comparison to the AWS instance are tabulated below:

Table 1. Device specifications for GPU programs
Variable Name	Local PC	AWS g3.4xlarge
Operating system	Ubuntu 16.04	Ubuntu 16.04
CPU model	Intel(R) Xeon(R) CPU E3-1231 v3 @ 3.40GHz	Intel(R) Xeon(R) CPU E5-2676 v3 @ 2.40GHz
vCPU(s)	8	8
GPU model	NVIDIA GeForce GTX 1070	NVIDIA Tesla M60
GPU architecture	Pascal	Maxwell 2.0
Core clock / boost Clock (MHz)	1506/1683	930/1180
Floating-point performance (gflops)	6463	4833

All used codes and test cases are documented with detailed instructions, which can be found at our Github repository.

Serial Version

In this section, we see the performance results of the serial version as a baseline.

First let's see the execution time of each stage in the stitching process for videos of different resolutions. We only include stages that involve heavier computation.

Table 2. Execution time of serial version
Resolution	480p	720p	1080p
Keypoint detection	0.255	0.565	1.299
Keypoint description	0.187	0.239	0.347
Keypoint matching	0.025	0.040	0.081
Homography	0.005	0.007	0.034
Warping	0.051	0.115	0.261
Stitching	0.001	0.004	0.007
Total	0.524	0.970	2.029

From this benchmarking of serial version, we can see that most stages in the stitching process grows linearly with the height of the video, which indicates the problem size. Among all the stages, keypoint detection, keypoint description and warping are the most computationally heavy stages that we should focus on parallelizing.

Then let's see the how Frames Per Second varies for videos of different resolutions.

Table 3. FPS of serial version
Resolution	480p	720p	1080p
FPS	1.7	1	0.5

We can see that, with the serial version, even with 480P videos, we can only achieve a FPS of 1.7. This is clearly not enough for a visually fluid video.

OpenMP Version

In this section, we show the performance results after implementing shared memory parallelization with OpenMP.

Table 4. Execution time of OpenMP version
Resolution	480p	720p	1080p
Keypoint detection	0.064	0.129	0.229
Keypoint description	0.272	0.502	0.928
Keypoint matching	0.017	0.021	0.031
Homography	0.001	0.003	0.008
Warping	0.010	0.024	0.057
Stitching	0.001	0.004	0.007
Total	0.365	0.683	1.260

Below we plotted the speedup of each stage against the serial version:

From the graph above, we can see that OpenMP achieves some speedup to some extent. However, for keypoint description, which is a rather costly stage, OpenMP did not do very well accelerating it. This might be due to the fact that keypoint description is memory bound, so assigning loops to different threads will not fundamentally speedup the program.

Then let's see the how Frames Per Second varies for videos of different resolutions.

Table 5. FPS of OpenMP version
Resolution	480p	720p	1080p
FPS	2.5	1.4	0.7

Based on FPS, OpenMP results in a more fluid video. But this is still not enough for our goal of "real-time".

OpenACC Version

In this section, we show the performance results after implementing GPU acceleration with OpenACC.

Table 6. Execution time of OpenACC version
Platform	AWS g3.4xlarge			GTX 1070
Resolution	480p	720p	1080p	480p	720p	1080p
Keypoint detection	0.016	0.033	0.073	0.011	0.027	0.061
Keypoint description	0.020	0.024	0.030	0.010	0.014	0.018
Keypoint matching	0.003	0.004	0.009	0.0007	0.003	0.006
Homography	0.003	0.006	0.024	0.009	0.009	0.034
Warping	0.005	0.011	0.026	0.003	0.008	0.023
Stitching	0.002	0.0025	0.007	0.0007	0.001	0.013
Total	0.055	0.125	0.267	0.048	0.109	0.264

Below we plot the speedup for the two types of GPU:

For both GPUs, we achieve significant speedup on keypoint detection, description, matching as well as warping. We did not focus too much on homography and stitching because their OpenCV implementations are already very efficient.

If we look at different video resolutions, we can also see that the speedup is pretty homogeneous. This demonstrates that we controlled overhead like memory copies pretty well so that it does not shrink the speedup.

Finally, if we compare the results of the two GPUs, with GTX 1070, which has more computing power, a higher speedup can be achieved. This shows the scalability of the program - with GPU that has higher clock rate and more Gflops, we can always scale up to accommodate videos of higher quality and potentially incorporate more complicated video processing techniques.

Table 7. FPS of OpenACC version
Platform	AWS g3.4xlarge			GTX 1070
Resolution	480p	720p	1080p	480p	720p	1080p
FPS	18	8	3.74	20	10	3.8

Based on the FPS table, we can see that with GTX 1070, we can stream stitched images at a rate of 20 frames per second for 480P, and 10 frames per second for 720P. Videos of this FPS is very fluid and will not show apparent delays to human eyes. For video of 1080P, the FPS is not as good, but we have shown that the program is highly scalable with the GPU's computing power.

Pipeline Version

In this section, we show the performance results after implementing task-level parallelization with our multi-threading pipeline, on top of GPU acceleration.

With task-level parallelization, we will not be able to further speed up the execution time of individual task. In fact, we even observe some slight slowdown on individual tasks, potentially due to frequent context switches between threads.

However, what's important about task-level parallelization is that it isolates the image capturing and rendering from the entire pipeline so that they can be carried out independently. As a result, we see some improvement in the rendering speed of the stitched images, which can be seen from the table below:

Table 8. FPS of pipeline version
Resolution	480p	720p	1080p
FPS	35	14	5

One option of our developed software is stitching the images without recomputing the keypoints and homography at every frame. In such cases, image rendering will become the bottleneck of the pipeline. With faster image rendering libraries, we could further improve the video FPS.

Comparison

In this section, we compare the performance results across all parallel versions. First, we tabulated the speedup of all stages for different parallel versions at a resolution of 720P:

Table 9. Speedup comparison of different parallel versions
Parallel Version	OpenMP	OpenACC (Tesla M60)	OpenACC (GeForce GTX 1070)
Keypoint detection	4.4	17.1	20.9
Keypoint description	0.5	10.0	17.1
Keypoint matching	1.9	10.0	13.3
Homography	2.3	1.2	0.8
Warping	4.8	10.5	14.4
Stitching	1	1.6	4
Total	1.4	8	9

We can see that on time-consuming stages including keypoint detection, keypoint description and warping, OpenACC is able to achieve significant speedup from the serial version.

Then let's directly compare the resulting FPS of the stitched video stream. Compared with GPU acceleration, the effect of OpenMP is not so significant. GPU is able to fundamentally speed up the program to a real-time level. The highest FPS is resulted from pipelining on top of OpenACC.

Discussions

Insights

Through this project, we conquered quite a few obstacles and obtained some precious insights from the process:

With sufficient optimization, GPU acceleration is very suitable for image stitching. Sub-procedures like keypoint detection, keypoint description and perspective warping all involve repetitive loop operations that benefit tremendously from GPU's computing prowess.
We managed to control overhead in GPU memory copies, by reimplementing the code such that one memory copy before the parallel region is sufficient.
Through the simple idea of using a moving average of homography, video stability is significantly improved without utilizing any complicated stabilization algorithms.
We managed to design a task-level pipeline that exploits the computing resources from both CPU and GPU. It also balances workload to a great extent. This paradigm is extensively useful, as just by initializing additional threads, it can handle additional operations that can improve the stitching results such as bundle adjustment, without incurring too much additional execution time.
The success of this application tells the possibility of making a lot of other computer vision tasks real-time. Sub-procedures such as keypoint detection, perspective warping can be transplanted in a lot of other important applications, such as video stabilization, object contouring and object straightening.

Future Work

There are of course some areas for improvement that call for future endeavors. For example, although we do not favor the idea of distributed memory parallelization with MPI due to the amount of data that have to be communicated, it might still be worth a try.

In addition, a lateral comparison of the scalability of different keypoint detection algorithms (SIFT, SURF, corner detection) might be interesting. Possibly more advanced operations like bundle adjustment can be assigned to the rest of the CPU threads available.

References

SURF code source: https://github.com/julapy/ofxOpenSurf
Data source: videos taken from Logitech webcams
Knowledge references:
- Image stitching overview: http://ppwwyyxx.com/2016/How-to-Write-a-Panorama-Stitcher/
- Keypoint detection: Speed-Up Robust Features (SURF)
- Transformation estimation: Random Sample Consensus (RANSAC)
- Perspective warping: https://github.com/stheakanath/panorama
- Blending: A Multiresolution Spline With Application to Image Mosaics

Real-Time Image Stitching