Real-time Imaging and Mosaicking of Planar Surfaces

(1)

LAPPEENRANTA UNIVERSITY OF TECHNOLOGY DEPARTMENT OF INFORMATION TECHNOLOGY

Real-time Imaging and Mosaicking of Planar Surfaces

The topic of the master’s thesis has been accepted in the departmental council of the Department of Information Technology, May 17th, 2006.

Examiners: Professor Heikki Kälviäinen, Lasse Lensu D.Sc. (Tech.) Supervisor: Joni Kämäräinen D.Sc. (Tech.)

Lappeenranta, October 19th, 2006

Pekka Paalanen

Korpimetsänkatu 4 B 17 53850 Lappeenranta Tel. 050 3210 527 paalanen@lut.fi http://www.iki.fi/pq/

(2)

ABSTRACT

Lappeenranta University of Technology Department of Information Technology Pekka Paalanen

Real-time Imaging and Mosaicking of Planar Surfaces Thesis for the Degree of Master of Science in Technology 2006

116 pages, 64 figures, 13 tables and 2 appendices.

Examiners: Professor Heikki Kälviäinen Lasse Lensu D.Sc. (Tech.)

Keywords: mosaic, mosaicking, mosaicing, real-time, video, tracking, egomotion, illumination, visual inspection, imaging, machine vision, computer vision

In a confined environment imaging a large surface with a sufficient resolution can be difficult. The imaging must be done in parts, and later the partial images joined into a seamless full view image, or a mosaic. If a user is moving the imaging device by hand, the user must get instant feedback to avoid leaving holes in the mosaic, and to do the job fast.

The objective of this study was to construct a small, portable and accurate imaging device for the paper and printing industry, and develop methods for providing feedback in the form of a real-time view of a rough mosaic image as it is accumulated.

Two imaging devices were built: the first from consumer grade components, and the second from industrial grade components. The images were processed with a standard personal computer. A simple tracking method estimated the transformation between successi- ve video frames composed into the mosaic image at a given camera frame rate. Numerical illumination compensation was investigated and implemented.

The first imaging device exhibits problems with illumination and lens distortion, producing poor quality mosaics. The second device solves these issues. The tracking method performs well considering its simplicity, and several improvements are proposed. The results of the study show that real-time mosaicking using megapixel video imagery is feasible using current consumer grade computer hardware.

(3)

TIIVISTELMÄ

Lappeenrannan teknillinen yliopisto Tietotekniikan osasto

Pekka Paalanen

Real-time Imaging and Mosaicking of Planar Surfaces Diplomityö

2006

116 sivua, 64 kuvaa, 13 taulukkoa ja 2 liitettä.

Tarkastajat: Professori Heikki Kälviäinen TkT Lasse Lensu

Hakusanat: mosaiikki, kuvamosaiikki, tosiaikainen, video, seuranta, liike, valaistus, visu- aalinen tarkastus, kuvaus, konenäkö, tietokonenäkö

Keywords: mosaic, mosaicking, mosaicing, real-time, video, tracking, egomotion, illumination, visual inspection, imaging, machine vision, computer vision

Laajojen pintojen kuvaaminen rajoitetussa työskentelytilassa riittävällä kuvatarkkuudella voi olla vaikeaa. Kuvaaminen on suoritettava osissa ja osat koottava saumattomaksi koko- naisnäkymäksi eli mosaiikkikuvaksi. Kuvauslaitetta käsin siirtelevän käyttäjän on saatava välitöntä palautetta, jotta mosaiikkiin ei jäisi aukkoja ja työ olisi nopeaa.

Työn tarkoituksena oli rakentaa pieni, kannettava ja tarkka kuvauslaite paperi- ja paino- teollisuuden tarpeisiin sekä kehittää palautteen antamiseen menetelmä, joka koostaa ja esittää karkeaa mosaiikkikuvaa tosiajassa.

Työssä rakennettiin kaksi kuvauslaitetta: ensimmäinen kuluttajille ja toinen teollisuu- teen tarkoitetuista osista. Kuvamateriaali käsiteltiin tavallisella pöytätietokoneella. Vi- deokuvien välinen liike laskettiin yksinkertaisella seurantamenetelmällä ja mosaiikkikuvaa koottiin kameroiden kuvanopeudella. Laskennallista valaistuksenkorjausta tutkittiin ja kehitetty menetelmä otettiin käyttöön.

Ensimmäisessä kuvauslaitteessa on ongelmia valaistuksen ja linssivääristymien kanssa tuottaen huonolaatuisia mosaiikkikuvia. Toisessa kuvauslaitteessa nämä ongelmat on kor- jattu. Seurantamenetelmä toimii hyvin ottaen huomioon sen yksinkertaisuuden ja siihen ehdotetaan monia parannuksia. Työn tulokset osoittavat, että tosiaikainen mosaiikkiku- van koostaminen megapikselin kuvamateriaalista on mahdollista kuluttajille tarkoitetulla tietokonelaitteistolla.

(4)

PREFACE

I wish to thank my supervisor Docent Joni Kämäräinen D.Sc. (Tech.) for keeping me working on this project and also unrelated fascinating topics, for sharing some of his enthusiasm, and for believing that one day I will get my projects done. Thanks also to Professor Heikki Kälviäinen for keeping me in his research group for so long, although I was just an undergraduate. Lasse Lensu D.Sc. (Tech.), thanks for being there when Joni was unavailable, giving valuable comments and agreeing to examine my thesis.

This work was done in the Laboratory of Information Processing, Department of Informa- tion Technology, Lappeenranta University of Technology, from 2005 till October 2006.

The work was part of the Papvision project, ”Paper and Board Printability Testing using Machine Vision.”

The work was funded by TEKES, the European Union, and the companies; Future Print- ing Center (Ciba Specialty Chemicals, Hansaprint, Omya), Labvision Technologies, Stora Enso, Metso Paper, Myllykoski Paper, and UPM-Kymmene; projects 70049/03, 70056/04, and 40483/05. I am grateful for the financial support.

Special thanks to Docent Ville Kyrki D.Sc. (Tech.) for several important ideas and parts of the software implementation, including the point tracking method, RANSAC framework, and the complex build system.

Acknowledgements go to Pertti Silfsten (PhD), docent of the University of Joensuu, for measuring the spectra and making sure I did not write nonsense about it. Also to Juha Turunen, who took the photographs of the second imaging device.

Thanks, #itlab, for being such a charm.

Finally, thank you to my parents, to whom I owe my existence.

Lappeenranta, October 19th, 2006

(5)

ABBREVIATIONS AND SYMBOLS

AMD Advanced Micro Devices, Inc.

API application programming interface

ATLAS Automatically Tuned Linear Algebra Software BLAS Basic Linear Algebra Subprograms

BRDF bidirectional reflectance distribution function CCD charge-coupled device, a photosensitive device

DCAM digital camera, sometimes related to Firewire camera standard API DMA direct memory access

FPGA field-programmable gate array fr unit: a video frame

GCC Gnu Compiler Collection, previously Gnu C-Compiler GigE gigabit ethernet

GPL GNU General Public License GPU graphics processing unit

IEEE The Institute of Electrical and Electronics Engineers, Inc.

IEEE 1394 a serial bus, also known as Firewire

IIDC Instrumentation & Industrial Digital Camera LAPACK Linear Algebra Package

LED light emitting diode MSE mean squared error

PC originally the IBM personal computer PWC the Philips webcam Linux driver

px unit: a pixel

quad quadrilateral, a four-vertex surface object in 3-space RANSAC Random sample consensus (algorithm)

RGB red-green-blue, three-component color value RMSE square root of mean squared error

SSE sum of squared errors

TVMET Tiny Vector Matrix library using Expression Templates USB Universal Serial Bus

YUV420P a color image format with reduced chrominance resolution

(8)

A∗B convolution of signalAwith signalB.

diag(α, β, . . .) diagonal matrix with given elements

I identity matrix

BM_A transformation matrixM from coordinate frame A to frame B

|M| determinant of square matrixM, product of eigenvalues

TrM trace of square matrixM, sum of diagonal elements or eigenvalues

x a scalar

x a vector

Ax vectorxgiven in coordinate frame A.

X a matrix

(9)

1 INTRODUCTION

1.1 Background

The need to take digital images is ubiquitous. Whether it is for taking photographs on vacation, environmental images from a satellite, or scanning a paper document into a digital form, digital imaging is everywhere. Digital imaging is especially useful in automatic visual inspection, which is used in almost every field in industry, creating a demand for specialized and accurate imaging technologies. Sometimes restrictions posed on imaging procedures may require that the image is taken in small pieces and then combined into a larger view.

The Papvision project [1] is a joint effort of Lappeenranta University of Technology and companies from the Finnish paper and printing industry. The project is focused on paper printability assessment and has developed automatic evaluation methods for different standard tests in the field. Common to practically all of the tests is that they are based on visual inspection.

Testing equipment had to be acquired to test the methods used in the Papvision project.

In his MSc thesis [2], Sami Lydén describes the imaging system he built for printability assessments, and also basic principles of digital imaging. The imaging system was called Papproto 1, and it was designed as a laboratory device, not meant to be moved outside the laboratory. All tests performed using the device required samples to be imaged in parts to get accurate high resolution images.

One motivation for the work was the need to develop a compact portable measuring instrument that can be used with a modern laptop computer. When a customer complains about paper or print quality, a vendor’s representative could take the instrument and go to the customer to accurately measure the delivered product. This could save the effort of transporting the product back to the vendor or factory for laboratory analysis, and also lets the vendor to get evidence about the condition in which the customer received the product.

The traditional definition of a mosaic is a surface decoration made of small pieces of colored glass or stones fitted together to form a pattern or a picture. A digital mosaic image is a digital image composed of several smaller images. One type of mosaic image is a panorama, a wide angle view stitched together from multiple photographs. The process

(10)

of aligning and blending small images together to form a coherent mosaic image is called mosaicking.

When a user takes snapshots for a mosaic image, the user has to remember what parts of a scene have not yet been photographed. It would be inconvenient or unacceptable to leave holes in a mosaic. A real-time approach eases the job if a rough version of the mosaic is constructed after every shot, and it can clearly be seen what still needs to be imaged. This can be extended to live video feeds where the requirement of real-time operation becomes a clear necessity. This kind of live video system resembles painting with a brush, but the brush is a camera.

1.2 Imaging Surfaces

A photograph, a picture and a mosaic are inherently two-dimensional and cannot really capture three-dimensional constructs. It is natural therefore to image surfaces, as they can be represented well with a two-dimensional presentation. The main item of interest is usually the surface texture, not so much the surface structure. An interesting application is the panorama image, where a camera rotates around its focal point. No perspective effects can be seen in the images, and they form a two-dimensional mosaic image with the topology of (a part of) a sphere. Correctly photographed and composed panorama images do not contain depth information or distortions due to varying depth.

Surfaces can be imaged with photographic cameras (like the everyday camera used to take family snapshots), line scanners and even point scanners. With a photographic camera you may need to step back to get the whole surface into view and usually the pose (position and orientation) of the camera with respect to the surface is unknown. Line and point scanners rely on physical motion to cover the surface area, either the imaging sensor or the surface (the object) has to be moved. The motion has to be very accurately controlled to get a solid image.

These imaging techniques process the surface in one shot or one predetermined batch of sweeps. Some of the problems with the techniques are that there might not be enough space to get a full view with a photographic camera, or the image resolution might be too low. Scanners usually require that the object is small enough to fit inside the scanner device, and they press the object against glass, which may even destroy the object in

(11)

achieve very high image resolutions. Line and point scanners can even be used to take spectral images.

There are cases when the imaging device should be small, portable, and operate in confined spaces. That is the motivation of this thesis. Additionally it may be required that the surface is uniformly illuminated. It can be very difficult to illuminate the whole surface at a time, but scanners accomplish uniform illumination very well by illuminating a small portion at a time.

One technique not yet mentioned in this context is mosaicking. It is a higher level technique that can be applied to all of the aforementioned imaging methods. It must be used with a two-dimensional image, i.e., not directly with a line camera, if the mosaic is to be constructed based on image information only. The other option is to use additional sensors to detect the camera pose on each shot. Ordinary scanners use stepper motors or feedback circuits to detect the sensor pose, but the difference to mosaicking is that the overlap between adjacent images is nonexistent.

In this thesis, the term system refers to an imaging device, processing unit, software, and everything needed for creating mosaic images of a surface. The imaging device is a scanner-like device including a camera, light sources and a frame, a compact instrument that is easy to hold and move by hand. By camera it is meant a camera housing, sensor array, circuitry, lens, and necessary software inside the camera module (firmware). A sensor array is the photosensitive element inside a camera, a collection of photosensitive cells or pixels.

1.3 Objectives and Restrictions

The objective of this thesis is to construct a device that can be used to image (scan) relatively large surfaces in small pieces, and to develop a method to automatically create a rough mosaic image on-line, in real-time. The mosaic is a color image.

Camera motion is restricted to two-dimensional translation and rotation on the plane of the surface. The camera sees the surface from directly above at a straight angle. The surface is a plane with no height variations or three-dimensional structures much larger than the image pixel scale. The camera produces video stream at a minimum rate of25^fr/s

to make the motion appear continuous and immediate to a person operating the device.

(12)

The imaging device must be small enough to be portable with a laptop computer. The device should provide its own controlled light sources, eliminating the effect of external light. The device may and should touch the imaged surface to maintain orientation and camera distance. The device is slid across the surface by hand.

The method is developed with a lower quality imaging device, called as the first imaging device. The second imaging device is built from industrial grade parts, when sufficient knowledge about selecting suitable components has been gained.

1.4 Structure of the Thesis

This thesis concerns hardware, theory and software implementation required for a func- tional proof-of-concept level system for real-time mosaicking from a live video stream.

Section 2 takes a look at existing real-time mosaicking applications.

The first imaging device hardware is presented in Section 3. This hardware is used in developing and testing the methods described in Section 4, which gives an overview of the system structure, and presents the methods and algorithms used.

Section 5 introduces the second imaging device hardware in detail and discusses imaging, illumination and camera response. Better hardware made it feasible to programmatically correct for uneven illumination. Theory of uneven illumination compensation and the results are given in Section 6.

Mosaicking process evaluation is described in Section 8 along with documented exper- iments using both of the imaging devices. The tests evaluate mosaicking accuracy and performance limits of the system.

Technical implementation details are reported in Section 7, including a list of matrix com- putation libraries and run-time performance tests with respect to hardware acceleration and different imaging devices.

Section 9 discusses the differences between the two imaging devices and the system performance, and presents a plethora of ideas for future work. Section 10 concludes the thesis.

(13)

2 PREVIOUS WORK ON REAL-TIME MOSAICKING

Mosaicking or image stitching is an old idea, roughly from the time when photograph- ing was invented, to create larger pictures than can be imaged in a single shot. A good introduction to digital mosaicking is an article by Szeliski [3]. While not considering real- time processing, the article describes well the basics behind two-dimensional mosaicking, creation of panoramas, and even three-dimensional scene reconstruction. Szeliski men- tions image intensity based local optimization, hierarchical matching and phase corre- lation methods for image registration (finding the transformation between images), but leaves out interest point based techniques. A comprehensive review of image registration methods is presented by Zitová and Flusser [4].

VideoBrush^TMwas a commercial software for creating digital mosaics and panoramas at the end of 90s, but the product and the company seem to have disappeared. Sawhney et al. [5] describe some aspects of the software: it was able to create a rough mosaic in real-time using a pure translation model, and then refined the mosaic off-line. The system accounted for lens distortions of consumer grade cameras.

In a recent article Baudisch et al. [6] present an interactive panorama construction tool that shows the mosaic image after every shot, helping the user to see what still needs to be photographed. While this tool is real-time in the sense that it gives immediate feedback, it still operates with still shots, not video.

Video mosaicking is used in many fields. Marks et al. [7] imaged the ocean floor with an underwater robot vehicle, determining proper snapshot locations based on video images.

Vercauteren et al. [8] use advanced mosaicking methods for fibered confocal microscope images. Bevilacqua et al. [9] construct a background image in real-time for background subtraction and moving object segmentation for pan-tilt-zoom surveillance cameras.

Hafiz et al. [10] developed a hardware accelerated system for real-time registration of aerial video imagery. They use a field-programmable gate array (FPGA) at 30 MHz for computationally intensive parts of their algorithms and assume that the transformation between adjacent video frames is small and limited to translation, rotation and uniform scaling of the image. A registration rate of12^fr/^sis reported for 512-by-512 pixel images.

Apparently this does not include blending images into a mosaic.

(14)

3 THE FIRST IMAGING DEVICE

3.1 Overview

The first imaging device used in this study (Figure 1) has a wooden frame, a consumer grade Universal Serial Bus (USB) webcam and a light source. The system uses a standard PC and GNU/Linux operating system. The distance from target plane to camera lens is approximately5cm and the viewable area using a full image is about44by34mm.

Figure 1. The first imaging device: wooden frame, camera and light source. Laboratory power supply for the light is seen in the rear.

3.2 Camera

The camera is a Logitech Quickcam Pro 4000, a typical webcam with USB-connector.

The largest image resolution the camera can provide is640by480pixels at15^fr/s. To get the maximum frame rate of 30^fr/^s, the largest possible image size is320 by240 pixels.

This gives a resolution of7^px/^mm.

(15)

only supported image output format found in the Video4Linux 1 application programming interface (API) is YUV 4:2:0 planar (YUV420P), which contains a full resolution lumi- nance channel and two half-by-half resolution chrominance channels.

The camera white balance can be set manually. Setting for the brightness, hue, color, contrast and whiteness are provided by the API, although only brightness has any effect on the image. Manual shutter setting is possible, but there is, nevertheless, some automatic adjustment done according to the amount of light, either in the Linux kernel driver or the camera itself.

Due to the consumer grade of the camera, image noise is well observable, but is low enough for prototype development under good lighting conditions. The small lens of the camera produces noticeable lens distortions when using the full image. Parts of the image are always out of focus. These flaws are good for method development, because they discourage some naive solutions, that might work only with good quality equipment, and thus any method developed becomes usable on wider variety of hardware.

3.3 Light Source

The light source (Figure 2a) is a Luxeon Star light emitting diode (LED). Model LXHL- MW1D used in the prototype emits white light in a Lambertian beam pattern and does not include additional optics. According to the specifications [11], typical color temperature is5500K and the emission spectrum has a relatively high peak at440nm.

Typical operating voltage for the LED is3.4V and the maximum allowed average current is350mA. The LED is fed through a typical LM317-based current regulator circuit (Fig- ure 2b), limiting the operating current to 0.32A. The complete circuit operating voltage range is6.5–12V.

The emission spectrum of the Luxeon Star LED is presented in the specifications [11], but the spectrum was also directly measured with a VIS-LIGA-Microspectrometer manufactured by German company microParts. In the micro-spectrometer, light travels through 102/122µmmultimode step-index fiber to625^lines/mmgrating whose spectral resolution (half-width) is about 12 nm. The detector is a Hamamatsu S 5463-256 N line sensor containing one cell per3 nmband. The sensitivity of the micro-spectrometer varies over wavelengths and it is compensated calculatorily.

(16)

(a) (b)

Figure 2. Light source of the first imaging device is a single LED powered through a current limiting circuit: (a) Luxeon Star LED is attached to copper plate for cooling; (b) Current limiting circuit schematic.

The grating resolution limits the discrimination of the wavelengths spreading sharp peaks over wider range of wavelengths. This does not affect the total power, the area under the curve, and therefore high sharp peaks appear much lower and wider. All readings are relative, the absolute intensity was not measured.

The measured spectrum for the LXHL-MW1D LED is presented in Figure 3. The cross- talk lowers the450 nmpeak that should be more at440 nm, but otherwise the distribution resembles that in the data sheet. Characteristic of this type of white LED is the strong peak in the blue range.

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

400 450 500 550 600 650 700

wavelength nm

Figure 3. Emission intensity distribution for the white Luxeon Star.

(17)

4 MOSAICKING METHOD

4.1 Overview

The goal of the system is to construct a ”big picture”, mosaic from a live video stream of a static scene. Camera travels an unknown path above a planar surface to be imaged.

The camera motion (egomotion) has to be determined from the image sequence to align the individual images with respect to each other. Then the images are blended together to form the mosaic image. Mosaicking should be done in real time so that a user operating the imaging device can see what is already imaged and what is not.

The egomotion estimate is based on a set of point trackers. A point tracker attempts to track a single spatial point in a scene through time in the video stream. Individual trackers are likely to fail sometimes and give distorted readings because of imaging noise, for instance. That is why a large set of trackers is required, so that enough trackers indicate the true motion. Outliers are pruned from the set of point trackers and the egomotion parameters are estimated from the motion of inlier trackers.

At the beginning of and during scene tracking (due to tracker pruning), new point trackers have to be created. An interest point detector is used to initialize new locations for point trackers to give them a better chance to detect motion correctly. Initializing a tracker to a flat area in an image is useless as different locations of a flat image are indistinguishable;

using an interest point detector avoids this. To save execution time, point trackers are reinitialized in batches, not usually on every video frame. This reinitialization phase is called point tracker resurrection.

A flow chart of the method is presented in Figure 4. Image capture works in the background so that the next image is being acquired while processing the current image. The point trackers are updated and failed trackers are removed. Then Random Sample Consen- sus (RANSAC) [12] is used to prune trackers that do not move uniformly. The remaining trackers are used to estimate camera egomotion. If the egomotion estimate cannot be determined or there are too few trackers left, the track is considered lost. If there are plenty of trackers left, there is no need to initialize further. If the track is lost, the camera position is not updated. Finally, the image is blended into the mosaic at the current estimate of the camera position, and the mosaic image is shown to the user.

(18)

Capture Image

Update Trackers

RANSAC Solver

Tracker Resurrection

Update Position

Update Mosaic

track lost enough trackers

Figure 4. Flow chart of the real-time mosaicking method.

4.2 Point Tracking

Image template matching with an exhaustive search inside a search window provides the simplest possible tracking mechanism. Feature based methods (e.g. [13, 14] and a comparison of several more local descriptors in [15]) could also be used, but they are likely to be heavier to compute, and most of the more advanced properties they offer are not required in this work.

Transformation from image to image is constrained to 2-dimensional translation. The transformation model is sufficient because the change between adjacent images is assumed to be small and the effects of other transformations including rotation on the plane are negligible in image areas that are of the template size. The assumptions are supported by the fact that the system is supposed to process images at speeds of30^fr/^sin real-time and the sensor moves relatively slowly.

A point tracker initializes the template imageGof sizenG, mGpixels to part of an image around the point to be tracked. The template image dimensions are odd. In the next imageIa search window is established around the previous known position of the point.

For every position(x, y)inside the search window the sum of squared errors (SSE)E is computed with respect to the template image as

E(x, y) =

nG−1

X

i=0 mG−1

X

j=0

I

x+i−n_G−1

2 , y+j− m_G−1 2

−G(i, j)

2

. (1)

The position(˜x,y)˜ producing minimumEis the integer approximation of the new tracker position.

An ad hoc refined position (ˆx,y)ˆ is computed comparing neighboring positions’ SSE

(19)

values and weighting positions by inverse of SSE.

ˆ x=







E(˜x,˜y)⁻¹x+E(˜˜ x−1,˜y)⁻¹(˜x−1)

E(˜x,˜y)⁻¹+E(˜x−1,˜y)⁻¹ ifE(˜x−1,y)˜ < E(˜x+ 1,y)˜

E(˜x,˜y)⁻¹x+E(˜˜ x+1,˜y)⁻¹(˜x+1)

E(˜x,˜y)⁻¹+E(˜x+1,˜y)⁻¹ otherwise

(2)

and respectively

ˆ y =







E(˜x,˜y)⁻¹y+E(˜˜ x,˜y−1)⁻¹(˜y−1)

E(˜x,˜y)⁻¹+E(˜x,˜y−1)⁻¹ ifE(˜x,y˜−1)< E(˜x,y˜+ 1)

E(˜x,˜y)⁻¹y+E(˜˜ x,˜y+1)⁻¹(˜y+1)

E(˜x,˜y)⁻¹+E(˜x,˜y+1)⁻¹ otherwise.

(3)

This is based on assumptions that the true position of the minimumEis not in the middle of the corresponding pixel, but is shifted towards the next smallest SSE pixel position, and the inverse of SSEs roughly indicate weighting between positions. Hopefully it reduces error in matched template position, but this has not been proven.

It is possible that a point tracker cannot successfully track a point. Reasons for this are violations of the assumptions. The scene itself changes, image changes due to lighting, noise or motion blur, the point moves too far from its previous position or camera motion is not close enough to 2-d translation. However, the point tracker always returns the best fit. To detect tracking failureE is thresholded. A too high error value causes the tracker to be removed. The threshold should not be set too low, its aim is to detect major tracking errors. In practice the threshold can be estimated from tracking performance charts (see Section 8.1). This pruning is required to reduce the probability that misleading point trackers would get majority in the RANSAC phase.

The used point tracking method is very simple and computationally heavy in theory due to the exhaustive search. Nevertheless, performance in both tracking and execution speed is sufficient.

4.3 Initializing Tracking Points

The point trackers should be initialized to scene points that are spatially discriminative in their neighborhood. The size of a neighborhood depends on the point tracker search window. Regardless of scene motion, the sum of squared errors between the template and the image should be the lowest in the correct scene point location in any image and clearly higher everywhere else within the search window. Corner-like structures in an image have the desired quality, and corner detectors should produce good candidates for

(20)

points to be tracked. A comparison of several different feature point detectors, including corner detectors, can be found in [16, 17, 18].

The choice for a corner detector was driven by ease of applicability. The Open Computer Vision Library (OpenCV) [19, 20] offers a function calledcvGoodFeaturesToTrack that takes a single channel image and returns a list of corner points. The system uses color images so the single channel image is computed by simple averaging of the R-, G- and B-channel. Corner detection is based on examining the eigenvalues of the derivative images’ covariance matrix. The theory of corner detection is presented next, but the actual implementation in OpenCV has not been verified.

The covariance matrix and eigenvalues can be computed as follows [21]. First the derivative images

Ix = ∂I

∂x =I∗h

−1 0 1i

Iy = ∂I

∂y =I∗







−1 0 1







(4)

of the input image I(x, y) are computed, where ∗ denotes convolution. The sum of squared errors generated by a small dislocation∆x,∆ycan be written as

E(∆x,∆y) =A(∆x)²+ 2C∆x∆y+B(∆y)² (5) where

A=I_x²∗w B =I_y²∗w C = (IxIy)∗w

. (6)

The Gaussian window functionwis defined as

w(x, y) = exp−x²+y²

2σ² . (7)

Note, that A, B and C are now images computed with convolution. If a point x, y in the original image needs to be inspected, A(x, y), B(x, y) andC(x, y) are used. In the following formulation ofE(∆x,∆y)the three variables are regarded as values at a single point.

(21)

According to [21], it can be written that

E(∆x,∆y) = [∆x∆y]M[∆x∆y]^T (8)

where the matrix

M =

"

A C C B

#

. (9)

The eigenvalues of the matrix M describe the curvature of the autocorrelation function in a rotation invariant manner. An edge is detected if one eigenvalue is big and the other one is small. If both eigenvalues are big, it corresponds to a corner.

According to the OpenCV documentation, the function cvGoodFeaturesToTrack first computes an eigenvalue image containing the smaller eigenvalue for each pixel, corresponding to the theory presented above. Non-maxima suppression using a 3-by-3 neighborhood is applied to the eigenvalue image, where all values that have higher values in their neighborhood are set to zero. Then eigenvalues are thresholded to remove too weak corners. The threshold is proportional to the maximum eigenvalue found in the eigenvalue image, which ensures that at least one corner is always found. Then the function discards all weaker corners that are too close to a stronger corner. As a result, the function returns a list of coordinates ordered in descending corner strength (minimal eigenvalue).

The OpenCV function also has another mode of operation in which it uses the Harris [21]

corner measure instead of the smaller eigenvalue. There the corner measure function is defined as

R=|M| −k(TrM)² . (10)

The valueRis negative on edge points and positive on corner points. IfTrM is near zero the point is not interesting. The parameter k is not discussed in the original paper [21], but other publications [22, 23] seem to usek = 0.04without further justification. Using the properties of determinant and trace, Eq. 10 can be written as

R=AB−C²−0.04(A+B)² . (11) Images A, B and C are used element-wise, and R is the Harris corner measure image, which could then be used in place of the eigenvalue image. This operation mode, however, is not used in the thesis.

(22)

4.4 From Points to Camera Egomotion

The camera motion model considers translation and rotation in two dimensions. For a point ^Ax in coordinate frame A (video frame t) the transformed point in coordinate frame B (video framet+ 1) is

Bx=^BR_A^Ax+^Bt_A , (12)

where^Bt_A= [t_xt_y]^T is the translation vector (origin of frame A in coordinates of frame B) and

BR_A=

"

cosθ −sinθ sinθ cosθ

#

(13) is the rotation matrix with rotation angleθ. The task is to solve for^BR_Aand^Bt_A, given a point pair set{(^Ax,^By)i},i = 1, . . . , n. Ideally^By_i =^Bx_i, but since the measured new position^Byhas some measurement error, a solution based on minimizing error must be sought.

Shinji Umeyama in [24] presents a closed-form solution of the least-squares problem of the similarity transformation (translation, rotation and uniform scaling) parameter estimation. He defines the mean squared error (MSE)

e²(^BR_A,^Bt_A,^BcA) = 1 n

n

X

i=1

k^Bx_i−(^BcABR_A^Ax_i+^Bt_A)k² , (14)

where^BcAis the scale parameter. His solution to minimizing Eq. 14 uses the covariance matrixΣABbetween the point sets from frames A and B.

ΣAB= 1 n

n

X

i=1

(^By_i−^Bµ)(^Ax_i−^Aµ)^T , (15) where

Aµ= 1 n

n

X

i=1

Ax_i and ^Bµ= 1 n

n

X

i=1

By_i . (16)

Singular value decompositionΣAB=U DV^T gives the rotation matrix as

BR_A=U SV^T , (17)

(23)

where

S =







I if|U||V|= 1

diag(1,1, . . . ,−1) if|U||V|=−1 . (18) The other parameters

Bc_A = 1

1 n

Pn

i=1k^Ax_i−^Aµk²Tr (DS) , (19) matrix traceTr being the sum of diagonal elements, and

Bt_A =^Bµ−^BcABR_A^Aµ . (20)

The last equation is the easiest to understand. ^Bt_Ais the origin of frame A in coordinates of frame B. Translation is computed from the midpoints^Aµand^Bµby first transforming the point^Aµinto frame B, and then subtracting it.

Since the camera motion model assumes no change in scale, ^BcA can be set to 1, and Eq. 14 is exactly the error function for the task using Eq. 12. Therefore Umeyama’s solution can be applied without modifications.

Point trackers give a set of 2-dimensional translations and positions as illustrated in Fig- ure 5a. There can be a large number of trackers that do not move uniformly according to the camera motion model. These outliers are due to tracking errors, because the tracking method cannot discard all failing trackers. The point trackers do not even care about the global camera motion model.

MSE-solutions are sensitive to outliers [12], therefore using Umeyama’s method on the tracker set in Figure 5a would produce bad estimates of the camera egomotion. Outliers have to be pruned first and the MSE-solution computed from the coherently moving set (Figure 5b). RANSAC [12] is the method to use on a data set that has many random outliers when a single global solution is required.

RANSAC is a simple and robust algorithm for fitting a model to given data that contains a significant portion of gross errors [12]. The model introduced above requires at least two point pairs to produce an estimate. The data pool is the collection of all point pairs produced by the point trackers, the position in video framest andt+ 1. In this task the RANSAC paradigm works as follows:

1. Randomly draw a minimum number of points (pairs) required to form a solution

(24)

(a) (b)

Figure 5. Apparent motion of a set of point trackers. Positions in frametare circles and positions in framet+ 1are dots: (a) All trackers; (b) Only coherent trackers.

from the data pool. Solve the task with the drawn points, producing a candidate solution (estimate of the egomotion).

2. With the candidate solution, compute the error for all data with respect to this solution. All data points (pairs) that have an error smaller than a threshold, the RANSAC inlier threshold, belong to a consensus set.

3. If the consensus set is large enough (RANSAC immediate acceptance threshold in number of inliers), use the whole consensus set to solve the task and produce a final solution. Stop.

4. Otherwise, if not too many attempts are made, go to step 1 to draw a new set of points from the data pool.

5. The maximum number of attempts has been reached. Take the largest consensus set encountered and solve the task to produce a final solution. Stop.

After the algorithm, if the selected consensus set contains too few point pairs, less than the specified minimum number of RANSAC inliers, the egomotion estimation is considered failed.

While this is a robust algorithm and usually produces stable results, it can be fooled by false tracking. If there are many point trackers moving coherently, but not according to the true camera egomotion, the false movement is detected and used, leading to false tracking.

(25)

4.5 Rough Mosaicking

The transformation from video frame t to video frame t + 1 has been discovered to be _B

R_A,^Bt_A

. The world transformation is updated accordingly

WR_C ←^WR_C^BR^T_A

Wt_C ←^Wt_C−^WR_C^Bt_A . (21)

W is the world coordinate frame, with origin in the middle of the mosaic image and in the same alignment, and Cis the camera coordinate frame, with origin in the middle of the captured image and in the respective alignment.

With the camera-to-world transformation _W

R_C,^Wt_C

defined in pixels it is relatively easy to draw the captured image on top of the mosaic image in the right pose. It is benefi- cial to keep the mosaic image pixels the same size as in the captured image, information loss is minimal without extra processing overhead.

The mosaic image starts as full black when the system comes into operation. Video frame by video frame new images are composed into the mosaic image. When the mosaic is shown to a user, only one image needs to be drawn: the mosaic, not each and every video frame. The individual video frames can be stored to disk for off-line processing, but they take too much space to be kept in memory.

The motion per frame is very small, only some pixels, ergo the effective area of the captured images is their border. Illumination variations, vignetting (peripheral darkening of the image due to camera optics or structure) and geometrical distortions are usually stronger near the image borders. Even if they are not visible to the plain eye in the captured images, they become clearly visible in the mosaic.

Rough mosaicking is implemented using OpenGL. The mosaic is drawn without blending or antialiasing, both of which might be useful techniques to hide the individual images’

edges. Blending uses the target image (mosaic) pixel value in addition the source image pixel value to compute a new target pixel value. Antialiasing is a technique that smoothes jagged edges that are supposed to look straight. The reason for not using these techniques is to be able to better see alignment errors and illumination problems. Details about the implementation are described in Section 7.5.

(26)

5 THE SECOND IMAGING DEVICE

5.1 Overview

The second imaging device in Figure 6 is assembled from industrial quality components into a self-made frame. The camera is connected to a computer with Firewire and the four segments of the square LED light are individually controllable. Power supply (not seen in the figure) for the LED light allows control of the segments separately.

Figure 6. The second imaging device. Square LED light in the bottom, camera in the middle.

Gray cable for the lights, black cable for the camera.

5.2 Acquired Hardware

The camera, the lens and the lights were bought after a bidding competition. The required properties were presented to vendors and the devices were selected based on received

(27)

5.2.1 Camera

The requirement was to get at least25^fr/sat 0.7–1.0 megapixel resolution. This seems to be near the high end performance of standard industrial cameras that are connected with a standardized digital interface. The platform, a Linux PC, restricted the usable connection types and practically the only choice was IEEE 1394 (Firewire), with IIDC standards compliant devices. IIDC is a general interface standard for industrial (raw data) Firewire cameras allowing them to be used with general driver software. An alternative to Firewire is Gigabit Ethernet (GigE), but at the time GigE cameras were not generally available, nor was open software.

Image compression usually destroys data which is not acceptable in this application. Af- fordable color cameras use a Bayer filter [2] on a single sensor array providing basically a single channel image, where each pixel is either red, green or blue. These two combined mean that the camera should transmit raw 8-bit-per-pixel Bayer data.

Other criteria were a global shutter (expose all pixels at the same time because of the moving target), progressive scan (not interleaved), square pixels, fast exposure, C or CS lens mount, and all settings manually controllable.

The selected camera is FOculus FO323C manufactured by New Electric Technology GmbH. At 1024x768 resolution it can deliver 30 ^fr/s raw Bayer images via Firewire.

Gamma correction is1.0by default providing linear intensity response (in theory). Any other gamma value would introduce an exponential mapping to the pixel values. The camera can be used with the open source drivers in the Linux kernel and the libdc1394 library. Promised signal-to-noise ratio is ”56 dB or better.” The camera includes a RS-232 serial interface that could be controlled via the Firewire interface, but it is not required here.

Basically the camera works fine and it can be utilized, although there are problems with some settings or image adjustments:

• The camera reports that it cannot adjust the white balance, but when using the corresponding adjustment call, the white balance is clearly changed as it should be.

• The camera reports it can adjust image brightness, but when trying to do so, the software hangs for a couple of seconds, reports failure, and there is no change in the image.

(28)

• The camera reports that setting the gamma correction value is supported. When set, the software hangs for a couple of seconds and reports failure, but the gamma setting is effective. The image changes as it should.

According to the manufacturer representative, these problems are due to defective camera firmware and no update is available.

Another mysterious property is image flickering when illuminated with an incandescent lamp. With LED lights the image is stable. The flickering seems to be affected by the camera shutter speed control. The shutter speed is basically controlled with an integer value. There are small ranges where image intensity stays constant (according to an RGB- histogram), but within those ranges flickering changes as a function of the shutter speed.

The Firewire bus in the used computer hardware configuration is susceptible to electromagnetic interference. The bus simply jams when, for instance, the aforementioned incandescent lamp is switched off. The camera has to be disconnected, or a complete bus reset must be issued with thegscanbusprogram.

5.2.2 Lens

A lens suitable for the camera was requested. The maximum acceptable working distance was set at 200 mm, but preferably less than 100 mm. The imaged area must be at least 20 by20 mm, and at most 60 by 60 mm. Geometrical and color distortions should be minimal.

The accepted lens is VS-LD10 manufactured by V.S. Technology, Japan. The lens is small in size and a distortionless macro lens with a C-mount. When installed into the second imaging device, the lens working distance is approximately11 cmand the viewable area is57by43 mm. The lens can be seen attached to the camera in Figure 7. There is also a polarizing filter that can be attached to the lens.

5.2.3 Lights

Two different light sources were desired, both using white LED technology for com-

(29)

Figure 7. The second imaging device opened. From the left: square LED light, macro lens, camera, cables.

(general texture imaging) and the other from a grazing angle with separately adjustable segments. The lights need a power source that should be electronically controllable, for automatic imaging with illumination from different directions.

The first light source is a LDR2-70SW ring light by CCS. The lens fits through the ring light if the lock screws are removed from the lens. There is a polarizing filter for this light. The ring light with its polarizing filter is in Figure 8.

For high angle illumination LDQ-100A-SW square four-segment light also by CCS is used. Each segment, side of the square, has its own power cable. The angle of the segments is adjustable as shown in Figure 7.

The power source for all the white LED lights is a four-channel PD-3024-4 manufactured by CCS. Each channel is adjustable separately with a 16-position rotary switch, but channels cannot be turned completely off with the switch. The power cables in the lights are too short to be used exclusively. Extension cables were offered at a relatively high price, hence an extension cable was made from category 5 ethernet cable. The power connec- tors were supposed to be special, but common pin headers were compatible enough. The downside is that the new power cables allow connection in the wrong orientation, possibly destroying the LEDs.

(30)

Figure 8. The second imaging device viewed from the bottom. The ring light and polarizing filter on the lens were removed later.

Similarly as for the Luxeon Star (Section 3.3), the emission spectrum of these lights were measured. The measurement was done with the minimum and maximum power. The results are shown in Figures 9 and 10. Note, that in all the graphs the intensity maximum is scaled to one, and therefore the vertical scales are not comparable. It appears that illumination power does not affect the color, and both lights may even use the same LEDs.

The intensity distribution is very similar to the white Luxeon Star.

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

400 450 500 550 600 650 700

wavelength nm

(a)

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

400 450 500 550 600 650 700

wavelength nm

(b)

Figure 9. The square LED light emission intensity distributions: (a) Minimum power; (b) Maxi- mum power. The vertical scales are arbitrary.

(31)

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

400 450 500 550 600 650 700

wavelength nm

(a)

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

400 450 500 550 600 650 700

wavelength nm

(b)

Figure 10. The ring LED light emission intensity distributions: (a) Minimum power; (b) Maxi- mum power. The vertical scales are arbitrary.

5.2.4 Frame

The frame of the second imaging device is made of wood and sheet metal, Figure 11.

The square light is fastened to the wooden support with metal parts, and the ring light is attached with a U-shape metal sheet secured by the same screws holding the camera.

The metal sheet parts were manufactured at the Department of Mechanical Engineering, Lappeenranta University of Technology, Finland. Schematics are in Appendix A.

5.3 Polarizing Filters

An electromagnetic wave, light, is a transverse wave, as explained in [25]. The electric and magnetic field components are perpendicular to the wave propagation direction and also to each other. Light is linearly polarized if the electric field direction varies on a single axis orthogonal to the line of propagation. Unpolarized light is a mixture of polarized light waves in all directions. A wave consisting of two electromagnetic components can also be elliptically polarized, where the polarization axis rotates along the propagation. In the context of imaging and optics, the elliptical case is ignored here and light is said to be polarized or not, referring to a high degree of linear polarization.

The ring light in the second prototype is around the camera objective. A shiny surface produces severe specular reflections as light is reflected directly back at the camera. The specular effect is seen in Figure 12a.

(32)

Figure 11. The second imaging device in its original design. The ring light is just under the camera, and the lens’s polarizing filter is sticking out.

(a) (b) (c)

Figure 12. The effect of polarization filters in the ring light and the camera. The imaging target is a hardcover book. (a) Both filters at the same orientation, severe specular reflection; (b) Filters are almost orthogonal; (c) Filters are orthogonal and the specular reflections are eliminated.

(33)

Polarizing filters can be used to eliminate specular reflections as the polarization axis does not change in specular reflection that comes at almost zero angle (orthogonal to the surface). When both the ring light and the camera have polarizing filters, the specular part of the reflection can be practically filtered out, as in Figure 12, by rotating the filters. The diffused reflection is generally unpolarized, so it can be imaged.

The downside of polarization filtering is that given unpolarized light, the intensity of light passing through is at most half of the incoming intensity [25]. In this case two filters are required so the intensity drops to less than a quarter of that without the filters, not accounting for the specular reflections. It is assumed that the wanted light reflected from the surface is unpolarized, but this might not be the case. Metallic objects, for instance, violate this assumption.

5.4 Focusing and Lens Distortions

Focusing and lens distortions were examined imaging a millimeter grid printed on paper, Figure 13. The nine highlighted areas were inspected more closely.

The ”distortionless” (marketing term) macro lens seems to keep its promise as there are no perceivable distortions in Figure 13. Therefore there is no need to implement geometrical distortion correction.

Focusing is easier using the nine zoom-in areas on different parts of the full image. Fig- ure 14 contains two collages of these nine areas. Sharpness in different parts of the image is easier to compare in a collage.

Lens aperture has a known effect on sharpness: the smaller the aperture, the sharper image. In Figure 14a the aperture is completely open, the center is in focus, but especially the lower right corner is out of focus. By adjusting only the aperture of the lens, the whole image can be brought to equal sharpness, Figure 14b. Changes in the aperture must be compensated in camera exposure time and gain to get equal image brightness.

(34)

Figure 13. Millimeter grid as test pattern, the inverted areas are the zoom-in areas.

(a) (b)

Figure 14. Effect of aperture to focus. Smaller aperture yields better depth of field as can be seen on the edges: (a) Aperture F/2.2; (b) Aperture F/8.

(35)

5.5 Illumination

The original design was to use the ring light for general surface texture imaging and the square light for high angle illumination revealing the surface profile. Polarizing filters were attached to the ring light and the camera to hide specular reflections (Section 5.3).

Unfortunately the intensity of light reaching the camera cell was too low, forcing a long exposure time. This resulted in higher image noise and motion blur at lower movement speeds (camera view moves too much during an exposure, resulting in a directionally blurred image).

The ring light produced fairly even illumination, but was too weak. Therefore the square light was used to improve intensity, but at sufficient intensity levels the illumination pattern is far from being flat. The combined illumination pattern of the ring and square lights is in Figure 15a.

(a) (b)

Figure 15. The illumination pattern is not flat, and creates color distortions, brighter areas reflect- ing more red: (a) The ring light and the square light; (b) The square light only.

The square light does not have polarizing filters as the incoming light angle is high enough not to produce any specular reflections on an almost smooth surface.

As the ring light contributes little to the illumination, it can be removed altogether. Illumi- nation unevenness compensation described in Section 6 cannot be avoided as it is almost impossible to create even enough lighting with the acquired hardware and physical limi- tations.

(36)

Removing the ring light rendered the polarizing filter in the camera unnecessary, allowing more light to reach the camera cell. Without the ring light in the way, the lens can be adjusted after the device has been assembled.

Figure 15b shows the illumination pattern without the ring light, after device recalibration.

As can be seen, the patterns in Figure 15 do not differ significantly. Both images are averaged illumination images used in computing the illumination compensation image (Section 6.5). With ideal illumination, an averaged illumination image is completely flat, apart from noise.

5.6 Adjusting Image

Image adjustments are fairly easy to do with an RGB-histogram display. The camera is pointed to a plain white target, in the thesis a stack of ordinary Xerox paper was used as the target. All settings are preadjusted to make the image readable, not much over- or underexposed.

Of the camera features, the gamma correction factor is set to 1.0. Due to defective firmware or missing implementation in the firmware, the FO323C camera cannot change the hue, saturation or brightness adjustments. White balance or similar adjustments are set to ”neutral”. White balance controls the red and blue (and green, if implemented properly) channel gains, affecting image color tones.

The lens aperture is completely opened as a short exposure time is required in the application. The image is focused with a textured target. The polarizing filter, if present in the lens, is rotated until the image is at its dimmest. This means the filter in the lens and the filter in the light source have polarizing angles perpendicular to each other. Having an object with specular reflections is a good target for adjusting the polarizer, as all specular reflections should disappear from the image. Removing specularities is the sole purpose of the filters. All the following adjustments are performed with the plain white target.

The camera’s global gain (signal amplification) is set to minimum. By varying the illumination level, a non-linearity was observed in the histogram response. The phenomenon resembles sensor saturation, but is not a strict threshold. The effect can be seen in Fig- ure 16. The illumination level is adjusted as high as possible, before the saturation-like effect appears. The histogram shows that the pixel values fall far below the maximum

(37)

range (Figure 16a). The gain control is adjusted according to Figure 17a, the highest pixel values should be near the maximum, but none of them is actually saturated. At this point white balance can be totally off, white objects having arbitrary color in the image.

R G B

50 100 150 200 250

(a)

R

G B

50 100 150 200 250

(b)

Figure 16. RGB-histograms: (a) Normal; (b) Non-linear effect. Increasing illumination reveals a non-linearity in the camera response. The color channel envelopes become narrow. The vertical scales are relative to the peak.

The effect of the camera’s white balance controls is observed, and in this case the red channel control is inadequate. The coarse white balance has to be adjusted by software.

The required coefficient for the red channel is estimated from the RGB-histogram (Fig- ure 17a). A value of ⁵⁵⁰₂₅₆ is used. This brings all color channels to the same order of magnitude. The camera’s white balance controls are then used to tune the white color.

Short exposure time is of essence to avoid motion blur in the image. Exposure time should be reduced and compensated for by increasing illumination. Also gain control is of use, but gain should be kept low as it amplifies image noise. However, the gain should not be decreased from the value discovered earlier, because that will reintroduce the non-linearity.

If the illumination is strong enough, and there is potential to increase exposure time, the lens aperture should be used to bring down the level of light reaching the camera cell. A smaller aperture (bigger F-number) enhances focus and extends the depth of field. The

(38)

upper limitTˆ^s/frfor exposure time can be derived from maximum pixel velocityvˆ^px/sas Tˆ= ∆

ˆ

v , (22)

where∆^px/^fris the allowed motion during exposure. The optimal value of∆depends on the point spread function, camera cell array optical fill ratio and tolerance to motion blur.

As a rule of thumb, the author proposes∆ = 0.5for gray level cameras, and ∆ = 1for Bayer filter cameras due to their reduced color resolution.

After exposure time, gain and illumination adjustments, the white balance should be rechecked. The histogram of the readjusted image is presented in Figure 17b.

R G B

50 100 150 200 250

(a)

R GB

50 100 150 200 250

(b)

Figure 17. RGB-histograms before and after white balance adjustments: (a) Gain adjusted, but white balance incorrect; (b) Correctly adjusted image. The vertical scales are relative to the peak.

The last step is to account for uneven illumination. This is done by taking several snapshots of the plain white target at different locations. When the snapshots are averaged, possible texture on the white target blends away. The process explained in Section 6 produces a correction image that is used to equalize the illumination in the software.

5.7 Evaluating Camera Response Linearity

The camera cell integrates the incoming light over spectrum and time, and circuitry estab-

Real-time Imaging and Mosaicking of Planar Surfaces