• Ei tuloksia

Requirements and methods for video summarization


Academic year: 2023

Jaa "Requirements and methods for video summarization"





Requirements and methods for video summarization

Author(s): Henri Kivinen Confidentiality: Consortium Date and status: 08.03.2012

This work was supported by TEKES as part of the next Media programme of TIVIT (Finnish Strategic Centre for Science, Technology and Innovation in the field of ICT)


Next Media - a Tivit Programme Phase 2 (1.1-31.12.2011)


1 (20)

Version history:

Version Date State

(draft/ /update/ final) Author(s) OR

Editor/Contributors Remarks

0.1 1.12.2011 draft Henri Kivinen

0.2 31.1.2012 update Henri Kivinen

0.3 29.2.2012 update Henri Kivinen

0.4 9.3.2012 update Henri Kivinen

Participants Name Organisation

Researcher Henri Kivinen Aalto SCI / Department of

Media Technology

next Media www.nextmedia.fi www.tivit.fi


Executive Summary

Content-based video indexing and retrieval has still a number of challenges to be solved before being fully automatized and reliable, and therefore, these applications often fail. In these cases it is very practical that the retrieved videos can be browsed effortless and fast to evaluate their relevance and importance. Video summarization provides the means for such video browsing by reducing redundant content in video sequences.

In this work we will focus on the summarization of user-generated video content, which has its own distinct characteristics compared to other video types such as professionally produced video content. As a consequence, the summarization of user-generated content is more challenging than for more constrained video content. Due to its unconstrained nature the summarization techniques must be generic in such that they can cope with all the infinite number of variations in which user-generated content exists. In the report we propose that motion information –based video summarization is possible to be implemented on such a generic level, and therefore they are currently the most reliable techniques to be applied for user-generated video content.


Next Media - a Tivit Programme Phase 2 (1.1-31.12.2011)


3 (20)

Table of Contents

Executive Summary ... 2  

1   Table of Tables ... 3  

2   Table of Figures ... 3  

3   Introduction ... 4  

4   Structural video analysis ... 5  

4.1   Temporal video segments ... 5  

4.2   Methods for sub-shot segmentation ... 6  

5   Motion information –based video summarization ... 8  

5.1   Motion saliency ... 8  

5.2   Camera motion characterization ... 13  

6   Presentation of video summarization ... 14  

6.1   Static and dynamic presentation ... 14  

6.2   Hierarchical presentation ... 15  

7   Discussion ... 17  

1 Table of Tables Table 1. Comparison of video domains (adapted from Wang & Ngo 2011). ... 7  

2 Table of Figures Figure 1. Temporal video segments (adapted from Petershon 2007) ... 5  

Figure 2. Visual skimming based on motion attention (Ma & Zhang, 2002) ... 9  

Figure 3. Detection of keyframes based on turning points of dominant motion (Liu et al. 2003) ... 10  

Figure 4. White area of the rectangle represents the motion parameter estimation template (Abdollahian et al. 2010) ... 11  

Figure 5. Total displacement of a trajectory compared to its cumulative length. (Guironnet et al. 2005) ... 14  

Figure 6. Comic-like video summarization of a news sequence from the TRECVID 2006 corpus (Calic et al. 2007). ... 16  


3 Introduction

Purpose of video summarization is to provide a compact but still informative representation of a video sequence. The need for rises from the very nature of video content that is always packed with redundant information. For example, a video sequence that can be a movie, news story, commercial, home video, user- generated video clip, etc., has a duration that spans from anything below 30 seconds to hours. With a common frame rate such as 25 frames per second a one- minute video clip comprise 1500 still images, and usually one has to view them all before getting an overall gist of the content or making a judgment on its importance or relevance. This can be very time consuming, especially when the number of video sequences grows.

Video summarization is usually seen as a complementary to video retrieval, as it makes browsing of retrieved videos faster. Especially when content-based video retrieval fails and the user needs to browse through a large set of video sequences (Xiong et al. 2006, Truong & Venkatesh 2007). In these cases, the user could exploit the summarized video presentations of retrieved video sequences to locate the relevant videos within a set of retrieved videos (Hu et al. 2011). This requires a video summarization process that maintains all the salient information while removing all redundant data. After this, the process should output a composition of an abstract representation or summary of the extracted content, which is finally exhibited to users such that it facilitates browsing.

The summarization process itself can be implemented with a multitude of different methods and techniques. Recent surveys (Truong & Venkatesh 2007;

Money & Agius 2008) on video summarization provide a comprehensive classification and reviews of various summarization techniques. In addition, Hu et al. (2011) has made an extensive round up of the developments during past four years on content-based video indexing and retrieval in which video summarization has been included. These state-of-art techniques have been usually developed for a specific type of video content, such as movies, documentaries, commercials, sports, news, and video surveillance. In such domain-dependent video summarization it is common that specific features exploited, which are identified and available for that particular kind of video domain. In this report we will, however, focus on summarization of consumer videos, which are missing the temporal structure, and technical quality of professionally produced and edited video content.

Compared to domain-dependent abstraction of professional video content, such as news (Albiol et al. 2003, Choi & Jeong 2000) or sports (Tan et al. 2000, Rui et al.

2000, Babaguchi 2000) where the structure can be pre-defined and exploited in the analysis, summarization of consumer videos is much more ambiguous.

Methods that we can apply on user-generated video content have to be more generic, suitable for any kind of scenes regardless of their content.

Implementation of machine vision algorithms that generalize for a wider domain is still very challenging. There are in fact a multitude of researchers working with the problems that are inherited from the core challenges of finding a fundamental


Next Media - a Tivit Programme Phase 2 (1.1-31.12.2011)


5 (20)

solution for semantic interpretation of visual information. Fortunately according to Snoek and Smeulders (2010), broad categorization of image content as the first step for further analysis is within reach.

In the next chapters we will start from the first step of making a video summarization, structural video analysis and present the techniques that can be utilized in user-generated video content. After structural analysis, we present some of the most promising methods for representing the actual video summarization. Finally we present our recommendations for implementing a video summarization tool that can be effortless and efficiently used for monitoring of user-generated content for news gathering and publishing.

4 Structural video analysis

4.1 Temporal video segments

On its lowest level a video is a sequence of still images that are perceived as changing and moving visual sequence when the frames are rapidly projected together. On a syntactic and semantic level video sequence is, however, much more complex aggregate than a sequence of still images. To understand the structure of video sequence more analytically, it can be divided into different temporal segments, which represents the video as a hierarchical composition of syntactic and semantic units. These units are frame, shot, scene, and program, which comprise a hierarchy of video segments (in Figure 1). A video program (film, commercial, news program) is composed of various scenes that are based on video shots, which in turn are a composition of adjacent frames.

Program level Scene level Shot level Frame level


film commercial

shot 1 shot 2 shot 3 shot 4 shot 5 shot 7

frame i

shot 6

scene 1 scene 2 scene 3 scene 4

frame i+1 frame i+2 ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... frame i+n


Figure 1. Temporal video segments (adapted from Petershon 2007)

According to Petershon (2010) temporal video segment can be defined as an unbroken part of a video stream for which a given set of properties is consistent.

Thus, adjacent temporal video segments are always mutually similar. The similarity is just affected by varying properties that can be visual, structural, or semantic. Similarity properties can be also related to how the video was produced, for example, a video shot has been traditionally defined as a temporal segment of


consecutive frames that have been captured during a single camera operation. Due to the strong correlations between adjacent frames within a shot, it has been commonly used as a fundamental unit when a video sequence is analyzed for further processing.

In computational shot boundary detection the boundaries have been generally classified with video cuts, where transition between successive shots is abrupt, or with gradual transitions including dissolve, fade in and out, and wipe, which is stretching over a number of frames (Hu et al. 2011). Hence, the shot boundary detection, as a research topic, has been directed into analyzing of edited video sequences as a pre-processing step for content-based video retrieval and browsing.

But the shot detection of more common transition effects is no longer seen as challenging problem. Thus, according to Smeaton et al. (2009) the average of 79% for precision and recall is enough, and the computational shot detection has therefore noted as solved. This conclusion was made based on notices taken on the latest novel approaches to shot boundary detection that did not improve their performance against state-of-art methods in annual TRECVid challenge1. Due to these achievements it was decided that shot boundary detection task was no longer needed and it was discontinued from year 2008.

However, as was noted earlier, traditionally shot boundaries are not defined by edited frame transitions but camera operations. Thus, shot boundary detection algorithms are rarely useful for raw, non-edited user-generated video sequences that usually comprise only one shot. Therefore, the structural analysis of user- generated video content has to focus on a sub-shot level. Similarly, as Petershon (2007) has noted, structural hierarchy should be extended with a sub-shot level, because scene changes can occur during a continuous camera act due to camera movement. According to Petershon (2007), a sub-shot is defined as “a sequence of consecutive frames showing one event or part thereof taken by a single camera act in one setting with only a small change in visual content”. User-generated video sequence should be segmented into sub-shots for extended analysis.

4.2 Methods for sub-shot segmentation

There are usually three main steps in general shot boundary detection: extraction of features from adjacent frames, similarity measurements between features, and detection of shot boundaries based on interpretation of feature similarities. Color histogram has been commonly used as a baseline for evaluating more advanced features. According to Hu at al. (2011) color histograms are robust to small camera motion, but detection of shots that belong to same scene is challenging for them, and they are sensitive to large camera and object movement. Especially changes and disturbances in illumination causes problems for color information based methods because they cause variation in visual content that is falsely interpreted as shot boundary (Yuan et al. 2007). Compared to color histograms edge features are more robust for video sequences with illumination changes and motion, whereas motion-compensated or actual motion features can characterize object and camera motion for extended analysis.

1 http://www-nlpir.nist.gov/projects/trecvid/


Next Media - a Tivit Programme Phase 2 (1.1-31.12.2011)


7 (20)

Due to the very nature of dynamic videos motion information is very important for content analysis as it describes the temporal dimension of the visual content (Ma et al. 2011). Compared to static image features, such as color histogram, motion information based features are closer to actual semantic concepts.

Motion information of a video sequence includes both camera motion and object motion. Camera motion is visible as background motion, whereas moving objects causes foreground motion. According to Ma et al. (2011), motion-based features are therefore usually either camera- or object based. Camera-based features, such as in recent study by Wang and Ngo (2011), extract the basic types of camera movement such as panning left and right, tilting up and down, zoom in and out, and parallax movement of camera itself (dollying).

The study by Wang and Ngo (2011) focused in summarizing rush video sequences, i.e. raw video footage produced in professional broadcasting and filmmaking industries, which are used to compose the final video product.

Compared to professional movie production consumer videos are obviously very different, especially in camera work, professional editing, and lack of redundancy, which partly comes along with the editing. Compared to rushes, consumer videos are also raw footage with a lot of redundancy. But it is caused by different factors.

The comparison of rushes videos to consumer videos and movie productions is presented in Table 1.

Table 1. Comparison of video domains (adapted from Wang & Ngo 2011).

Consumer video Movie product Rushes video Camera work Amateur Professional Professional Editing None or little Professional None

Redundancy A lot No A lot

Redundancy in the rushes videos is due to repetitive camera work, such as taking multiple shots of a scene when actor forgets his lines or for other reasons (Wang

& Ngo 2011). Consumer videos, on the other hand, contain redundancy that is caused by lack of pre-planned camera work, which occurs as unnecessary camera movements. In addition, both consumer and rushes videos contain unintentional camera movement. Unintentional camera movement, such as camera shakiness, is much stronger in consumer videos compared to rushes as they commonly recorded with hand-held cameras. Rushes also contain shaky camera movement, but due to different reasons, as for example, when a cameraman changes camera parameters between shots. In professional production these shaky footages, in addition to other redundant content, are replaced with stock footages in a final product. Wang and Ngo (2011) do pose a relevant question on how to classify useful footages for potential reuse. This is a relevant question for consumer videos as well, and especially for publishing purposes. Though explicit retakes are not


that common in consumer videos there are still similar redundancy concepts such as back and forth panning of the same scene.

Understanding of the camera motion is an important step before higher-level semantic analysis of video sequence. After segmenting a video sequence according to camera motion we can analyze and describe these segments further with other methods. In addition, as the camera motion reflects the intentions of a camera user, and therefore the content of interest in a video sequence, the camera actions provides a subjective foundation for further objective analysis. This is especially important, as the quality and efficiency of the video summarizations are usually subjectively biased.

For consumer videos that have exceptionally heterogeneous content, we propose that the most generic feature with lowest content-dependency is camera and object motion. Extraction and analysis of motion information including both camera and object motion should be the first step in analyzing consumer videos.

Analysis of motion information has however many challenges. One is the issue with unintentional camera motion, which should be separated from intentional camera movements. Unintentional motion, such as shakiness is very common in consumer videos that are usually recorded with hand-held cameras. This motion noise must be compensated before we can extract motion features. Computational video stabilization methods that can be utilized for motion compensation have recently evolved on a new level where they can effectively remove unintentional motion. For example, the recent subspace video stabilization method incorporates promising techniques for modeling unintentional camera motion and content- preserving frame warping for its compensation between video frames (Liu et al.


5 Motion information –based video summarization

5.1 Motion saliency

Motion features have been extensively studied in various researches. In Ma and Zhang (2002) they modeled generic motion attention for video skimming. The attention model was based on an assumption that when motion vectors are processed through three motion inductors they will form a saliency map: Intensity inductor, Spatial Coherence inductor, and Temporal Coherence inductor. The saliency map can be then used for reducing the gap between low level visual features and human perception, and thus, facilitate the computational video attention detection. The motion estimation is carried out using the motion compensation vectors in MPEG video stream, where they are used to encode P- and B-frames out of intra-frames (I), which are can be decoded independently other frames in video sequence. Motion compensation vectors are useful for estimating motion since their spatial layout can be formed into a motion vector field. The motion vector field is used as the input for computation of inductor responses.


Next Media - a Tivit Programme Phase 2 (1.1-31.12.2011)


9 (20)

The first motion inductor, intensity inductor induces motion energy or activity, which is computed as a magnitude of motion vector that is normalized with maximum magnitude. The second inductor, spatial coherence inductor induces spatial phase consistency of motion vectors, which can infer to belonging to same moving object having high attention value. The spatial coherence of motion vector is computed as the entropy of probability distribution of spatial phase histogram.

Finally, the temporal phase coherency is computed similarly as spatial coherence, as the entropy of probability distribution of temporal phase histogram. These three inductors are then integrated into a motion saliency map that can be utilized in visual video skimming by indicating salient sub-sequences in a video shot (Figure 2). (Ma & Zhang 2002)

Estimation of motion in



make use of their properties to detect motion attention is another critical issue. By observing the relationship between motion vectors and attended motions. we have d r a w some conclusions as following. First, generally speaking, the motion with high intensity always attracts human's attention, However.

the amera motion is also able to give rise to high intensity in I channel. which is not the interest of human yet. For example, the fast camera panning is used frequently in video production.

But human only concern the objects in scene, even if this abject is stili sometimes. I n addition. I inductor is not sensitive to the motion with lower energy. Therefore we must take advantage of the other channels to suppress these negative effects in Ichannel.

Second. the spatial phase consistency provides us two cues. One is the phases of motion vectors in moving object tend to be consistent. The other is if the phases of motion vectors are disordered and the magnitudes of them are evidently large. it implies the motion information is not reliable. Usually. Cr inductor is sensitive to the motion with lower intensity. Finally, since camera motion is always more stable than object motion during a longer time, we can make use of this properly to discriminate object motion from camera motion by computing Cl.

I n other words. Clchannei is very sensitive to object motion. As

a resuit. we define motion attention measure MA as (6).

(6) With (6). the outputs from I Cr Cf channels are integrated into a

motion saliency map, as shown in Figure 2 (d). in which the motion attention areas are presented clearly.

MA = I x Clx(1- I x Cs)

Ih, $0

Figure 2. Motion attention detection.


I-Map, (b) Cs-Map, (c) CI~Map. (d) Saliency map. (e)

Original image in which the motion attention areas are marked by blue boxes.

2.3. Motion Attention Detection

Motion saliency map is a gray image, in which the regions with higher MA value are more likely to be attended motions and vice

W M .

We employ image processing methods to detect the regions of motion attention. The stem include:

1) Histogram balance.

2) Median filtering.

3) Binarization.

4) Region growing.

5) Region selection.

In step 5). we select the regions according to the size of region at

the first. The fist three biggest regions are detected as the candidates of motion attended regions. if the number of regions is more than 3; otherwise. all of regions are looked upon as

candidates. Then. the regions with the area lower than a threshold MinArea are removed. Here. the maximum number of attended regions in a frame is constrained to 3 considering it is impossible far human to focus on more than three objects simultaneously.


Although semantic-based video summary is able to give user explicit information. the visual abstraction is an indispensable part of video summarization. An ideal video skimming Is the seamless combination of visual, audio and linguistic information.

Users would feel more excited when they look thmugh the video summary containing the visual contents attracting their attention than that only has important linguistic information. Based on the results of motion attention detection. we may accumulate the intensity of the motion attended regions in motion saliency map

to measure the d e p e of attracting user's attention, called motion attention indicator MAI,

(7) where lq is the motion intensity of macro block. A is the set of detected areas with motion attention in a frame, 4 denotes the set of macro blacks in each attention area. and NMB is the

number of macro block in MVF which is used for the normalization purpose. In our implementation. MA1 is only computed in P-frames in order to save processing time. If let I-

frames and B-frames also have equal MAlvalues of the closest P-frame. we obtain a continuous MAlcurve along the time axis.

Figure 3 (a) shows a segment of curve smoothed by Gaussian

filter and corresponding key-framer of shots.

I ? L=Lenelh ofShor 3

Figure 3. Visual skimming (b) based on mUon attention KFdenotes the key-frame of shot

In Figure 3 (a) we can see that the

C U N ~

has several crests in a shot. If the segments located around wave crests are connected, we will obtain a visual skimming of video Which segments are selected to compose skimmed video from a crest will be determined by skim ratio, the number of crests in a shot.

and the shape of crest. For example. as illustrated In Figure 3 (b), if a 25% skimmed video sequence is wanted, we will select

12.5% length of shot from each crest because there are TWO crests in this shot. Then we search the segments with the required length along the crest contour. In addition, the frames with the maximum MAlvalue in each crest can be used as key-frames of shot. since they can atmact the most intensive attentions.

I. 131

Figure 2. Visual skimming based on motion attention (Ma & Zhang, 2002)

Another motion analysis-based video abstraction technique has been presented by Liu et al. (2003), in which they use a triangle model of perceived motion energy to model motion patterns in video sequence for abstraction. The triangle model of extracted motion energy segments is first used segment a video sequence into meaningful action events based on motion patterns. The motion energy is calculated as a product of average magnitude of motion vectors and ratio of dominant motion direction. Consequently, the perceived motion energy is a metric that gives adaptive weight for motion intensity according to its distinct characteristics with more emphasis on dominant motion. In addition, by using the triangle model, which is illustrated in Figure 3, a video sequence can be segmented into sub-segments of different motion patterns in terms of acceleration and deceleration. These turning points of dominant motion, which can be also defined as motion states, are selected as keyframes that are included in video abstract.


Figure 3. Detection of keyframes based on turning points of dominant motion (Liu et al. 2003)

In more recent work by Abdollahian et al. (2010) a new concept known as camera view was presented for the temporal segmentation of user-generated video. The authors’ define the camera view as the basic unit of user-generated video structure due to its one-shot –characteristics. The concept itself builds on a notice that user- generated videos are commonly edited while recording. As in professional production there are usually multiple shots that are cut and edited to include only an interesting segment, in user-generated video on the other hand, it is common that camera is zoomed or panned to follow an interesting object. Authors argue that this kind of content-based camera motion is not typical in professionally produced video where the same effect is obtained through video editing.

The motion estimation was carried out using the Integral Template Matching algorithm (Lan et al. 2003), which was developed for fast extraction of dominant motion in video sequence. The dominant motion model is defined as

x' y'



## $

%&&=R x




## $

%&&+ H V



# $

%& (1)

where (x,y) and (x’,y’) are the matched pixel location in current and reference frame, respectively. Parameters of the model, horizontal (H), vertical (V), and radial (R), are estimated using the Integral Template Matching algorithm, which uses a template T in the current frame is matched against the previous frame using a full search in a predefined search window (in Figure 3). By excluding the central area of a frame algorithm can exclude object motion that is commonly located in the center of the frame. In addition to the central area, also the border areas of a frame are excluded, as pixel value errors due to video compression and camera aberration is common in these areas. (Abdollahian et la. 2010)


Next Media - a Tivit Programme Phase 2 (1.1-31.12.2011)


11 (20)


how viewers are visually attracted to different regions in a scene and identify the ROIs. There are several applications that use ROIs for image and video adaptation to smaller screens. One such application is described in [26] where ROIs are used to model the information structure within the frames and generate browsing paths. Chen et al. [27] use a human attention model to create different cropped and scaled versions of images ac- cording to the screen size. Several factors such as contrast, size, location, shape, faces, foreground/background and local motion activities can influence visual attention and have been used to identify ROIs in images and videos [23], [28]–[31]. Most pre- vious attention models have not considered camera motion as an independent factor in identifying visual saliency within a frame.

For example, in [25] the global motion is subtracted from the macroblock motion vectors to obtain relative motion which is used for generating temporal saliency maps. Ma et al. in [24]

use a camera attention model to assign an attention factor to each frame based on the type of motion and speed of camera but this factor is the same for all pixels within a frame. As we will show in this paper, camera motion plays an important role in at- tracting viewer’s attention while watching a video. This feature has been previously used in video summarization approaches as an indicator of the camera person’s levels of interest [32], [33].

Here, we will show that this feature has also a major influence on viewers’ visual attention and therefore can be used as a pow- erful tool for UGV content analysis.









A. Global Motion Estimation

In the majority of UGV, camera motion is limited to a few op- erations, e.g. pan, tilt, and zoom; more complex camera move- ments, such as rotation, rarely occur in UGV. Several motion models and estimation methods have been proposed in the liter- ature for global motion estimation and camera motion charac- terization [34]–[36]. However our goal here is to be computa- tionally efficient to be able to target devices with low processing power such as mobile devices. Therefore, we use a simplified three-parameter global camera motion model in the three major directions, horizontal , vertical , and radial . This model adequately describes the majority of the camera motion we have observed in UGV. The motion model is defined as

(1) where and are the matched pixel locations in the current and reference frame, respectively. The Integral Tem- plate Matching algorithm [37] is used to estimate the motion parameters. In this method a template, , in the current frame is matched against the previous frame using a full search in a pre- defined search window. The template is illustrated as the white region in Fig. 2. The central part of the frame is excluded from the template to avoid the effect of object motion close to the frame center. This is motivated by the observation that if the moving object is of interest to the camera person, it will most likely be close to the frame center and if not, it usually does not stay in the camera view for a long time. This local motion sig- nificantly decreases the accuracy of the estimation especially

Fig. 2. Template used for motion parameter estimation is shown as the white area.

Fig. 3. Decision tree used to label video frames.

in the case of radial motion parameter. Pixels close to frame boundary are also removed from the template because of errors in the pixel values due to compression and camera aberration.

The parameters , and are estimated by minimizing the distance between the 2-D template in the current frame, , and the previous template transformed using parameters ,

and which is denoted by :

(2) In order to accelerate the template matching process, we esti- mate the initial values for and using the Integral Projec- tion method [38]. In this technique, the 2-D intensity matrix of each frame is projected onto two 1-D vectors in the hori- zontal and vertical directions. The projections in each direction are matched between the consecutive frames resulting in the ini- tial estimated values for translational parameters, and . A parabolic fitting is then preformed to modify these values to the resolution of a fraction of a pixel. The template matching starts from the initial point and iterates through the values in the search window. The iteration stops when a local minimum is found.

B. Motion Classification

After the camera motion parameters are estimated, video frames are labeled based on the type of motion as a prepro- cessing step in the analysis. Camera motion in UGV usually contains both intentional motion and unintentional motion, such as shaky and fast motion. While intentional camera motion provides valuable cues about the relative importance of a segment, unintentional motion decreases the perceived quality of the video and can be misleading in the analysis. In our system video frames are classified into one of four classes using the decision tree structure shown in Fig. 3.

Due to their superior performance, we have used support vector machine (SVM) classifiers [39] at each decision node of this tree. The LIBSVM library was used, which is an open source software library for SVM training and classification

Figure 4. White area of the rectangle represents the motion parameter estimation template (Abdollahian et al. 2010)

Novelty of the camera view-based segmentation is in how it structures a video sequence into different views that a camera is “observing” during panning of a scene. The segmentation technique is based on measuring a displacement of camera with eight-dimensional motion feature. The motion feature is based on work by Wu et al. (2005) in which they extracted feature vectors that are defined by three basic properties: speed, direction, and acceleration. These properties were formed into four dimensions on x- and y-axes: average speed of translational motion, average acceleration, variance of acceleration, and frequency of direction change. In Wu et al. they used Support Vector Machines to classify video quality properties of a video sequence.

Similarly in Abdollahian et al. (2010) they used trained SVM classifiers with eight-dimensional motion features to label set of frames as camera zooming, motion blur caused by fast camera motion, and shakiness due to unintentional camera motion. After motion labeling segment borders are defined such that the frame before the segmented camera view and successive frame are not correlated, i.e. they are not overlapping each other. The correlation in this case is defined as a magnitude of a displacement vector between two frames. The displacement vector between frames i and j is defined asDi,j=!"Di,xj,Di,yj,Di,rj#$, where Dx, Dy, and Dr are total horizontal, vertical and radial inter-frame displacements. The camera view boundary frame is identified whenever a displacement vector between current frame and previously detected boundary frame is larger than Td=1.

After the detection of boundaries of video segments corresponding keyframes are identified according to simple rules that were formulated based user study. In the study (Abdollahian et al. 2010) a number of users were asked to select a set of representative frames from several user-generated video clips that where segmented with their the camera view algorithm, while being able access to individual video frames as well. In the test setup they gave following statement to the users, “If you wanted to summarize the important content of the video segment in minimum number of frames, which frame(s) would you choose?” The user also had the option not to select any frames if they thought the segment had no significant information. Users were also asked to provide the reason why they selected each of the keyframes.


According to the results they made the following conclusions. User’s preferred to select frames when there was a “nice” view of the object of interest, a new action occurred, after a zoom-in (close-up view), after a zoom-out (overall view), a new semantically significant object entered the camera view (text, people, buildings), there was a pause after a continuous camera movement or a significant change in the background. In addition, in cases when there was high blurring or shaky movements, or when the camera motion did not change and the scene was static (e.g. camera pans to the left on a homogenous background).

According to Abdollahian et al. (2010) following rules where used to select a keyframe:

• Frame after a zoom-in, which is usually a close-up view of interesting object

• Frame after a large zoom-out that gives an overview of the scene

• Frame after a camera stops after movement, and when camera is on hold, which are both indicative patterns

• Those segments where a camera has constant motion, all frames are equally as important, and therefore the frame closest to the middle of a segment with least amount of motion is selected as the keyframe

In addition to camera motion –based keyframe detection Abdollahian et al. (2010) presented also a novel location-based saliency map for analyzing the keyframes for their region of interest (ROI). The location-based saliency map is a combination of color contrast saliency map, moving objects saliency map, and highlighted faces. The color contrast saliency map is based on work by Ma &

Zhang (2003), in which they used the color components in LUV-colorspace as a stimulus for calculating saliency maps for images. According to Abdollahian et al.

(2010) the RGB color space yielded to better results as it combines the luminance and color contrast. The technique begins by using K-nearest neighbor algorithm in clustering color vectors of pixels into 32 color clusters. After color clustering, the image is down-sampled by a factor of 16 in each dimension in order to reduce the computational complexity. Finally, the contrast saliency map, including each pixel in the down-sampled image, is calculated with a sum of weighted distance between the pixel and its neighborhood.

The moving objects saliency map is generated due to assumption that regions with significant motion in relation to background motion attract more attention in viewer. Generation of the map follows the same principles as in previously discussed motion inductors by Ma & Zhang (2002).

For keyframe selection Abdollahian et al. (2010) argue that even previously presented simple selection strategy, such as selecting one keyframe per detected segment, will make it possible to cover all the camera views. Hence, a video summary will include all the scenes captured by the camera.


Next Media - a Tivit Programme Phase 2 (1.1-31.12.2011)


13 (20)

5.2 Camera motion characterization

In work by Guironnet et al. (2007) Transferable Belief Model (Smets & Kennes 1994) is applied for characterizing motion states into different camera operations:

pan and tilt, zoom and static camera motion. The belief model itself builds on a structure that can process imprecise data by combining various sources of information and managing the conflict between the sources. Analysis of camera motion comprises three steps, motion parameter extraction, camera motion classification, and description of motion.

Camera motion extraction (Guironnet et al. 2006) is accomplished with estimation of an affine parametric model for dominant motion between successive frames.

The affine parameters are then converted into symbolic values in classification process that builds on rule-based system that is divided into three stages. In the first stage previously gathered data is combined for obtaining frame-level belief masses on camera motion, which are filtered according to transferable belief model to ensure the temporal coherence of the belief masses.

In the second stage, static and dynamic sub-segments are separated, and in the third stage, more detailed analysis of motion is carried out with a temporal integration, which allows the preservation of frames with significant motion magnitude and duration. Thus, it allows motions to be studied on segment level by gathering frames with a certain belief in a type of motion. The camera motion description phase is finally carried out with the extraction of different motion features on each video segment, which contain characterized camera motion.

According authors’ evaluations performance of the TBM-based method is very good (precision 100% and recall > 92%) for single camera motions, such as translation, zoom and static motion. Performance for composed motions where the motions can be superimposed (zoom and translation) or successive in the same extract was reduced to 79 %. (Guironnet et al. 2006)

After the classification of different camera motion in video sequence, the keyframe extraction for video summary is performed with three techniques. First, the keyframes selection is done according to succession of camera motions based on heuristic rules. The rules imply that only two frames are selected to describe the succession of two camera motions. In addition, if one of the segments is static, the two frames are selected at the beginning and at the end of the segment that has camera motion. One of these frames will then represent also the static segment.

(Guironnet et al. 2006)

The second technique for selecting keyframes is magnitude of camera motions.

This rises from an intuitive assumption that, for example, a translation motion with a strong magnitude will require more keyframes to be represented than a static segment, because the visual content changes more rapidly. The magnitude is defined as a total displacement of motion between first and last frame of a segment. The displacement itself can be defined as a total length of motion vector trajectory or as absolute total displacement between first and last frame. Here, Guironnet et al. (2005) uses a definition of rectilinear, which measures the cumulative change of direction of motion in a trajectory. In case the length of trajectory is longer than its absolute total displacement, the trajectory is not



14 (20)

rectilinear and the motion changes directory, such as in Figure 5. In addition, if the total displacement is large, the frames of the beginning, the middle, and the end of the segment are selected as keyframes. In case the magnitude is weak, only the last frame of the segment is selected. Whereas if the motion trajectory is considered as rectilinear, i.e. the total displacement is closer to the length of the motion trajectory, and the total displacement is large, then the first and the last frames of the segment are selected as keyframes. If the total displacement is weak, only the last frame is selected. The rectilinear coefficient is defined as,





dt (1) where dt is the total length of a trajectory and td is the total displacement.

4 EURASIP Journal on Image and Video Processing

Initial frame Final frame




(a) Definition of the enlarge- ment coefficient ec

Initial frame

Final frame td



(b) Definition of the distance traveled dt and the total dis- placement td from displace- ment d(t) between 2 successive frames

Figure 2: Example of parameters extracted to describe each segment of a video for (a) a zoom and (b) a translation.

Frames Translation Static

Frames Zoom Static


Frames Translation Static

Frames Translation



Frames Zoom Static

Frames Translation



Figure 3: Rules for keyframe selection according to two consecu- tive camera motions. Cases: (a) translation and static, (b) zoom and static, (c) translation and zoom. For example, if a static segment is followed by a translation segment (Figure (a) left), the first frame of the translation segment (or the last frame of the static segment) is selected as well as the last frame of the translation segment.

great (i.e., higher than threshold δ


), the first and the last frames of the segment are selected. In the opposite case, only the last frame is selected.

After an experimental study, we chose the following thresholds: δ


= 0.5, δ


= 300, and δ


= 5. Keyframe selec- tion according to camera motion magnitude is summarized in Figure 5.

2.2.3. Keyframe selection according to succession and magnitude of camera motions

Keyframes Translation Static

Final (succession of motions)

Frames Translation Static

2nd iteration

Frames Translation Static

1st iteration

Shot Translation


Figure 4: Illustration of keyframe selection. The first iteration cor- responds to the process of segments 1 and 2. In the same way, the second iteration corresponds to the succession of segments 2 and 3.

Keyframe selection is one frame at the end of the static segment (or beginning of the translation segment) and one frame at the end of the translation segment (or at the beginning of the last segment).

different rules explained above. First, the identified motions which have a weak magnitude or a weak duration are pro- cessed as static segments. If a translation motion of duration T with a total displacement td is detected, the standardized total displacement td


= td/T is calculated. This is regarded as a static segment if the duration T is shorter than threshold δ


and if the standardized total displacement td


is shorter than threshold δ


. In the same way, a zoom of duration T with an enlargement ec is regarded as a static segment if the

Figure 5. Total displacement of a trajectory compared to its cumulative length.

(Guironnet et al. 2005)

The third technique combines the first and second techniques for detecting static segments. If the magnitude of camera motion in a segment is weak or it has a short duration the segment is defined as static. In addition, if the translation related to its duration is weak and the duration is longer that a specific limit, the current segment is defined as static.

6 Presentation of video summarization

6.1 Static and dynamic presentation

Video summarizations are commonly (Truong & Venkatesh 2007, Ma et al. 2011) presented as a set of static keyframes or dynamic videos skims. After extracting the keyframes of video sequence there are different options for presenting them to the user. One of the most common video summarization presentation techniques is a storyboard, which is usually a static grid of extracted keyframes. According to a recent study on evaluation of video summarization techniques (Westman 2010), the storyboard has a capability to give an informative summary of the original video content. However, according to the user studies the storyboards lacked in their representativeness and ability to replace the original video content.


Next Media - a Tivit Programme Phase 2 (1.1-31.12.2011)


15 (20)

Dynamic video skimming is a technique that condenses the original video into a shorter version, while preserving important content with its time-evolving properties. Hence, video skims are practically short video clips cut from the original video sequence. Preservation of motion information is one of greatest advantages of video skims, in addition to aural information, which can both enhance the expressiveness of the video summary. (Truong & Venkatesh 2007, Ma et al. 2011)

Compared to static storyboards, dynamic videos skimming also support the recognition of objects in the content, and their representativeness is enough even for replacing the original video content. According to the user study (Westman 2010) the dynamic video skims were liked especially due to the clarity in presentation and the normal pace of the moving imagery.

Generation of dynamic and static video summaries have been usually carried out differently, but it is still possible to transform from one form to another. Whereas video skims can be created from keyframes by joining fixed-size segments, sub- shots, or the whole shots they are included in, the set of keyframes for static storyboard can be created from a video skim by uniform sampling or selecting one frame from each skim. (Truong & Venkatesh 2007)

6.2 Hierarchical presentation

In addition to static and dynamic video abstracts also a technique for hierarchical abstraction has been studied (Geng et al. (2006). The hierarchical abstract is a multilevel video summarization that is based on the importance of structural units in video sequence that is defined by visual and aural attention levels. According to Geng et al. (2006) the hierarchical video summarization approach can provide different levels of granularities for video abstracts: scene level, shot level, and sub-shot level. The granularity of a video structure is defined with the importance rank, which controls the skim ratio and the keyframe ratio that are used for constructing a structure-level video summary. According to Geng et al. (2006), utilization of an importance rank on structural level provides more comprehensible and compact video summaries, which are also flexible in such that users can also tune the skim and keyframe ratios.

Calic et al. (2007) has also presented a novel approach for user-centered video abstraction in which they exploit the universally familiar narrative structure of comics to enhance the intuitiveness and readability of generated video abstracts.

The abstract representation follows the narrative structure of comics by linking the temporal flow of video sequence with the spatial positions in a comic strip, while the significance of video structure unit is expressed with the size of its representation. The importance ranking is defined with an application-depended cost function.

In Calic et al. (2007) their keyframe extraction method is based on spectral clustering approach where keyframes are clustered based on their similarity, which is defined with HSV color histogram –based visual feature. The cost function is used to weight keyframes based on their distance to cluster centre. As



16 (20)

a result, those keyframes that are concentrated around cluster centre are treated as repetitive, redundant content, whereas cluster outliers are presented as more important. According to Calic et al. (2007) these outliers are usually cutaways, establishing shots, and other content that should be highlighted in order to generate video summaries for professional video content. Comic-like video summary is then assembled with a layout structure that follows the values of the cost function using panel templates that describe the frames sizes based on their significance. In the layout structure repetitive content is always presented by the smallest frame size, whereas outliers are present with the largest frame size. An example of comic-like video summarization is presented in Figure 6.


Fig. 2. News sequence from the TRECVID 2006 search corpus, summarized using layout parameters and . Repetitive content is always presented by the smallest frames in the layout. On the other hand, outliers are presented as big (e.g., a commercial break within a newscast, row 2, frame 11) which is very helpful for the user to swiftly uncover the structure of the presented sequence.

layout error is equivalent to optimization of the sum of in- dependent error functions of two adjacent panels and , where


Although the dependency between nonadjacent panels is pre- cisely and uniquely defined through the hierarchy of the DP so- lution tree, strictly speaking the claim about the independency of sums from (4) is incorrect. The reason for that is a limiting factor that each row layout has tofit to required page width , and therefore, width of the last panel in a row is directly depen- dent upon the sum of widths of previously used panels. If the task would have been to layout a single row until we run out of frames, regardless of itsfinal width, the proposed solution would be optimal. Nevertheless, by introducing specific corrections to the error function the suboptimal solution often achieves optimal results.

The proposed suboptimal panelling algorithm comprises fol- lowing procedural steps.

1) Load all available panel templates

2) For each pair of adjacent panels: penalize, if panel heights are not equal, determine corresponding cost function

4) If page height reached, display the page. Else, go to the beginning

In a specific case when the current width reaches the desired page width , the following corrections to

are introduced:

if , penalize all but empty panels;

if , return standard error function, but set it to 0 if the panel is empty;

if , empty frames are penalized and error func- tion is recalculated for the row resized tofit required width

, as



The experiments were conducted on the TRECVID 2006 evaluation content, provided by NIST as the benchmarking material for evaluation of video retrieval systems. In order to evaluate the results of the DP suboptimal panelling algorithm, results are compared against the optimal solution, as described in Section III. Results in Table I show the dependency of approximation error defined in (9) for two main algorithm parameters: maximum row height and number of frames Figure 6. Comic-like video summarization of a news sequence from the TRECVID

2006 corpus (Calic et al. 2007).


Next Media - a Tivit Programme Phase 2 (1.1-31.12.2011)


17 (20)

7 Discussion

Many of the video summarization techniques are trying to remove redundant content in a video sequence. Therefore they are also depended on the definition of redundant content compared to significant and interesting content. This definition is the formulated into a similarity metric that measures either visual similarity or something more generic similarity. While the similarity, or distance of data points in a space with multitude of dimensions such as space that defines an image or a video, is actually one of the greatest challenges of any application that is based on computational vision. From a computational perspective visual similarity is highly corrupted by context-depended semantic similarity, which has highly subjective characteristics depending on the type of application and content.

Whereas the user-generated video content differs from professionally produced in many aspects. Low technical quality makes it challenging for low-level machine vision algorithms, and on the other hand its unconstrained temporal structure due to the lack of higher-level quality (camera work, narration, unedited, etc.) makes it difficult for creating semantically more valid summarizations. For this reason, utilization of spatio-temporal information, or motion, is useful as it provides the subjective aspect to the content by modelling the camera motion, and therefore, the intentions of the person who recorded the video sequence. In addition to camera motion, another subjective aspect rises with the exploitation motion saliency. Modelling of the motion attention in a video sequence provides us the information for highlighting perceptually significant content that has high saliency. Hence, it is possible to make more meaningful video summarizations when these salient sub-segments are processed with a higher weight.

The use of user-generated video content will become more and more popular as a part of crowdsourcing and social media. But there is a lot of useless content that should be filtered, especially when the purpose is to locate newsworthy content to be facilitated in media. This will require laborious and time consuming monitoring before the content can be published or used as a news source. Video summarization can be exploited especially as a tool for monitoring the content by making it fast and effortless to examine the importance of a video sequence.

While as it is important that the monitoring is done based on all the relevant information with a guarantee that no information gets lost before examination, we propose that motion-based techniques are currently the most reliable option for summarization. Hence, we can track any objects or areas in a video scene and make analysis on generic level based on that without knowing what we are exactly tracking of. Thus, the complete video summarization is still depended of object and scene recognition, but it is also very likely that the exploitation of spatio-temporal information is required for general visual recognition. Thus, it is very important to understand motion information in video sequences.



Abdollahian, G., Taskiran, C. M., Pizlo, Z. & Delp, E. J. (2010) Camera Motion-Based Analysis of User Generated Video. IEEE Transactions on Multimedia, vol. 12(1), January 2010, pp. 28-41.

Babaguchi, N., Kawai, Y., Yasugi, Y. & Kitahashi, T. (2000) Linking live and replay scenes in broadcasted sports video. Proc. 2000 ACM Workshops

Multimedia, Nov. 2000, pp. 205–208.

Calic, J., Gibson, D.P. & Campbell, N.W. (2007) Efficient Layout of Comic- Like Video Summaries. IEEE Transactions on Circuits and Systems for Video Technology, vol.17, no.7, pp.931-936, July 2007. doi:


Choi, J. & Jeong, D. (2000) Story board construction using segmentation of MPEG encoded news video. Proc. 43rd IEEE Midwest Symp. Circuits and Systems, 2000, vol. 2, pp. 758–761.

Guironnet, M., Pellerin, D. & Rombaut, M. (2006) Camera motion

classification based on transferable belief model. 4th European Signal Processing Conference (EUSIPCO 2006), Florence, Italy, September 4-8, 2006.

Guironnet, M., Pellerin, D., Guyader N. & Ladret, P. (2007) Video

Summarization Based on Camera Motion and a Subjective Evaluation Method.

EURASIP Journal on Image and Video Processing, vol. 2007, Article ID 60245, 12 pages, 2007. doi:10.1155/2007/60245

Hu, W., Xie, N., Li, L., Zeng, X. & Maybank, S. (2011) A Survey on Visual Content-Based Video Indexing and Retrieval. Systems, Man, and Cybernetics, Part C: Applications and Reviews, IEEE Transactions on , vol.41, no.6, pp.797- 819, Nov. 2011. doi: 10.1109/TSMCC.2011.2109710

Lan, D-J., Ma, Y-F. & Zhang, H-J. (2003) A novel motion-based representation for video mining. in Proc. IEEE Int. Conf. Multimedia and Expo, vol. 3,

Baltimore, MD, Jul. 6–9, 2003, pp. 469–472.

Liu, T., Zhang, H-J. & Qi F. (2003) A novel video key-frame-extraction algorithm based on perceived motion energy model. IEEE Transactions on Circuits and Systems for Video Technology, 10(13), pp. 10006-1013. doi:


Ma, Y-F. & Zhang, H-J. (2002) A model of motion attention for video skimming.

Image Processing. 2002. Proceedings. 2002 International Conference on , vol.1, no., pp. I-129- I-132 vol.1, 2002. doi: 10.1109/ICIP.2002.1037976


Next Media - a Tivit Programme Phase 2 (1.1-31.12.2011)


19 (20)

Ma, Y-F. & Zhang, H-J. (2003) Contrast-based image attention analysis by using fuzzy growing. in Proc. 11th ACM Int. Conf. Multimedia, 2003, pp. 374–


Ma, Y-F., Hua, X-S., Lu, L. & Zhang, H-J. (2005) A generic framework of user attention model and its application in video summarization. Multimedia, IEEE Transactions on , vol.7, no.5, pp. 907- 919, Oct. 2005. doi:


Money, A.G. & Agius, H. (2008) Video summarisation: a conceptual framework and survey of the state of the art. Journal of Visual Communication and Image Representation, 19 (2), 121–143.

Petershon. (2007) Sub-Shots – Basic Units of Video. Systems, Signals and Image Processing, 2007 and 6th EURASIP Conference focused on Speech and Image Processing, Multimedia Communications and Services. Pp. 323-326.

Rui, Y., Gupta, A. & Acero, A. (2000) Automatically extracting highlights for TV baseball programs,” in Proc. 8th ACM Int. Conf. Multimedia, 2000, pp. 105–


Smeaton, A. F., Over, P. & Doherty, A. R. (2010) Video shot boundary detection: Seven years of TRECVid activity. Computer Vision and Image Understanding, Volume 114, Issue 4, April 2010, Pages 411-418.

Smets, P. & Kennes, R. (1994) The transferable belief model. Artificial Intelligence, 66(1994), pp. 191-243.

Tan, Y. P., Saur, D., Kulkami, S. & Ramadge, P. (2000) Rapid estimation of camera motion from compressed video with application to video annotation. IEEE Trans. Circuits Syst. Video Technol., vol. 10, no. 1, pp. 133–146, Feb. 2000.

Truong, B. T. & Venkatesh, S. (2007) Video Abstraction: A Systematic Review and Classification. ACM Transactions on Multimedia Computing,

Communications and Applications, Vol. 3, No. 1(3).

Westman, S. (2010) Evaluation Constructs for Visual Video Summaries: user- supplied constructs and descriptions. Proceedings of the 14th European

conference on Research and advanced technology for digital libraries/ (ECDL'10).

Springer-Verlag, Berlin, Heidelberg, pp. 67-79.

Wu, S., Ma, Y-F. & Zhang, H-J. ( 2005) Video Quality Classsification Based Home Video Segmentation. IEEE International Conference on

Multimedia and Expo, 2005. ICME 2005. DOI: 10.1109/ICME.2005.1521399 Yuan, J., Wang, H., Xiao, L., Zheng, W., Li, J., Lin, F. & Zhang, B. (2007) A Formal Study of Shot Boundary Detection. IEEE Transactions on Circuits and Systems for Video Technology, vol.17, no.2, pp.168-186, Feb. 2007


Xiong, Z., Zhou, X. S., Tian, Q., Rui, Y. & Huang, T. S. (2006) Semantic retrieval of video - Review of research on video retrieval in meetings, movies and broadcast news, and sports. IEEE Signal Process. Mag., vol. 23, no. 2, pp. 18–27, Mar. 2006.



The main contributions of this paper are in 1) identifying the software and interface requirements for modern sensor and data analytics application systems and 2) outlining the