Video fingerprints can help us identify a large amount of video on the Internet and enable interesting services
to the end user. One of the main challenges for video fingerprints is for them to be robust against intentional/
unintentional geometric modifications on the content such as scaling, aspect ratio conversion, rotation
and cropping. In this paper, we review a number of fingerprinting methods proposed in literature that are particularly
designed to be robust against such modifications. We also present two approaches that we adopted. One
that is based on estimation of Singular Value Decomposition (SVD) bases from a window of past video frames
(Approach 1) and another that is based on extraction of moment invariant features from concentric circular
regions and doesn't require any specific transform (Approach 2). While both approaches provide the desired
robustness against geometric modifications, Approach 1 is computationally more intensive than Approach 2 as
the SVD bases are updated for every input frame at 12fps. It also requires a longer query clip than Approach 2
for reliable identification. We present results comparing the performance of both of these approaches for a 150hr
video database.
We tested our previously reported sports highlights playback for personal video recorders with a carefully chosen set of
sports aficionados. Each subject spent about an hour with the content, going through the same basic steps of
introduction, trying out the system, and follow up questionnaire. The main conclusion was that the users unanimously
liked the functionality very much even when it made mistakes. Furthermore, the users felt that if the user interface were
made much more responsive so as to quickly compensate for false alarms and misses, the functionality would be vastly
enhanced. The ability to choose summaries of any desired length turned out to be the main attraction.
Severe complexity constraints on consumer electronic devices motivate us to investigate general-purpose video summarization techniques that are able to apply a common hardware setup to multiple content genres. On the other hand, we know that high quality summaries can only be produced with domain-specific processing. In this paper, we present a time-series analysis based video summarization technique that provides a general core to which we are able to add small content-specific extensions for each genre. The proposed time-series analysis technique consists of unsupervised clustering of samples taken through sliding windows from the time series of features obtained from the content. We classify content into two broad categories, scripted content such as news and drama, and unscripted content such as sports and surveillance. The summarization problem then reduces to finding either finding semantic boundaries of the scripted content or detecting highlights in the unscripted content. The proposed technique is essentially an event detection technique and is thus best suited to unscripted content, however, we also find applications to scripted content. We thoroughly examine the trade-off between content-neutral and content-specific processing for effective summarization for a number of genres, and find that our core technique enables us to minimize the complexity of the content-specific processing and to postpone it to the final stage. We achieve the best results with unscripted content such as sports and surveillance video in terms of quality of summaries and minimizing content-specific processing. For other genres such as drama, we find that more content-specific processing is required. We also find that judicious choice of key audio-visual object detectors enables us to minimize the complexity of the content-specific processing while maintaining its applicability to a broad range of genres. We will present a demonstration of our proposed technique at the conference.
KEYWORDS: Surveillance, Time series analysis, Video, Video surveillance, Feature extraction, Sensors, Data acquisition, Classification systems, Statistical analysis, Data modeling
We present a systematic framework for arriving at audio classes for detection of crimes in elevators. We use a time series analysis framework to analyze the low-level features extracted from the audio of an elevator surveillance content to perform an inlier/outlier based temporal segmentation. Since suspicious events in elevators are outliers in a background of usual events, such a segmentation help bring out such events without any a priori knowledge. Then, by performing an automatic clustering on the detected outliers, we identify consistent patterns for which we can train supervised detectors. We apply the proposed framework to a collection of elevator surveillance audio data to systematically acquire audio classes such as banging, footsteps, non-neutral speech and normal speech etc. Based on the observation that the banging audio class and non-neutral speech class are indicative of suspicious events in the elevator data set, we are able to detect all of the suspicious activities without any misses.
In our past work on sports highlights extraction, we have shown the utility of detecting audience reaction using an audio classification framework. The audio classes in the framework were chosen based on intuition. In this paper, we present a systematic way of identifying the key audio classes for sports highlights extraction using a time series clustering framework. We treat the low-level audio features as a time series and model the highlight segments as "unusual" events in a background of an "usual" process. The set of audio classes to characterize the sports domain is then identified by analyzing the consistent patterns in each of the clusters output from the time series clustering framework. The distribution of features from the training data so obtained for each of the key audio classes, is parameterized by a Minimum Description Length Gaussian Mixture Model (MDL-GMM). We also interpret the meaning of each of the mixture components of the MDL-GMM for the key audio class (the "highlight" class) that is correlated with highlight moments. Our results show that the "highlight" class is a mixture of audience cheering and commentator's excited speech. Furthermore, we show that the precision-recall performance for highlights extraction based on this "highlight" class is better than that of our previous approach which uses only audience cheering as the key highlight class.
Robust hash functions are central to the security of multimedia content authentication systems. Such functions are sensitive to a key but are robust to many allowed signal processing operations on the underlying content. The robustness of the hash function to changes in the original content implies the existence of a cluster in the feature space around the original contents feature vector, any point within which getting hashed to the same output. The shape and size of the cluster determines the trade-off between the robustness offered and the security of the authentication system based on the robust hash function. The clustering itself is based on a secret key and hence unknown to the attacker. However, we show that the specific clustering arrived at by the robust visual hash function (VHF) may be possible to learn. Given just an input and its hash bits, we show how to construct a statistical model of the hash function, without any knowledge of the secret key used to compute the hash. We also show how to use this model to engineer arbitrary and malicious collisions. Finally, we propose one possible modification to VHF so that constructing a model that mimics its behavior becomes difficult.
KEYWORDS: Video, Video surveillance, Mining, Machine learning, Surveillance, Feature extraction, Digital video discs, Video processing, Data mining, Binary data
We discuss the meaning and significance of the video mining problem, and present our work on some aspects of video mining. A simple definition of video mining is unsupervised discovery of patterns in audio-visual content. Such purely unsupervised discovery is readily applicable to video surveillance as well as to consumer video browsing applications. We interpret video mining as content-adaptive or "blind" content processing, in which the first stage is content characterization and the second stage is event discovery based on the characterization obtained in stage 1. We discuss the target applications and find that using a purely unsupervised approach are too computationally complex to be implemented on our product platform. We then describe various combinations of unsupervised and supervised learning techniques that help discover patterns that are useful to the end-user of the application. We target consumer video browsing applications such as commercial message detection, sports highlights extraction etc. We employ both audio and video features. We find that supervised audio classification combined with unsupervised unusual event discovery enables accurate supervised detection of desired events. Our techniques are computationally simple and robust to common variations in production styles etc.
Removing commercials from television programs is a much
sought-after feature for a personal video recorder. In this paper,
we employ an unsupervised clustering scheme (CM_Detect) to detect
commercials in television programs. Each program is first divided
into W8-minute chunks, and we extract audio and visual features
from each of these chunks. Next, we apply k-means clustering to
assign each chunk with a commercial/program label. In
contrast to other methods, we do not make any assumptions
regarding the program content. Thus, our method is highly
content-adaptive and computationally inexpensive. Through
empirical studies on various content, including American news,
Japanese news, and sports programs, we demonstrate that our method
is able to filter out most of the commercials without falsely
removing the regular program.
In our past work, we have attempted to use a mid-level feature namely the state population histogram obtained from the Hidden Markov Model (HMM) of a general sound class, for speaker change detection so as to extract semantic boundaries in broadcast news. In this paper, we compare the performance of our previous approach with
another approach based on video shot detection and speaker change detection using the Bayesian Information Criterion (BIC). Our experiments show that the latter approach performs significantly better than the former. This motivated us to examine the mid-level feature closely. We found that the component population histogram
enabled discovery of broad phonetic categories such as vowels, nasals, fricatives etc, regardless of the number of distinct speakers in the test utterance. In order for it to be useful for speaker change detection, the individual components should model the phonetic sounds of each speaker separately. From our experiments, we conclude that state/component population histograms can only be useful for further clustering or semantic class discovery
if the features are chosen carefully so that the individual states represent the semantic categories of interest.
KEYWORDS: Visualization, Multimedia, Statistical analysis, Image processing, Signal processing, Data modeling, Digital watermarking, Statistical modeling, Error control coding, Machine learning
Robust hash functions are central to the security of multimedia content authentication systems. Such functions are sensitive to a key but robust to many allowed signal processing operations on the underlying content. Robustness of the hash function to changes in the original content implies the existence of a cluster in the feature space around the original contents feature vector, any point within which getting hashed to the same output. The shape and size of the cluster determines the trade-off between the robustness offered and the security of the authentication system based on the robust hash function. The clustering itself is based on a secret key and hence unknown to the attacker. However, we show in this paper that the specific clustering arrived at by a robust hash function may be possible to learn. Specifically, we look at a well known robust hash function for image data called the Visual Hash Function (VHF). Given just an input and its hash value, we show how to construct a statistical model of the hash function, without any knowledge of the secret key used to compute the hash. We also show how to use this model to engineer arbitrary and malicious collisions. Finally, we propose one possible modification to VHF so that constructing a model that mimics its behavior becomes difficult.
In Casey describes a generalized sound recognition framework based on reduced rank spectra and Minimum-Entropy Priors. This approach enables successful recognition of a wide variety of sounds such as male speech, female speech, music, animal sounds etc. In this work, we apply this recognition framework to news video to enable quick video browsing. We identify speaker change positions in the broadcast news using the sound recognition framework. We combine the speaker change position with color & motion cues from video and are able to locate the beginning of each of the topics covered by the news video. We can thus skim the video by merely playing a small portion starting from each of the locations where one of the principal cast begins to speak. In combination with our motion-based video browsing approach, our technique provides simple automatic news video browsing. While similar work has been done before, our approach is simpler and faster than competing techniques, and provides a rich framework for further analysis and description of content.
KEYWORDS: Multimedia, Signal processing, Feature extraction, Linear filtering, Filtering (signal processing), Image compression, Digital watermarking, Receivers, Data modeling, Data hiding
The goal of audio content authentication techniques is to separate malicious manipulations from authentic signal processing applications like compression, filtering, etc. The key difference between malicious operations and signal processing operations is that the latter tends to preserve the perceptual content of the underlying audio signal. Hence, in order to separate malicious operations from allowed operations, a content authentication procedure should be based on a model that approximates human perception of audio. In this paper, we propose an audio content authentication technique based on an invariant feature contained in two perceptually similar audio data, i.e. the masking curve. We also evaluate the performance of this technique by embedding a hash based on the masking curve into the audio signal using an existing transparent and robust data hiding technique. At the receiver, the same content-based hash is extracted from the audio and compared with the calculated hash bits. Correlation between calculated hash bits and extracted hash bits degrades gracefully with the perceived quality of received audio. This implies that the threshold for authentication can be adapted to the required level of perceptual quality at the receiver. Experimental results show that this content-based hash is able to differentiate allowed signal processing applications like MP3 compression from certain malicious operations, which modify the perceptual content of the audio.
We describe a technique for reducing the data set for a technique for reducing the data set for principal cast and other taking head detection in broadcast news content using the spatial attributes of MPEG-7 Motion Activity descriptor. The fact that these descriptors are easy to extract from compressed domain and also work well when used for matching talking head sequences, motivated us to utilize them for rapidly pruning the data set for subsequent sophisticated face detection techniques. We are thus able to speed up the process of finding the principal cast from broadcast news content by reducing the number of segments on which computationally more expensive face detection and recognition is employed. We present the experimental results of two from the centroid of ground truth set and is computationally less expensive. The second clustering procedure is based on multiple templates, which are the mean feature vectors of the component Gaussians of a Gaussian Mixture Model (GMM) trained best to fit the training data. We are able to save 50% on computation measured in terms of number of rejected shots to total number of shots while missing 25% of talking head shots in the news program. We also observe that the second clustering procedure while being slightly computationally intensive allows for higher pruning factors with more accuracy.
Access to the requested content is limited to institutions that have purchased or subscribe to SPIE eBooks.
You are receiving this notice because your organization may not have SPIE eBooks access.*
*Shibboleth/Open Athens users─please
sign in
to access your institution's subscriptions.
To obtain this item, you may purchase the complete book in print or electronic format on
SPIE.org.
INSTITUTIONAL Select your institution to access the SPIE Digital Library.
PERSONAL Sign in with your SPIE account to access your personal subscriptions or to use specific features such as save to my library, sign up for alerts, save searches, etc.