Pedestrian detection is a particular issue in both academia and industry. However, most existing pedestrian detection methods usually fail to detect small-scale pedestrians due to the introduction of feeble contrast and motion blur in images and videos. In this paper, we propose a multi-level feature fusion strategy to detect multi-scale pedestrians, which works particularly well with small-scale pedestrians that are relatively far from the camera. We propose a multi-level feature fusion strategy to make the shallow feature maps encode more semantic and global information to detect small-scale pedestrians. In addition, we redesign the aspect ratio of anchors to make it more robust for pedestrian detection task. The extensive experiments on both Caltech and CityPersons datasets demonstrate that our method outperforms the state-of-the-art pedestrian detection algorithms. Our proposed approach achieves a MR−2 of 0.84%, 23.91% and 62.19% under the “Near”, Medium” and “Far” settings respectively on Caltech dataset, and also leads a better speed-accuracy trade-off with 0.28 second per image of 1024×2048 pixel compared with others on CityPersons dataset.
Object detection is one of the most popular and difficult field in computer vision. Although deep learning methods have great performance on object detection. For specific application, algorithms which use hand-crafted features are still widely used. One main problem in object detection is the scale problem. Algorithms usually use image pyramid to cover as many scales as possible. But gaps still exist between scale levels in image pyramid. Our work extends some sub scale level to fill the gaps between image pyramids. To this end, we use Gaussian Scales Pyramid to generate sub-scale image and extract HOG feature on the sub-scale. We use framework offered by DPM algorithm and make modification on it. We compare the result of our method with DPM baseline on Pascal VOC database. Our work has great performance on some categories and makes an improvement on the overall performance. This work can be used in other object detection frameworks. We apply multi-scale HOG feature on pre-process procedure of our own detection framework and test it on our own dataset. Then the framework gains performance improvement on precision and recall rate of the pre-process procedure comparing to the original HOG feature architecture.
Correlation Filters (CFs) based trackers have recently attracted many researchers’ attention because of their high efficiency and robustness. Nevertheless, CFs trackers usually require a cosine window on account of the boundary effects. This allows trackers to distinguish targets in small background areas. In this paper, we propose an online learning algorithm that employs the global context to alleviate the problems. It is based on Passive-Aggressive algorithm that incorporates context information within CFs trackers. In addition, we train an SVM classifier to redetect objects in case of the model drift caused by occlusion and fast motion etc. The results of extensive experiments on a large-scale benchmark dataset show that the proposed tracker outperform the state-of-the-art trackers.
Image quality assessment is needed in multiple image processing areas and blur is one of the key reasons of image deterioration. Although great full-reference image quality assessment metrics have been proposed in the past few years, no-reference method is still an area of current research. Facing this problem, this paper proposes a no-reference sharpness assessment method based on wavelet transformation which focuses on the edge area of image. Based on two simple characteristics of human vision system, weights are introduced to calculate weighted log-energy of each wavelet sub band. The final score is given by the ratio of high-frequency energy to the total energy. The algorithm is tested on multiple databases. Comparing with several state-of-the-art metrics, proposed algorithm has better performance and less runtime consumption.
Object tracking in video sequences has broad applications in both military and civilian domains. However, as the length of input video sequence increases, a number of problems arise, such as severe object occlusion, object appearance variation, and object out-of-view (some portion or the entire object leaves the image space). To deal with these problems and identify the object being tracked from cluttered background, we present a robust appearance model using Speeded Up Robust Features (SURF) and advanced integrated features consisting of the Felzenszwalb's Histogram of Oriented Gradients (FHOG) and color attributes. Since re-detection is essential in long-term tracking, we develop an effective object re-detection strategy based on moving area detection. We employ the popular kernel correlation filters in our algorithm design, which facilitates high-speed object tracking. Our evaluation using the CVPR2013 Object Tracking Benchmark (OTB2013) dataset illustrates that the proposed algorithm outperforms reference state-of-the-art trackers in various challenging scenarios.
Object tracking is a challenging research task due to target appearance variation caused by deformation and occlusion. Keypoint matching based tracker can handle partial occlusion problem, but it’s vulnerable to matching faults and inflexible to target deformation. In this paper, we propose an innovative keypoint matching procedure to address above issues. Firstly, the scale and orientation of corresponding keypoints are applied to estimate the target’s status. Secondly, a kernel function is employed in order to discard the mismatched keypoints, so as to improve the estimation accuracy. Thirdly, the model updating mechanism is applied to adapt to target deformation. Moreover, in order to avoid bad updating, backward matching is used to determine whether or not to update target model. Extensive experiments on challenging image sequences show that our method performs favorably against state-of-the-art methods.
Visual tracking is a challenging problem in computer vision. Recent years, significant numbers of trackers have been proposed. Among these trackers, tracking with dense spatio-temporal context has been proved to be an efficient and accurate method. Other than trackers with online trained classifier that struggle to meet the requirement of real-time tracking task, a tracker with spatio-temporal context can run at hundreds of frames per second with Fast Fourier Transform (FFT). Nevertheless, the performance of the tracker with Spatio-temporal context relies heavily on the learning rate of the context, which restricts the robustness of the tracker.
In this paper, we proposed a tracking method with dual spatio-temporal context trackers that hold different learning rate during tracking. The tracker with high learning rate could track the target smoothly when the appearance of target changes, while the tracker with low learning rate could percepts the occlusion occurring and continues to track when the target starts to emerge again. To find the target among the candidates from these two trackers, we adopt Normalized Correlation Coefficient (NCC) to evaluate the confidence of each sample. Experimental results show that the proposed algorithm performs robustly against several state-of-the-art tracking methods.
Action recognition is a very challenging task in the field of real-time video surveillance. The traditional models on action recognition are constructed of Spatial-temporal features and Bag-of-Feature representations. Based on this model, current research work tends to introduce dense sampling to achieve better performance. However, such approaches are computationally intractable when dealing with large video dataset. Hence, there are some recent works focused on feature reduction to speed up the algorithm without reducing accuracy.
In this paper, we proposed a novel selective feature sampling strategy on action recognition. Firstly, the optical flow field is estimated throughout the input video. And then the sparse FAST (Features from Accelerated Segment Test) points are selected within the motion regions detected by using the optical flows on the temporally down-sampled image sequences. The selective features, sparse FAST points, are the seeds to generate the 3D patches. Consequently, the simplified LPM (Local Part Model) which greatly speeds up the model is formed via 3D patches. Moreover, MBHs (Motion Boundary Histograms) calculated by optical flows are also adopted in the framework to further improve the efficiency. Experimental results on UCF50 dataset and our artificial dataset show that our method could reach more real-time effect and achieve a higher accuracy compared with the other competitive methods published recently.
Object tracking is a challenging task in computer vision. Most state-of-the-art methods maintain an object model and update the object model by using new examples obtained incoming frames in order to deal with the variation in the appearance. It will inevitably introduce the model drift problem into the object model updating frame-by-frame without any censorship mechanism. In this paper, we adopt a multi-expert tracking framework, which is able to correct the effect of bad updates after they happened such as the bad updates caused by the severe occlusion. Hence, the proposed framework exactly has the ability which a robust tracking method should process. The expert ensemble is constructed of a base tracker and its formal snapshot. The tracking result is produced by the current tracker that is selected by means of a simple loss function. We adopt an improved compressive tracker as the base tracker in our work and modify it to fit the multi-expert framework. The proposed multi-expert tracking algorithm significantly improves the robustness of the base tracker, especially in the scenes with frequent occlusions and illumination variations. Experiments on challenging video sequences with comparisons to several state-of-the-art trackers demonstrate the effectiveness of our method and our tracking algorithm can run at real-time.
KEYWORDS: Super resolution, Video, Associative arrays, Video coding, Video processing, Lawrencium, Image processing, Visualization, Feature extraction, Information visualization
Methods for super-resolution can be classified into three categories: (i) The Interpolation-based methods, (ii) The Reconstruction-based methods (iii) The Learning-based methods. The Learning-based methods usually have the best performance due to the learning process. However, learning-based methods can’t be applied to video super-resolution due to the great computational complexity. We proposed a fast sparsity-based video super-resolution algorithm by utilizing inter-frame information. Firstly, the background can be extracted via existing methods such as Gaussians Mixture Model(GMM) in this paper. Secondly, we construct background and foreground patch dictionaries by randomly sampling patches from high-resolution video. During the process of video super-resolution, only the foreground regions are reconstructed using foreground dictionary via sparse coding. Respectively the background is updated and only changed regions of the background is reconstructed using background dictionary in the same way. Finally, the background and foreground should be fused to get the super-resolution outcome. The experiments show that it makes sparsity-based methods much faster in video super-resolution with approximate, even better, performance.
In this paper, we propose a region of interest-based (ROI-adaptive) fusion algorithm of infrared and visible images by
using the Laplacian Pyramid method. Firstly, we estimate the saliency map of infrared images, and then divide the infrared
image into two parts: the regions of interest (RoI) and the regions of non-interest (nRoI), by normalizing the saliency map.
Visible images are also segmented into two parts by using the Gauss High-pass filter: the regions of high frequency (RoH)
and the regions of low frequency (RoL). Secondly, we down-sampled both the nRoI of infrared image and the RoL of
visible image as the input of next level processing. Finally, we use normalized saliency map of infrared images as the
weighted coefficient to get the basic image on the top level and choose max gray value of the RoI of infrared image and
the RoH of visible image to get the detail image. In this way, our method can keep target feature of infrared image and
texture detail information of visual image at the same time. Experiment results show that such fusion scheme performs
better than the other fusion algorithms both on human visual system and quantitative metrics.
Detecting dim and small target in infrared images and videos is one of the most important techniques in many computer vision applications, such as video surveillance and infrared imaging precise guidance. In this paper, we proposed a real-time target detection approach in infrared imagery. This method combined saliency detection technology and local average filtering. First, we compute the log amplitude spectrum of infrared image. Second, we find the spikes of the amplitude spectrum using cubic facet model and suppress the sharp spikes using local average filtering. At last, the detection result in spatial domain is obtained by reconstructing the 2D signal using the original phase and the filtered amplitude spectrum. Experimental results of infrared images with different types of backgrounds demonstrate the high efficiency and accuracy of the proposed method to detect the dim and small targets.
Human abnormal behaviors detection is one of the most challenging tasks in the video surveillance for the public
security control. Interaction Energy Potential model is an effective and competitive method published recently to detect
abnormal behaviors, but their model of abnormal behaviors is not accurate enough, so it has some limitations. In order to
solve this problem, we propose a novel Particle Motion model. Firstly, we extract the foreground to improve the
accuracy of interest points detection since the complex background usually degrade the effectiveness of interest points
detection largely. Secondly, we detect the interest points using the graphics features. Here, the movement of each human
target can be represented by the movements of detected interest points of the target. Then, we track these interest points
in videos to record their positions and velocities. In this way, the velocity angles, position angles and distance between
each two points can be calculated. Finally, we proposed a Particle Motion model to calculate the eigenvalue of each
frame. An adaptive threshold method is proposed to detect abnormal behaviors. Experimental results on the BEHAVE
dataset and online videos show that our method could detect fight and robbery events effectively and has a promising
performance.
KEYWORDS: Distributed interactive simulations, Image processing, Information technology, Visualization, Image segmentation, Human vision and color perception, Information visualization, Associative arrays, Control systems, Databases
The task of salient region detection aims at establishing the most important and informative regions of an image. In this
work, we propose a novel method that tackles such task as a process from superpixel-level locating to pixel-level refining.
Firstly, we over-segment the image into superpixels and compute an affinity matrix to estimate the similarity between
each two superpixels according to both color contrast and space distribution. The matrix is then applied to aggregate
superpixels into several clusters by using affinity propagation. To measure the saliency of each cluster, three parameters
are taken into account including color contrast, cluster compactness and proximity to the focus. We appoint the most
salient one to three clusters as the crude salient region. For the refining step, we regard each selected superpixel as an
influential center. Hence, the saliency value of a pixel is simultaneously determined by all the selected superpixels.
Practically, several Gauss curves are constructed based on the selected superpixels. Pixel-wise saliency value is decided
by the color distinction and spatial distance between one pixel and the curves’ centers. We evaluate our algorithm on the
publicly available dataset with human annotations, and experimental results show that our approach has competitive
performance.
Accurate and fast detection of small infrared target has very important meaning for infrared precise guidance, early
warning, video surveillance, etc. Based on human visual attention mechanism, an automatic detection algorithm for
small infrared target is presented. In this paper, instead of searching for infrared targets, we model regular patches that do
not attract much attention by our visual system. This is inspired by the property that the regular patches in spatial domain
turn out to correspond to the spikes in the amplitude spectrum. Unlike recent approaches using global spectral filtering,
we define the concept of local maxima suppression using local spectral filtering to smooth the spikes in the amplitude
spectrum, thereby producing the pop-out of the infrared targets. In the proposed method, we firstly compute the
amplitude spectrum of an input infrared image. Second, we find the local maxima of the amplitude spectrum using cubic
facet model. Third, we suppress the local maxima using the convolution of the local spectrum with a low-pass Gaussian
kernel of an appropriate scale. At last, the detection result in spatial domain is obtained by reconstructing the 2D signal
using the original phase and the log amplitude spectrum by suppressing local maxima. The experiments are performed
for some real-life IR images, and the results prove that the proposed method has satisfying detection effectiveness and
robustness. Meanwhile, it has high detection efficiency and can be further used for real-time detection and tracking.
Abnormal event detection in crowded scenes is one of the most challenging tasks in the video surveillance for the
public security control. Different from previous work based on learning. We proposed an unsupervised Interaction Power
model with an adaptive threshold strategy to detect abnormal group activity by analyzing the steady state of individuals’
behaviors in the crowed scene. Firstly, the optical flow field of the potential pedestrians is only calculated within the
extracted foreground to reduce the computational cost. Secondly, each pedestrian can be divided into patches of the same
size, and the interaction power of the pedestrians will be represented by the motion particles which describe the motion
status at the center pixels of the patches. The motion status of each patch is computed by using the optical flows of the
pixels within the patch. For each motion particle, its interaction power, defined as its steady state of the current behavior,
is computed among all its neighboring motion particles. Finally, the dense crowds’ steady state can be represented as a
collection of motion particles’ interaction power. Here, an adaptive threshold strategy is proposed to detect abnormal
events by examining the frame power field which is a fixed-size random sampling of the interaction power of motion
particles. Experimental results on the standard UMN dataset and online videos show that our method could detect the
crowd anomalies and achieve a higher accuracy compared to the other competitive methods published recently.
An autonomous system must have the capability of estimating or controlling its own motion parameters. There already exit tens of research work to fulfill the task. However, most of them are based on the motion correspondences establishment or full optical flows estimation. The above solutions put restrictions on the scene: either there must be presence of enough distinct features, or there must be dense texture. Different from the traditional works, utilizing no motion correspondences or epipolar geometry, we start from the normal flow data, ensure good use of every piece of them because they could only be sparsely available. We apply the spherical image model to avoid the ambiguity in describing the camera motion. Since each normal flow gives a locus for the location of the camera motion, the intersection of such loci offered by different data points will narrow the possibilities of the camera motion and even pinpoint it. A voting scheme in φ-θ domain is applied to simplify the 3D voting space to a 2D voting space. We tested the algorithms introduced above by using both synthetic image data and real image sequences. Experimental results are shown to illustrate the potential of the methods.
Infrared images are usually subject to low contrast, edge blurring and a large amount of noise. Aiming at improving the quality of the infrared images, this paper presents a novel adaptive algorithm on infrared image enhancement. Firstly, the input image is decomposed via the nonsubsampled Contourlet transform (NSCT) to achieve the coefficients of subbands at different scales and directions. Next, the coefficients of high frequency are classified into three categories automatically by using an adaptive classification method which analyzes the coefficients in their local neighborhood. After that, a nonlinear mapping function is adopted to modify the coefficients, in order to highlight the edges and suppress the high frequency noise. Finally, the enhanced image is obtained by reconstructing via the above modified coefficients. Experiment results show that the proposed algorithm could effectively enhance image contrast and highlight edges while avoiding the image distortion and noise amplification.
This paper proposes a novel method on scene matching which aims to detect the unauthorized change of the camera’s field of view (FOV) automatically. The problem is substantially difficult due to mixed representation of FOV change and scene content variation in actual situation. In this work, a local viewpoint-invariant descriptor is firstly proposed to measure the appearance similarity of the captured scenes. And then the structural similarity constraint is adopted to further distinguish whether the current scene remains despite the content change in the scene. Experimental results demonstrate that the proposed method works well in existence of viewpoint change, partial occlusion and structural similarities in real environment. The proposed scheme has been proved to be practically applicable and reliable by its use in an actual intelligent surveillance system.
This paper presents an approach of global sparse matching algorithm based on Delaunay Triangle theory to obtain
reliable matching result of detected feature points between images. Our approach can solve the matching problem in the
case of images captured under a certain range of scaling, rotation and translation, together with image affine distortion,
addition of noise, and change in illumination. Considering that it is hard to obtain a high percentage of correct matched
point pairs, we present this kind of Delaunay Triangle based matching method to improve the accuracy and exactness of
matching result. First, corner detection algorithms are proposed to obtain accurate location of feature points. Then the
relation of the feature points is constructed according to the Delaunay Triangle Theory in the form of a triangle net.
Therefore, the feature points matching problem is transformed into a node angle and length vector matching problem in
the triangle net. During the matching procession, the gray information of the image has not been used. The experimental
results show that the method could achieve accurate matching results with a better percentage of correction.
Ground vehicle tracking is an important component of Aerial Video Surveillance System (AVS), which has important
military and civilian uses. This paper presents an image alignment based framework for ground vehicle tracking from an
air-borne platform, which can precisely track a pointed out vehicle and get back it when it reappear. We track a set of
point features of the selected vehicle by the technique of image alignment. An edge feature-based outlier rejection
criterion, a Kalman filter and a reappearance verification program are used to make the proposed tracking system
perform excellently under complicated conditions. Experimental results on real aerial images show that the proposed
framework is rational and robust.
Access to the requested content is limited to institutions that have purchased or subscribe to SPIE eBooks.
You are receiving this notice because your organization may not have SPIE eBooks access.*
*Shibboleth/Open Athens users─please
sign in
to access your institution's subscriptions.
To obtain this item, you may purchase the complete book in print or electronic format on
SPIE.org.
INSTITUTIONAL Select your institution to access the SPIE Digital Library.
PERSONAL Sign in with your SPIE account to access your personal subscriptions or to use specific features such as save to my library, sign up for alerts, save searches, etc.