The type of histogram distance metric selected for a CBIR query varies greatly and will affect the accuracy of the retrieval results. This paper compares the retrieval results of a variety of commonly used CBIR distance metrics: the Euclidean distance, the Manhattan distance, the vector cosine angle distance, histogram intersection distance, χ2 distance, Jensen-Shannon divergence, and the Earth Mover’s distance. A training set of ground-truth labeled images is used to build a classifier for the CBIR system, where the images were obtained from three commonly used benchmarking datasets: the WANG dataset (http://savvash.blogspot.com/2008/12/benchmark-databases-for-cbir.html), the Corel Subset dataset (http://vision.stanford.edu/resources_links.html), and the CalTech dataset (http://www.vision.caltech.edu/htmlfiles/). To implement the CBIR system, we use the Tamura texture features of coarseness, contrast, and directionality. We create texture histograms of the training set and the query images, and then measure the difference between a randomly selected query and the corresponding retrieved image using a k-nearest-neighbors approach. Precision and recall is used to evaluate the retrieval performance of the system, given a particular distance metric. Then, given the same query image, the distance metric is changed and performance of the system is evaluated once again.
This research project describes an agglomerative image clustering technique that is used for the purpose of automating image categorization. The system is implemented in two stages: feature vector formation, and feature space clustering. The features that we selected are based on texture salience (Gabor filters and a binary pattern descriptor). Global properties are encoded via a hierarchical spatial pyramid and local structure is encoded as a bit string, retained via a set of histograms. The transform can be computed efficiently – it involves only 16 operations (8 comparisons and 8 additions) per 3x3 region. A disadvantage is that it is not invariant to rotation or scale changes; however, the spatial pyramid representing global structure helps to ameliorate this problem. An agglomerative clustering technique is implemented and evaluated based on ground-truth values and a human subjective rating.
A image processing pipeline is presented that applies principles from the computer graphics technique of deferred shading to composite rendered objects into a live scene viewed by a Kinect. Issues involving the presentation of the Kinect's output are addressed, and algorithms for improving the believability and aesthetic matching of the rendered scene against the real scene are proposed. An implementation of this pipeline using GLSL shaders to perform this pipeline at interactive framerates is given. The results of experiments with this program are provided that show promise that the approaches evaluated here can be applied to improve other implementations.
Skeleton estimation from single-camera grayscale images is generally accomplished using model-based
techniques. Multiple cameras are sometimes used; however, skeletal points extracted from a single subject using
multiple images are usually too sparse to be helpful for localizing body parts. For this project, we use a single viewpoint
without any model-based assumptions to identify a central source of motion, the body, and its associated extremities.
Harris points are tracked using Lucas-Kanade refinement with a weighted kernel found from expectation maximization.
The algorithm tracks key image points and trajectories and re-represents them as complex vectors describing the motion
of a specific body part. Normalized correlation is calculated from these vectors to form a matrix of graph edge weights,
which is subsequently partitioned using a graph-cut algorithm to identify dependent trajectories. The resulting Harris
points are clustered into rigid component centroids using mean shift, and the extremity centroids are connected to their
nearest body centroid to complete the body-part estimation. We collected ground truth labels from seven participants for
body parts that are compared to the clusters given by our algorithm.
Current face identification systems are not robust enough to accurately identify the same individual in different images
with changes in head pose, facial expression, occlusion, length of hair, illumination, aging, etc. This is especially a
problem for facial images that are captured using low resolution video cameras or webcams. This paper introduces a new
technique for facial identification in low resolution images that combines facial structure with skin texture to
accommodate changes in lighting and head pose. Experiments using this new technique show that combining facial structure features with skin texture features results in a facial identification system for low resolution images that is more robust to pose and illumination conditions than either technique used alone.
Wide area airborne surveillance (WAAS) systems are a new class of remote sensing imagers which have many
military and civilian applications. These systems are characterized by long loiter times (extended imaging
time over fixed target areas) and large footprint target areas. These characteristics complicate moving object
detection and tracking due to the large image size and high number of moving objects. This research evaluates
existing object detection and tracking algorithms withWAAS data and provides enhancements to the processing
chain which decrease processing time and maintain or increase tracking accuracy. Decreases in processing time
are needed to perform real-time or near real-time tracking either on the WAAS sensor platform or in ground
station processing centers. Increased tracking accuracy benefits real-time users and forensic (off-line) users.
This research introduces a mode-specific model of visual saliency that can be used to highlight likely lesion locations
and potential errors (false positives and false negatives) in single-mode PET and MRI images and multi-modal fused
PET/MRI images. Fused-modality digital images are a relatively recent technological improvement in medical imaging;
therefore, a novel component of this research is to characterize the perceptual response to these fused images. Three
different fusion techniques were compared to single-mode displays in terms of observer error rates using synthetic
human brain images generated from an anthropomorphic phantom. An eye-tracking experiment was performed with
naïve (non-radiologist) observers who viewed the single- and multi-modal images. The eye-tracking data allowed the
errors to be classified into four categories: false positives, search errors (false negatives never fixated), recognition errors
(false negatives fixated less than 350 milliseconds), and decision errors (false negatives fixated greater than 350
milliseconds). A saliency model consisting of a set of differentially weighted low-level feature maps is derived from the
known error and ground truth locations extracted from a subset of the test images for each modality. The saliency model
shows that lesion and error locations attract visual attention according to low-level image features such as color,
luminance, and texture.
The objective of the character recognition effort for the Archimedes Palimpsest is to provide a tool that allows scholars of ancient Greek mathematics to retrieve as much information as possible from the remaining degraded text. With this in mind, the current pattern recognition system does not output a single classification decision, as in typical target detection problems, but has been designed to provide intermediate results that allow the user to apply his or her own decisions (or evidence) to arrive at a conclusion. To achieve this result, a probabilistic network has been incorporated into our previous recognition system, which was based primarily on spatial correlation techniques. This paper reports on the revised tool and its recent success in the transciption process.
Eye movements are an external manifestation of selective attention and can play an important role in indicating which attributes of a scene carry the most pertinent information. Models that predict gaze
distribution often define a local conspicuity value that relies on low-level image features to indicate the perceived salience of an image region. While such bottom-up models have some success in predicting fixation densities in simple 2D images, success with natural scenes requires an understanding of the goals
of the observer, including the perceived usefulness of an object in the context of an explicit or implicit task. In the present study, observers viewed natural images while their eye movements were recorded. Eye movement patterns revealed that subjects preferentially fixated objects relevant for potential actions
implied by the gist of the scene, rather than selecting targets based purely on image features. A proto-object map is constructed that is based on highly textured regions of the image that predict the location of potential objects. This map is used as a mask to inhibit the unimportant low-level features and enhance the
important features to constrain the regions of potential interest. The resulting importance map correlates well to subject fixations on natural-task images.
Visual perception, operating below conscious awareness, effortlessly provides the experience of a rich representation of the environment, continuous in space and time. Conscious visual perception is made possible by the 'foveal compromise,' the combination of the high-acuity fovea and a sophisticated suite of eye movements. Our illusory visual experience cannot be understood by introspection, but monitoring eye movements lets us probe the processes of visual perception. Four tasks representing a wide range of complexity were used to explore visual perception; image quality judgments, map reading, model building, and hand-washing. Very short fixation durations were observed in all tasks, some as short as 33 msec. While some tasks showed little variation in eye movement metrics, differences in eye movement patterns and high-level strategies were observed in the model building and hand washing tasks. Performance in the hand washing task revealed a new type of eye movement. 'Planful' eye movements were made to objects well in advance of a subject's interaction with the object. Often occurring in the middle of another task, they provide 'overlapping' temporal information about the environment providing a mechanism to produce our conscious visual experience.
Access to the requested content is limited to institutions that have purchased or subscribe to SPIE eBooks.
You are receiving this notice because your organization may not have SPIE eBooks access.*
*Shibboleth/Open Athens users─please
sign in
to access your institution's subscriptions.
To obtain this item, you may purchase the complete book in print or electronic format on
SPIE.org.
INSTITUTIONAL Select your institution to access the SPIE Digital Library.
PERSONAL Sign in with your SPIE account to access your personal subscriptions or to use specific features such as save to my library, sign up for alerts, save searches, etc.