In this paper, we propose a “Multi-modal Action Segmentation approach” that uses three modalities: (i) video, (ii) audio, (iii) thermal to classify cooking behavior in the kitchen. These 3 modalities are assumed to be features related to cooking. However, there is no public dataset containing these three modalities. Therefore, we built the original dataset and frame-level annotation. We then examined the usefulness of Action Segmentation using multi-modal features. We analyzed the effects of each modality using three evaluation metrics. As a result, the accuracy, edit distance, and F1 value were improved by up to about 1%, 2%, and 8%, respectively, compared to the case when only images were used.
In order to create a classification model for fungi, it is necessary to have robustness against out-of-distribution data from the viewpoint of practicality. Therefore, in this paper, we perform out-of-distribution detection on a fungi. Unlike the case of conventional out-of-distribution detection, the characteristics of in-distribution data and out-of-distribution data in this paper are very similar. Therefore, the problem in which conventional methods using out-of-distribution data for validation are not effective is mentioned. We also verify whether the accuracy of out-of-distribution detection can be improved using the attention branch network.
Optical flow estimation in onboard cameras is an important task in automatic driving and advanced driver- assistance systems. However, there is a problem that calculation is mistakable with high contrast and high speed. Event cameras have great features such as high dynamic range and low latency that can overcome these problems. Event cameras report only the change in the logarithmic intensity per pixel rather than the absolute brightness. There is a method of estimating the optical ow simultaneously with the luminance restoration from the event data. The regularization using the L1 norm of differentiation is insufficient for spatially sparse event data. Therefore, we propose to use the focus of expansion (FOE) for regularization of optical ow estimation in event camera. The FOE is defined as the intersection of the translation vector of the camera and the image plane. The optical ow becomes radial from the FOE excluding the rotational component. Using the property, the optical ow can be regularized in the correct direction in the optimization process. We demonstrated that the optical ow was improved by introducing our regularization using the public dataset.
This paper presents a new approach for human action recognition around sleeping with the human body parts locations and the positional relationship between human and sleeping environment. Body parts are estimated from the depth image obtained by a time-of-flight (TOF) sensor using oriented 3D normal vector. Issues in action recognition of sleeping situation are the demand of availability in darkness, and hiding of the human body by duvets. Therefore, the extraction of image features is difficult since color and edge features are obscured by covers. Thus, first in our method, positions of four parts of the body (head, torso, thigh, and lower leg) are estimated by using the shape model of bodily surface constructed by oriented 3D normal vector. This shape model can represent the surface shape of rough body, and is effective in robust posture estimation of the body hidden with duvets. Then, action descriptor is extracted from the position of each body part. The descriptor includes temporal variation of each part of the body and spatial vector of position of the parts and the bed. Furthermore, this paper proposes hierarchical action classes and classifiers to improve the indistinct action classification. Classifiers are composed of two layers, and recognize human action by using the action descriptor. First layer focuses on spatial descriptor and classifies action roughly. Second layer focuses on temporal descriptor and classifies action finely. This approach achieves a robust recognition of obscured human by using the posture information and the hierarchical action recognition.
Building and road detection from aerial imagery has many applications in a wide range of areas including urban design, real-estate management, and disaster relief. The extracting buildings and roads from aerial imagery has been performed by human experts manually, so that it has been very costly and time-consuming process. Our goal is to develop a system for automatically detecting buildings and roads directly from aerial imagery. Many attempts at automatic aerial imagery interpretation have been proposed in remote sensing literature, but much of early works use local features to classify each pixel or segment to an object label, so that these kind of approach needs some prior knowledge on object appearance or class-conditional distribution of pixel values. Furthermore, some works also need a segmentation step as pre-processing. Therefore, we use Convolutional Neural Networks(CNN) to learn mapping from raw pixel values in aerial imagery to three object labels (buildings, roads, and others), in other words, we generate three-channel maps from raw aerial imagery input. We take a patch-based semantic segmentation approach, so we firstly divide large aerial imagery into small patches and then train the CNN with those patches and corresponding three-channel map patches. Finally, we evaluate our
system on a large-scale road and building detection datasets that is publicly available.
This paper presents an approach for real-time human activity recognition. Three different kinds of features (flow, shape, and a keypoint-based feature) are applied in activity recognition. We use random forests for feature integration and activity classification. A forest is created at each feature that performs as a weak classifier. The international classification of functioning, disability and health (ICF) proposed by WHO is applied in order to set the novel definition in activity recognition. Experiments on human activity recognition using the proposed framework show - 99.2% (Weizmann action dataset), 95.5% (KTH human actions dataset), and 54.6% (UCF50 dataset) recognition accuracy with a real-time processing speed. The feature integration and activity-class definition allow us to accomplish high-accuracy recognition match for the state-of-the-art in real-time.
KEYWORDS: 3D modeling, Data modeling, Databases, Optimization (mathematics), Principal component analysis, Matrices, 3D image processing, Model-based design, 3D metrology, Statistical analysis
We propose a model-based 3D human shape reconstruction system from two silhouettes. Firstly, we synthesize a deformable body model from 3D human shape database consists of a hundred whole body mesh models. Each mesh model is homologous, so that it has the same topology and same number of vertices among all models. We perform principal component analysis (PCA) on the database and synthesize an Active Shape Model (ASM). ASM allows changing the body type of the model with a few parameters. The pose changing of our model can be achieved by reconstructing the skeleton structures from implanted joints of the model. By applying pose changing after body type deformation, our model can represents various body types and any pose. We apply the model to the problem of 3D human shape reconstruction from front and side silhouette. Our approach is simply comparing the contours between the model's and input silhouettes', we then use only torso part contour of the model to reconstruct whole shape. We optimize the model parameters by minimizing the difference between corresponding silhouettes by using a stochastic, derivative-free non-linear optimization method, CMA-ES.
This paper proposes a method that estimates the position of clouds from VIS images (visible), and IR images (infrared)
of GMS (Geostationary Meteorological Satellite). In estimating the position of clouds, because the brightness value of
land and sea is lower than cloud, and the brightness value of land and sea is continually varied by altitude of sun, the
cloud area cannot be estimated by threshold processing. In this study, Variation character of brightness value is classified
in each area, and the processing method of each area is proposed based on this variation character. In land area, there is
correlation between brightness value of VIS and IR image if the area is not covered by cloud. Thus, the object domain is
estimated cloud area using the correlation between them. In sea area, due to temperature is stable, cloud area is estimated
by background subtraction method. This method was used to estimate and evaluated in the 202 GMS-5 images. The
evaluated results shown that the proposed method is more accurate than the previous method, which estimated by
threshold processing (Omi, 2003).
In recent years, NOAA images have been provided very useful information about ecosystems, climate, weather and
water from all over the world. In order to use NOAA images, they need to be transformed from image coordinate system
into map coordinate system. This paper proposes a method that corrects the errors caused by this transformation. First,
elevation values are read from GTOPO30 database and they are verified to divide data into flat and rough blocks. The
elevation errors of all blocks are then calculated based on the elevation values. After correcting elevation errors, residual
errors are specified by GCP template matching. On the flat blocks, residual errors are corrected by affine transformation;
on the rough blocks, residual errors are corrected by applying Radial Basic Function Transformation to the residual
errors of the blocks that match GCP templates. With this correction method, residual errors are corrected precisely and
the errors of interpolation process are reduced. This method was applied to correct the errors for NOAA images
receiving in Tokyo, Bangkok and Ulaanbaatar. The results proved that this is a high accurate geometric correction
method.
KEYWORDS: Radar, Signal processing, 3D image processing, Extremely high frequency, 3D metrology, Sensors, 3D acquisition, Distance measurement, 3D displays, Signal detection
In recent years, crisis management's response to terrorist attacks and natural disasters, as well as accelerating rescue operations has become an important issue. We aim to make a support system for firefighters using the application of various engineering techniques such as information technology and radar technology. In rescue operations, one of the biggest problems is that the view of firefighters is obstructed by dense smoke. One of the current measures against this condition is the use of search sticks, like a blind man walking in town. The most important task for firefighters is to understand inside situation of a space with dense smoke. Therefore, our system supports firefighters' activity by visualizing the space with dense smoke. First, we scan target space with dense smoke by using millimeter-wave radar combined with a gyro sensor. Then multiple directional scan data can be obtained, and we construct a 3D map from high-reflection point dataset using 3D image processing technologies (3D grouping and labeling processing). In this paper, we introduce our system and report the results of the experiment in the real smoke space situation and practical achievements.
KEYWORDS: 3D modeling, Teeth, 3D image processing, Data modeling, 3D displays, Image display, 3D acquisition, Surgery, Image segmentation, Computed tomography
In orthognathic surgery, the framing of 3D-surgical planning that considers the balance between the front and back positions and the symmetry of the jawbone, as well as the dental occlusion of teeth, is essential. In this study, a support system for orthodontic surgery to visualize the changes in the mandible and the occlusal condition and to determine the optimum position in mandibular osteotomy has been developed. By integrating the operating portion of a tooth model that is to determine the optimum occlusal position by manipulating the entity tooth model and the 3D-CT skeletal images (3D image display portion) that are simultaneously displayed in real-time, the determination of the mandibular position and posture in which the improvement of skeletal morphology and occlusal condition is considered, is possible. The realistic operation of the entity model and the virtual 3D image display enabled the construction of a surgical simulation system that involves augmented reality.
The aim of this study is to develop a system that recognizes both the macro- and microscopic configurations of nerve cells and automatically performs the necessary 3-D measurements and functional classification of spines.
The acquisition of 3-D images of cranial nerves has been enabled by the use of a confocal laser scanning microscope, although the highly accurate 3-D measurements of the microscopic structures of cranial nerves and their classification based on their configurations have not yet been accomplished. In this study, in order to obtain highly accurate measurements of the microscopic structures of cranial nerves, existing positions of spines were predicted by the 2-D image processing of tomographic images. Next, based on the positions that were predicted on the 2-D images, the positions and configurations of the spines were determined more accurately by 3-D image processing of the volume data. We report the successful construction of an automatic analysis system that uses a coarse-to-fine technique to analyze the microscopic structures of cranial nerves with high speed and accuracy by combining 2-D and 3-D image analyses.
In recent years, crisis management in response to terrorist attacks and natural disasters, as well as accelerating rescue operations has become an important issue. Rescue operations greatly influence human lives, and require the ability to accurately and swiftly communicate information as well as assess the status of the site. Currently, considerable amount of research is being conducted for assisting rescue operations, with the application of various engineering techniques such as information technology and radar technology.
In the present research, we believe that assessing the status of the site is most crucial in rescue and firefighting operations at a fire disaster site, and aim to visualize the space that is smothered with dense smoke. In a space filled with dense smoke, where visual or infrared sensing techniques are not feasible, three-dimensional measurements can be realized using a compact millimeter wave radar device combined with directional information from a gyro sensor. Using these techniques, we construct a system that can build and visualize a three-dimensional geometric model of the space. The
final objective is to implement such a system on a wearable computer, which will improve the firefighters' spatial perception, assisting them in the baseline assessment and the decision-making process. In the present paper, we report the results of the basic experiments on three-dimensional measurement and visualization of a space that is smoke free, using a millimeter wave radar.
A person with an asymmetric morphology of maxillofacial skeleton reportedly possesses an asymmetric jaw function and the risk to express temporomandibular disorder is high. A comprehensive analysis from the point of view of both the morphology and the function such as maxillofacial or temporomandibular joint morphology, dental occlusion, and features of mandibular movement pathways is essential.
In this study, the 4D jaw movement visualization system was developed to visually understand the characteristic jaw movement, 3D maxillofacial skeleton structure, and the alignment of the upper and lower teeth of a patient. For this purpose, the 3D reconstructed images of the cranial and mandibular bones, obtained by computed tomography, were measured using a non-contact 3D measuring device, and the obtained morphological images of teeth model were integrated and activated on the 6 DOF jaw movement data. This system was experimentally applied and visualized in a jaw deformity patient and its usability as a clinical diagnostic support system was verified.
A content-based scene indexing has been important technique for an effective video contents handling such as scene retrieval and editing. The standard multimedia content descriptor (MPEG7) has been proposed for the key scene indexing. As for an automatic scene indexing, audio-visual features are most important clues. Many methods have been proposed for effective scene indexing based on those features. In this paper, we propose an automatic key scene detection method for baseball video contents using video features. We regard pitching scenes as key scenes, because they are starting points of all baseball play scenes. If the pitching scenes are detected, they could be effective hints to detect other scenes. In addition, a pitching scene digest video can be easily edited by gathering automatically extracted scenes. The pitching scene digest can be useful data for pitching analysis. We extract pitching scenes using color, domain and motion template created from manually selected pitching scene samples. Those templates contain image features unique to pitching scenes. Template matching is applied to video stream, so that target scenes can be detected by judging calculated matching rate. We experimentally test our method for actual baseball video contents. It can be useful data for pitching analysis and editing of digest news broad casting. We are developing the video indexing support system which users can give text annotations to indexed scenes using MPEG7 format descriptors.
Access to the requested content is limited to institutions that have purchased or subscribe to SPIE eBooks.
You are receiving this notice because your organization may not have SPIE eBooks access.*
*Shibboleth/Open Athens users─please
sign in
to access your institution's subscriptions.
To obtain this item, you may purchase the complete book in print or electronic format on
SPIE.org.
INSTITUTIONAL Select your institution to access the SPIE Digital Library.
PERSONAL Sign in with your SPIE account to access your personal subscriptions or to use specific features such as save to my library, sign up for alerts, save searches, etc.