Paper
22 May 2017 Automatic similarity detection and clustering of data
Craig Einstein, Peter Chin
Author Affiliations +
Abstract
An algorithm was created which identifies the number of unique clusters in a dataset and assigns the data to the clusters. A cluster is defined as a group of data which share similar characteristics. Similarity is measured using the dot product between two vectors where the data are input as vectors. Unlike other clustering algorithms such as K-means, no knowledge of the number of clusters is required. This allows for an unbiased analysis of the data. The automatic cluster detection algorithm (ACD), is executed in two phases: an averaging phase and a clustering phase. In the averaging phase, the number of unique clusters is detected. In the clustering phase, data are matched to the cluster to which they are most similar. The ACD algorithm takes a matrix of vectors as an input and outputs a 2D array of the clustered data. The indices of the output correspond to a cluster, and the elements in each cluster correspond to the position of the datum in the dataset. Clusters are vectors in N-dimensional space, where N is the length of the input vectors which make up the matrix. The algorithm is distributed, increasing computational efficiency
© (2017) COPYRIGHT Society of Photo-Optical Instrumentation Engineers (SPIE). Downloading of the abstract is permitted for personal use only.
Craig Einstein and Peter Chin "Automatic similarity detection and clustering of data", Proc. SPIE 10185, Cyber Sensing 2017, 101850K (22 May 2017); https://doi.org/10.1117/12.2267844
Advertisement
Advertisement
RIGHTS & PERMISSIONS
Get copyright permission  Get copyright permission on Copyright Marketplace
KEYWORDS
Associative arrays

Detection and tracking algorithms

Binary data

Data processing

Web 2.0 technologies

Phase measurement

Classification systems

Back to Top