Clustering method: description, basic concepts, application features

The clustering method is the task of grouping a set of objects so that they in the same group are more similar to each other than to objects in other industries. This is the main task of data mining and a general statistical analysis technique used in many fields, including machine learning, pattern recognition, image retrieval, information retrieval, data compression, and computer graphics.

Optimization task

using the clustering method

The clustering method itself is not one specific algorithm, but a general problem that needs to be solved. This can be achieved using various algorithms that differ significantly in understanding what constitutes a group and how to find it effectively. The use of the clustering method for the formation of meta-subjects includes the use of groups with small distances between members, dense regions of space, intervals, or certain statistical distributions. Therefore, clustering can be formulated as a multi-purpose optimization problem.

The appropriate method and parameter settings (including items such as the distance function to use, density threshold, or number of expected clusters) depend on the individual data set and the intended use of the results. Analysis as such is not an automatic task, but an iterative process of knowledge discovery or interactive multi-purpose optimization. This clustering method includes trial and failure attempts. It is often necessary to modify the data preprocessing and model parameters until the result reaches the desired properties.

In addition to the term “clustering,” there are a number of words with similar meanings, including automatic classification, numerical taxonomy, botryology, and typological analysis. Subtle differences often lie in using the clustering method to form meta-subject relationships. While the resulting groups are of interest in the extraction of data, in the automatic classification the discriminatory force already performs these functions.

Cluster analysis was based on numerous works by Kreber in 1932. And introduced into psychology by Zubin in 1938 and Robert Trion in 1939. And these works have been used by Kettel since 1943 to denote the attribute classification of clustering methods in theory.

Term

using method

The concept of “cluster” cannot be precisely defined. This is one of the reasons why there are so many clustering methods. There is a common denominator: a group of data objects. However, all kinds of researchers use different models. And each of these uses of clustering methods includes different data. The concepts found by all kinds of algorithms differ significantly in its properties.

Using the clustering method is the key to understanding the differences between instructions. Typical cluster models include:

  • Centroid s. This is, for example, when k-means clustering represents each cluster with one average vector.
  • Connectivity Model s. This is, for example, hierarchical clustering, which builds models based on distance connectivity.
  • Distribution Model s. In this case, clusters are modeled using the clustering method to form meta-subject statistical distributions. Such as multidimensional normal separation, which is applicable to the maximize expectation algorithm.
  • Density Model s. These are, for example, DBSCAN (spatial clustering algorithm with the presence of noise) and OPTICS (order points for determining the structure), which define groups as connected dense areas in the data space.
  • Subspace Model c. In biclustering (also known as co-clustering or two modes), groups are modeled with both elements and with the corresponding attributes.
  • Model s. Some algorithms do not provide a refined connection for their clustering method for generating meta-subject results and simply provide grouping of information.
  • Graph based model s. A click, that is, a subset of nodes, such that every two connections in the rib part can be considered as a prototype of the shape of the cluster. Attenuation of the complete requirement is known as quasicore. Exactly the same name is presented in the HCS clustering algorithm.
  • Neural models s. The most famous network without supervision is a self-organizing card. And it is these models that can usually be characterized as similar to one or more of the above clustering methods for the formation of meta-subject results. It includes subspace systems when neural networks implement the necessary form of analysis of the main or independent components.

This term is, in fact, a set of such groups that usually contain all the objects in the set of data clustering methods. In addition, it can indicate the relationship of clusters to each other, for example, a hierarchy of systems embedded in each other. Grouping can be divided into the following aspects:

  • Hard centroid clustering method. Here, each object belongs to the group or is outside it.
  • Soft or fuzzy system. In this paragraph, already every object to a certain extent belongs to every cluster. It is also called the fuzzy clustering method of c-means.

And subtle differences are also possible. For example:

  • Strict partitioning clustering. Here, each object belongs to exactly one group.
  • Strict sectional clustering with outliers. In this case, objects may also not belong to any cluster and may be considered unnecessary.
  • Overlapping clustering (also alternative, with multiple views). Here, objects can belong to more than one branch. Typically involving solid clusters.
  • Hierarchical clustering methods. Objects that belong to a child group also belong to the parent subsystem.
  • Subspace formation. Although they are similar to overlapping clusters, within a uniquely defined system, reciprocal groups should not be blocked.

Instruction manual

using the clustering method to form

As indicated above, clustering algorithms can be classified based on their cluster model. The following overview will list only the most striking examples of these instructions. Since there may be more than 100 published algorithms, not all provide models for their clusters and therefore cannot be easily classified.

There is no objectively correct clustering algorithm. But, as noted above, the instruction is always in the field of view of the observer. The most suitable clustering algorithm for a particular task often has to be chosen experimentally, unless there is a mathematical reason to prefer one model to another. It should be noted that an algorithm designed for a single type usually does not work with a dataset that contains a radically different subject. For example, k-means cannot find non-convex groups.

Compound-Based Clustering

clustering method

This association is also known by its name as a hierarchical model. It is based on the typical idea that objects are more connected with neighboring parts than with those that are much further away. These algorithms connect objects, forming various clusters, depending on their distance. A group can be described mainly by the maximum distance that is needed to connect different parts of the cluster. At various distances, other groups will be formed that can be represented using the dendrogram. This explains where the generic name “hierarchical clustering” comes from. That is, these algorithms do not provide a single separation of the data set, but instead provide an extensive subordination order. It is thanks to him that there is a drain with each other at certain distances. In the dendrogram, the Y axis indicates the distance at which the clusters join. And the objects are located along the straight line X so that the groups do not mix.

Compound-based clustering is a family of methods that differ in how they calculate distances. In addition to the usual choice of distance functions, the user also needs to determine the communication criterion. Since a cluster consists of several objects, there are many options for calculating it. A popular choice is known as a single-lever grouping, it is this complete communication method that contains UPGMA or WPGMA (an unweighted or weighted ensemble of pairs with an arithmetic mean, also known as clustering of medium bonds). In addition, the hierarchical system can be agglomerative (starting with individual elements and combining them into groups) or dividing (starting with a complete set of data and dividing it into sections).

Distributed clustering

clustering method for forming

These models are most closely related to statistics, which is based on separation. Clusters can easily be defined as objects that most likely belong to the same distribution. A convenient feature of this approach is that it is very similar to the way you create artificial datasets. By fetching random objects from the distribution.

Although the theoretical basis of these methods is excellent, they suffer from one key problem, known as re-equipment, unless restrictions are placed on the complexity of the model. Larger communication will usually be able to better explain the data, which makes it difficult to choose the appropriate method.

Gaussian mixture model

This method uses all sorts of algorithms to maximize expectations. Here, the data set is usually modeled with a fixed (to avoid overriding) number of Gaussian distributions that are randomly initialized and whose parameters are iteratively optimized to better match the data set. This system will converge to a local optimum. That is why several runs can give different results. To get the most stringent clustering, objects are often assigned to the Gaussian distribution to which they most likely belong. And for softer groups this is not necessary.

Distribution-based clustering creates complex models that can ultimately capture correlation and dependency between attributes. However, these algorithms impose an additional burden on the user. For many real data sets, there may not be a concise mathematical model (for example, assuming that a Gaussian distribution is a pretty strong assumption).

Density Clustering

clustering to form

In this example, groups are mainly defined as areas with higher impermeability than the rest of the data set. Objects in these rare parts that are necessary to separate all components are usually considered noise and boundary points.

The most popular density-based clustering method is DBSCAN (Noise Spatial Clustering Algorithm). Unlike many new methods, it has a well-defined cluster component called “density reachability”. Like link-based clustering, it is based on join points within defined distance thresholds. However, this method only collects items that satisfy the density criterion. In the original version, defined as the minimum number of other objects in this radius, the cluster consists of all objects connected by density (which can form a group of arbitrary shape, unlike many other methods), as well as all objects that are within the allowable range.

Another interesting property of DBSCAN is that its complexity is quite low - it requires a linear number of range queries to the database. And also the unusual thing is that it will detect essentially the same results (this is deterministic for the main and noise points, but not for boundary elements) in each run. Therefore, there is no need to run it several times.

The main disadvantage of DBSCAN and OPTICS is that they expect some decrease in density to detect cluster boundaries. For example, in datasets with overlapping Gaussian distributions — a common use case for artificial objects — the boundaries of clusters created by these algorithms often look arbitrary. This happens because the group density is continuously decreasing. And in a data set consisting of mixtures of Gaussians, these algorithms are almost always superior to methods such as EM clustering, which are able to accurately model systems of this type.

Mean displacement is a cluster approach in which each object moves to the densest region in the neighborhood based on an estimate of the entire core. In the end, objects converge to local impermeability maxima. Like clustering by the k-means method, these “density attractors" can serve as representatives for the data set. But mean displacement can detect arbitrary-shaped clusters similar to DBSCAN. Due to the expensive iterative procedure and density estimation, average displacement is usually slower than DBSCAN or k-Means. In addition, the applicability of the typical shift algorithm to multidimensional data is hampered by the uneven behavior of the kernel density estimation, which leads to excessive fragmentation of the cluster tails.

Rating

clustering method for the formation of meta subject

Verifying clustering results is as complex as grouping itself. Popular approaches include an “internal” assessment (where the system comes down to a single quality score) and, of course, an “external” mark (where clustering is compared to the existing classification of the “fundamental truth”). A manual assessment of a human expert and an indirect score are found by studying the usefulness of clustering in the proposed application.

Internal grading measures suffer from the problem that they represent functions that can themselves be seen as clustering goals. For example, you can group the data specified by the Silhouette coefficient, except that there is no known effective algorithm for this. Using such an internal measure for evaluation, it is better to compare the similarity of optimization problems.

The external mark has similar problems. If there are such labels of "ground truth", then you do not need to cluster. And in practical applications, there are usually no such concepts. On the other hand, labels reflect only one possible partitioning of the data set, which does not mean that there is no other (and maybe even better) clustering.

Therefore, none of these approaches can ultimately judge the actual quality. But this requires a human assessment, which is very subjective. Nevertheless, such statistics can be informative in identifying bad clusters. But one should not discount the subjective assessment of a person.

Internal mark

When a clustering result is evaluated based on data that has been clustered itself, this is called the term. These methods usually assign the best result to an algorithm that creates groups with high similarity within and low between groups. One of the drawbacks of using internal criteria in evaluating a cluster is that high grades do not necessarily lead to effective applications for finding information. In addition, this score is biased towards algorithms that use the same model. For example, clustering k-means naturally optimizes distances to objects, and an internal criterion based on it is likely to overestimate the resulting grouping.

Therefore, measures of such an assessment are best suited to get an idea of ​​situations where one algorithm works better than another. But this does not mean that each information gives more reliable results than the other. The validity period measured by such an index depends on the assertion that the structure exists in the data set. , , , . , k- , . k- .

, . , . , (). , . , . , . , . , , . , (, ) , .

Now it’s clear what does not apply to clustering methods, and what models are used for these purposes.


All Articles