The nearest neighbor method is the simplest metric classifier, which is based on assessing the similarity of various objects.
The analyzed object belongs to the class to which the subjects of the training sample belong. Find out what the closest neighbor method is. Let's try to understand this difficult issue, give examples of various techniques.
Method hypothesis
The nearest neighbor method can be considered the most common algorithm used for classification. The object to be classified belongs to the class y_i to which the closest object of the training sample x_i belongs.
The specifics of the method of nearest neighbors
The method of k nearest neighbors allows to increase the reliability of the classification. The analyzed object belongs to the same class as the bulk of its neighbors, that is, k objects close to it of the analyzed sample x_i. When solving problems with two classes, the number of neighbors will be odd in order to eliminate the ambiguity situation if the same number of neighbors will belong to different classes.
Weighted Neighbor Technique
The analyzed tsvector postgresql method of nearest neighbors is used when the number of classes is at least three, and oddness cannot be used. But ambiguity arises even in these cases. Then the ith neighbor receives weight w_i, which decreases with increasing rank of neighbor i. The object belongs to the class that will have the maximum total weight among close neighbors.
Compactness conjecture
All of the above methods are based on the compactness hypothesis. It assumes a connection between the measure of similarity of objects and their belonging to one class. In such a situation, the border between the different species has a simple shape, and classes create compact mobile areas in the space of objects. By such areas in mathematical analysis it is customary to mean closed bounded sets. This hypothesis is not related to the everyday perception of the word.
Basic formula
Let us examine in more detail the method of the nearest neighbor. If a training sample of the form “object-response” is proposed X ^ m = \ {(x_1, y_1), \ dots, (x_m, y_m) \}; if the distance function \ rho (x, x '), which is presented in the form of an adequate model of similarity of objects, is specified for many objects, when the value of this function increases, the similarity between the objects x, x' decreases.
For any object u, we construct the objects of the training sample x_i as the distances to u increase:
\ rho (u, x_ {1; u}) \ leq \ rho (u, x_ {2; u}) \ leq \ cdots \ leq \ rho (u, x_ {m; u}),
where x_ {i; u} characterizes the object of the training sample, which is the ith neighbor of the original object u. We use a similar designation to answer on the ith neighbor: y_ {i; u}. As a result, we get that an arbitrary object u provokes a change in the numbering of its own sample.
Determining the number of neighbors k
The nearest neighbor method with k = 1 is capable of giving an erroneous classification, not only at the emission objects, but also for other classes that are located nearby.
If we take k = m, the algorithm will be as stable as possible and will degenerate into a constant value. That is why for reliability it is important not to allow extreme indicators k.
In practice, the criterion of moving control is used as the optimal indicator k.
Emissions screening
The objects of instruction are mainly unequal, but among them there are those that have the characteristic features of a class and are called standards. With the proximity of the subject to the ideal sample, the likelihood of its belonging to this class is high.
How effective is the nearest neighbor method? An example can be seen on the basis of peripheral and non-informative categories of objects. It is assumed that the object in question is densely surrounded by other representatives of this class. If you remove them from the sample taken, the quality of the classification will not suffer.
A certain number of noise emissions that are “in the thick” of another class can fall into such a sample. Removal mainly affects the quality of the classification.
If uninformative and noise objects are eliminated from the sample taken, one can count on several positive results at the same time.
First of all, interpolation by the nearest neighbor method allows improving the quality of classification, reducing the amount of data stored, and reducing the classification time spent on choosing the nearest standards.
Application of super-large samples
The nearest neighbors method is based on the actual storage of training objects. To create super-large samples using technical problems. The task is not only to save a significant amount of information, but also to manage to find an arbitrary object u among the nearest k neighbors in the minimum time period.
In order to cope with the task, two methods are used:
- thin out the sample by throwing out non-information objects;
- apply special effective data structures and indices for instant search of nearest neighbors.
Rules for the selection of methods
The classification was considered above. The nearest neighbor method is used to solve practical problems in which the distance function \ rho (x, x ') is known in advance. When describing objects by numerical vectors, the Euclidean metric is used. Such a choice has no special justification, but involves the measurement of all the signs "on a single scale." If this factor is not taken into account, then the sign with the highest numerical values will prevail in the metric.
If there are a significant number of features, calculating the distance as the sum of deviations for specific features, a serious dimension problem appears.
In a high-dimensional space, all objects will be far from each other. As a result, the sample of k nearest neighbors for the object under study will be arbitrary. To eliminate this problem, a small number of informative features are selected. Algorithms for calculating estimates are built on the basis of different sets of attributes, and for each individual they build their proximity function.
Conclusion
Mathematical calculations quite often involve the use of a variety of techniques that have their own distinctive characteristics, advantages and disadvantages. The considered method of nearest neighbors allows us to solve rather serious problems associated with the characterization of mathematical objects. Experimental concepts based on the analyzed technique are currently actively used in artificial intelligence.
In expert systems, it is necessary not only to classify objects, but also to show the user an explanation of the classification in question. In this method, explanations of this phenomenon are expressed by the relation of the object to a certain class, as well as its location relative to the sample used. Specialists in the legal industry, geologists, and doctors, accept this “precedent" logic and actively use it in their research.
In order for the analyzed method to be the most reliable, effective, and give the desired result, it is necessary to take the minimum indicator k, and also not to allow outliers among the analyzed objects. That is why they apply the methodology for selecting standards, and also carry out optimization of metrics.