In data mining, what is an outlier?
A data point that is noticeably different from other data points in a dataset is an outlier. It may be the result of mistakes, such as measurement or data entry errors, or it may be a valid observation that is uncommon or rare. Outliers greatly impact the outcomes of an analysis or model in data mining. Assume, for instance, that a dataset used for regression analysis contains an outlier. In that case, it can significantly impact the regression line and its coefficients, producing inaccurate or deceptive results. Additionally, outliers can alter the data’s distribution or affect how well clustering algorithms perform, producing inaccurate results.
To ensure accurate results in data mining, outliers must be properly identified and handled, such as by removing or transforming the data.
An outlier analysis of a data object behaves differently and significantly deviates from the other data objects. An object that significantly differs from the other objects is said to be an outlier. Errors in measurement or execution may be the reason for them. Outlier analysis or outlier mining is the process of analyzing outlier analysis of data.
An outlier analysis of data can neither be considered noise nor an error. Instead, it is thought that they weren’t created using the same process as the other data objects.
There are three categories of outliers:
- Global (or Point) (or Point) Outliers
- Individual Outliers
- Contextual (or Conditional) (or Conditional) Outliers
Another name for them is Point Outliers. These are outliers in their most basic form. A data point is considered a global outlier analysis of data in a dataset if it significantly differs from every other data point. Most outlier detection techniques focus on locating global outliers.
For instance, in an intrusion detection system, if many packages are broadcast in a short period of time, this may be regarded as a global outlier analysis of data and we can infer that the system in question has possibly been compromised.
As the name implies, if a group of data points in a dataset significantly deviates from the rest of the dataset, they are said to be collective outliers. Individual data objects in this case might not behave as outliers, but the group as a whole might. We may require background knowledge regarding the relationship between those data objects exhibiting the outlier analysis of data behavior in order to detect these types of outliers.
A DOS (denial-of-service) package sent from one computer to another, for instance, might be regarded as typical behavior by an intrusion detection system. However, if this occurs on multiple computers at once, it may be regarded as abnormal behavior, and all of the computers involved may be referred to as collective outliers.
Other names for them include Conditional Outliers. Here, if a data object in a given dataset significantly differs from the other data points due to a single circumstance or condition. A data point might exhibit normal behavior under one condition but be an outlier under another. Therefore, in order to identify contextual outliers, a context must be specified as part of the problem statement. Contextual outlier analysis of data gives users the freedom to look at outliers in various contexts, which can be very useful in many applications. The contextual and behavioral attributes are used to determine the data point’s attributes.
In the context of a “winter season,” for instance, a temperature reading of 40°C might behave as an outlier, but in the context of a “summer season,” it would behave normally.
How Does Data Mining Find Outliers?
In data mining, there are several techniques for finding outliers, such as:
The Z-score, or the number of standard deviations a data point is from the mean, is calculated and determined for each and every data point in this situation by this method. Data points with Z-scores greater than a specific threshold, usually 3 or -3, are classified as outliers.
- Finding the dataset’s mean and calculating the standard deviation are the steps in the Z-score calculation.
- By taking the mean out and dividing the result by the standard deviation, you can determine the Z-score for each data point.
- Outliers are recognized as data points with a Z-score greater than or equal to a predetermined limit, typically 3 or -3. For datasets with non-normal data distributions, this method may not be appropriate.
For instance, a data point with a value of 140 would have a Z-score of 2 (140-100)/10=2 if the dataset’s mean was 100 and the normal or usual deviation was 10. A data point with this Z-score is 2 standard deviations above the mean and might be regarded as an outlier.
Method of Interquartile Range (IQR)
By dividing a dataset into quartiles and calculating the variability of the middle 50% of the data, the Interquartile Range (IQR) method can identify outliers in the dataset.
- The steps involved are as follows: determining the third quartile (Q3) and first quartile (Q1) (Q3),
- Calculate the IQR by deducting Q1 from Q3
- Outliers are defined as data points outside the Q1 – 1.5 * IQR to Q3 + 1.5 * IQR range.
- This approach is straightforward and frequently used, but it relies on a symmetrical distribution of the data and might not successfully spot outliers in non-symmetrical datasets.
Consider a dataset with values below 2, 3, 4, 5, 6, 7, 8, and 9. The IQR would be 8 – 4 = 4 and the Q1 would be 4 and the Q3 would be 8. Outliers would be data points with IQRs of less than or equal to 1.5 (Q1 – 1.5) or greater than 11 (Q3 + 1.5).
Method of Mahalanobis Distance
To determine the separation between each data point and the dataset’s mean, this method employs multivariate statistics. Data points that fall outside of the norm are referred to as outliers.
- Finding the mean and covariance matrix is a step in the process.
- Analyzing each data point’s Mahalanobis distance and identifying outliers based on a cutoff value established by statistical tests.
- Although this approach works well for datasets with normal distributions, non-normal datasets might not fit well.
Using a dataset of people’s height and weight as an example. We are looking for individuals who differ significantly from the group’s averages for height and weight. We first determine the group’s average height and weight, after which we calculate each individual’s deviation from the mean based on both height and weight, taking into account how the two variables are related. Outliers are people who deviate significantly from the norm. The Mahalanobis distance method would therefore be used to identify this person as an outlier analysis of data if, for instance, the majority of the group’s members are about average in height and weight. Still, one individual is significantly taller and heavier than the average.
DBSCAN (Density-Based Spatial Clustering of Applications with Noise)
Outliers are data points that do not belong to any dense clusters, according to this clustering algorithm.
- The minimum number of points (minutes) necessary to form a cluster and the radius (eps) around each point that defines the neighborhood are two of the most important parameters in DBSCAN. The size and quantity of clusters the algorithm generates can be regulated by changing these parameters.
- DBSCAN has the advantage of being able to find clusters of any shape, as opposed to other algorithms like k-means which can only find spherical clusters.
- DBSCAN, however, can be sensitive to parameter selection and may not perform well for datasets with a lot of overlapping clusters. Other clustering algorithms or a combination of different algorithms may be required in these circumstances.
The factor of Local Outliers (LOF)
The local density of each data point is calculated by this algorithm, and outliers are data points with low local density in comparison to their neighbors. The LOF algorithm has the advantage of being able to identify both global and local outliers. Global outliers are data points in the dataset that are very different from the majority of other data points. Local outliers, on the other hand, are data points that are far from the majority of the data points despite being close to their nearest neighbors.
- Let’s say we have a dataset with the heights and weights of 100 individuals. The LOF algorithm will be used to find any outliers in this dataset.
- To determine the local density score, we would first choose a value for a few of the nearest neighboring data points. Let’s use the five nearest neighbors for this example.
- We would then use each individual’s height, weight and the heights and weights of their five closest neighbors to calculate the local density score for each person in the dataset.
- In order to identify any outliers, we would then compare the local density scores of each individual. An individual will be categorized as an outlier analysis of data if their local density score is significantly lower than the scores of their five closest neighbors.
Person A would be regarded as an outlier analysis of data because their height and weight are significantly different from those of their closest neighbors; for instance, if person A is 6 feet tall and weighs 200 pounds, and their five closest neighbors are all between 5 feet and 6 feet tall and weigh between 150 and 170 pounds.
To find outliers in a dataset, these methods are used singly or in combination. The goals of the analysis and the nature of the data will determine the types of methods to be used.
In conclusion, outlier analysis of data is a critical step in data mining, as it helps to identify and deal with the anomalies in the data. This guide has provided a comprehensive overview of the various techniques used in the outlier analysis of data, ranging from simple statistical methods to more advanced machine learning algorithms. It is essential to choose the appropriate method depending on the type and size of the dataset, as well as the research objectives. By detecting and removing outliers, data analysts can enhance the accuracy and reliability of their models, leading to better decision-making and insights. However, it is important to note that outlier analysis of data is not a one-time process but a continuous effort, as new data may introduce new outliers.