Five methods for being able to detect outliers from your dataset

Photo by Will Myers on Unsplash

Outlier detection can often be an important part of any exploratory data analysis. This is because in the real world, data is often messy and many different things can affect the underlying data. It is thus important to be able to identify different methods to be able to identify these from the underlying data.

The first thing to ask, then, is “what is an outlier?” An outlier can be classed as a data point, or several data points, that don’t fit the pattern, data structure, or within the normal bounds to what we would expect for the data that we have. This is a very subjective definition as outliers depend heavily on the context in which you are examining as to whether a data point is an outlier or not.

This outlier could be the result of many different issues:

  • Human error
  • Instrument error
  • Experimental error
  • Intentional creation
  • Data processing error
  • Sampling error
  • Natural outlier

The purpose for being able to identify this outlier of course can also be different. This could be because an outlier would indicate something has changed in the action that produces the data which is useful in the case of:

  • Fraud detection
  • Intrusion detection
  • Fault diagnostics
  • Time series monitoring
  • Health monitoring

Where an outlier would indicate maybe something has gone wrong in the process or the nature of the process generating the data has changed. This would thus entail identifying outliers on the basis of underlying accepted normal data.

In other cases it is useful to be able to remove outliers from existing data to ensure a model works. For example:

  • Recommendation engines
  • Time series forecasting
  • Scientific experimentation
  • Model building

Where outliers in existing data would affect model implementation such as in linear regression or classification tasks. Thus it would be important to identify outliers from existing data where we are not exactly sure what normal behavior is yet.

These two tasks can thus be separated, with the first referring to novelty detection in newly generated data, and the latter referring to anomaly detection in underlying data. It is primarily to this second domain that we turn our attention to although the methods outlined below can also be applied to the first scenario as well.

The dataset that I use for this is that of Pokémon through the first seven generations containing a total of 801 Pokémon. Specifically, I will focus on analysing the distribution of the HP…

Continue reading:—-7f60cf5620c9—4