Practising Unsupervised Learning: DBScan

Prof. Dr. Mirco Schoenfeld

DBScan

DBScan stands for

Density-based spatial clustering of applications with noise

(Ester et al. 1996)

DBScan

DBScan is a density-based clustering.

It groups together points with many nearby neighbors.

Merits

In 2014, the algorithm was awarded the test of time award.

It is one of the most common clustering algorithms.

(Schubert et al. 2017)

abstract formulation

The algorithm (abstracted):

  1. Find the points in a fixed neighborhood of every point, and identify the core points with more than minPts neighbors.
  2. Find the connected components of core points on the neighbor graph, ignoring all non-core points.
  3. Assign each non-core point to a nearby cluster if the cluster is a neighbor, otherwise assign it to noise.

visualization

https://scikit-learn.org/stable/modules/generated/sklearn.cluster.DBSCAN.html

visualization

https://scikit-learn.org/stable/modules/generated/sklearn.cluster.DBSCAN.html

advantages

DBScan has a few advantages:

  • it does not require the number of clusters a priori
  • it can find arbitrarily-shaped clusters
  • it has a notion of noise and is robust to outliers
  • configurable by domain experts

disadvantages

DBScan has also a few disadvantages:

  • it is not entirely deterministic
  • its quality largely depends on the distance measure
  • it struggles to cluster datasets with large differences in densities
  • if data and scale are not well understood it’s difficult to set up

It’s your turn

  1. Download the task sheet
  2. Open the task sheet in RStudio
  3. Fill the gaps to apply a dbscan clustering
    If you want to read what a function (e.g. dbscan) does use ? to access its documentation (i.e. ?dbscan)

References

Ester, Martin, Hans-Peter Kriegel, Jörg Sander, and Xiaowei Xu. 1996. “A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise.” In Proceedings of the Second International Conference on Knowledge Discovery and Data Mining, 226–31. KDD’96. Portland, Oregon: AAAI Press.
Schubert, Erich, Jörg Sander, Martin Ester, Hans Peter Kriegel, and Xiaowei Xu. 2017. “DBSCAN Revisited; Why and How You Should (Still) Use DBSCAN.” ACM Transactions on Database Systems 42 (3): 1–21. https://doi.org/10.1145/3068335.