| Title: | Resistant Clustering via Chopping Up Mutual Reachability Minimum Spanning Trees |
|---|---|
| Description: | Implements a fast and resistant divisive clustering algorithm which identifies a specified number of clusters: 'lumbermark' iteratively chops off sizeable limbs that are joined by protruding segments of a dataset's mutual reachability minimum spanning tree (Gagolewski, 2026 <DOI:10.48550/arXiv.2604.07143>). The use of a mutual reachability distance pulls peripheral points farther away from each other. It is a viable alternative to the 'HDBSCAN*'algorithm. When combined with the 'deadwood' package, it can act as an outlier detector. The 'Python' version of 'lumbermark' is available via 'PyPI'. |
| Authors: | Marek Gagolewski [aut, cre, cph] (ORCID: <https://orcid.org/0000-0003-0637-6028>) |
| Maintainer: | Marek Gagolewski <[email protected]> |
| License: | AGPL-3 |
| Version: | 0.9.0.9003 |
| Built: | 2026-07-01 15:24:38 UTC |
| Source: | https://github.com/gagolews/lumbermark |
Lumbermark is a fast and robust divisive clustering algorithm which identifies a specified number of clusters.
It iteratively chops off sizeable limbs that are joined by protruding segments of a dataset's mutual reachability minimum spanning tree.
The use of a mutual reachability distance (; Campello et al., 2013)
pulls peripheral points farther away from each other.
lumbermark(d, ...) ## Default S3 method: lumbermark( d, k, min_cluster_factor = 0.25, M = 5L, nested = FALSE, skip_leaves = (M > 0L), min_cluster_size = 10, distance = c("euclidean", "l2", "manhattan", "cityblock", "l1", "cosine"), verbose = FALSE, ... ) ## S3 method for class 'dist' lumbermark( d, k, min_cluster_factor = 0.25, M = 5L, nested = FALSE, skip_leaves = (M > 0L), min_cluster_size = 10, verbose = FALSE, ... ) ## S3 method for class 'mst' lumbermark( d, k, min_cluster_factor = 0.25, nested = FALSE, skip_leaves = TRUE, min_cluster_size = 10, verbose = FALSE, ... )lumbermark(d, ...) ## Default S3 method: lumbermark( d, k, min_cluster_factor = 0.25, M = 5L, nested = FALSE, skip_leaves = (M > 0L), min_cluster_size = 10, distance = c("euclidean", "l2", "manhattan", "cityblock", "l1", "cosine"), verbose = FALSE, ... ) ## S3 method for class 'dist' lumbermark( d, k, min_cluster_factor = 0.25, M = 5L, nested = FALSE, skip_leaves = (M > 0L), min_cluster_size = 10, verbose = FALSE, ... ) ## S3 method for class 'mst' lumbermark( d, k, min_cluster_factor = 0.25, nested = FALSE, skip_leaves = TRUE, min_cluster_size = 10, verbose = FALSE, ... )
d |
a numeric matrix (or an object coercible to one,
e.g., a data frame with numeric-like columns) or an
object of class |
... |
further arguments passed to |
k |
integer; the desired number of clusters to detect |
min_cluster_factor |
numeric value in (0,1); output cluster sizes will
not be smaller than |
M |
integer; smoothing factor; |
nested |
if TRUE, generating |
skip_leaves |
logical; whether the MST leaves should be omitted from cluster size counting |
min_cluster_size |
integer; minimal cluster size; used as a safeguard
for small |
distance |
metric used to compute the linkage, one of:
|
verbose |
logical; whether to print diagnostic messages and progress information |
Lumbermark is a viable alternative to the HDBSCAN* algorithm that is able to detect a predefined number of clusters and indicate outliers (via deadwood; see Gagolewski, 2026).
As with all distance-based methods (this includes k-means and DBSCAN as well), applying data preprocessing and feature engineering techniques (e.g., feature scaling, feature selection, dimensionality reduction) might lead to more meaningful results.
If d is a numeric matrix or an object of class dist,
mst() will be called to compute an MST, which generally
takes at most time. However, by default, a faster algorithm
based on K-d trees is selected automatically for low-dimensional Euclidean
spaces; see mst_euclid from
the quitefastmst package.
Once a minimum spanning tree with increasingly sorted edge weights
is determined, the Lumbermark algorithm runs in time.
If you want to test different parameters or s,
it is best to compute the MST explicitly beforehand.
lumbermark() returns an object of class mstclust, which defines
a -partition of a given dataset, i.e., a vector whose -th
element denotes the -th input point's cluster label between 1 and .
The mst attribute gives the computed minimum
spanning tree which can be reused in further calls to the functions
from genieclust, lumbermark, and deadwood.
The cut_edges attribute gives the
indexes of the MST edges whose omission leads to the requested
-partition (connected components of the resulting spanning forest).
M. Gagolewski, Lumbermark: Resistant clustering by chopping up mutual reachability minimum spanning trees, 2026, doi:10.48550/arXiv.2604.07143
R.J.G.B. Campello, D. Moulavi, J. Sander, Density-based clustering based on hierarchical density estimates, Lecture Notes in Computer Science 7819, 2013, 160-172, doi:10.1007/978-3-642-37456-2_14
M. Gagolewski, A. Cena, M. Bartoszuk, Ł. Brzozowski, Clustering with minimum spanning trees: How good can it be?, Journal of Classification 42, 2025, 90-112, doi:10.1007/s00357-024-09483-1
M. Gagolewski, deadwood, in preparation, 2026
M. Gagolewski, quitefastmst, in preparation, 2026
The official online manual of lumbermark at https://lumbermark.gagolewski.com/
mst() for the minimum spanning tree routines
library("datasets") data("iris") X <- jitter(as.matrix(iris[3:4])) y_pred <- lumbermark(X, k=3) y_test <- as.integer(iris[,5]) plot(X, col=y_pred, pch=y_test, asp=1, las=1) # detect three clusters and find outliers with Deadwood library("deadwood") y_pred2 <- lumbermark(X, k=3) plot(X, col=y_pred2, asp=1, las=1) is_outlier <- deadwood(y_pred2) points(X[!is_outlier, ], col=y_pred2[!is_outlier], pch=16)library("datasets") data("iris") X <- jitter(as.matrix(iris[3:4])) y_pred <- lumbermark(X, k=3) y_test <- as.integer(iris[,5]) plot(X, col=y_pred, pch=y_test, asp=1, las=1) # detect three clusters and find outliers with Deadwood library("deadwood") y_pred2 <- lumbermark(X, k=3) plot(X, col=y_pred2, asp=1, las=1) is_outlier <- deadwood(y_pred2) points(X[!is_outlier, ], col=y_pred2[!is_outlier], pch=16)