• About
  • Documentation

  • More Universes
  • Recent Updates
  • Leader board

  • All repositories
  • All packages
  • All articles
  • All datasets
  • All system Libraries
gagolews
  • Builds
  • Packages
  • Articles
  • Datasets
  • Contribution
  • Badges
  • API
  • Feed

Links togagolews

stringi - Fast and Portable Character String Processing Facilities

A collection of character string/text/natural language processing tools for pattern searching (e.g., with 'Java'-like regular expressions or the 'Unicode' collation algorithm), random string generation, case mapping, string transliteration, concatenation, sorting, padding, wrapping, Unicode normalisation, date-time formatting and parsing, and many more. They are fast, consistent, convenient, and - thanks to 'ICU' (International Components for Unicode) - portable across all locales and platforms. Documentation about 'stringi' is provided via its website at <https://stringi.gagolewski.com/> and the paper by Gagolewski (2022, <doi:10.18637/jss.v103.i02>).

Last updated

icuicu4cnatural-language-processingnlpregexregexpstring-manipulationstringistringrtexttext-processingtidy-dataunicodecpp

18.40 score 317 stars 9.7k dependents 14k scripts 1.1M downloads

genieclust - Genie: Fast and Robust Hierarchical Clustering

Genie is a robust hierarchical clustering algorithm (Gagolewski, Bartoszuk, Cena, 2016 <DOI:10.1016/j.ins.2016.05.003>). 'genieclust' is its faster, more capable implementation (Gagolewski, 2021 <DOI:10.1016/j.softx.2021.100722>). It enables clustering with respect to mutual reachability distances, allowing it to act as an alternative to 'HDBSCAN*' that can identify any number of clusters or their entire hierarchy. When combined with the 'deadwood' package, it can act as an outlier detector. Additional package features include the Gini and Bonferroni inequality indices, external cluster validity measures (e.g., the normalised clustering accuracy, the adjusted Rand index, the Fowlkes-Mallows index, and normalised mutual information), and internal cluster validity indices (e.g., the Calinski-Harabasz, Davies-Bouldin, Ball-Hall, Silhouette, and generalised Dunn indices). The 'Python' version of 'genieclust' is available via 'PyPI'.

Last updated

cluster-analysisclusteringclustering-algorithmdata-analysisdata-miningdata-sciencegeniehdbscanhierarchical-clusteringhierarchical-clustering-algorithmmachine-learningmachine-learning-algorithmsmlpacknmslibpythonpython3sparsecpp

7.90 score 73 stars 6 dependents 14 scripts 577 downloads

FuzzyNumbers - Tools to Deal with Fuzzy Numbers

S4 classes and methods to deal with fuzzy numbers. They allow for computing any arithmetic operations (e.g., by using the Zadeh extension principle), performing approximation of arbitrary fuzzy numbers by trapezoidal and piecewise linear ones, preparing plots for publications, computing possibility and necessity values for comparisons, etc.

Last updated

7.42 score 13 stars 13 dependents 104 scripts 460 downloads

TurtleGraphics - Turtle Graphics

An implementation of turtle graphics <http://en.wikipedia.org/wiki/Turtle_graphics>. Turtle graphics comes from Papert's language Logo and has been used to teach concepts of computer programming.

Last updated

7.16 score 23 stars 2 dependents 208 scripts 221 downloads

quitefastmst - Euclidean and Mutual Reachability Minimum Spanning Trees

Functions to compute Euclidean minimum spanning trees using single-, sesqui-, and dual-tree Boruvka algorithms. Thanks to K-d trees, they are fast in spaces of low intrinsic dimensionality. Mutual reachability distances (used in the definition of the 'HDBSCAN*' algorithm) are supported too. The package also includes relatively fast fallback minimum spanning tree and nearest-neighbours algorithms for spaces of higher dimensionality. The 'Python' version of 'quitefastmst' is available via 'PyPI'.

Last updated

cluster-analysisclusteringclustering-evaluationeuclidean-distancesgeniehdbscanhdbscan-clustering-algorithmmachine-learningmachine-learning-algorithmsminimum-spanning-treemstmutual-reachability-distanceneighbor-searchoutlier-detectioncppopenmp

5.31 score 1 stars 9 dependents 402 downloads

deadwood - Outlier Detection via Pruning Mutual Reachability Minimum Spanning Trees

Implements an anomaly detection algorithm based on a dataset's mutual reachability minimum spanning tree: 'deadwood' chops protruding tree segments and marks small debris as outliers; see Gagolewski (2026) <https://deadwood.gagolewski.com/>. More precisely, the use of a mutual reachability distance pulls peripheral points farther away from each other. Tree edges with weights beyond the detected elbow point are removed. All the resulting connected components whose sizes are smaller than a given threshold are deemed anomalous. The 'Python' version of 'deadwood' is available via 'PyPI'.

Last updated

anomaly-detectiondata-sciencemachine-learningmachine-learning-algorithmsminimum-spanning-treeminimum-spanning-treesmstnoise-detectionoutlier-detectionoutlierscppopenmp

5.19 score 1 stars 8 dependents 387 downloads

agop - Aggregation Operators and Preordered Sets

Tools supporting multi-criteria and group decision making, including variable number of criteria, by means of aggregation operators, spread measures, fuzzy logic connectives, fusion functions, and preordered sets. Possible applications include, but are not limited to, quality management, scientometrics, software engineering, etc.

Last updated

aggregationcpp

5.06 score 5 stars 2 dependents 77 scripts 316 downloads

genie - Fast, Robust, and Outlier Resistant Hierarchical Clustering

Includes the reference implementation of Genie - a hierarchical clustering algorithm that links two point groups in such a way that an inequity measure (namely, the Gini index) of the cluster sizes does not significantly increase above a given threshold. This method most often outperforms many other data segmentation approaches in terms of clustering quality as tested on a wide range of benchmark datasets. At the same time, Genie retains the high speed of the single linkage approach, therefore it is also suitable for analysing larger data sets. For more details see (Gagolewski et al. 2016 <DOI:10.1016/j.ins.2016.05.003>). For an even faster and more feature-rich implementation, including, amongst others, noise point detection, see the 'genieclust' package (Gagolewski, 2021 <DOI:10.1016/j.softx.2021.100722>).

Last updated

clustercluster-analysisclusteringdata-analysisdata-miningdata-sciencedatasciencegeniehierarchical-clustering-algorithmmachine-learningmachine-learning-algorithmsoutlierscppopenmp

4.64 score 22 stars 20 scripts 372 downloads

stringx - Replacements for Base String Functions Powered by 'stringi'

English is the native language for only 5% of the World population. Also, only 17% of us can understand this text. Moreover, the Latin alphabet is the main one for merely 36% of the total. The early computer era, now a very long time ago, was dominated by the US. Due to the proliferation of the internet, smartphones, social media, and other technologies and communication platforms, this is no longer the case. This package replaces base R string functions (such as grep(), tolower(), sprintf(), and strptime()) with ones that fully support the Unicode standards related to natural language and date-time processing. It also fixes some long-standing inconsistencies, and introduces some new, useful features. Thanks to 'ICU' (International Components for Unicode) and 'stringi', they are fast, reliable, and portable across different platforms.

Last updated

icuicu4cnatural-language-processingnlpregexregexpstring-manipulationstringitexttext-processingunicode

4.15 score 28 stars 1 scripts 228 downloads

lumbermark - Resistant Clustering via Chopping Up Mutual Reachability Minimum Spanning Trees

Implements a fast and resistant divisive clustering algorithm which identifies a specified number of clusters: 'lumbermark' iteratively chops off sizeable limbs that are joined by protruding segments of a dataset's mutual reachability minimum spanning tree (Gagolewski, 2026 <DOI:10.48550/arXiv.2604.07143>). The use of a mutual reachability distance pulls peripheral points farther away from each other. When combined with the 'deadwood' package, it can act as an outlier detector. The 'Python' version of 'lumbermark' is available via 'PyPI'.

Last updated

anomaly-detectioncluster-analysisclusteringclustering-algorithmhdbscanmachine-learningmachine-learning-algorithmsminimum-spanning-treeminimum-spanning-treesoutlier-detectionoutlierscpp

4.00 score 489 downloads

CITAN - CITation ANalysis Toolpack

Supports quantitative research in scientometrics and bibliometrics. Provides various tools for preprocessing bibliographic data retrieved, e.g., from Elsevier's Scopus, computing bibliometric impact of individuals, or modelling phenomena encountered in the social sciences. This package is deprecated; see 'agop' instead.

Last updated

3.82 score 6 stars 22 scripts 305 downloads

realtest - Where Expectations Meet Reality: Realistic Unit Testing

A framework for unit testing for realistic minimalists, where we distinguish between expected, acceptable, current, fallback, ideal, or regressive behaviour. It can also be used for monitoring third-party software projects for changes.

Last updated

continuous-testingtesting-toolsunit-testing

3.78 score 12 stars 184 downloads