Title: | A Partial Clustering Algorithm |
---|---|
Description: | Provide the 'CrossClustering' algorithm (Tellaroli et al. (2016) <doi:10.1371/journal.pone.0152333>), which is a partial clustering algorithm that combines the Ward's minimum variance and Complete Linkage algorithms, providing automatic estimation of a suitable number of clusters and identification of outlier elements. |
Authors: | Paola Tellaroli [cre, aut], Marco Bazzi [aut], Michele Donato [aut], Livio Finos [aut], Philippe Courcoux [aut], Corrado Lanera [aut] |
Maintainer: | Paola Tellaroli <[email protected]> |
License: | GPL-3 |
Version: | 4.1.2 |
Built: | 2024-11-04 05:19:57 UTC |
Source: | https://github.com/corradolanera/crossclustering |
Computes the adjusted Rand index and the confidence interval, comparing two classifications from a contingency table.
print method for ari class
ari(mat, alpha = 0.05, digits = 2) ## S3 method for class 'ari' print(x, ...)
ari(mat, alpha = 0.05, digits = 2) ## S3 method for class 'ari' print(x, ...)
mat |
A matrix of integers representing the contingency table of reference |
alpha |
A single number strictly included between 0 and 1 representing the significance level of interest. (default is 0.05) |
digits |
An integer for the returned significant digits to return (default is 2) |
x |
an object used to select a method. |
... |
further arguments passed to or from other methods. |
The adjusted Rand Index (ARI) should be interpreted as follows:
ARI >= 0.90 excellent recovery; 0.80 =< ARI < 0.90 good recovery; 0.65 =< ARI < 0.80 moderate recovery; ARI < 0.65 poor recovery.
As the confidence interval is based on the approximation to the Normal distribution, it is recommended to trust in the confidence interval only in cases of total number of object clustered greater than 100.
An object of class ari
with the following elements:
AdjustedRandIndex |
The adjusted Rand Index |
CI |
The confidence interval |
print(ari)
:
Paola Tellaroli, <paola dot
tellaroli at
unipd dot
it>;
L. Hubert and P. Arabie (1985) Comparing partitions, Journal of Classification, 2, 193-218.
E.M. Qannari, P. Courcoux and Faye P. (2014) Significance test of the adjusted Rand index. Application to the free sorting task, Food Quality and Preference, (32)93-97
M.H. Samuh, F. Leisch, and L. Finos (2014), Tests for Random Agreement in Cluster Analysis, Statistica Applicata-Italian Journal of Applied Statistics, vol. 26, no. 3, pp. 219-234.
D. Steinley (2004) Properties of the Hubert-Arabie Adjusted Rand Index, Psychological Methods, 9(3), 386-396
D. Steinley, M.J. Brusco, L. Hubert (2016) The Variance of the Adjusted Rand Index, Psychological Methods, 21(2), 261-272
#### This example compares the adjusted Rand Index as computed on the ### partitions given by Ward's algorithm with the ground truth on the ### famous Iris data set by the adjustedRandIndex function ### {mclust package} and by the ari function. library(CrossClustering) library(mclust) clusters <- iris[-5] |> dist() |> hclust(method = 'ward.D') |> cutree(k = 3) ground_truth <- iris[[5]] |> as.numeric() mc_ari <- adjustedRandIndex(clusters, ground_truth) mc_ari ari_cc <- table(ground_truth, clusters) |> ari(digits = 7) ari_cc all.equal(mc_ari, unclass(ari_cc)[["ari"]], check.attributes = FALSE)
#### This example compares the adjusted Rand Index as computed on the ### partitions given by Ward's algorithm with the ground truth on the ### famous Iris data set by the adjustedRandIndex function ### {mclust package} and by the ari function. library(CrossClustering) library(mclust) clusters <- iris[-5] |> dist() |> hclust(method = 'ward.D') |> cutree(k = 3) ground_truth <- iris[[5]] |> as.numeric() mc_ari <- adjustedRandIndex(clusters, ground_truth) mc_ari ari_cc <- table(ground_truth, clusters) |> ari(digits = 7) ari_cc all.equal(mc_ari, unclass(ari_cc)[["ari"]], check.attributes = FALSE)
This function performs the CrossClustering algorithm. This method combines the Ward's minimum variance and Complete-linkage (default, useful for finding spherical clusters) or Single-linkage (useful for finding elongated clusters) algorithms, providing automatic estimation of a suitable number of clusters and identification of outlier elements.
cc_crossclustering( dist, k_w_min = 2, k_w_max = attr(dist, "Size") - 2, k2_max = k_w_max + 1, out = TRUE, method = c("complete", "single") ) ## S3 method for class 'crossclustering' print(x, ...)
cc_crossclustering( dist, k_w_min = 2, k_w_max = attr(dist, "Size") - 2, k2_max = k_w_max + 1, out = TRUE, method = c("complete", "single") ) ## S3 method for class 'crossclustering' print(x, ...)
dist |
A dissimilarity structure as produced by the function
|
k_w_min |
(int) Minimum number of clusters for the Ward's minimum variance method. By default is set equal 2 |
k_w_max |
(int) Maximum number of clusters for the Ward's minimum variance method (see details) |
k2_max |
(int) Maximum number of clusters for the Complete/Single-linkage method. It can not be equal or greater than the number of elements to cluster (see details) |
out |
(lgl) If |
method |
(chr) "complete" (default) or "single". CrossClustering combines Ward's algorithm with Complete-linkage if method is set to "complete", otherwise (if method is set to 'single') Single-linkage will be used. |
x |
an object used to select a method. |
... |
further arguments passed to or from other methods. |
See cited document for more details.
A list of objects describing characteristics of the partitioning as follows:
Optimal_cluster |
number of clusters |
cluster_list_elements |
a list of clusters; each element of this lists contains the indices of the elements belonging to the cluster |
Silhouette |
the average silhouette width over all the clusters |
n_total |
total number of input elements |
n_clustered |
number of input elements that have actually been clustered |
print(crossclustering)
:
Paola Tellaroli, <paola dot
tellaroli at
unipd dot
it>;;
Marco Bazzi, <bazzi at
stat dot
unipd dot
it>;
Michele Donato, <mdonato at
stanford dot
edu>
Tellaroli P, Bazzi M., Donato M., Brazzale A. R., Draghici S. (2016). Cross-Clustering: A Partial Clustering Algorithm with Automatic Estimation of the Number of Clusters. PLoS ONE 11(3): e0152333. doi:10.1371/journal.pone.0152333
#' Tellaroli P, Bazzi M., Donato M., Brazzale A. R., Draghici S. (2017). E1829: Cross-Clustering: A Partial Clustering Algorithm with Automatic Estimation of the Number of Clusters. CMStatistics 2017, London 16-18 December, Book of Abstracts (ISBN 978-9963-2227-4-2)
library(CrossClustering) #### Example of Cross-Clustering as in reference paper #### method = "complete" data(toy) ### toy is transposed as we want to cluster samples (columns of the ### original matrix) toy_dist <- t(toy) |> dist(method = "euclidean") ### Run CrossClustering cc_crossclustering( toy_dist, k_w_min = 2, k_w_max = 5, k2_max = 6, out = TRUE ) #### Simulated data as in reference paper #### method = "complete" set.seed(10) sg <- c(500, 250, 700, 300, 100) # 5 clusters t <- matrix(0, nrow = 5, ncol = 5) t[1, ] <- rep(6, 5) t[2, ] <- c( 0, 5, 12, 13, 15) t[3, ] <- c(15, 11, 9, 5, 0) t[4, ] <- c( 6, 12, 15, 10, 5) t[5, ] <- c(12, 17, 3, 7, 10) t_mat <- NULL for (i in seq_len(nrow(t))) { t_mat <- rbind( t_mat, matrix(rep(t[i, ], sg[i]), nrow = sg[i], byrow = TRUE) ) } data_15 <- matrix(NA, nrow = 2000, ncol = 5) data_15[1:1850, ] <- matrix( abs(rnorm(sum(sg) * 5, sd = 1.5)), nrow = sum(sg), ncol = 5 ) + t_mat set.seed(100) # simulate outliers data_15[1851:2000, ] <- matrix( runif(n = 150 * 5, min = 0, max = max(data_15, na.rm = TRUE)), nrow = 150, ncol = 5 ) ### Run CrossClustering cc_crossclustering( dist(data_15), k_w_min = 2, k_w_max = 19, k2_max = 20, out = TRUE ) #### Correlation-based distance is often used in gene expression time-series ### data analysis. Here there is an example, using the "complete" method. data(nb_data) nb_dist <- as.dist(1 - abs(cor(t(nb_data)))) cc_crossclustering(dist = nb_dist, k_w_max = 20, k2_max = 19) #### method = "single" ### Example on a famous shape data set ### Two moons data data(twomoons) moons_dist <- twomoons[, 1:2] |> dist(method = "euclidean") cc_moons <- cc_crossclustering( moons_dist, k_w_max = 9, k2_max = 10, method = 'single' ) moons_col <- cc_get_cluster(cc_moons) plot( twomoons[, 1:2], col = moons_col, pch = 19, xlab = "", ylab = "", main = "CrossClustering-Single" ) ### Worms data data(worms) worms_dist <- worms[, 1:2] |> dist(method = "euclidean") cc_worms <- cc_crossclustering( worms_dist, k_w_max = 9, k2_max = 10, method = "single" ) worms_col <- cc_get_cluster(cc_worms) plot( worms[, 1:2], col = worms_col, pch = 19, xlab = "", ylab = "", main = "CrossClustering-Single" ) ### CrossClustering-Single is not affected to chain-effect problem data(chain_effect) chain_dist <- chain_effect |> dist(method = "euclidean") cc_chain <- cc_crossclustering( chain_dist, k_w_max = 9, k2_max = 10, method = "single" ) chain_col <- cc_get_cluster(cc_chain) plot( chain_effect, col = chain_col, pch = 19, xlab = "", ylab = "", main = "CrossClustering-Single" )
library(CrossClustering) #### Example of Cross-Clustering as in reference paper #### method = "complete" data(toy) ### toy is transposed as we want to cluster samples (columns of the ### original matrix) toy_dist <- t(toy) |> dist(method = "euclidean") ### Run CrossClustering cc_crossclustering( toy_dist, k_w_min = 2, k_w_max = 5, k2_max = 6, out = TRUE ) #### Simulated data as in reference paper #### method = "complete" set.seed(10) sg <- c(500, 250, 700, 300, 100) # 5 clusters t <- matrix(0, nrow = 5, ncol = 5) t[1, ] <- rep(6, 5) t[2, ] <- c( 0, 5, 12, 13, 15) t[3, ] <- c(15, 11, 9, 5, 0) t[4, ] <- c( 6, 12, 15, 10, 5) t[5, ] <- c(12, 17, 3, 7, 10) t_mat <- NULL for (i in seq_len(nrow(t))) { t_mat <- rbind( t_mat, matrix(rep(t[i, ], sg[i]), nrow = sg[i], byrow = TRUE) ) } data_15 <- matrix(NA, nrow = 2000, ncol = 5) data_15[1:1850, ] <- matrix( abs(rnorm(sum(sg) * 5, sd = 1.5)), nrow = sum(sg), ncol = 5 ) + t_mat set.seed(100) # simulate outliers data_15[1851:2000, ] <- matrix( runif(n = 150 * 5, min = 0, max = max(data_15, na.rm = TRUE)), nrow = 150, ncol = 5 ) ### Run CrossClustering cc_crossclustering( dist(data_15), k_w_min = 2, k_w_max = 19, k2_max = 20, out = TRUE ) #### Correlation-based distance is often used in gene expression time-series ### data analysis. Here there is an example, using the "complete" method. data(nb_data) nb_dist <- as.dist(1 - abs(cor(t(nb_data)))) cc_crossclustering(dist = nb_dist, k_w_max = 20, k2_max = 19) #### method = "single" ### Example on a famous shape data set ### Two moons data data(twomoons) moons_dist <- twomoons[, 1:2] |> dist(method = "euclidean") cc_moons <- cc_crossclustering( moons_dist, k_w_max = 9, k2_max = 10, method = 'single' ) moons_col <- cc_get_cluster(cc_moons) plot( twomoons[, 1:2], col = moons_col, pch = 19, xlab = "", ylab = "", main = "CrossClustering-Single" ) ### Worms data data(worms) worms_dist <- worms[, 1:2] |> dist(method = "euclidean") cc_worms <- cc_crossclustering( worms_dist, k_w_max = 9, k2_max = 10, method = "single" ) worms_col <- cc_get_cluster(cc_worms) plot( worms[, 1:2], col = worms_col, pch = 19, xlab = "", ylab = "", main = "CrossClustering-Single" ) ### CrossClustering-Single is not affected to chain-effect problem data(chain_effect) chain_dist <- chain_effect |> dist(method = "euclidean") cc_chain <- cc_crossclustering( chain_dist, k_w_max = 9, k2_max = 10, method = "single" ) chain_col <- cc_get_cluster(cc_chain) plot( chain_effect, col = chain_col, pch = 19, xlab = "", ylab = "", main = "CrossClustering-Single" )
Provides the vector of clusters' ID to which each element belong to.
cc_get_cluster(x, n_elem) ## Default S3 method: cc_get_cluster(x, n_elem) ## S3 method for class 'crossclustering' cc_get_cluster(x, n_elem)
cc_get_cluster(x, n_elem) ## Default S3 method: cc_get_cluster(x, n_elem) ## S3 method for class 'crossclustering' cc_get_cluster(x, n_elem)
x |
list of clustered elements or a |
n_elem |
total number of elements clustered (ignored if x
is of class |
An integer vector of clusters to which the elements belong (1
for the outliers, ID + 1 for the others).
cc_get_cluster(default)
: default method for cc_get_cluster.
cc_get_cluster(crossclustering)
: automatically extract inputs from a
crossclustering
object
Paola Tellaroli, <paola dot
tellaroli at
unipd dot
it>;;
Marco Bazzi, <bazzi at
stat dot
unipd dot
it>;
Michele Donato, <mdonato at
stanford dot
edu>.
Tellaroli P, Bazzi M., Donato M., Brazzale A. R., Draghici S. (2016). Cross-Clustering: A Partial Clustering Algorithm with Automatic Estimation of the Number of Clusters. PLoS ONE 11(3): e0152333. doi:10.1371/journal.pone.0152333
library(CrossClustering) data(toy) ### toy is transposed as we want to cluster samples (columns of the ### original matrix) toy_dist <- t(toy) |> dist(method = "euclidean") ### Run CrossClustering toyres <- cc_crossclustering( toy_dist, k_w_min = 2, k_w_max = 5, k2_max = 6, out = TRUE ) ### cc_get_cluster cc_get_cluster(toyres[], 7) ### cc_get_cluster directly from a crossclustering object cc_get_cluster(toyres)
library(CrossClustering) data(toy) ### toy is transposed as we want to cluster samples (columns of the ### original matrix) toy_dist <- t(toy) |> dist(method = "euclidean") ### Run CrossClustering toyres <- cc_crossclustering( toy_dist, k_w_min = 2, k_w_max = 5, k2_max = 6, out = TRUE ) ### cc_get_cluster cc_get_cluster(toyres[], 7) ### cc_get_cluster directly from a crossclustering object cc_get_cluster(toyres)
A test for testing the null hypothesis of random agreement (i.e., adjusted Rand Index equal to 0) between two partitions.
cc_test_ari(ground_truth, partition)
cc_test_ari(ground_truth, partition)
ground_truth |
(int) A vector of the actual membership of elements in clusters |
partition |
The partition coming from a clustering algorithm |
A list with six elements:
Rand |
the Rand Index |
ExpectedRand |
expected value of Rand Index |
AdjustedRand |
Adjusted Rand Index |
var_ari |
variance of Rand Index |
nari |
nari |
p-value |
the p-value of the test |
Paola Tellaroli, <paola dot
tellaroli at
unipd dot
it>;
Philippe Courcoux, <philippe dot
courcoux at
oniris-nantes dot
fr>
E_M. Qannari, p. Courcoux and Faye p. (2014) Significance test of the adjusted Rand index. Application to the free sorting task, Food Quality and Preference, (32)93-97
L. Hubert and p. Arabie (1985) Comparing partitions, Journal of Classification, 2, 193-218.
library(CrossClustering) clusters <- iris[-5] |> dist() |> hclust(method = 'ward.D') |> cutree(k = 3) ground_truth <- iris[[5]] |> as.numeric() cc_test_ari(ground_truth, clusters)
library(CrossClustering) clusters <- iris[-5] |> dist() |> hclust(method = 'ward.D') |> cutree(k = 3) ground_truth <- iris[[5]] |> as.numeric() cc_test_ari(ground_truth, clusters)
A permutation test for testing the null hypothesis of random agreement (i.e., adjusted Rand Index equal to 0) between two partitions.
cc_test_ari_permutation(ground_truth, partition)
cc_test_ari_permutation(ground_truth, partition)
ground_truth |
(int) A vector of the actual membership of elements in clusters |
partition |
The partition coming from a clustering algorithm |
A data_frame with two columns:
ari |
the adjusted Rand Index |
p_value |
the p-value of the test |
Paola Tellaroli, <paola dot
tellaroli at
unipd dot
it>;
Livio Finos, <livio dot
finos at
unipd dot
it>
Samuh M. H., Leisch F., and Finos L. (2014), Tests for Random Agreement in Cluster Analysis, Statistica Applicata-Italian Journal of Applied Statistics, vol. 26, no. 3, pp. 219-234.
L. Hubert and P. Arabie (1985) Comparing partitions, Journal of Classification, 2, 193-218.
library(CrossClustering) clusters <- iris[-5] |> dist() |> hclust(method = 'ward.D') |> cutree(k = 3) ground_truth <- iris[[5]] |> as.numeric() cc_test_ari_permutation(ground_truth, clusters)
library(CrossClustering) clusters <- iris[-5] |> dist() |> hclust(method = 'ward.D') |> cutree(k = 3) ground_truth <- iris[[5]] |> as.numeric() cc_test_ari_permutation(ground_truth, clusters)
A toy dataset for illustrating the chain effect.
chain_effect
chain_effect
A data frame with 28 rows and 2 variables:
X
num
x coordinates 0 is negative.
Y
num
y coordinates.
Computes the consensus between Ward's minimum variance and Complete-linkage (or Single-linkage) algorithms (i.e., the number of elements classified together by both algorithms).
consensus_cluster(k, cluster_ward, cluster_other)
consensus_cluster(k, cluster_ward, cluster_other)
k |
(int) a vector containing the number of clusters for Ward and for Complete-linkage (or Single-linkage) algorithms, respectively |
cluster_ward |
an object of class hclust for the Ward algorithm |
cluster_other |
an object of class hclust for the Complete-linkage (or Single-linkage) algorithm |
an object of class consensus_cluster
with the following
elements:
elements |
list of the elements belonging to each cluster |
;
a_star |
contingency table of the clustering |
;
max_consensus |
maximum clustering consensus |
.
Paola Tellaroli, <paola dot
tellaroli at
unipd dot
it>;;
Marco Bazzi, <bazzi at
stat dot
unipd dot
it>;
Michele Donato, <mdonato at
stanford dot
edu>.
Tellaroli P, Bazzi M., Donato M., Brazzale A. R., Draghici S. (2016). Cross-Clustering: A Partial Clustering Algorithm with Automatic Estimation of the Number of Clusters. PLoS ONE 11(3): e0152333. doi:10.1371/journal.pone.0152333
library(CrossClustering) data(toy) ### toy is transposed as we want to cluster samples (columns of the ### original matrix) toy_dist <- t(toy) |> dist(method = "euclidean") ### Hierarchical clustering cluster_ward <- toy_dist |> hclust(method = "ward.D") cluster_other <- toy_dist |> hclust(method = "complete") ### consensus_cluster consensus_cluster( c(3, 4), cluster_ward, cluster_other )
library(CrossClustering) data(toy) ### toy is transposed as we want to cluster samples (columns of the ### original matrix) toy_dist <- t(toy) |> dist(method = "euclidean") ### Hierarchical clustering cluster_ward <- toy_dist |> hclust(method = "ward.D") cluster_other <- toy_dist |> hclust(method = "complete") ### consensus_cluster consensus_cluster( c(3, 4), cluster_ward, cluster_other )
Check if a given, single, number is 0 or not
is_zero(num)
is_zero(num)
num |
a numerical vector of length one |
a boolean, TRUE if num is 0
is_zero(1) is_zero(0)
is_zero(1) is_zero(0)
nb_data
contains a subset of a bigger normalized negative binomial
simulated dataset.
nb_data
nb_data
A data frame with 100 observations on 36 numeric variables.
This dataset is part of a larger simulated and normalized dataset with 2 experimental groups, 6 time-points and 3 replicates. Simulation has been done by using a negative binomial distribution. The first 20 genes are simulated with changes among time.
Data included in the bioconductor package maSigPro
.
https://doi.org/doi:10.18129/B9.bioc.maSigPro
Given a diagonal matrix which is supposed to have no non-zero entry in the diagonal after the first one (if any) the function returns the diagonal (sub-)matrix without the columns and the row corresponding to the zero-entries in the diagonal (if any).
prune_zero_tail(diag_mat)
prune_zero_tail(diag_mat)
diag_mat |
a diagonal matrix which must satisfy the following property: in the diagonal, every element after a zero is a zero. |
a diagonal matrix without zeros in the diagonal, composed by the first rows and columns of the original matrix with non zeros in the diagonal (which are also the only ones)
diag_mat <- diag(c(1, 2, 3, 0, 0, 0, 0)) prune_zero_tail(diag_mat)
diag_mat <- diag(c(1, 2, 3, 0, 0, 0, 0)) prune_zero_tail(diag_mat)
Reverse the process of create a contingency table
reverse_table(x)
reverse_table(x)
x |
a contingency table |
a list of 2 vector corresponding to the unrolled table
clust_1 <- iris[, 1:4] |> dist() |> hclust() |> cutree(k = 3) clust_2 <- iris[, 1:4] |> dist() |> hclust() |> cutree(k = 4) cont_table <- table(clust_1, clust_2) reverse_table(cont_table)
clust_1 <- iris[, 1:4] |> dist() |> hclust() |> cutree(k = 3) clust_2 <- iris[, 1:4] |> dist() |> hclust() |> cutree(k = 4) cont_table <- table(clust_1, clust_2) reverse_table(cont_table)
A toy example matrix
toy
toy
A matrix of 10 row and 7 columns
A famous shape data set containing two clusters with two moons shapes and outliers
twomoons
twomoons
A data frame with 52 rows and 3 variables:
x
num
x coordinates
y
num
y coordinates.
clusters
integer
cluster membership (outliers classified as 3rd cluster).
A famous shape data set containing two clusters with two worms shapes and outliers
worms
worms
A data frame with 87 rows and 3 variables:
x
num
x coordinates
y
num
y coordinates.
cluster
integer
cluster membership (outliers classified as 3rd cluster).