Title: | Diverse Cluster Ensemble in R |
---|---|
Description: | Performs cluster analysis using an ensemble clustering framework, Chiu & Talhouk (2018) <doi:10.1186/s12859-017-1996-y>. Results from a diverse set of algorithms are pooled together using methods such as majority voting, K-Modes, LinkCluE, and CSPA. There are options to compare cluster assignments across algorithms using internal and external indices, visualizations such as heatmaps, and significance testing for the existence of clusters. |
Authors: | Derek Chiu [aut, cre], Aline Talhouk [aut], Johnson Liu [ctb, com] |
Maintainer: | Derek Chiu <[email protected]> |
License: | MIT + file LICENSE |
Version: | 2.2.0 |
Built: | 2024-11-17 04:56:58 UTC |
Source: | https://github.com/alinetalhouk/dicer |
Compute the compactness validity index for a clustering result.
compactness(data, labels)
compactness(data, labels)
data |
a dataset with rows as observations, columns as variables |
labels |
a vector of cluster labels from a clustering result |
This index is agnostic to any reference clustering results, calculating cluster performance on the basis of compactness and separability. Smaller values indicate a better clustering structure.
the compactness score
Derek Chiu
MATLAB function valid_compactness
by Simon Garrett in
LinkCluE
set.seed(1) E <- matrix(rep(sample(1:4, 1000, replace = TRUE)), nrow = 100, byrow = FALSE) set.seed(1) dat <- as.data.frame(matrix(runif(1000, -10, 10), nrow = 100, byrow = FALSE)) compactness(dat, E[, 1])
set.seed(1) E <- matrix(rep(sample(1:4, 1000, replace = TRUE)), nrow = 100, byrow = FALSE) set.seed(1) dat <- as.data.frame(matrix(runif(1000, -10, 10), nrow = 100, byrow = FALSE)) compactness(dat, E[, 1])
Runs consensus clustering across subsamples of the data, clustering algorithms, and cluster sizes.
consensus_cluster( data, nk = 2:4, p.item = 0.8, reps = 1000, algorithms = NULL, nmf.method = c("brunet", "lee"), hc.method = "average", xdim = NULL, ydim = NULL, rlen = 200, alpha = c(0.05, 0.01), minPts = 5, distance = "euclidean", abs = TRUE, prep.data = c("none", "full", "sampled"), scale = TRUE, type = c("conventional", "robust", "tsne"), min.var = 1, progress = TRUE, seed.nmf = 123456, seed.data = 1, file.name = NULL, time.saved = FALSE )
consensus_cluster( data, nk = 2:4, p.item = 0.8, reps = 1000, algorithms = NULL, nmf.method = c("brunet", "lee"), hc.method = "average", xdim = NULL, ydim = NULL, rlen = 200, alpha = c(0.05, 0.01), minPts = 5, distance = "euclidean", abs = TRUE, prep.data = c("none", "full", "sampled"), scale = TRUE, type = c("conventional", "robust", "tsne"), min.var = 1, progress = TRUE, seed.nmf = 123456, seed.data = 1, file.name = NULL, time.saved = FALSE )
data |
data matrix with rows as samples and columns as variables |
nk |
number of clusters (k) requested; can specify a single integer or a range of integers to compute multiple k |
p.item |
proportion of items to be used in subsampling within an algorithm |
reps |
number of subsamples |
algorithms |
vector of clustering algorithms for performing consensus clustering. Must be any number of the following: "nmf", "hc", "diana", "km", "pam", "ap", "sc", "gmm", "block", "som", "cmeans", "hdbscan". A custom clustering algorithm can be used. |
nmf.method |
specify NMF-based algorithms to run. By default the
"brunet" and "lee" algorithms are called. See |
hc.method |
agglomeration method for hierarchical clustering. The
the "average" method is used by default. See |
xdim |
x dimension of the SOM grid |
ydim |
y dimension of the SOM grid |
rlen |
the number of times the complete data set will be presented to the SOM network. |
alpha |
SOM learning rate, a vector of two numbers indicating the amount
of change. Default is to decline linearly from 0.05 to 0.01 over |
minPts |
minimum size of clusters for HDBSCAN. Default is 5. |
distance |
a vector of distance functions. Defaults to "euclidean".
Other options are given in |
abs |
only used for |
prep.data |
Prepare the data on the "full" dataset, the "sampled" dataset, or "none" (default). |
scale |
logical; should the data be centered and scaled? |
type |
if we use "conventional" measures (default), then the mean and standard deviation are used for centering and scaling, respectively. If "robust" measures are specified, the median and median absolute deviation (MAD) are used. Alternatively, we can apply "tsne" for dimension reduction. |
min.var |
minimum variability measure threshold used to filter the
feature space for only highly variable features. Only features with a
minimum variability measure across all samples greater than |
progress |
logical; should a progress bar be displayed? |
seed.nmf |
random seed to use for NMF-based algorithms |
seed.data |
seed to use to ensure each algorithm operates on the same set of subsamples |
file.name |
if not |
time.saved |
logical; if |
See examples for how to use custom algorithms and distance functions. The default clustering algorithms provided are:
"nmf": Nonnegative Matrix Factorization (using Kullback-Leibler Divergence or Euclidean distance; See Note for specifications.)
"hc": Hierarchical Clustering
"diana": DIvisive ANAlysis Clustering
"km": K-Means Clustering
"pam": Partition Around Medoids
"ap": Affinity Propagation
"sc": Spectral Clustering using Radial-Basis kernel function
"gmm": Gaussian Mixture Model using Bayesian Information Criterion on EM algorithm
"block": Biclustering using a latent block model
"som": Self-Organizing Map (SOM) with Hierarchical Clustering
"cmeans": Fuzzy C-Means Clustering
"hdbscan": Hierarchical Density-based Spatial Clustering of Applications with Noise (HDBSCAN)
The progress bar increments on every unit of reps
.
An array of dimension nrow(x)
by reps
by length(algorithms)
by
length(nk)
. Each cube of the array represents a different k. Each slice
of a cube is a matrix showing consensus clustering results for algorithms.
The matrices have a row for each sample, and a column for each subsample.
Each entry represents a class membership.
When "hdbscan" is part of algorithms
, we do not include its clustering
array in the consensus result. Instead, we report two summary statistics as
attributes: the proportion of outliers and the number of clusters.
The nmf.method
options are "brunet" (Kullback-Leibler Divergence) and
"lee" (Euclidean distance). When "hdbscan" is chosen as an algorithm to
use, its results are excluded from the rest of the consensus clusters. This
is because there is no guarantee that the cluster assignment will have
every sample clustered; more often than not there will be noise points or
outliers. In addition, the number of distinct clusters may not even be
equal to nk
.
Derek Chiu, Aline Talhouk
data(hgsc) dat <- hgsc[1:100, 1:50] # Custom distance function manh <- function(x) { stats::dist(x, method = "manhattan") } # Custom clustering algorithm agnes <- function(d, k) { return(as.integer(stats::cutree(cluster::agnes(d, diss = TRUE), k))) } assign("agnes", agnes, 1) cc <- consensus_cluster(dat, reps = 6, algorithms = c("pam", "agnes"), distance = c("euclidean", "manh"), progress = FALSE) str(cc)
data(hgsc) dat <- hgsc[1:100, 1:50] # Custom distance function manh <- function(x) { stats::dist(x, method = "manhattan") } # Custom clustering algorithm agnes <- function(d, k) { return(as.integer(stats::cutree(cluster::agnes(d, diss = TRUE), k))) } assign("agnes", agnes, 1) cc <- consensus_cluster(dat, reps = 6, algorithms = c("pam", "agnes"), distance = c("euclidean", "manh"), progress = FALSE) str(cc)
Combines results for multiple objects from consensus_cluster()
and outputs
either the consensus matrices or consensus classes for all algorithms.
consensus_combine(..., element = c("matrix", "class"))
consensus_combine(..., element = c("matrix", "class"))
... |
any number of objects outputted from |
element |
either "matrix" or "class" to extract the consensus matrix or consensus class, respectively. |
This function is useful for collecting summaries because the original results
from consensus_cluster
were combined to a single object. For example,
setting element = "class"
returns a matrix of consensus cluster
assignments, which can be visualized as a consensus matrix heatmap.
consensus_combine
returns either a list of all consensus matrices
or a data frame showing all the consensus classes
Derek Chiu
# Consensus clustering for multiple algorithms set.seed(911) x <- matrix(rnorm(500), ncol = 10) CC1 <- consensus_cluster(x, nk = 3:4, reps = 10, algorithms = "ap", progress = FALSE) CC2 <- consensus_cluster(x, nk = 3:4, reps = 10, algorithms = "km", progress = FALSE) # Combine and return either matrices or classes y1 <- consensus_combine(CC1, CC2, element = "matrix") str(y1) y2 <- consensus_combine(CC1, CC2, element = "class") str(y2)
# Consensus clustering for multiple algorithms set.seed(911) x <- matrix(rnorm(500), ncol = 10) CC1 <- consensus_cluster(x, nk = 3:4, reps = 10, algorithms = "ap", progress = FALSE) CC2 <- consensus_cluster(x, nk = 3:4, reps = 10, algorithms = "km", progress = FALSE) # Combine and return either matrices or classes y1 <- consensus_combine(CC1, CC2, element = "matrix") str(y1) y2 <- consensus_combine(CC1, CC2, element = "class") str(y2)
Evaluates algorithms on internal/external validation indices. Poor performing algorithms can be trimmed from the ensemble. The remaining algorithms can be given weights before use in consensus functions.
consensus_evaluate( data, ..., cons.cl = NULL, ref.cl = NULL, k.method = NULL, plot = FALSE, trim = FALSE, reweigh = FALSE, n = 5, lower = 0, upper = 1 )
consensus_evaluate( data, ..., cons.cl = NULL, ref.cl = NULL, k.method = NULL, plot = FALSE, trim = FALSE, reweigh = FALSE, n = 5, lower = 0, upper = 1 )
data |
data matrix with rows as samples and columns as variables |
... |
any number of objects outputted from |
cons.cl |
matrix of cluster assignments from consensus functions such as
|
ref.cl |
reference class |
k.method |
determines the method to choose k when no reference class is
given. When |
plot |
logical; if |
trim |
logical; if |
reweigh |
logical; if |
n |
an integer specifying the top |
lower |
the lower bound that determines what is ambiguous |
upper |
the upper bound that determines what is ambiguous |
This function always returns internal indices. If ref.cl
is not NULL
,
external indices are additionally shown. Relevant graphical displays are also
outputted. Algorithms are ranked across internal indices using Rank
Aggregation. Only the top n
algorithms are kept, the rest are trimmed.
consensus_evaluate
returns a list with the following elements
k
: if ref.cl
is not NULL
, this is the number of distinct classes
in the reference; otherwise the chosen k
is determined by the one giving
the largest mean PAC across algorithms
pac
: a data frame showing the PAC for each combination of algorithm
and cluster size
ii
: a list of data frames for all k showing internal evaluation
indices
ei
: a data frame showing external evaluation indices for k
trim.obj
: A list with 4 elements
alg.keep
: algorithms kept
alg.remove
: algorithms removed
rank.matrix
: a matrix of ranked algorithms for every internal
evaluation index
top.list
: final order of ranked algorithms
E.new
: A new version of a consensus_cluster
data object
# Consensus clustering for multiple algorithms set.seed(911) x <- matrix(rnorm(500), ncol = 10) CC <- consensus_cluster(x, nk = 3:4, reps = 10, algorithms = c("ap", "km"), progress = FALSE) # Evaluate algorithms on internal/external indices and trim algorithms: # remove those ranking low on internal indices set.seed(1) ref.cl <- sample(1:4, 50, replace = TRUE) z <- consensus_evaluate(x, CC, ref.cl = ref.cl, n = 1, trim = TRUE) str(z, max.level = 2)
# Consensus clustering for multiple algorithms set.seed(911) x <- matrix(rnorm(500), ncol = 10) CC <- consensus_cluster(x, nk = 3:4, reps = 10, algorithms = c("ap", "km"), progress = FALSE) # Evaluate algorithms on internal/external indices and trim algorithms: # remove those ranking low on internal indices set.seed(1) ref.cl <- sample(1:4, 50, replace = TRUE) z <- consensus_evaluate(x, CC, ref.cl = ref.cl, n = 1, trim = TRUE) str(z, max.level = 2)
Returns the (weighted) consensus matrix given a data matrix
consensus_matrix(data, weights = NULL)
consensus_matrix(data, weights = NULL)
data |
data matrix has rows as samples, columns as replicates |
weights |
a vector of weights for each algorithm used in meta-consensus
clustering. Must have |
Given a vector of cluster assignments, we first calculate the connectivity matrix and indicator matrix. A connectivity matrix has a 1 if both samples are in the same cluster, and 0 otherwise. An indicator matrix has a 1 if both samples were selected to be used in a subsample of a consensus clustering algorithm, and 0 otherwise. Summation of connectivity matrices and indicator matrices is performed over different subsamples of the data. The consensus matrix is calculated by dividing the aggregated connectivity matrices by the aggregated indicator matrices.
If a meta-consensus matrix is desired, where consensus classes of different
clustering algorithms are aggregated, we can construct a weighted
meta-consensus matrix using weights
.
a consensus matrix
When consensus is calculated over bootstrap samples, not every sample
is used in each replication. Thus, there will be scenarios where two
samples are never chosen together in any bootstrap samples. This typically
happens when the number of replications is small. The coordinate in the
consensus matrix for such pairs of samples is NaN
from a 0 / 0
computation. These entries are coerced to 0.
Derek Chiu
set.seed(2) x <- replicate(100, rbinom(100, 4, 0.2)) w <- rexp(100) w <- w / sum(w) cm1 <- consensus_matrix(x) cm2 <- consensus_matrix(x, weights = w)
set.seed(2) x <- replicate(100, rbinom(100, 4, 0.2)) w <- rexp(100) w <- w / sum(w) cm1 <- consensus_matrix(x) cm2 <- consensus_matrix(x, weights = w)
Performs hierarchical clustering on a stack of consensus matrices to obtain consensus class labels.
CSPA(E, k)
CSPA(E, k)
E |
is an array of clustering results. |
k |
number of clusters |
cluster assignments for the consensus class
Derek Chiu
Strehl, A., & Ghosh, J. (2002). Cluster ensembles—a knowledge reuse framework for combining multiple partitions. Journal of machine learning research, 3(Dec), 583-617.
Other consensus functions:
LCA()
,
LCE()
,
k_modes()
,
majority_voting()
data(hgsc) dat <- hgsc[1:100, 1:50] x <- consensus_cluster(dat, nk = 4, reps = 4, algorithms = c("hc", "diana"), progress = FALSE) CSPA(x, k = 4)
data(hgsc) dat <- hgsc[1:100, 1:50] x <- consensus_cluster(dat, nk = 4, reps = 4, algorithms = c("hc", "diana"), progress = FALSE) CSPA(x, k = 4)
Runs consensus clustering across subsamples, algorithms, and number of clusters (k).
dice( data, nk, p.item = 0.8, reps = 10, algorithms = NULL, k.method = NULL, nmf.method = c("brunet", "lee"), hc.method = "average", distance = "euclidean", cons.funs = c("kmodes", "majority", "CSPA", "LCE", "LCA"), sim.mat = c("cts", "srs", "asrs"), prep.data = c("none", "full", "sampled"), min.var = 1, seed = 1, seed.data = 1, trim = FALSE, reweigh = FALSE, n = 5, evaluate = TRUE, plot = FALSE, ref.cl = NULL, progress = TRUE )
dice( data, nk, p.item = 0.8, reps = 10, algorithms = NULL, k.method = NULL, nmf.method = c("brunet", "lee"), hc.method = "average", distance = "euclidean", cons.funs = c("kmodes", "majority", "CSPA", "LCE", "LCA"), sim.mat = c("cts", "srs", "asrs"), prep.data = c("none", "full", "sampled"), min.var = 1, seed = 1, seed.data = 1, trim = FALSE, reweigh = FALSE, n = 5, evaluate = TRUE, plot = FALSE, ref.cl = NULL, progress = TRUE )
data |
data matrix with rows as samples and columns as variables |
nk |
number of clusters (k) requested; can specify a single integer or a range of integers to compute multiple k |
p.item |
proportion of items to be used in subsampling within an algorithm |
reps |
number of subsamples |
algorithms |
vector of clustering algorithms for performing consensus clustering. Must be any number of the following: "nmf", "hc", "diana", "km", "pam", "ap", "sc", "gmm", "block", "som", "cmeans", "hdbscan". A custom clustering algorithm can be used. |
k.method |
determines the method to choose k when no reference class is
given. When |
nmf.method |
specify NMF-based algorithms to run. By default the
"brunet" and "lee" algorithms are called. See |
hc.method |
agglomeration method for hierarchical clustering. The
the "average" method is used by default. See |
distance |
a vector of distance functions. Defaults to "euclidean".
Other options are given in |
cons.funs |
consensus functions to use. Current options are "kmodes" (k-modes), "majority" (majority voting), "CSPA" (Cluster-based Similarity Partitioning Algorithm), "LCE" (linkage clustering ensemble), "LCA" (latent class analysis) |
sim.mat |
similarity matrix; choices are "cts", "srs", "asrs". |
prep.data |
Prepare the data on the "full" dataset, the "sampled" dataset, or "none" (default). |
min.var |
minimum variability measure threshold used to filter the
feature space for only highly variable features. Only features with a
minimum variability measure across all samples greater than |
seed |
random seed for knn imputation reproducibility |
seed.data |
seed to use to ensure each algorithm operates on the same set of subsamples |
trim |
logical; if |
reweigh |
logical; if |
n |
an integer specifying the top |
evaluate |
logical; if |
plot |
logical; if |
ref.cl |
reference class |
progress |
logical; should a progress bar be displayed? |
There are three ways to handle the input data before clustering via argument
prep.data
. The default is to use the raw data as-is ("none"). Or, we can
enact prepare_data()
on the full dataset ("full"), or the bootstrap sampled
datasets ("sampled").
A list with the following elements
E |
raw clustering ensemble object |
Eknn |
clustering ensemble object with knn imputation used on |
Ecomp |
flattened ensemble object with remaining missing entries imputed by majority voting |
clusters |
final clustering assignment from the diverse clustering ensemble method |
indices |
if |
Aline Talhouk, Derek Chiu
library(dplyr) data(hgsc) dat <- hgsc[1:100, 1:50] ref.cl <- strsplit(rownames(dat), "_") %>% purrr::map_chr(2) %>% factor() %>% as.integer() dice.obj <- dice(dat, nk = 4, reps = 5, algorithms = "hc", cons.funs = "kmodes", ref.cl = ref.cl, progress = FALSE) str(dice.obj, max.level = 2)
library(dplyr) data(hgsc) dat <- hgsc[1:100, 1:50] ref.cl <- strsplit(rownames(dat), "_") %>% purrr::map_chr(2) %>% factor() %>% as.integer() dice.obj <- dice(dat, nk = 4, reps = 5, algorithms = "hc", cons.funs = "kmodes", ref.cl = ref.cl, progress = FALSE) str(dice.obj, max.level = 2)
External validity indices compare a predicted clustering result with a reference class or gold standard.
ev_nmi(pred.lab, ref.lab, method = "emp") ev_confmat(pred.lab, ref.lab)
ev_nmi(pred.lab, ref.lab, method = "emp") ev_confmat(pred.lab, ref.lab)
pred.lab |
predicted labels generated by classifier |
ref.lab |
reference labels for the observations |
method |
method of computing the entropy. Can be any one of "emp", "mm", "shrink", or "sg". |
ev_nmi
calculates the normalized mutual information
ev_confmat
calculates a variety of statistics associated with
confusion matrices. Accuracy, Cohen's kappa, and Matthews correlation
coefficient have direct multiclass definitions, whereas all other
metrics use macro-averaging.
ev_nmi
returns the normalized mutual information.
ev_confmat
returns a tibble of the following summary statistics using yardstick::summary.conf_mat()
:
accuracy
: Accuracy
kap
: Cohen's kappa
sens
: Sensitivity
spec
: Specificity
ppv
: Positive predictive value
npv
: Negative predictive value
mcc
: Matthews correlation coefficient
j_index
: Youden's J statistic
bal_accuracy
: Balanced accuracy
detection_prevalence
: Detection prevalence
precision
: alias for ppv
recall
: alias for sens
f_meas
: F Measure
ev_nmi
is adapted from infotheo::mutinformation()
Johnson Liu, Derek Chiu
Strehl A, Ghosh J. Cluster ensembles: a knowledge reuse framework for combining multiple partitions. J. Mach. Learn. Res. 2002;3:583-617.
set.seed(1) E <- matrix(rep(sample(1:4, 1000, replace = TRUE)), nrow = 100, byrow = FALSE) x <- sample(1:4, 100, replace = TRUE) y <- sample(1:4, 100, replace = TRUE) ev_nmi(x, y) ev_confmat(x, y)
set.seed(1) E <- matrix(rep(sample(1:4, 1000, replace = TRUE)), nrow = 100, byrow = FALSE) x <- sample(1:4, 100, replace = TRUE) y <- sample(1:4, 100, replace = TRUE) ev_nmi(x, y) ev_confmat(x, y)
Graph cumulative distribution function (CDF) graphs, relative change in area under CDF curves, heatmaps, and cluster assignment tracking plots.
graph_cdf(mat) graph_delta_area(mat) graph_heatmap(mat, main = NULL) graph_tracking(cl) graph_all(x)
graph_cdf(mat) graph_delta_area(mat) graph_heatmap(mat, main = NULL) graph_tracking(cl) graph_all(x)
mat |
same as |
main |
heatmap title. If |
cl |
same as |
x |
an object from |
graph_cdf
plots the CDF for consensus matrices from different algorithms.
graph_delta_area
calculates the relative change in area under CDF curve
between algorithms. graph_heatmap
generates consensus matrix heatmaps for
each algorithm in x
. graph_tracking
tracks how cluster assignments change
between algorithms. graph_all
is a wrapper that runs all graphing
functions.
Various plots from graph_*{}
functions. All plots are
generated using ggplot
, except for graph_heatmap
, which uses
NMF::aheatmap()
. Colours used in graph_heatmap
and graph_tracking
utilize RColorBrewer::brewer.pal()
palettes.
Derek Chiu
https://stackoverflow.com/questions/4954507/calculate-the-area-under-a-curve
# Consensus clustering for 3 algorithms library(ggplot2) set.seed(911) x <- matrix(rnorm(80), ncol = 10) CC1 <- consensus_cluster(x, nk = 2:4, reps = 3, algorithms = c("hc", "pam", "km"), progress = FALSE) # Plot CDF p <- graph_cdf(CC1) # Change y label and add colours p + labs(y = "Probability") + stat_ecdf(aes(colour = k)) + scale_color_brewer(palette = "Set2") # Delta Area p <- graph_delta_area(CC1) # Heatmaps with column side colours corresponding to clusters CC2 <- consensus_cluster(x, nk = 3, reps = 3, algorithms = "hc", progress = FALSE) graph_heatmap(CC2) # Track how cluster assignments change between algorithms p <- graph_tracking(CC1)
# Consensus clustering for 3 algorithms library(ggplot2) set.seed(911) x <- matrix(rnorm(80), ncol = 10) CC1 <- consensus_cluster(x, nk = 2:4, reps = 3, algorithms = c("hc", "pam", "km"), progress = FALSE) # Plot CDF p <- graph_cdf(CC1) # Change y label and add colours p + labs(y = "Probability") + stat_ecdf(aes(colour = k)) + scale_color_brewer(palette = "Set2") # Delta Area p <- graph_delta_area(CC1) # Heatmaps with column side colours corresponding to clusters CC2 <- consensus_cluster(x, nk = 3, reps = 3, algorithms = "hc", progress = FALSE) graph_heatmap(CC2) # Track how cluster assignments change between algorithms p <- graph_tracking(CC1)
There are 489 samples measured on 321 genes. Sample IDs are in the row names
and gene names are in the column names. This data set is used for clustering
HGSC into subtypes with prognostic significance. The cluster assignments
obtained by TCGA are indicated by the last six characters of each row name in
hgsc
: MES.C1
, IMM.C2
, DIF.C4
, and PRO.C5
hgsc
hgsc
A data frame with 489 rows and 321 columns.
The non-missing cases indicate the training set, and missing cases indicate the test set.
impute_knn(x, data, seed = 123456)
impute_knn(x, data, seed = 123456)
x |
clustering object |
data |
data matrix |
seed |
random seed for knn imputation reproducibility |
An object with (potentially not all) missing values imputed with K-Nearest Neighbours.
We consider 5 nearest neighbours and the minimum vote for definite decision is 3.
Aline Talhouk
Other imputation functions:
impute_missing()
data(hgsc) dat <- hgsc[1:100, 1:50] x <- consensus_cluster(dat, nk = 4, reps = 4, algorithms = c("km", "hc", "diana"), progress = FALSE) x <- apply(x, 2:4, impute_knn, data = dat, seed = 1)
data(hgsc) dat <- hgsc[1:100, 1:50] x <- consensus_cluster(dat, nk = 4, reps = 4, algorithms = c("km", "hc", "diana"), progress = FALSE) x <- apply(x, 2:4, impute_knn, data = dat, seed = 1)
Impute missing values from bootstrapped subsampling
impute_missing(E, data, nk)
impute_missing(E, data, nk)
E |
4D array of clusterings from |
data |
data matrix with samples as rows and genes/features as columns |
nk |
cluster size to extract data for (single value) |
The default output from consensus_cluster
will undoubtedly contain NA
entries because each replicate chooses a random subset (with replacement) of
all samples. Missing values should first be imputed using impute_knn()
. Not
all missing values are guaranteed to be imputed by KNN. See class::knn()
for details. Thus, any remaining missing values are imputed using majority
voting.
If flattened matrix consists of more than one repetition, i.e. it
isn't a column vector, then the function returns a matrix of clusterings
with complete cases imputed using majority voting, and relabelled, for
chosen k
.
Aline Talhouk
Other imputation functions:
impute_knn()
data(hgsc) dat <- hgsc[1:100, 1:50] E <- consensus_cluster(dat, nk = 3:4, reps = 10, algorithms = c("hc", "km", "sc"), progress = FALSE) sum(is.na(E)) E_imputed <- impute_missing(E, dat, 4) sum(is.na(E_imputed))
data(hgsc) dat <- hgsc[1:100, 1:50] E <- consensus_cluster(dat, nk = 3:4, reps = 10, algorithms = c("hc", "km", "sc"), progress = FALSE) sum(is.na(E)) E_imputed <- impute_missing(E, dat, 4) sum(is.na(E_imputed))
Combine clustering results using K-modes.
k_modes(E, is.relabelled = TRUE, seed = 1)
k_modes(E, is.relabelled = TRUE, seed = 1)
E |
a matrix of clusterings with number of rows equal to the number of cases to be clustered, number of columns equal to the clustering obtained by different resampling of the data, and the third dimension are the different algorithms. Matrix may already be two-dimensional. |
is.relabelled |
logical; if |
seed |
random seed for reproducibility |
Combine clustering results generated using different algorithms and different
data perturbations by k-modes. This method is the categorical data analog of
k-means clustering. Complete cases are needed: i.e. no NA
s. If the matrix
contains NA
s those are imputed by majority voting (after class relabeling).
a vector of cluster assignments based on k-modes
Aline Talhouk
Luo, H., Kong, F., & Li, Y. (2006, August). Combining multiple clusterings via k-modes algorithm. In International Conference on Advanced Data Mining and Applications (pp. 308-315). Springer, Berlin, Heidelberg.
Other consensus functions:
CSPA()
,
LCA()
,
LCE()
,
majority_voting()
data(hgsc) dat <- hgsc[1:100, 1:50] cc <- consensus_cluster(dat, nk = 4, reps = 6, algorithms = "pam", progress = FALSE) table(k_modes(cc[, , 1, 1, drop = FALSE], is.relabelled = FALSE))
data(hgsc) dat <- hgsc[1:100, 1:50] cc <- consensus_cluster(dat, nk = 4, reps = 6, algorithms = "pam", progress = FALSE) table(k_modes(cc[, , 1, 1, drop = FALSE], is.relabelled = FALSE))
Combine clustering results using latent class analysis.
LCA(E, is.relabelled = TRUE, seed = 1)
LCA(E, is.relabelled = TRUE, seed = 1)
E |
a matrix of clusterings with number of rows equal to the number of cases to be clustered, number of columns equal to the clustering obtained by different resampling of the data, and the third dimension are the different algorithms. Matrix may already be two-dimensional. |
is.relabelled |
logical; if |
seed |
random seed for reproducibility |
a vector of cluster assignments based on LCA
Derek Chiu
Other consensus functions:
CSPA()
,
LCE()
,
k_modes()
,
majority_voting()
data(hgsc) dat <- hgsc[1:100, 1:50] cc <- consensus_cluster(dat, nk = 4, reps = 6, algorithms = "pam", progress = FALSE) table(LCA(cc[, , 1, 1, drop = FALSE], is.relabelled = FALSE))
data(hgsc) dat <- hgsc[1:100, 1:50] cc <- consensus_cluster(dat, nk = 4, reps = 6, algorithms = "pam", progress = FALSE) table(LCA(cc[, , 1, 1, drop = FALSE], is.relabelled = FALSE))
Generate a cluster assignment from a CTS, SRS, or ASRS similarity matrix.
LCE(E, k, dc = 0.8, R = 10, sim.mat = c("cts", "srs", "asrs"))
LCE(E, k, dc = 0.8, R = 10, sim.mat = c("cts", "srs", "asrs"))
E |
is an array of clustering results. An error is thrown if there are
missing values. |
k |
requested number of clusters |
dc |
decay constant for CTS, SRS, or ASRS matrix |
R |
number of repetitions for SRS matrix |
sim.mat |
similarity matrix; choices are "cts", "srs", "asrs". |
a vector containing the cluster assignment from either the CTS, SRS, or ASRS similarity matrices
Johnson Liu
Other consensus functions:
CSPA()
,
LCA()
,
k_modes()
,
majority_voting()
data(hgsc) dat <- hgsc[1:100, 1:50] x <- consensus_cluster(dat, nk = 4, reps = 4, algorithms = c("km", "hc"), progress = FALSE) ## Not run: LCE(E = x, k = 4, sim.mat = "asrs") ## End(Not run) x <- apply(x, 2:4, impute_knn, data = dat, seed = 1) x_imputed <- impute_missing(x, dat, nk = 4) LCE(E = x_imputed, k = 4, sim.mat = "cts")
data(hgsc) dat <- hgsc[1:100, 1:50] x <- consensus_cluster(dat, nk = 4, reps = 4, algorithms = c("km", "hc"), progress = FALSE) ## Not run: LCE(E = x, k = 4, sim.mat = "asrs") ## End(Not run) x <- apply(x, 2:4, impute_knn, data = dat, seed = 1) x_imputed <- impute_missing(x, dat, nk = 4) LCE(E = x_imputed, k = 4, sim.mat = "cts")
Combine clustering results using majority voting.
majority_voting(E, is.relabelled = TRUE)
majority_voting(E, is.relabelled = TRUE)
E |
a matrix of clusterings with number of rows equal to the number of cases to be clustered, number of columns equal to the clustering obtained by different resampling of the data, and the third dimension are the different algorithms. Matrix may already be two-dimensional. |
is.relabelled |
logical; if |
Combine clustering results generated using different algorithms and different data perturbations by majority voting. The class of a sample is the cluster label which was selected most often across algorithms and subsamples.
a vector of cluster assignments based on majority voting
Aline Talhouk
Other consensus functions:
CSPA()
,
LCA()
,
LCE()
,
k_modes()
data(hgsc) dat <- hgsc[1:100, 1:50] cc <- consensus_cluster(dat, nk = 4, reps = 6, algorithms = "pam", progress = FALSE) table(majority_voting(cc[, , 1, 1, drop = FALSE], is.relabelled = FALSE))
data(hgsc) dat <- hgsc[1:100, 1:50] cc <- consensus_cluster(dat, nk = 4, reps = 6, algorithms = "pam", progress = FALSE) table(majority_voting(cc[, , 1, 1, drop = FALSE], is.relabelled = FALSE))
Finds a permutation of a matrix such that its Frobenius norm with another matrix is minimized.
min_fnorm(A, B = diag(nrow(A)))
min_fnorm(A, B = diag(nrow(A)))
A |
data matrix we want to permute |
B |
matrix whose distance with the permuted A we want to minimize. By
default, |
Finds the permutation P of A such that ||PA - B||
is minimum in Frobenius
norm. Uses the linear-sum assignment problem (LSAP) solver in the package
clue
. The default B is the identity matrix of same dimension, so that the
permutation of A maximizes its trace. This procedure is useful for
constructing a confusion matrix when we don't know the true class labels of a
predicted class and want to compare to a reference class.
Permuted matrix such that it is the permutation of A closest to B
Ravi Varadhan: https://stat.ethz.ch/pipermail/r-help/2010-April/236664.html
set.seed(1) A <- matrix(sample(1:25, size = 25, rep = FALSE), 5, 5) min_fnorm(A)
set.seed(1) A <- matrix(sample(1:25, size = 25, rep = FALSE), 5, 5) min_fnorm(A)
Given a consensus matrix, returns the proportion of ambiguous clusters (PAC). This is a robust way to assess clustering performance.
PAC(cm, lower = 0, upper = 1)
PAC(cm, lower = 0, upper = 1)
cm |
consensus matrix. Should be symmetric and values between 0 and 1. |
lower |
the lower bound that determines what is ambiguous |
upper |
the upper bound that determines what is ambiguous |
Since a consensus matrix is symmetric, we only look at its lower (or upper)
triangular matrix. The proportion of entries strictly between lower
and
upper
is the PAC. In a perfect clustering, the consensus matrix would
consist of only 0s and 1s, and the PAC assessed on the (0, 1) interval would
have a perfect score of 0. Using a (0.1, 0.9) interval for defining ambiguity
is common as well.
The PAC is not, strictly speaking, an internal validity index. Originally used to choose the optimal number of clusters, here we use it to assess cluster stability. However, PAC is still agnostic any gold standard clustering result so we use it like an internal validity index.
the PAC is a score used in clustering performance. The lower it is the better, because we want minimal ambiguity amongst the consensus.
Derek Chiu
Senbabaoglu, Y., Michailidis, G., & Li, J. Z. (2014). Critical limitations of consensus clustering in class discovery. Scientific reports, 4.
set.seed(1) x <- replicate(100, rbinom(100, 4, 0.2)) y <- consensus_matrix(x) PAC(y, lower = 0.05, upper = 0.95)
set.seed(1) x <- replicate(100, rbinom(100, 4, 0.2)) y <- consensus_matrix(x) PAC(y, lower = 0.05, upper = 0.95)
Using a principal component constructed from the sample space, we simulate
null distributions with univariate Normal distributions using pcn_simulate
.
Then a subset of these distributions is chosen using pcn_select
.
pcn_simulate(data, n.sim = 50) pcn_select(data.sim, cl, type = c("rep", "range"), int = 5)
pcn_simulate(data, n.sim = 50) pcn_select(data.sim, cl, type = c("rep", "range"), int = 5)
data |
data matrix with rows as samples, columns as features |
n.sim |
The number of simulated datasets to simulate |
data.sim |
an object from |
cl |
vector of cluster memberships |
type |
select either the representative dataset ("rep") or a range of datasets ("range") |
int |
every |
pcn_simulate
returns a list of length n.sim
. Each element is a
simulated matrix using this "Principal Component Normal" (pcn) procedure.
pcn_select
returns a list with elements
ranks
: When type = "range"
, ranks of each extracted dataset shown
ind
: index of representative simulation
dat
: simulation data representation of all in pcNormal
Derek Chiu
set.seed(9) A <- matrix(rnorm(300), nrow = 20) pc.dat <- pcn_simulate(A, n.sim = 50) cl <- sample(1:4, 20, replace = TRUE) pc.select <- pcn_select(pc.dat, cl, "rep")
set.seed(9) A <- matrix(rnorm(300), nrow = 20) pc.dat <- pcn_simulate(A, n.sim = 50) cl <- sample(1:4, 20, replace = TRUE) pc.select <- pcn_select(pc.dat, cl, "rep")
Perform feature selection or dimension reduction to remove noise variables.
prepare_data( data, scale = TRUE, type = c("conventional", "robust", "tsne"), min.var = 1 )
prepare_data( data, scale = TRUE, type = c("conventional", "robust", "tsne"), min.var = 1 )
data |
data matrix with rows as samples and columns as variables |
scale |
logical; should the data be centered and scaled? |
type |
if we use "conventional" measures (default), then the mean and standard deviation are used for centering and scaling, respectively. If "robust" measures are specified, the median and median absolute deviation (MAD) are used. Alternatively, we can apply "tsne" for dimension reduction. |
min.var |
minimum variability measure threshold used to filter the
feature space for only highly variable features. Only features with a
minimum variability measure across all samples greater than |
We can apply a basic filtering method of feature selection that removes variables with low signal and (optionally) scales before consensus clustering. Or, we can use t-SNE dimension reduction to transform the data to just two variables. This lower-dimensional embedding allows algorithms such as hierarchical clustering to achieve greater performance.
dataset prepared for usage in consensus_cluster
Derek Chiu
set.seed(2) x <- replicate(10, rnorm(100)) x.prep <- prepare_data(x) dim(x) dim(x.prep)
set.seed(2) x <- replicate(10, rnorm(100)) x.prep <- prepare_data(x) dim(x) dim(x.prep)
Relabel clustering categories to match to a standard by minimizing the Frobenius norm between the two labels.
relabel_class(pred.cl, ref.cl)
relabel_class(pred.cl, ref.cl)
pred.cl |
vector of predicted cluster assignments |
ref.cl |
vector of reference labels to match to |
A vector of relabeled cluster assignments
Aline Talhouk
set.seed(2) pred <- sample(1:4, 100, replace = TRUE) true <- sample(1:4, 100, replace = TRUE) relabel_class(pred, true)
set.seed(2) pred <- sample(1:4, 100, replace = TRUE) true <- sample(1:4, 100, replace = TRUE) relabel_class(pred, true)
Uses the SigClust K-Means algorithm to assess significance of clustering results.
sigclust(x, k, nsim, nrep = 1, labflag = 0, label = 0, icovest = 2)
sigclust(x, k, nsim, nrep = 1, labflag = 0, label = 0, icovest = 2)
x |
data matrix, samples are rows and features are columns |
k |
cluster size to test against |
nsim |
number of simulations |
nrep |
See |
labflag |
See |
label |
true class label. See |
icovest |
type of covariance matrix estimation |
This function is a wrapper for the original sigclust::sigclust()
, except
that an additional parameter k
is allows testing against any number of
clusters. In addition, the default type of covariance estimation is also
different.
An object of class sigclust
. See sigclust::sigclust()
for
details.
Hanwen Huang: [email protected]; Yufeng Liu: [email protected]; J. S. Marron: [email protected]
Liu, Yufeng, Hayes, David Neil, Nobel, Andrew and Marron, J. S, 2008, Statistical Significance of Clustering for High-Dimension, Low-Sample Size Data, Journal of the American Statistical Association 103(483) 1281–1293.
data(hgsc) dat <- hgsc[1:100, 1:50] nk <- 4 cc <- consensus_cluster(dat, nk = nk, reps = 5, algorithms = "pam", progress = FALSE) cl.mat <- consensus_combine(cc, element = "class") lab <- cl.mat$`4`[, 1] set.seed(1) str(sigclust(x = dat, k = nk, nsim = 50, labflag = 1, label = lab))
data(hgsc) dat <- hgsc[1:100, 1:50] nk <- 4 cc <- consensus_cluster(dat, nk = nk, reps = 5, algorithms = "pam", progress = FALSE) cl.mat <- consensus_combine(cc, element = "class") lab <- cl.mat$`4`[, 1] set.seed(1) str(sigclust(x = dat, k = nk, nsim = 50, labflag = 1, label = lab))
cts
computes the connected triple based similarity matrix, srs
computes
the simrank based similarity matrix, and asrs
computes the approximated
simrank based similarity matrix.
cts(E, dc) srs(E, dc, R) asrs(E, dc)
cts(E, dc) srs(E, dc, R) asrs(E, dc)
E |
an N by M matrix of cluster ensembles |
dc |
decay factor, ranges from 0 to 1 inclusive |
R |
number of iterations for |
an N by N CTS, SRS, or ASRS matrix
Johnson Liu, Derek Chiu
MATLAB functions cts, srs, asrs in package LinkCluE by Simon Garrett
set.seed(1) E <- matrix(rep(sample(1:4, 800, replace = TRUE)), nrow = 100) CTS <- cts(E = E, dc = 0.8) SRS <- srs(E = E, dc = 0.8, R = 3) ASRS <- asrs(E = E, dc = 0.8) purrr::walk(list(CTS, SRS, ASRS), str)
set.seed(1) E <- matrix(rep(sample(1:4, 800, replace = TRUE)), nrow = 100) CTS <- cts(E = E, dc = 0.8) SRS <- srs(E = E, dc = 0.8, R = 3) ASRS <- asrs(E = E, dc = 0.8) purrr::walk(list(CTS, SRS, ASRS), str)