Title: | Linear Optimal Low-Rank Projection |
---|---|
Description: | Supervised learning techniques designed for the situation when the dimensionality exceeds the sample size have a tendency to overfit as the dimensionality of the data increases. To remedy this High dimensionality; low sample size (HDLSS) situation, we attempt to learn a lower-dimensional representation of the data before learning a classifier. That is, we project the data to a situation where the dimensionality is more manageable, and then are able to better apply standard classification or clustering techniques since we will have fewer dimensions to overfit. A number of previous works have focused on how to strategically reduce dimensionality in the unsupervised case, yet in the supervised HDLSS regime, few works have attempted to devise dimensionality reduction techniques that leverage the labels associated with the data. In this package and the associated manuscript Vogelstein et al. (2017) <arXiv:1709.01233>, we provide several methods for feature extraction, some utilizing labels and some not, along with easily extensible utilities to simplify cross-validative efforts to identify the best feature extraction method. Additionally, we include a series of adaptable benchmark simulations to serve as a standard for future investigative efforts into supervised HDLSS. Finally, we produce a comprehensive comparison of the included algorithms across a range of benchmark simulations and real data applications. |
Authors: | Eric Bridgeford [aut, cre], Minh Tang [ctb], Jason Yim [ctb], Joshua Vogelstein [ths] |
Maintainer: | Eric Bridgeford <[email protected]> |
License: | GPL-2 |
Version: | 2.1 |
Built: | 2025-02-09 05:01:37 UTC |
Source: | https://github.com/neurodata/lol |
A function that trains a classifier based on the nearest centroid.
lol.classify.nearestCentroid(X, Y, ...)
lol.classify.nearestCentroid(X, Y, ...)
X |
|
Y |
|
... |
optional args. |
A list of class nearestCentroid
, with the following attributes:
centroids |
|
ylabs |
|
priors |
|
For more details see the help vignette:
vignette("centroid", package = "lolR")
Eric Bridgeford
library(lolR) data <- lol.sims.rtrunk(n=200, d=30) # 200 examples of 30 dimensions X <- data$X; Y <- data$Y model <- lol.classify.nearestCentroid(X, Y)
library(lolR) data <- lol.sims.rtrunk(n=200, d=30) # 200 examples of 30 dimensions X <- data$X; Y <- data$Y model <- lol.classify.nearestCentroid(X, Y)
A function for random classifiers.
lol.classify.rand(X, Y, ...)
lol.classify.rand(X, Y, ...)
X |
|
Y |
|
... |
optional args. |
A structure, with the following attributes:
ylabs |
|
priors |
|
Eric Bridgeford
A function that predicts the maximally present class in the dataset. Functionality consistent with the standard R prediction interface so that one can compute the "chance" accuracy with minimal modification of other classification scripts.
lol.classify.randomChance(X, Y, ...)
lol.classify.randomChance(X, Y, ...)
X |
|
Y |
|
... |
optional args. |
A list of class randomGuess
, with the following attributes:
ylabs |
|
priors |
|
Eric Bridgeford
library(lolR) data <- lol.sims.rtrunk(n=200, d=30) # 200 examples of 30 dimensions X <- data$X; Y <- data$Y model <- lol.classify.randomChance(X, Y)
library(lolR) data <- lol.sims.rtrunk(n=200, d=30) # 200 examples of 30 dimensions X <- data$X; Y <- data$Y model <- lol.classify.randomChance(X, Y)
A function that predicts by randomly guessing based on the pmf of the class priors. Functionality consistent with the standard R prediction interface so that one can compute the "guess" accuracy with minimal modification of other classification scripts.
lol.classify.randomGuess(X, Y, ...)
lol.classify.randomGuess(X, Y, ...)
X |
|
Y |
|
... |
optional args. |
A list of class randomGuess
, with the following attributes:
ylabs |
|
priors |
|
Eric Bridgeford
library(lolR) data <- lol.sims.rtrunk(n=200, d=30) # 200 examples of 30 dimensions X <- data$X; Y <- data$Y model <- lol.classify.randomGuess(X, Y)
library(lolR) data <- lol.sims.rtrunk(n=200, d=30) # 200 examples of 30 dimensions X <- data$X; Y <- data$Y model <- lol.classify.randomGuess(X, Y)
A function that embeds points in high dimensions to a lower dimensionality.
lol.embed(X, A, ...)
lol.embed(X, A, ...)
X |
|
A |
|
... |
optional args. |
an array [n, r]
the original n
points embedded into r
dimensions.
Eric Bridgeford
library(lolR) data <- lol.sims.rtrunk(n=200, d=30) # 200 examples of 30 dimensions X <- data$X; Y <- data$Y model <- lol.project.lol(X=X, Y=Y, r=5) # use lol to project into 5 dimensions Xr <- lol.embed(X, model$A)
library(lolR) data <- lol.sims.rtrunk(n=200, d=30) # 200 examples of 30 dimensions X <- data$X; Y <- data$Y model <- lol.project.lol(X=X, Y=Y, r=5) # use lol to project into 5 dimensions Xr <- lol.embed(X, model$A)
A function for recovering the Bayes Optimal Projection, which optimizes Bayes classification.
lol.project.bayes_optimal(X, Y, mus, Sigmas, priors, ...)
lol.project.bayes_optimal(X, Y, mus, Sigmas, priors, ...)
X |
|
Y |
|
mus |
|
Sigmas |
|
priors |
|
... |
optional args. |
A list of class embedding
containing the following:
A |
|
d |
the eigen values associated with the eigendecomposition. |
ylabs |
|
centroids |
|
priors |
|
Xr |
|
cr |
|
Eric Bridgeford
library(lolR) data <- lol.sims.rtrunk(n=200, d=30) # 200 examples of 30 dimensions X <- data$X; Y <- data$Y # obtain bayes-optimal projection of the data model <- lol.project.bayes_optimal(X=X, Y=Y, mus=data$mus, S=data$Sigmas, priors=data$priors)
library(lolR) data <- lol.sims.rtrunk(n=200, d=30) # 200 examples of 30 dimensions X <- data$X; Y <- data$Y # obtain bayes-optimal projection of the data model <- lol.project.bayes_optimal(X=X, Y=Y, mus=data$mus, S=data$Sigmas, priors=data$priors)
A function for implementing the Maximal Data Piling (MDP) Algorithm.
lol.project.dp(X, Y, ...)
lol.project.dp(X, Y, ...)
X |
|
Y |
|
... |
optional args. |
A list containing the following:
A |
|
ylabs |
|
centroids |
|
priors |
|
Xr |
|
cr |
|
For more details see the help vignette:
vignette("dp", package = "lolR")
Minh Tang and Eric Bridgeford
library(lolR) data <- lol.sims.rtrunk(n=200, d=30) # 200 examples of 30 dimensions X <- data$X; Y <- data$Y model <- lol.project.dp(X=X, Y=Y) # use mdp to project into maximal data piling
library(lolR) data <- lol.sims.rtrunk(n=200, d=30) # 200 examples of 30 dimensions X <- data$X; Y <- data$Y model <- lol.project.dp(X=X, Y=Y) # use mdp to project into maximal data piling
A function for implementing the Linear Optimal Low-Rank Projection (LOL) Algorithm. This algorithm allows users to find an optimal projection from 'd' to 'r' dimensions, where 'r << d', by combining information from the first and second moments in thet data.
lol.project.lol( X, Y, r, second.moment.xfm = FALSE, second.moment.xfm.opts = list(), first.moment = "delta", second.moment = "linear", orthogonalize = FALSE, robust.first = TRUE, robust.second = FALSE, ... )
lol.project.lol( X, Y, r, second.moment.xfm = FALSE, second.moment.xfm.opts = list(), first.moment = "delta", second.moment = "linear", orthogonalize = FALSE, robust.first = TRUE, robust.second = FALSE, ... )
X |
|
Y |
|
r |
the rank of the projection. Note that |
second.moment.xfm |
whether to use extraneous options in estimation of the second moment component. The transforms specified should be a numbered list of transforms you wish to apply, and will be applied in accordance with |
second.moment.xfm.opts |
optional arguments to pass to the |
first.moment |
the function to capture the first moment. Defaults to
|
second.moment |
the function to capture the second moment. Defaults to
|
orthogonalize |
whether to orthogonalize the projection matrix. Defaults to |
robust.first |
whether to perform PCA on a robust estimate of the first moment component or not. A robust estimate corresponds to usage of medians. Defaults to |
robust.second |
whether to perform PCA on a robust estimate of the second moment component or not. A robust estimate corresponds to usage of a robust covariance matrix, which requires |
... |
trailing args. |
A list containing the following:
A |
|
ylabs |
|
centroids |
|
priors |
|
Xr |
|
cr |
|
second.moment |
the method used to estimate the second moment. |
first.moment |
the method used to estimate the first moment. |
For more details see the help vignette:
vignette("lol", package = "lolR")
Eric Bridgeford
Joshua T. Vogelstein, et al. "Supervised Dimensionality Reduction for Big Data" arXiv (2020).
library(lolR) data <- lol.sims.rtrunk(n=200, d=30) # 200 examples of 30 dimensions X <- data$X; Y <- data$Y model <- lol.project.lol(X=X, Y=Y, r=5) # use lol to project into 5 dimensions # use lol to project into 5 dimensions, and produce an orthogonal basis for the projection matrix model <- lol.project.lol(X=X, Y=Y, r=5, orthogonalize=TRUE) # use LRQDA to estimate the second moment by performing PCA on each class model <- lol.project.lol(X=X, Y=Y, r=5, second.moment='quadratic') # use PLS to estimate the second moment model <- lol.project.lol(X=X, Y=Y, r=5, second.moment='pls') # use LRLDA to estimate the second moment, and apply a unit transformation # (according to scale function) with no centering model <- lol.project.lol(X=X, Y=Y, r=5, second.moment='linear', second.moment.xfm='unit', second.moment.xfm.opts=list(center=FALSE))
library(lolR) data <- lol.sims.rtrunk(n=200, d=30) # 200 examples of 30 dimensions X <- data$X; Y <- data$Y model <- lol.project.lol(X=X, Y=Y, r=5) # use lol to project into 5 dimensions # use lol to project into 5 dimensions, and produce an orthogonal basis for the projection matrix model <- lol.project.lol(X=X, Y=Y, r=5, orthogonalize=TRUE) # use LRQDA to estimate the second moment by performing PCA on each class model <- lol.project.lol(X=X, Y=Y, r=5, second.moment='quadratic') # use PLS to estimate the second moment model <- lol.project.lol(X=X, Y=Y, r=5, second.moment='pls') # use LRLDA to estimate the second moment, and apply a unit transformation # (according to scale function) with no centering model <- lol.project.lol(X=X, Y=Y, r=5, second.moment='linear', second.moment.xfm='unit', second.moment.xfm.opts=list(center=FALSE))
A function for implementing the Low-rank Canonical Correlation Analysis (LR-CCA) Algorithm.
lol.project.lrcca(X, Y, r, ...)
lol.project.lrcca(X, Y, r, ...)
X |
[n, d] the data with |
Y |
[n] the labels of the samples with |
r |
the rank of the projection. |
... |
trailing args. |
A list containing the following:
A |
|
d |
the eigen values associated with the eigendecomposition. |
ylabs |
|
centroids |
|
priors |
|
Xr |
|
cr |
|
For more details see the help vignette:
vignette("lrcca", package = "lolR")
Eric Bridgeford and Minh Tang
library(lolR) data <- lol.sims.rtrunk(n=200, d=30) # 200 examples of 30 dimensions X <- data$X; Y <- data$Y model <- lol.project.lrcca(X=X, Y=Y, r=5) # use lrcca to project into 5 dimensions
library(lolR) data <- lol.sims.rtrunk(n=200, d=30) # 200 examples of 30 dimensions X <- data$X; Y <- data$Y model <- lol.project.lrcca(X=X, Y=Y, r=5) # use lrcca to project into 5 dimensions
A function that performs LRLDA on the class-centered data. Same as class-conditional PCA.
lol.project.lrlda(X, Y, r, xfm = FALSE, xfm.opts = list(), robust = FALSE, ...)
lol.project.lrlda(X, Y, r, xfm = FALSE, xfm.opts = list(), robust = FALSE, ...)
X |
|
Y |
|
r |
the rank of the projection. |
xfm |
whether to transform the variables before taking the SVD.
|
xfm.opts |
optional arguments to pass to the |
robust |
whether to use a robust estimate of the covariance matrix when taking PCA. Defaults to |
... |
trailing args. |
A list containing the following:
A |
|
d |
the eigen values associated with the eigendecomposition. |
ylabs |
|
centroids |
|
priors |
|
Xr |
|
cr |
|
For more details see the help vignette:
vignette("lrlda", package = "lolR")
Eric Bridgeford
library(lolR) data <- lol.sims.rtrunk(n=200, d=30) # 200 examples of 30 dimensions X <- data$X; Y <- data$Y model <- lol.project.lrlda(X=X, Y=Y, r=2) # use lrlda to project into 2 dimensions
library(lolR) data <- lol.sims.rtrunk(n=200, d=30) # 200 examples of 30 dimensions X <- data$X; Y <- data$Y model <- lol.project.lrlda(X=X, Y=Y, r=2) # use lrlda to project into 2 dimensions
A function that performs PCA on data.
lol.project.pca(X, r, xfm = FALSE, xfm.opts = list(), robust = FALSE, ...)
lol.project.pca(X, r, xfm = FALSE, xfm.opts = list(), robust = FALSE, ...)
X |
|
r |
the rank of the projection. |
xfm |
whether to transform the variables before taking the SVD.
|
xfm.opts |
optional arguments to pass to the |
robust |
whether to perform PCA on a robust estimate of the covariance matrix or not. Defaults to |
... |
trailing args. |
A list containing the following:
A |
|
d |
the eigen values associated with the eigendecomposition. |
Xr |
|
For more details see the help vignette:
vignette("pca", package = "lolR")
Eric Bridgeford
library(lolR) data <- lol.sims.rtrunk(n=200, d=30) # 200 examples of 30 dimensions X <- data$X; Y <- data$Y model <- lol.project.pca(X=X, r=2) # use pca to project into 2 dimensions
library(lolR) data <- lol.sims.rtrunk(n=200, d=30) # 200 examples of 30 dimensions X <- data$X; Y <- data$Y model <- lol.project.pca(X=X, r=2) # use pca to project into 2 dimensions
A function for implementing the Partial Least-Squares (PLS) Algorithm.
lol.project.pls(X, Y, r, ...)
lol.project.pls(X, Y, r, ...)
X |
[n, d] the data with |
Y |
[n] the labels of the samples with |
r |
the rank of the projection. |
... |
trailing args. |
A list containing the following:
A |
|
ylabs |
|
centroids |
|
priors |
|
Xr |
|
cr |
|
For more details see the help vignette:
vignette("pls", package = "lolR")
Eric Bridgeford
library(lolR) data <- lol.sims.rtrunk(n=200, d=30) # 200 examples of 30 dimensions X <- data$X; Y <- data$Y model <- lol.project.pls(X=X, Y=Y, r=5) # use pls to project into 5 dimensions
library(lolR) data <- lol.sims.rtrunk(n=200, d=30) # 200 examples of 30 dimensions X <- data$X; Y <- data$Y model <- lol.project.pls(X=X, Y=Y, r=5) # use pls to project into 5 dimensions
A function for implementing gaussian random projections (rp).
lol.project.rp(X, r, scale = TRUE, ...)
lol.project.rp(X, r, scale = TRUE, ...)
X |
|
r |
the rank of the projection. Note that |
scale |
whether to scale the random projection by the sqrt(1/d). Defaults to |
... |
trailing args. |
A list containing the following:
A |
|
Xr |
|
For more details see the help vignette:
vignette("rp", package = "lolR")
Eric Bridgeford
library(lolR) data <- lol.sims.rtrunk(n=200, d=30) # 200 examples of 30 dimensions X <- data$X; Y <- data$Y model <- lol.project.rp(X=X, r=5) # use lol to project into 5 dimensions
library(lolR) data <- lol.sims.rtrunk(n=200, d=30) # 200 examples of 30 dimensions X <- data$X; Y <- data$Y model <- lol.project.rp(X=X, r=5) # use lol to project into 5 dimensions
A simulation for the stacked cigar experiment.
lol.sims.cigar(n, d, rotate = FALSE, priors = NULL, a = 0.15, b = 4)
lol.sims.cigar(n, d, rotate = FALSE, priors = NULL, a = 0.15, b = 4)
n |
the number of samples of the simulated data. |
d |
the dimensionality of the simulated data. |
rotate |
whether to apply a random rotation to the mean and covariance. With random rotataion matrix |
priors |
the priors for each class. If |
a |
scalar for all of the mu1 but 2nd dimension. Defaults to |
b |
scalar for 2nd dimension value of mu2 and the 2nd variance term of S. Defaults to |
A list of class simulation
with the following:
X |
|
Y |
|
mus |
|
Sigmas |
|
priors |
|
simtype |
The name of the simulation. |
params |
Any extraneous parameters the simulation was created with. |
For more details see the help vignette:
vignette("sims", package = "lolR")
Eric Bridgeford
library(lolR) data <- lol.sims.cigar(n=200, d=30) # 200 examples of 30 dimensions X <- data$X; Y <- data$Y
library(lolR) data <- lol.sims.cigar(n=200, d=30) # 200 examples of 30 dimensions X <- data$X; Y <- data$Y
A simulation for the cross experiment, in which the two classes have orthogonal covariant dimensions and the same means.
lol.sims.cross(n, d, rotate = FALSE, priors = NULL, a = 1, b = 0.25, K = 2)
lol.sims.cross(n, d, rotate = FALSE, priors = NULL, a = 1, b = 0.25, K = 2)
n |
the number of samples of simulated data. |
d |
the dimensionality of the simulated data. |
rotate |
With random rotataion matrix |
priors |
the priors for each class. If |
a |
scalar for the magnitude of the variance that is high within the particular class. Defaults to |
b |
scalar for the magnitude of the varaince that is not high within the particular class. Defaults to |
K |
the number of classes. Defaults to |
A list of class simulation
with the following:
X |
|
Y |
|
mus |
|
Sigmas |
|
priors |
|
simtype |
The name of the simulation. |
params |
Any extraneous parameters the simulation was created with. |
For more details see the help vignette:
vignette("sims", package = "lolR")
Eric Bridgeford
library(lolR) data <- lol.sims.cross(n=200, d=30) # 200 examples of 30 dimensions X <- data$X; Y <- data$Y
library(lolR) data <- lol.sims.cross(n=200, d=30) # 200 examples of 30 dimensions X <- data$X; Y <- data$Y
A function for simulating from 2 classes with differing means each with 2 sub-clusters, where one sub-cluster has a narrow tail and the other sub-cluster has a fat tail.
lol.sims.fat_tails( n, d, rotate = FALSE, f = 15, s0 = 10, rho = 0.2, t = 0.8, priors = NULL )
lol.sims.fat_tails( n, d, rotate = FALSE, f = 15, s0 = 10, rho = 0.2, t = 0.8, priors = NULL )
n |
the number of samples of the simulated data. |
d |
the dimensionality of the simulated data. |
rotate |
whether to apply a random rotation to the mean and covariance. With random rotataion matrix |
f |
the fatness scaling of the tail. S2 = f*S1, where S1_ij = rho if i != j, and 1 if i == j. Defaults to |
s0 |
the number of dimensions with a difference in the means. s0 should be < d. Defaults to |
rho |
the scaling of the off-diagonal covariance terms, should be < 1. Defaults to |
t |
the fraction of each class from the narrower-tailed distribution. Defaults to |
priors |
the priors for each class. If |
A list of class simulation
with the following:
X |
|
Y |
|
mus |
|
Sigmas |
|
priors |
|
simtype |
The name of the simulation. |
params |
Any extraneous parameters the simulation was created with. |
For more details see the help vignette:
vignette("sims", package = "lolR")
Eric Bridgeford
library(lolR) data <- lol.sims.fat_tails(n=200, d=30) # 200 examples of 30 dimensions X <- data$X; Y <- data$Y
library(lolR) data <- lol.sims.fat_tails(n=200, d=30) # 200 examples of 30 dimensions X <- data$X; Y <- data$Y
A simulation for the multiclass hump experiment, in which each class has a unique hump which distinguishes its mean.
lol.sims.khump( n, d, rotate = FALSE, priors = NULL, b = 4, K = 4, var.dim = 100 )
lol.sims.khump( n, d, rotate = FALSE, priors = NULL, b = 4, K = 4, var.dim = 100 )
n |
the number of samples of the simulated data. |
d |
the dimensionality of the simulated data. |
rotate |
whether to apply a random rotation to the mean and covariance. With random rotataion matrix |
priors |
the priors for each class. If |
b |
scalar for mu scaling. Default to |
K |
the number of classes. Should be an even number. Defaults to |
var.dim |
the variance for each dimension. Defaults to |
A list of class simulation
with the following:
X |
|
Y |
|
mus |
|
Sigmas |
|
priors |
|
simtype |
The name of the simulation. |
params |
Any extraneous parameters the simulation was created with. |
robust |
If robust is not false, a list containing |
For more details see the help vignette:
vignette("sims", package = "lolR")
Eric Bridgeford
library(lolR) data <- lol.sims.rtrunk(n=200, d=30) # 200 examples of 30 dimensions X <- data$X; Y <- data$Y
library(lolR) data <- lol.sims.rtrunk(n=200, d=30) # 200 examples of 30 dimensions X <- data$X; Y <- data$Y
A simulation for the multiclass hump experiment, in which each class has a unique hump which distinguishes its mean.
lol.sims.kident(n, d, rotate = FALSE, priors = NULL, b = 4, K = 4, maxvar = 25)
lol.sims.kident(n, d, rotate = FALSE, priors = NULL, b = 4, K = 4, maxvar = 25)
n |
the number of samples of the simulated data. |
d |
the dimensionality of the simulated data. |
rotate |
whether to apply a random rotation to the mean and covariance. With random rotataion matrix |
priors |
the priors for each class. If |
b |
scalar for mu scaling. Default to |
K |
the number of classes. Should be an even number. Defaults to |
maxvar |
the maximum covariance between the two classes. Defaults to |
A list of class simulation
with the following:
X |
|
Y |
|
mus |
|
Sigmas |
|
priors |
|
simtype |
The name of the simulation. |
params |
Any extraneous parameters the simulation was created with. |
robust |
If robust is not false, a list containing |
For more details see the help vignette:
vignette("sims", package = "lolR")
Eric Bridgeford
library(lolR) data <- lol.sims.rtrunk(n=200, d=30) # 200 examples of 30 dimensions X <- data$X; Y <- data$Y
library(lolR) data <- lol.sims.rtrunk(n=200, d=30) # 200 examples of 30 dimensions X <- data$X; Y <- data$Y
A simulation for the multiclass trunk experiment, in which the maximal covariant dimensions are the reverse of the maximal mean differences.
lol.sims.ktrunk( n, d, rotate = FALSE, priors = NULL, b = 4, K = 4, maxvar = 100 )
lol.sims.ktrunk( n, d, rotate = FALSE, priors = NULL, b = 4, K = 4, maxvar = 100 )
n |
the number of samples of the simulated data. |
d |
the dimensionality of the simulated data. |
rotate |
whether to apply a random rotation to the mean and covariance. With random rotataion matrix |
priors |
the priors for each class. If |
b |
scalar for mu scaling. Default to |
K |
the number of classes. Should be an even number. Defaults to |
maxvar |
the maximum covariance between the two classes. Defaults to |
A list of class simulation
with the following:
X |
|
Y |
|
mus |
|
Sigmas |
|
priors |
|
simtype |
The name of the simulation. |
params |
Any extraneous parameters the simulation was created with. |
robust |
If robust is not false, a list containing |
For more details see the help vignette:
vignette("sims", package = "lolR")
Eric Bridgeford
library(lolR) data <- lol.sims.rtrunk(n=200, d=30) # 200 examples of 30 dimensions X <- data$X; Y <- data$Y
library(lolR) data <- lol.sims.rtrunk(n=200, d=30) # 200 examples of 30 dimensions X <- data$X; Y <- data$Y
A function for simulating data in which a difference in the means is present only in a subset of dimensions, and equal covariance.
lol.sims.mean_diff( n, d, rotate = FALSE, priors = NULL, K = 2, md = 1, subset = c(1), offdiag = 0, s = 1 )
lol.sims.mean_diff( n, d, rotate = FALSE, priors = NULL, K = 2, md = 1, subset = c(1), offdiag = 0, s = 1 )
n |
the number of samples of the simulated data. |
d |
the dimensionality of the simulated data. |
rotate |
whether to apply a random rotation to the mean and covariance. With random rotataion matrix |
priors |
the priors for each class. If |
K |
the number of classes. Defaults to |
md |
the magnitude of the difference in the means in the specified subset of dimensions. Ddefaults to |
subset |
the dimensions to have a difference in the means. Defaults to only the first dimension. |
offdiag |
the off-diagonal elements of the covariance matrix. Should be < 1. |
s |
the scaling parameter of the covariance matrix. S_ij = scaling*1 if i == j, or scaling*offdiag if i != j. Defaults to |
A list of class simulation
with the following:
X |
|
Y |
|
mus |
|
Sigmas |
|
priors |
|
simtype |
The name of the simulation. |
params |
Any extraneous parameters the simulation was created with. |
For more details see the help vignette:
vignette("sims", package = "lolR")
Eric Bridgeford
library(lolR) data <- lol.sims.mean_diff(n=200, d=30) # 200 examples of 30 dimensions X <- data$X; Y <- data$Y
library(lolR) data <- lol.sims.mean_diff(n=200, d=30) # 200 examples of 30 dimensions X <- data$X; Y <- data$Y
A function for simulating data generalizing the Toeplitz setting, where each class has a different covariance matrix. This results in a Quadratic Discriminant.
lol.sims.qdtoep( n, d, rotate = FALSE, priors = NULL, D1 = 10, b = 0.4, rho = 0.5 )
lol.sims.qdtoep( n, d, rotate = FALSE, priors = NULL, D1 = 10, b = 0.4, rho = 0.5 )
n |
the number of samples of the simulated data. |
d |
the dimensionality of the simulated data. |
rotate |
whether to apply a random rotation to the mean and covariance. With random rotataion matrix |
priors |
the priors for each class. If |
D1 |
the dimensionality for the non-equal covariance terms. Defaults to |
b |
a scaling parameter for the means. Defaults to |
rho |
the scaling of the covariance terms, should be < 1. Defaults to |
A list of class simulation
with the following:
X |
|
Y |
|
mus |
|
Sigmas |
|
priors |
|
simtype |
The name of the simulation. |
params |
Any extraneous parameters the simulation was created with. |
For more details see the help vignette:
vignette("sims", package = "lolR")
Eric Bridgeford
library(lolR) data <- lol.sims.qdtoep(n=200, d=30) # 200 examples of 30 dimensions X <- data$X; Y <- data$Y
library(lolR) data <- lol.sims.qdtoep(n=200, d=30) # 200 examples of 30 dimensions X <- data$X; Y <- data$Y
A helper function for applying a random rotation to gaussian parameter set.
lol.sims.random_rotate(mus, Sigmas, Q = NULL)
lol.sims.random_rotate(mus, Sigmas, Q = NULL)
mus |
means per class. |
Sigmas |
covariances per class. |
Q |
rotation to use, if any |
Eric Bridgeford
A simulation for the reversed random trunk experiment, in which the maximal covariant directions are the same as the directions with the maximal mean difference.
lol.sims.rev_rtrunk( n, d, robust = FALSE, rotate = FALSE, priors = NULL, b = 4, K = 2, maxvar = b^3, maxvar.outlier = maxvar^3 )
lol.sims.rev_rtrunk( n, d, robust = FALSE, rotate = FALSE, priors = NULL, b = 4, K = 2, maxvar = b^3, maxvar.outlier = maxvar^3 )
n |
the number of samples of the simulated data. |
d |
the dimensionality of the simulated data. |
robust |
the number of outlier points to add, where outliers have opposite covariance of inliers. Defaults to |
rotate |
whether to apply a random rotation to the mean and covariance. With random rotataion matrix |
priors |
the priors for each class. If |
b |
scalar for mu scaling. Default to |
K |
number of classes, should be <4. Defaults to |
maxvar |
the maximum covariance between the two classes. Defaults to |
maxvar.outlier |
the maximum covariance for the outlier points. Defaults to |
A list of class simulation
with the following:
X |
|
Y |
|
mus |
|
Sigmas |
|
priors |
|
simtype |
The name of the simulation. |
params |
Any extraneous parameters the simulation was created with. |
robust |
If robust is not false, a list containing |
For more details see the help vignette:
vignette("sims", package = "lolR")
Eric Bridgeford
library(lolR) data <- lol.sims.rtrunk(n=200, d=30) # 200 examples of 30 dimensions X <- data$X; Y <- data$Y
library(lolR) data <- lol.sims.rtrunk(n=200, d=30) # 200 examples of 30 dimensions X <- data$X; Y <- data$Y
A helper function for estimating a random rotation matrix.
lol.sims.rotation(d)
lol.sims.rotation(d)
d |
dimensions to generate a rotation matrix for. |
the rotation matrix
Eric Bridgeford
A simulation for the random trunk experiment, in which the maximal covariant dimensions are the reverse of the maximal mean differences.
lol.sims.rtrunk( n, d, rotate = FALSE, priors = NULL, b = 4, K = 2, maxvar = 100 )
lol.sims.rtrunk( n, d, rotate = FALSE, priors = NULL, b = 4, K = 2, maxvar = 100 )
n |
the number of samples of the simulated data. |
d |
the dimensionality of the simulated data. |
rotate |
whether to apply a random rotation to the mean and covariance. With random rotataion matrix |
priors |
the priors for each class. If |
b |
scalar for mu scaling. Default to |
K |
number of classes, should be <=4. Defaults to |
maxvar |
the maximum covariance between the two classes. Defaults to |
A list of class simulation
with the following:
X |
|
Y |
|
mus |
|
Sigmas |
|
priors |
|
simtype |
The name of the simulation. |
params |
Any extraneous parameters the simulation was created with. |
robust |
If robust is not false, a list containing |
For more details see the help vignette:
vignette("sims", package = "lolR")
Eric Bridgeford
library(lolR) data <- lol.sims.rtrunk(n=200, d=30) # 200 examples of 30 dimensions X <- data$X; Y <- data$Y
library(lolR) data <- lol.sims.rtrunk(n=200, d=30) # 200 examples of 30 dimensions X <- data$X; Y <- data$Y
A helper function for simulating from Gaussian Mixture.
lol.sims.sim_gmm(mus, Sigmas, n, priors)
lol.sims.sim_gmm(mus, Sigmas, n, priors)
mus |
|
Sigmas |
|
n |
the number of examples. |
priors |
|
A list with the following:
X |
|
Y |
|
priors |
|
Eric Bridgeford
A function for simulating data in which the covariance is a non-symmetric toeplitz matrix.
lol.sims.toep(n, d, rotate = FALSE, priors = NULL, D1 = 10, b = 0.4, rho = 0.5)
lol.sims.toep(n, d, rotate = FALSE, priors = NULL, D1 = 10, b = 0.4, rho = 0.5)
n |
the number of samples of the simulated data. |
d |
the dimensionality of the simulated data. |
rotate |
whether to apply a random rotation to the mean and covariance. With random rotataion matrix |
priors |
the priors for each class. If |
D1 |
the dimensionality for the non-equal covariance terms. Defaults to |
b |
a scaling parameter for the means. Defaults to |
rho |
the scaling of the covariance terms, should be < 1. Defaults to |
A list of class simulation
with the following:
X |
|
Y |
|
mus |
|
Sigmas |
|
priors |
|
simtype |
The name of the simulation. |
params |
Any extraneous parameters the simulation was created with. |
For more details see the help vignette:
vignette("sims", package = "lolR")
Eric Bridgeford
library(lolR) data <- lol.sims.toep(n=200, d=30) # 200 examples of 30 dimensions X <- data$X; Y <- data$Y
library(lolR) data <- lol.sims.toep(n=200, d=30) # 200 examples of 30 dimensions X <- data$X; Y <- data$Y
A function to simulate from the 2-class xor problem.
lol.sims.xor2(n, d, priors = NULL, fall = 100)
lol.sims.xor2(n, d, priors = NULL, fall = 100)
n |
the number of samples of the simulated data. |
d |
the dimensionality of the simulated data. |
priors |
the priors for each class. If |
fall |
the falloff for the covariance structuring. Sigma declines by ndim/fall across the variance terms. Defaults to |
A list of class simulation
with the following:
X |
|
Y |
|
mus |
|
Sigmas |
|
priors |
|
simtype |
The name of the simulation. |
params |
Any extraneous parameters the simulation was created with. |
For more details see the help vignette:
vignette("sims", package = "lolR")
Eric Bridgeford
library(lolR) data <- lol.sims.xor2(n=200, d=30) # 200 examples of 30 dimensions X <- data$X; Y <- data$Y
library(lolR) data <- lol.sims.xor2(n=200, d=30) # 200 examples of 30 dimensions X <- data$X; Y <- data$Y
A utility to use irlba when necessary
lol.utils.decomp( X, xfm = FALSE, xfm.opts = list(), ncomp = 0, t = 0.05, robust = FALSE )
lol.utils.decomp( X, xfm = FALSE, xfm.opts = list(), ncomp = 0, t = 0.05, robust = FALSE )
X |
the data to compute the svd of. |
xfm |
whether to transform the variables before taking the SVD.
|
xfm.opts |
optional arguments to pass to the |
ncomp |
the number of left singular vectors to retain. |
t |
the threshold of percent of singular vals/vecs to use irlba. |
robust |
whether to use a robust estimate of the covariance matrix when taking PCA. Defaults to |
the svd of X.
Eric Bridgeford
A function that performs a utility computation of information about the differences of the classes.
lol.utils.deltas(centroids, priors, ...)
lol.utils.deltas(centroids, priors, ...)
centroids |
|
priors |
|
... |
optional args. |
deltas [d, K]
the K difference vectors.
Eric Bridgeford
A function that performs basic utilities about the data.
lol.utils.info(X, Y, robust = FALSE, ...)
lol.utils.info(X, Y, robust = FALSE, ...)
X |
|
Y |
|
robust |
whether to perform PCA on a robust estimate of the covariance matrix or not. Defaults to |
... |
optional args. |
n
the number of samples.
d
the number of dimensions.
ylabs [K]
vector containing the unique, ordered class labels.
priors [K]
vector containing prior probability for the unique, ordered classes.
Eric Bridgeford
A function for one-hot encoding categorical respose vectors.
lol.utils.ohe(Y)
lol.utils.ohe(Y)
Y |
[n] a vector of the categorical resposes, with |
a list containing the following:
Yh |
[n, K] the one-hot encoded Y respose variable. |
ylabs |
[K] a vector of the y names corresponding to each response column. |
Eric Bridgeford
A function for performing leave-one-out cross-validation for a given embedding model. This function produces fold-wise
cross-validated misclassification rates for standard embedding techniques. Users can optionally specify custom embedding techniques
with proper configuration of alg.*
parameters and hyperparameters. Optional classifiers implementing the S3 predict
function can be used
for classification, with hyperparameters to classifiers for determining misclassification rate specified in classifier.*
parameters and hyperparameters.
lol.xval.eval( X, Y, r, alg, sets = NULL, alg.dimname = "r", alg.opts = list(), alg.embedding = "A", classifier = lda, classifier.opts = list(), classifier.return = "class", k = "loo", rank.low = FALSE, ... )
lol.xval.eval( X, Y, r, alg, sets = NULL, alg.dimname = "r", alg.opts = list(), alg.embedding = "A", classifier = lda, classifier.opts = list(), classifier.return = "class", k = "loo", rank.low = FALSE, ... )
X |
|
Y |
|
r |
the number of embedding dimensions desired, where |
alg |
the algorithm to use for embedding. Should be a function that accepts inputs |
sets |
a user-defined cross-validation set. Defaults to
|
alg.dimname |
the name of the parameter accepted by |
alg.opts |
the hyper-parameter options you want to pass into your algorithm, as a keyworded list. Defaults to |
alg.embedding |
the attribute returned by
|
classifier |
the classifier to use for assessing performance. The classifier should accept |
classifier.opts |
any extraneous options to be passed to the classifier function, as a list. Defaults to an empty list. |
classifier.return |
if the return type is a list,
|
k |
the cross-validated method to perform. Defaults to
|
rank.low |
whether to force the training set to low-rank. Defaults to
|
... |
trailing args. |
Returns a list containing:
lhat |
the mean cross-validated error. |
model |
The model returned by |
classifier |
The classifier trained on all of the embedded data. |
lhats |
the cross-validated error for each of the |
For more details see the help vignette:
vignette("xval", package = "lolR")
For extending cross-validation techniques shown here to arbitrary embedding algorithms, see the vignette:
vignette("extend_embedding", package = "lolR")
For extending cross-validation techniques shown here to arbitrary classification algorithms, see the vignette:
vignette("extend_classification", package = "lolR")
Eric Bridgeford
# train model and analyze with loo validation using lda classifier library(lolR) data <- lol.sims.rtrunk(n=200, d=30) # 200 examples of 30 dimensions X <- data$X; Y <- data$Y r=5 # embed into r=5 dimensions # run cross-validation with the nearestCentroid method and # leave-one-out cross-validation, which returns only # prediction labels so we specify classifier.return as NaN xval.fit <- lol.xval.eval(X, Y, r, lol.project.lol, classifier=lol.classify.nearestCentroid, classifier.return=NaN, k='loo') # train model and analyze with 5-fold validation using lda classifier data <- lol.sims.rtrunk(n=200, d=30) # 200 examples of 30 dimensions X <- data$X; Y <- data$Y xval.fit <- lol.xval.eval(X, Y, r, lol.project.lol, k=5) # pass in existing cross-validation sets sets <- lol.xval.split(X, Y, k=2) xval.fit <- lol.xval.eval(X, Y, r, lol.project.lol, sets=sets)
# train model and analyze with loo validation using lda classifier library(lolR) data <- lol.sims.rtrunk(n=200, d=30) # 200 examples of 30 dimensions X <- data$X; Y <- data$Y r=5 # embed into r=5 dimensions # run cross-validation with the nearestCentroid method and # leave-one-out cross-validation, which returns only # prediction labels so we specify classifier.return as NaN xval.fit <- lol.xval.eval(X, Y, r, lol.project.lol, classifier=lol.classify.nearestCentroid, classifier.return=NaN, k='loo') # train model and analyze with 5-fold validation using lda classifier data <- lol.sims.rtrunk(n=200, d=30) # 200 examples of 30 dimensions X <- data$X; Y <- data$Y xval.fit <- lol.xval.eval(X, Y, r, lol.project.lol, k=5) # pass in existing cross-validation sets sets <- lol.xval.split(X, Y, k=2) xval.fit <- lol.xval.eval(X, Y, r, lol.project.lol, sets=sets)
A function for performing leave-one-out cross-validation for a given embedding model, that allows users to determine the optimal number of embedding dimensions for
their algorithm-of-choice. This function produces fold-wise cross-validated misclassification rates for standard embedding techniques across a specified selection of
embedding dimensions. Optimal embedding dimension is selected as the dimension with the lowest average misclassification rate across all folds.
Users can optionally specify custom embedding techniques with proper configuration of alg.*
parameters and hyperparameters.
Optional classifiers implementing the S3 predict
function can be used for classification, with hyperparameters to classifiers for
determining misclassification rate specified in classifier.*
.
lol.xval.optimal_dimselect( X, Y, rs, alg, sets = NULL, alg.dimname = "r", alg.opts = list(), alg.embedding = "A", alg.structured = TRUE, classifier = lda, classifier.opts = list(), classifier.return = "class", k = "loo", rank.low = FALSE, ... )
lol.xval.optimal_dimselect( X, Y, rs, alg, sets = NULL, alg.dimname = "r", alg.opts = list(), alg.embedding = "A", alg.structured = TRUE, classifier = lda, classifier.opts = list(), classifier.return = "class", k = "loo", rank.low = FALSE, ... )
X |
|
Y |
|
rs |
|
alg |
the algorithm to use for embedding. Should be a function that accepts inputs |
sets |
a user-defined cross-validation set. Defaults to
|
alg.dimname |
the name of the parameter accepted by |
alg.opts |
the hyper-parameter options to pass to your algorithm as a keyworded list. Defaults to |
alg.embedding |
the attribute returned by
|
alg.structured |
a boolean to indicate whether the embedding matrix is structured. Provides performance increase by not having to compute the embedding matrix
|
classifier |
the classifier to use for assessing performance. The classifier should accept |
classifier.opts |
any extraneous options to be passed to the classifier function, as a list. Defaults to an empty list. |
classifier.return |
if the return type is a list,
|
k |
the cross-validated method to perform. Defaults to
|
rank.low |
whether to force the training set to low-rank. Defaults to
|
... |
trailing args. |
Returns a list containing:
folds.data |
the results, as a data-frame, of the per-fold classification accuracy. |
foldmeans.data |
the results, as a data-frame, of the average classification accuracy for each |
optimal.lhat |
the classification error of the optimal |
.
optimal.r |
the optimal number of embedding dimensions from |
.
model |
the model trained on all of the data at the optimal number of embedding dimensions. |
classifier |
the classifier trained on all of the data at the optimal number of embedding dimensions. |
For more details see the help vignette:
vignette("xval", package = "lolR")
For extending cross-validation techniques shown here to arbitrary embedding algorithms, see the vignette:
vignette("extend_embedding", package = "lolR")
For extending cross-validation techniques shown here to arbitrary classification algorithms, see the vignette:
vignette("extend_classification", package = "lolR")
Eric Bridgeford
# train model and analyze with loo validation using lda classifier library(lolR) data <- lol.sims.rtrunk(n=200, d=30) # 200 examples of 30 dimensions X <- data$X; Y <- data$Y # run cross-validation with the nearestCentroid method and # leave-one-out cross-validation, which returns only # prediction labels so we specify classifier.return as NaN xval.fit <- lol.xval.optimal_dimselect(X, Y, rs=c(5, 10, 15), lol.project.lol, classifier=lol.classify.nearestCentroid, classifier.return=NaN, k='loo') # train model and analyze with 5-fold validation using lda classifier data <- lol.sims.rtrunk(n=200, d=30) # 200 examples of 30 dimensions X <- data$X; Y <- data$Y xval.fit <- lol.xval.optimal_dimselect(X, Y, rs=c(5, 10, 15), lol.project.lol, k=5) # pass in existing cross-validation sets sets <- lol.xval.split(X, Y, k=2) xval.fit <- lol.xval.optimal_dimselect(X, Y, rs=c(5, 10, 15), lol.project.lol, sets=sets)
# train model and analyze with loo validation using lda classifier library(lolR) data <- lol.sims.rtrunk(n=200, d=30) # 200 examples of 30 dimensions X <- data$X; Y <- data$Y # run cross-validation with the nearestCentroid method and # leave-one-out cross-validation, which returns only # prediction labels so we specify classifier.return as NaN xval.fit <- lol.xval.optimal_dimselect(X, Y, rs=c(5, 10, 15), lol.project.lol, classifier=lol.classify.nearestCentroid, classifier.return=NaN, k='loo') # train model and analyze with 5-fold validation using lda classifier data <- lol.sims.rtrunk(n=200, d=30) # 200 examples of 30 dimensions X <- data$X; Y <- data$Y xval.fit <- lol.xval.optimal_dimselect(X, Y, rs=c(5, 10, 15), lol.project.lol, k=5) # pass in existing cross-validation sets sets <- lol.xval.split(X, Y, k=2) xval.fit <- lol.xval.optimal_dimselect(X, Y, rs=c(5, 10, 15), lol.project.lol, sets=sets)
A function to split a dataset into training and testing sets for cross validation. The procedure for cross-validation
is to split the data into k-folds. The k-folds are then rotated individually to form a single held-out testing set the model will be validated on,
and the remaining (k-1) folds are used for training the developed model. Note that this cross-validation function includes functionality to be used for
low-rank cross-validation. In that case, instead of using the full (k-1) folds for training, we subset min((k-1)/k*n, d)
samples to ensure that
the resulting training sets are all low-rank. We still rotate properly over the held-out fold to ensure that the resulting testing sets
do not have any shared examples, which would add a complicated dependence structure to inference we attempt to infer on the testing sets.
lol.xval.split(X, Y, k = "loo", rank.low = FALSE, ...)
lol.xval.split(X, Y, k = "loo", rank.low = FALSE, ...)
X |
|
Y |
|
k |
the cross-validated method to perform. Defaults to
|
rank.low |
whether to force the training set to low-rank. Defaults to
|
... |
optional args. |
sets the cross-validation sets as an object of class "XV"
containing the following:
train |
length |
test |
length |
Eric Bridgeford
# prepare data for 10-fold validation library(lolR) data <- lol.sims.rtrunk(n=200, d=30) # 200 examples of 30 dimensions X <- data$X; Y <- data$Y sets.xval.10fold <- lol.xval.split(X, Y, k=10) # prepare data for loo validation sets.xval.loo <- lol.xval.split(X, Y, k='loo')
# prepare data for 10-fold validation library(lolR) data <- lol.sims.rtrunk(n=200, d=30) # 200 examples of 30 dimensions X <- data$X; Y <- data$Y sets.xval.10fold <- lol.xval.split(X, Y, k=10) # prepare data for loo validation sets.xval.loo <- lol.xval.split(X, Y, k='loo')
A function that predicts the class of points based on the nearest centroid
## S3 method for class 'nearestCentroid' predict(object, X, ...)
## S3 method for class 'nearestCentroid' predict(object, X, ...)
object |
An object of class
|
X |
|
... |
optional args. |
Yhat [n]
the predicted class of each of the n
data point in X
.
For more details see the help vignette:
vignette("centroid", package = "lolR")
Eric Bridgeford
library(lolR) data <- lol.sims.rtrunk(n=200, d=30) # 200 examples of 30 dimensions X <- data$X; Y <- data$Y model <- lol.classify.nearestCentroid(X, Y) Yh <- predict(model, X)
library(lolR) data <- lol.sims.rtrunk(n=200, d=30) # 200 examples of 30 dimensions X <- data$X; Y <- data$Y model <- lol.classify.nearestCentroid(X, Y) Yh <- predict(model, X)
A function that predicts the maximally present class in the dataset. Functionality consistent with the standard R prediction interface so that one can compute the "chance" accuracy with minimal modification of other classification scripts.
## S3 method for class 'randomChance' predict(object, X, ...)
## S3 method for class 'randomChance' predict(object, X, ...)
object |
An object of class
|
X |
|
... |
optional args. |
Yhat [n]
the predicted class of each of the n
data point in X
.
Eric Bridgeford
library(lolR) data <- lol.sims.rtrunk(n=200, d=30) # 200 examples of 30 dimensions X <- data$X; Y <- data$Y model <- lol.classify.randomChance(X, Y) Yh <- predict(model, X)
library(lolR) data <- lol.sims.rtrunk(n=200, d=30) # 200 examples of 30 dimensions X <- data$X; Y <- data$Y model <- lol.classify.randomChance(X, Y) Yh <- predict(model, X)
A function that predicts by randomly guessing based on the pmf of the class priors. Functionality consistent with the standard R prediction interface so that one can compute the "guess" accuracy with minimal modification of other classification scripts.
## S3 method for class 'randomGuess' predict(object, X, ...)
## S3 method for class 'randomGuess' predict(object, X, ...)
object |
An object of class
|
X |
|
... |
optional args. |
Yhat [n]
the predicted class of each of the n
data point in X
.
Eric Bridgeford
library(lolR) data <- lol.sims.rtrunk(n=200, d=30) # 200 examples of 30 dimensions X <- data$X; Y <- data$Y model <- lol.classify.randomGuess(X, Y) Yh <- predict(model, X)
library(lolR) data <- lol.sims.rtrunk(n=200, d=30) # 200 examples of 30 dimensions X <- data$X; Y <- data$Y model <- lol.classify.randomGuess(X, Y) Yh <- predict(model, X)