This page shows examples of conditional knockoffs for the multivariate gaussian models in Section 3.1 of the accompanying paper, which allows for arbitrary mean and covariance matrices.

library("lattice")
library(knockoff)
source('src/knockoff_measure.R')
source('src/util.R')
source('src/cknock_ldg.R')

In this scenario we assume that the distribution of the covariates belongs to a multivariate Gaussian family with unknown mean and unknown covariance matrix. We will separately look at the cases where the number of observations \(n\) is greater than 2 times the dimension \((2p)\) and where \(n<p\) but unlabeled data are available.

Low-Dimensional

We first set up the experiment with \(n>2p\).

set.seed(2019)
# Problem parameters
p = 400         # number of covariates
n = 3*p         # number of observations
k = 60          # number of non-zero regression coefficients
A = 5           # signal amptitude 
nonzero = sample(p, k) 
beta = 1 * (1:p %in% nonzero) * sample(c(-1,1),p,rep = T) 
# Generate the covariates from a multivariate normal distribution 
rho = 0.3
Sigma = rho^abs(outer(1:p,1:p,'-'))      # Covariance matrix
Sigma.chol = chol(Sigma)

# Generate the covariates
X = matrix(rnorm(n * p), n) %*% Sigma.chol

# Generate the response from a linear model
Y = X%*%beta*A/sqrt(n) + rnorm(n)

The conditional knockoffs can be generated by calling the function where the method option refers to how \(\mathbf{s}\) is computed and mix is used in the simulations of Section 3.1.2 and discussed in Appendix B.1.1.

Xk.cond = cknockoff.ldg(X,method = 'mix')

We can look at the histogram of the original-knockoff correlations for all the \(X_j\) to make sure they are not too high.

cor.ok=diag(cor(X,Xk.cond))
hist(cor.ok,main='',xlab='Original-Knockoff Correlation')

Once we have generated conditional knockoffs, all the usual knockoffs machinery applies.

fdr.level = 0.2
knockoff.stat.fun =  mod.stat.lasso_coefdiff
filter.cond = knockoff.filter(X,Y,
    knockoffs = function(x) {
     Xk.cond
    },
    statistic = knockoff.stat.fun,
    fdr = fdr.level,offset = 1)
# false positive and false negative
c(fp(filter.cond$selected, beta),
 fn(filter.cond$selected, beta))
## [1] 0.1538462 0.9166667

As post-diagnostics, we can look at the distribution of the \(W\) statistics.

coloring=rep(1,p); coloring[nonzero]=2  # the color for the true non-zero locations of beta is red
plot(filter.cond$statistic,ylab='Relative Importance Measure (W)',
     pch=19,col=coloring,cex=0.5)   # plot the statistics used by the knockoff filter
abline(h=filter.cond$threshold,col='blue',lty=3)  # indicates the threshold
Ws=filter.cond$statistic[-nonzero];Ws=Ws[Ws!=0]
hist(Ws,main='',xlab='Non-zero W for Null Variables')

High-Dimensional with Unlabeled Data

In this example, we set up the experiments with a small sample size (\(n<2p\)) but with additional unlabeled data such that the total number of covariate rows is more than \(2p\).

# Problem parameters
n = 300
n.u = 2*p
# Generate labelled and unlabelled covariate
X = matrix(rnorm(n * p), n) %*% Sigma.chol
X.u = matrix(rnorm(n.u * p), n.u) %*% Sigma.chol

# Generate the response from a linear model
Y = X%*%beta*A/sqrt(n) + rnorm(n)

We then generate low-dimensional conditional knockoffs for all the covariates but only keep the rows corresponding to the labeled data.

Xk.star = cknockoff.ldg(rbind(X,X.u),method = 'mix')
Xk.cond = Xk.star[1:n,]
filter.cond = knockoff.filter(X,Y,
    knockoffs = function(x) {
     Xk.cond
    },
    statistic = knockoff.stat.fun,
    fdr = fdr.level,offset = 1)
# false positive and false negative
c(fp(filter.cond$selected, beta),
 fn(filter.cond$selected, beta))
## [1] 0.1707317 0.5666667