--- title: "CTMLE Vignette" author: "Cheng Ju" date: "`r Sys.Date()`" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{CTMLE Vignette} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- # Installation To install the CRAN release version of `ctmle`: ```R install.packages('ctmle') ``` To install the development version (requires the devtools package): ```R devtools::install_github('jucheng1992/ctmle') ``` # Collaborative Targeted Maximum Likelihood Estimation In this package, we implemented the general template of C-TMLE, for estimation of average additive treatment effect (ATE). The package also offers the functions for discrete C-TMLE, which could be used for variable selection, and C-TMLE for model selection of LASSO. ## C-TMLE for variable selection In this section, we start with examples of discrete C-TMLE for variable selection, using greedy forward searhcing, and scalable discrete C-TMLE with pre-ordering option. ```{r,eval=TRUE} library(ctmle) library(dplyr) set.seed(123) N <- 1000 p = 5 Wmat <- matrix(rnorm(N * p), ncol = p) beta1 <- 4+2*Wmat[,1]+2*Wmat[,2]+2*Wmat[,5] beta0 <- 2+2*Wmat[,1]+2*Wmat[,2]+2*Wmat[,5] tau <- 2 gcoef <- matrix(c(-1,-1,rep(-(3/((p)-2)),(p)-2)),ncol=1) W <- as.matrix(Wmat) g <- 1/(1+exp(W%*%gcoef /3)) A <- rbinom(N, 1, prob = g) epsilon <-rnorm(N, 0, 1) Y <- beta0 + tau * A + epsilon # With initial estimate of Q Q <- cbind(rep(mean(Y[A == 0]), N), rep(mean(Y[A == 1]), N)) time_greedy <- system.time( ctmle_discrete_fit1 <- ctmleDiscrete(Y = Y, A = A, W = data.frame(Wmat), Q = Q, preOrder = FALSE, detailed = TRUE) ) ctmle_discrete_fit2 <- ctmleDiscrete(Y = Y, A = A, W = data.frame(Wmat), preOrder = FALSE, detailed = TRUE) time_preorder <- system.time( ctmle_discrete_fit3 <- ctmleDiscrete(Y = Y, A = A, W = data.frame(Wmat), Q = Q, preOrder = TRUE, order = rev(1:p), detailed = TRUE) ) ``` Scalable (discrete) C-TMLE takes much less computation time: ```{r,eval = TRUE} time_greedy time_preorder ``` Show the brief results from greedy CTMLE: ```{r,eval = TRUE} ctmle_discrete_fit1 ``` Summary function offers detial information of which variable is selected. ```{r,eval = TRUE} summary(ctmle_discrete_fit1) ``` ## C-TMLE LASSO for model selection of LASSO In this section, we introduce the C-TMLE algorithms for model selection of LASSO in the estimation of propensity core, and for simplicity we call them LASSO C-TMLE algorithm. We have three variacions of C-TMLE LASSO algorithms, see technical details in the corresponding references. ```{r,eval = TRUE} # Generate high-dimensional data set.seed(123) N <- 1000 p = 100 Wmat <- matrix(rnorm(N * p), ncol = p) beta1 <- 4+2*Wmat[,1]+2*Wmat[,2]+2*Wmat[,5]+2*Wmat[,6]+2*Wmat[,8] beta0 <- 2+2*Wmat[,1]+2*Wmat[,2]+2*Wmat[,5]+2*Wmat[,6]+2*Wmat[,8] tau <- 2 gcoef <- matrix(c(-1,-1,rep(-(3/((p)-2)),(p)-2)),ncol=1) W <- as.matrix(Wmat) g <- 1/(1+exp(W%*%gcoef /3)) A <- rbinom(N, 1, prob = g) epsilon <-rnorm(N, 0, 1) Y <- beta0 + tau * A + epsilon # With initial estimate of Q Q <- cbind(rep(mean(Y[A == 0]), N), rep(mean(Y[A == 1]), N)) glmnet_fit <- cv.glmnet(y = A, x = W, family = 'binomial', nlambda = 20) ``` We suggest start build a sequence of lambdas from the lambda selected by cross-validation, as the model selected by cv.glmnet would over-smooth w.r.t. the target parameter. ```{r} lambdas <-glmnet_fit$lambda[(which(glmnet_fit$lambda==glmnet_fit$lambda.min)):length(glmnet_fit$lambda)] ``` We fit C-TMLE1 algorithm by feed the algorithm with a vector of lambda, in decreasing order: ```{r} time_ctmlelasso1 <- system.time( ctmle_fit1 <- ctmleGlmnet(Y = Y, A = A, W = data.frame(W = W), Q = Q, lambdas = lambdas, ctmletype=1, family="gaussian",gbound=0.025, V=5) ) ``` We fit C-TMLE2 algorithm ```{r} time_ctmlelasso2 <- system.time( ctmle_fit2 <- ctmleGlmnet(Y = Y, A = A, W = data.frame(W = W), Q = Q, lambdas = lambdas, ctmletype=2, family="gaussian",gbound=0.025, V=5) ) ``` For C-TMLE3, we need two gn estimators, one with lambda selected by cross-validation, and the other with lambda slightly different from the selected lambda: ```{r} gcv <- stats::predict(glmnet_fit, newx=W, s="lambda.min",type="response") gcv <- bound(gcv,c(0.025,0.975)) s_prev <- glmnet_fit$lambda[(which(glmnet_fit$lambda == glmnet_fit$lambda.min))] * (1+5e-2) gcvPrev <- stats::predict(glmnet_fit,newx = W,s = s_prev,type="response") gcvPrev <- bound(gcvPrev,c(0.025,0.975)) time_ctmlelasso3 <- system.time( ctmle_fit3 <- ctmleGlmnet(Y = Y, A = A, W = W, Q = Q, ctmletype=3, g1W = gcv, g1WPrev = gcvPrev, family="gaussian", gbound=0.025, V = 5) ) ``` Les't compare the running time for each LASSO-C-TMLE ```{r,eval = TRUE} time_ctmlelasso1 time_ctmlelasso2 time_ctmlelasso3 ``` Finally, we compared three C-TMLE estimates: ```{r,eval = TRUE} ctmle_fit1 ctmle_fit2 ctmle_fit3 ``` Show which regularization parameter (lambda) is selected by C-TMLE1: ```{r,eval = TRUE} lambdas[ctmle_fit1$best_k] ``` In comparison, show which regularization parameter (lambda) is selected by cv.glmnet: ```{r,eval = TRUE} glmnet_fit$lambda.min ``` ## Advanced topic: the general template of C-TMLE In this section, we briefly introduce the general template of C-TMLE. In this function, the gn candidates could be a user-specified matrix, each column stand for the estimated PS for each unit. The estimators should be ordered by their empirical fit. As C-TMLE requires cross-validation, it needs two gn estimate: one from cross-validated prediction, one from a vanilla prediction. For example, consider 5-folds cross-validation, where argument `folds` is the list of indices for each folds, then the (i,j)-th element in input `gn_candidates_cv` should be the predicted value of i-th unit, predicted by j-th unit, trained by other 4 folds where all of them do not contain i-th unit. `gn_candidates` should be just the predicted PS for each estimator trained on the whole data. We could easily use `SuperLearner` package and `build_gn_seq` function to easily achieve this: ```{r,eval = TRUE} lasso_fit <- cv.glmnet(x = as.matrix(W), y = A, alpha = 1, nlambda = 100, nfolds = 10) lasso_lambdas <- lasso_fit$lambda[lasso_fit$lambda <= lasso_fit$lambda.min][1:5] # Build SL template for glmnet SL.glmnet_new <- function(Y, X, newX, family, obsWeights, id, alpha = 1, nlambda = 100, lambda = 0,...){ # browser() if (!is.matrix(X)) { X <- model.matrix(~-1 + ., X) newX <- model.matrix(~-1 + ., newX) } fit <- glmnet::glmnet(x = X, y = Y, lambda = lambda, family = family$family, alpha = alpha) pred <- predict(fit, newx = newX, type = "response") fit <- list(object = fit) class(fit) <- "SL.glmnet" out <- list(pred = pred, fit = fit) return(out) } # Use a sequence of estimator to build gn sequence: SL.cv1lasso <- function (... , alpha = 1, lambda = lasso_lambdas[1]){ SL.glmnet_new(... , alpha = alpha, lambda = lambda) } SL.cv2lasso <- function (... , alpha = 1, lambda = lasso_lambdas[2]){ SL.glmnet_new(... , alpha = alpha, lambda = lambda) } SL.cv3lasso <- function (... , alpha = 1, lambda = lasso_lambdas[3]){ SL.glmnet_new(... , alpha = alpha, lambda = lambda) } SL.cv4lasso <- function (... , alpha = 1, lambda = lasso_lambdas[4]){ SL.glmnet_new(... , alpha = alpha, lambda = lambda) } SL.library = c('SL.cv1lasso', 'SL.cv2lasso', 'SL.cv3lasso', 'SL.cv4lasso', 'SL.glm') ``` Construct the object `folds`, which is a list of indices for each fold ```{r} V = 5 folds <-by(sample(1:N,N), rep(1:V, length=N), list) ``` Use `folds` and SuperLearner template to compute `gn_candidates` and `gn_candidates_cv` ```{r} gn_seq <- build_gn_seq(A = A, W = W, SL.library = SL.library, folds = folds) ``` Lets look at the output of `build_gn_seq` ```{r,eval = TRUE} gn_seq$gn_candidates %>% dim gn_seq$gn_candidates_cv %>% dim gn_seq$folds %>% length ``` Then we could use `ctmleGeneral` algorithm. As input estimator is already trained, it is much faster than previous C-TMLE algorithms. *Note: we recommand use the same `folds` as `build_gn_seq` for `ctmleGeneral`, to make cross-validation objective.* ```{r,eval = TRUE} ctmle_general_fit1 <- ctmleGeneral(Y = Y, A = A, W = W, Q = Q, ctmletype = 1, gn_candidates = gn_seq$gn_candidates, gn_candidates_cv = gn_seq$gn_candidates_cv, folds = folds, V = 5) ctmle_general_fit1 ``` ## Citation If you used `ctmle` package in your research, please cite: >Ju, Cheng; Susan, Gruber; van der Laan, Mark J.; ctmle: Variable and Model Selection for Causal Inference with Collaborative Targeted Maximum Likelihood Estimation ## References ### C-TMLE LASSO and C-TMLE for Model Selection >Ju, Cheng; Benkeser, David; van der Laan, Mark; "Robust inference on the average treatment effect using the outcome highly adaptive lasso", Biometrics, https://doi.org/10.1111/biom.13121 #### Scalable Discrete C-TMLE with Pre-ordering >Ju, Cheng; Gruber, Susan; Lendle. S. D.; et al. Scalable collaborative targeted learning for high-dimensional data. Statistical methods in medical research, 2019, 28(2): 532-554. #### Discrete C-TMLE with Greedy Search >Susan, Gruber, and van der Laan, Mark J.. "An Application of Collaborative Targeted Maximum Likelihood Estimation in Causal Inference and Genomics." The International Journal of Biostatistics 6.1 (2010): 1-31. #### General Template of C-TMLE >van der Laan, Mark J., and Susan Gruber. "Collaborative double robust targeted maximum likelihood estimation." The international journal of biostatistics 6.1 (2010): 1-71.