ebpm_demo

Last updated: 2019-09-30

Checks: 7 0

Knit directory: ebpmf_demo/

This reproducible R Markdown analysis was created with workflowr (version 1.4.0). The Checks tab describes the reproducibility checks that were applied when the results were created. The Past versions tab lists the development history.

R Markdown file: up-to-date

Great! Since the R Markdown file has been committed to the Git repository, you know the exact version of the code that produced these results.

Environment: empty

Great job! The global environment was empty. Objects defined in the global environment can affect the analysis in your R Markdown file in unknown ways. For reproduciblity it’s best to always run the code in an empty environment.

Seed: set.seed(20190923)

The command set.seed(20190923) was run prior to running the code in the R Markdown file. Setting a seed ensures that any results that rely on randomness, e.g. subsampling or permutations, are reproducible.

Session information: recorded

Great job! Recording the operating system, R version, and package versions is critical for reproducibility.

Cache: none

Nice! There were no cached chunks for this analysis, so you can be confident that you successfully produced the results during this run.

File paths: relative

Great job! Using relative paths to the files within your workflowr project makes it easier to run your code on other machines.

Repository version: c1e3544

Great! You are using Git for version control. Tracking code development and connecting the code version to the results is critical for reproducibility. The version displayed above was the version of the Git repository at the time these results were generated.

Note that you need to be careful to ensure that all relevant files for the analysis have been committed to Git prior to generating the results (you can use wflow_publish or wflow_git_commit). workflowr only checks the R Markdown file, but you know if there are other scripts or data files that it depends on. Below is the status of the Git repository when the results were generated:


Ignored files:
    Ignored:    .Rhistory
    Ignored:    .Rproj.user/

Untracked files:
    Untracked:  analysis/ebpmf_demo.Rmd
    Untracked:  docs/figure/test.Rmd/

Unstaged changes:
    Modified:   analysis/index.Rmd

Note that any generated files, e.g. HTML, png, CSS, etc., are not included in this status report because it is ok for generated content to have uncommitted changes.

These are the previous versions of the R Markdown and HTML files. If you’ve configured a remote Git repository (see ?wflow_git_remote), click on the hyperlinks in the table below to view them.

File	Version	Author	Date	Message
html	5cec335	zihao12	2019-09-28	Build site.
Rmd	f915fec	zihao12	2019-09-28	fix bug in predicting posterior mean, and rerun analysis
html	a0f7bf7	zihao12	2019-09-26	Build site.
Rmd	989021e	zihao12	2019-09-26	fix ebpm bug in rmse computation
html	f854215	zihao12	2019-09-26	Build site.
Rmd	522f45b	zihao12	2019-09-26	fix a stupid bug in ebpm_demo.Rmd
html	6ae1cb4	zihao12	2019-09-26	Build site.
Rmd	32d1853	zihao12	2019-09-26	rewrite demo for ebpm
html	c7fda34	zihao12	2019-09-26	Build site.
Rmd	4f11a65	zihao12	2019-09-26	rewrite demo for ebpm
html	9a58a35	zihao12	2019-09-26	Build site.
Rmd	6d71395	zihao12	2019-09-26	rewrite demo for ebpm
html	a3da36f	zihao12	2019-09-24	Build site.
Rmd	731df2d	zihao12	2019-09-24	update deno
html	aa84714	zihao12	2019-09-24	Build site.
Rmd	c1164b4	zihao12	2019-09-24	demo ebpm update grid range
html	25f8b5a	zihao12	2019-09-24	Build site.
Rmd	a92d9ab	zihao12	2019-09-24	demo ebpm update grid range
html	06d1f33	zihao12	2019-09-24	Build site.
Rmd	5e5f539	zihao12	2019-09-24	demo ebpm update
html	9e28fe2	zihao12	2019-09-23	Build site.
Rmd	fdb63ad	zihao12	2019-09-23	demo ebpm

EBPM problem

\[ \begin{align} & x_i \sim Pois(s_i \lambda_i)\\ & \lambda_i \sim g(.)\\ & g \in \mathcal{G} \end{align} \] Our goal is to estimate \(\hat{g}\) (MLE), then compute posterior \(p(\lambda_i | x_i, \hat{g})\). Here I use mixture of exponential as prior family.

see detail in https://www.overleaf.com/project/5bd084d90a33772e7a7f99a2

library(mixsqp)
library(ggplot2)

Warning: package 'ggplot2' was built under R version 3.5.2

library(gtools)
require(gridExtra)

Loading required package: gridExtra

EBPM-exponential-mixture

For now, I use mixture of exponential as prior family.
Under the exponential case, we discuss how to select the range of the grid (of \(\mu\), exponential mean). For convenience we use \(exp(\mu)\) to denote exponential distribution with mean \(\mu\):
The goal is: for each observation, we want to include the range of \(\lambda\) “of interest” (i.e. \(log \ p(x | \lambda)\) is close to that of the MLE, like within \(log(0.1)\)).
If \(x = 0\), \(\ell(\lambda) = - \lambda s\), in order to have a good likelihood, we want the model to be able to choose \(\lambda \sim o(\frac{1}{s})\). Therefore, we want the smallest \(\mu\) to be in the order of \(o(\frac{1}{s})\) if there is 0 count.
If \(x > 0\), the MLE would be \(\frac{x}{s}\), and \(\lambda\) too small would have bad likelihood. So we want the biggest \(\mu\) to be of order \(O(max(\frac{x}{s}))\)

experiment setup

I simulate \(\lambda_i \sim \sum_k \pi_k exp(b_k), k = 1, ..., 50\), where \(b_k\) is the exponential rate, for \(i = 1, ..., 4000\).
Then I fit EBPM with mixture of exponential as prior (our model knows the grid for \(b_k\)). I compare \(\ell(\pi)\), which should be better than oracle; I also compare \(\ell(\lambda)\) (it is not clear whether the posterior will be better than oracle, but still we can use it to see how good the model is).
Then I fit with the same model, without knowing the grid for \(b_k\).

## ===========================================================================
## ==========================ebpm_exponential_mixture=========================
## ===========================================================================

## description
## this solves ebpm problem, with mixture of exponential distribution as prior:
## g(.) = \sum_k \pi_k exp(.;b_k), where b_k is rate of exponential

## generate a geometric sequence: x_n = low*m^{n-1} up to x_n < up
geom_seq <- function(low, up, m){
  N =  ceiling(log(up/low)/log(m)) + 1
  out  = low*m^(seq(1,N, by = 1)-1)
  return(out)
}

lin_seq <- function(low, up, m){
  out = seq(low, up, length.out = m)
  return(out)
}

## select grid for b_k
select_grid_exponential <- function(x, s, m = 2){
  ## mu_grid: mu =  1/b is the exponential mean
  xprime = x
  xprime[x == 0] = xprime[x == 0] + 1
  mu_grid_min =  0.05*min(xprime/s)
  mu_grid_max = 2*max(x/s)
  mu_grid = geom_seq(mu_grid_min, mu_grid_max, m)
  #mu_grid = lin_seq(mu_grid_min, mu_grid_max, m)
  b = 1/mu_grid
  a = rep(1, length(b))
  return(list(a= a, b = b))
}

## compute L matrix from data and selected grid
## L_ik = NB(x_i; a_k, b_k/b_k + s_i)
## but for computation in mixsqr, we can simplyfy it for numerical stability
compute_L <- function(x, s, a, b){
  prob = 1 - s/outer(s,b, "+")
  l = dnbinom(x,a,prob = prob, log = T) 
  l_rowmax  = apply(l,1,max)
  L = exp(l -  l_rowmax)
  return(list(L = L, l_rowmax = l_rowmax))
}

## compute ebpm_exponential_mixture problem
ebpm_exponential_mixture <- function(x,s,m = 2, grid = NULL, seed = 123){
  set.seed(seed)
  if(is.null(grid)){grid <- select_grid_exponential(x,s,m)}
  b = grid$b
  a = grid$a
  tmp <-  compute_L(x,s,a, b)
  L =  tmp$L
  l_rowmax = tmp$l_rowmax
  fit <- mixsqp(L, control = list(verbose = T))
  ll_pi = sum(log(exp(l_rowmax) * L %*%  fit$x))
  pi = fit$x
  cpm = outer(x,a,  "+")/outer(s, b, "+")
  Pi_tilde = t(t(L) * pi)
  Pi_tilde = Pi_tilde/rowSums(Pi_tilde)
  lam_pm = rowSums(Pi_tilde * cpm) 
  ll_lam = sum(dpois(x, s*lam_pm, log = T))
  return(list(pi = pi, lam_pm = lam_pm, ll_lam = ll_lam,ll_pi = ll_pi,L = L,grid = grid))
}

## ===========================================================================
## ==========================experiment setup=================================
## ===========================================================================
## sample from mixture of gamm distribution
sim_mgamma <- function(a,b,pi){
  idx = which(rmultinom(1,1,pi) == 1)
  return(rgamma(1, shape = a[idx], rate =  b[idx]))
}

## simulate a poisson mean problem
simulate_pm  <-  function(seed = 123){
  set.seed(seed)
  n = 4000 ## number of  data
  d = 50 ## number of mixture components in prior
  ## simulate grid
  a = replicate(d,1)
  b = 10*runif(d)
  grid  = list(a = a, b = b)
  pi <- rdirichlet(1,rep(1/d, d))
  lam_true = replicate(n, sim_mgamma(a,b,pi))
  s = replicate(length(lam_true), 10)
  #s = 2*runif(length(lam_true))
  x  = rpois(length(lam_true),s*lam_true)
  ll_lam = sum(dpois(x, s*lam_true, log = T))
  tmp =  compute_L(x,s,a,b)
  L =  tmp$L
  l_rowmax = tmp$l_rowmax
  ll_pi = sum(log(exp(l_rowmax) * L %*% matrix(pi, ncol = 1)))
  return(list(x =  x, s = s, lam_true = lam_true, pi = pi, grid = grid, ll_lam = ll_lam, ll_pi = ll_pi))
}

rmse <- function(x,y){
  return(sqrt(mean((x-y)^2)))
}

## test functions  above
main <- function(know_grid = F){
  m = 1.1
  sim = simulate_pm()
  x = sim$x 
  s = sim$s
  lam_true = sim$lam_true
  
  start = proc.time()
  if(!know_grid){
    fit = ebpm_exponential_mixture(x, s, m)
  }else{
    fit = ebpm_exponential_mixture(x, s, m, sim$grid)
  }
  
  runtime = proc.time() - start

  print(sprintf("fit with %d data points and %d grid points", length(sim$x),length(fit$grid$b)))
  print(sprintf("runtime: %f", runtime[[3]]))
   print("\n")
  print("log likelihood for pi:")
  print(sprintf("oracle ll_pi: %f", sim$ll_pi))
  print(sprintf("fitted ll_pi: %f", fit$ll_pi))
  print("\n")
  print("RMSE with lam_oracle:")
  print(sprintf("mle    : %f", rmse(x/s, sim$lam_true)))
  print(sprintf("fitted : %f", rmse(fit$lam_pm, sim$lam_true)))
  
  df = data.frame(n = 1:length(x), x = x, s =  s, lam_true = lam_true, lam_hat = fit$lam_pm)
  
  plot1 <- ggplot(df)  + geom_point(aes(x = x/s, y = lam_hat, color = "blue"), cex = 0.5) +
    labs(x = "x/s", y = "lam_hat", title = "EBPM") +
    guides(fill = "color")
  plot2 <-  ggplot(df)  + geom_point(aes(x = lam_true, y = lam_hat, color = "blue"), cex = 0.5) +
    labs(x = "lam_true", y = "lam_hat", title = "EBPM") +
    guides(fill = "color")
  grid.arrange(plot1, plot2, ncol=1)
  return(list(fit = fit, sim = sim))
}

fit (knowing grid)

out1 = main(know_grid = T)

Running mix-SQP algorithm 0.1-120 on 4000 x 50 matrix
convergence tol. (SQP):     1.0e-08
conv. tol. (active-set):    1.0e-10
zero threshold (solution):  1.0e-08
zero thresh. (search dir.): 1.0e-15
l.s. sufficient decrease:   1.0e-02
step size reduction factor: 7.5e-01
minimum step size:          1.0e-08
max. iter (SQP):            1000
max. iter (active-set):     51
number of EM iterations:    4
iter        objective max(rdual) nnz stepsize max.diff nqp nls
   1 +2.812590003e-01  -- EM --   50 1.00e+00 1.67e-02  --  --
   2 +2.533469855e-01  -- EM --   50 1.00e+00 5.11e-03  --  --
   3 +2.385549735e-01  -- EM --   50 1.00e+00 3.05e-03  --  --
   4 +2.293655351e-01  -- EM --   50 1.00e+00 2.50e-03  --  --
   1 +2.293655351e-01 +6.422e-02  50  ------   ------   --  --
   2 +2.172995460e-01 +2.068e-02  50 9.90e-01 9.43e-01  51   1
   3 +2.003965831e-01 +3.260e-03  49 9.90e-01 2.62e-02  51   1
   4 +2.000887113e-01 +3.433e-04  38 9.90e-01 1.15e-01  51   1
   5 +1.997620735e-01 +7.227e-06   4 9.90e-01 8.76e-01  44   1
   6 +1.997550286e-01 +7.278e-08   4 9.90e-01 8.76e-03   4   1
   7 +1.997549576e-01 +7.279e-10   4 9.90e-01 8.76e-05   4   1
Convergence criteria met---optimal solution found.
[1] "fit with 4000 data points and 50 grid points"
[1] "runtime: 0.159000"
[1] "\n"
[1] "log likelihood for pi:"
[1] "oracle ll_pi: -6154.499159"
[1] "fitted ll_pi: -6153.161922"
[1] "\n"
[1] "RMSE with lam_oracle:"
[1] "mle    : 0.112189"
[1] "fitted : 0.087142"

Version	Author	Date
5cec335	zihao12	2019-09-28
f854215	zihao12	2019-09-26
9a58a35	zihao12	2019-09-26
a3da36f	zihao12	2019-09-24
25f8b5a	zihao12	2019-09-24
9e28fe2	zihao12	2019-09-23

plot(log10(out1$sim$pi), log10(out1$fit$pi), ylab = "log10(pi_hat)",  xlab = "log10(pi_true)")

Version	Author	Date
9a58a35	zihao12	2019-09-26
a3da36f	zihao12	2019-09-24
25f8b5a	zihao12	2019-09-24

hist(out1$sim$grid$b, breaks = 100, xlab = "b_true", main =  "histogram of b_true")

Version	Author	Date
9a58a35	zihao12	2019-09-26

fit (not knowing grid)

out2 = main(know_grid = F)

Running mix-SQP algorithm 0.1-120 on 4000 x 67 matrix
convergence tol. (SQP):     1.0e-08
conv. tol. (active-set):    1.0e-10
zero threshold (solution):  1.0e-08
zero thresh. (search dir.): 1.0e-15
l.s. sufficient decrease:   1.0e-02
step size reduction factor: 7.5e-01
minimum step size:          1.0e-08
max. iter (SQP):            1000
max. iter (active-set):     68
number of EM iterations:    4
iter        objective max(rdual) nnz stepsize max.diff nqp nls
   1 +6.197650915e-01  -- EM --   67 1.00e+00 1.02e-02  --  --
   2 +5.834642849e-01  -- EM --   67 1.00e+00 4.13e-03  --  --
   3 +5.616717046e-01  -- EM --   67 1.00e+00 3.80e-03  --  --
   4 +5.474023293e-01  -- EM --   67 1.00e+00 3.48e-03  --  --
   1 +5.474023293e-01 +1.030e-01  67  ------   ------   --  --
   2 +5.480618226e-01 +1.390e-01  67 9.90e-01 8.51e-01  68   1
   3 +4.989059046e-01 +1.745e-02  67 9.90e-01 4.06e-01  68   1
   4 +4.900072839e-01 +8.985e-04  47 9.90e-01 8.04e-01  68   1
   5 +4.891817310e-01 +1.238e-05   6 9.90e-01 3.02e-01  55   1
   6 +4.891699740e-01 +1.245e-07   5 9.90e-01 2.87e-03   5   1
   7 +4.891698557e-01 +1.245e-09   4 9.90e-01 2.87e-05   4   1
Convergence criteria met---optimal solution found.
[1] "fit with 4000 data points and 67 grid points"
[1] "runtime: 0.203000"
[1] "\n"
[1] "log likelihood for pi:"
[1] "oracle ll_pi: -6154.499159"
[1] "fitted ll_pi: -6153.168083"
[1] "\n"
[1] "RMSE with lam_oracle:"
[1] "mle    : 0.112189"
[1] "fitted : 0.087132"

Version	Author	Date
5cec335	zihao12	2019-09-28
f854215	zihao12	2019-09-26
9a58a35	zihao12	2019-09-26
06d1f33	zihao12	2019-09-24

hist(out2$sim$grid$b, breaks = 100, xlab = "b_true", main =  "histogram of b_true")

Version	Author	Date
9a58a35	zihao12	2019-09-26

hist(out2$fit$grid$b, breaks = 100, xlab = "b_hat", main =  "histogram of b_hat")

Version	Author	Date
9a58a35	zihao12	2019-09-26

hist(out2$sim$pi, breaks = 100, xlab = "pi_true", main =  "histogram of pi_true")

Version	Author	Date
9a58a35	zihao12	2019-09-26

hist(out2$fit$pi, breaks = 100, xlab = "pi_hat", main =  "histogram of pi_hat")

Version	Author	Date
9a58a35	zihao12	2019-09-26

compare two experiment result

plot(out1$fit$lam_pm, out2$fit$lam_pm,  xlab  = "lam_post (use oracle grid)", ylab = "lam_post (use selected grid)", main = "posterior mean for lambda")

Version	Author	Date
5cec335	zihao12	2019-09-28
f854215	zihao12	2019-09-26
9a58a35	zihao12	2019-09-26
25f8b5a	zihao12	2019-09-24

Conclusion

Our grid selecting seems to work fine, as the posterior mean of \(\lambda\)s are almost the same between the two experiments (use oracle grid VS estimate grid)
\(\ell(\pi_{est})\) is slightly better than \(\ell(\pi_{true})\)

sessionInfo()

R version 3.5.1 (2018-07-02)
Platform: x86_64-apple-darwin15.6.0 (64-bit)
Running under: macOS  10.14

Matrix products: default
BLAS: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRblas.0.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRlapack.dylib

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] gridExtra_2.3  gtools_3.8.1   ggplot2_3.2.1  mixsqp_0.1-120

loaded via a namespace (and not attached):
 [1] Rcpp_1.0.2       compiler_3.5.1   pillar_1.4.2     git2r_0.25.2    
 [5] workflowr_1.4.0  tools_3.5.1      digest_0.6.21    evaluate_0.14   
 [9] tibble_2.1.3     gtable_0.3.0     pkgconfig_2.0.3  rlang_0.4.0     
[13] yaml_2.2.0       xfun_0.8         withr_2.1.2      stringr_1.4.0   
[17] dplyr_0.8.1      knitr_1.25       fs_1.3.1         rprojroot_1.3-2 
[21] grid_3.5.1       tidyselect_0.2.5 glue_1.3.1       R6_2.4.0        
[25] rmarkdown_1.13   purrr_0.3.2      magrittr_1.5     whisker_0.3-2   
[29] backports_1.1.4  scales_1.0.0     htmltools_0.3.6  assertthat_0.2.1
[33] colorspace_1.4-1 labeling_0.3     stringi_1.4.3    lazyeval_0.2.2  
[37] munsell_0.5.0    crayon_1.3.4