HurdleDMR from R

HurdleDMR.jl is a Julia implementation of the Hurdle Distributed Multiple Regression (HDMR), as described in:

Kelly, Bryan, Asaf Manela, and Alan Moreira (2018). Text Selection. Working paper.

It includes a Julia implementation of the Distributed Multinomial Regression (DMR) model of Taddy (2015).

This tutorial explains how to use this package from R via the JuliaCall package that is available on CRAN.

Setup

Install Julia

First, install Julia itself. The easiest way to do that is from the download site https://julialang.org/downloads/. An alternative is to install JuliaPro from https://juliacomputing.com

Once installed, open start julia in a terminal and add the following packages:

Pkg.add("RCall")
Pkg.clone("https://github.com/AsafManela/Lasso.jl")
Pkg.clone("https://github.com/AsafManela/HurdleDMR.jl")

The JuliaCall package for R

Now, back to R

In [ ]:
install.packages("JuliaCall")

Load the JuliaCall library and setup julia

In [ ]:
library(JuliaCall)
j <- julia_setup()

We can now use j$xx to call julia as in

In [2]:
j$eval("1+2")
3

Example data

We will use for this example data that ships with the fantastic textir package for R.

In [3]:
#install.packages("textir") 
library(textir)

data(we8there)

covars <- we8thereRatings[,'Overall',drop=FALSE]
counts <- we8thereCounts
Loading required package: distrom
Loading required package: Matrix
Loading required package: gamlr
Loading required package: parallel

Benchmark in R

To compare the two, we first fit a dmr in R using textir (a wrapper for distrom).

Make a cluster of nprocs processors

In [4]:
nprocs <- as.integer(detectCores()-2)
cl <- makeCluster(nprocs,type=ifelse(.Platform$OS.type=="unix","FORK","PSOCK"))

Fit Distributed mutlinomial regression (DMR) in parallel

In [5]:
system.time(fits <- dmr(cl, covars, counts, verb=1))
fitting 6166 observations on 2640 categories, 1 covariates.
converting counts matrix to column list...
distributed run.
socket cluster with 18 nodes on host ‘localhost’
   user  system elapsed 
  0.400   0.072   6.806 

Good. Now stop the cluster to clean up.

In [6]:
stopCluster(cl)

Get AICc optimal coefficients

In [7]:
BR <- coef(fits)
dim(BR)
  1. 2
  2. 2640

Get SR projection onto factors

In [8]:
zR <- srproj(BR,counts) 
dim(zR)
  1. 6166
  2. 2

Multinomial inverse regression (MNIR)

The fitted object can be used to it a forward regression to predict a covariate using the low dimensional SRproj of the counts

In [9]:
X <- cbind(covars,zR)
colnames(X) <- c("Overall","zOverall","m")
fmR <- lm(Overall ~ zOverall + m, X)
summary(fmR)
Call:
lm(formula = Overall ~ zOverall + m, data = X)

Residuals:
    Min      1Q  Median      3Q     Max 
-4.5142 -0.5608  0.1370  0.6838  4.0842 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 3.402149   0.019292 176.348  < 2e-16 ***
zOverall    3.181332   0.041696  76.298  < 2e-16 ***
m           0.006737   0.001096   6.146 8.42e-10 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.9654 on 6163 degrees of freedom
Multiple R-squared:  0.4896,	Adjusted R-squared:  0.4894 
F-statistic:  2956 on 2 and 6163 DF,  p-value: < 2.2e-16

We can now predict Overall with new counts data

In [10]:
newX = as.data.frame(srproj(BR,counts[1:10,]))
colnames(newX) <-c("zOverall","m")
yhatdmrR <- predict(fmR, newX)
as.vector(yhatdmrR)
  1. 4.86419983555177
  2. 2.24484606535388
  3. 5.65307467024801
  4. 4.56844624548385
  5. 4.66848593149633
  6. 5.06290073040112
  7. 3.662787762179
  8. 4.47468035358465
  9. 4.00369363965061
  10. 7.35463703886009

Same model but in Julia

Now lets try that in julia.

We need to pass the data to julia

In [11]:
j$assign("covars",covars)
## there are probably more efficient ways to pass the sparse matrix, but JuliaCall doesn't do this automatically at the moment
j$assign("counts",as.matrix(counts))
j$command("counts = sparse(counts)")
6166×2640 SparseMatrixCSC{Float64,Int64} with 66459 stored entries: [11 , 1] = 1.0 [20 , 1] = 1.0 [43 , 1] = 1.0 [63 , 1] = 1.0 [80 , 1] = 1.0 [87 , 1] = 1.0 [88 , 1] = 1.0 ⋮ [1955, 2640] = 1.0 [2509, 2640] = 1.0 [2842, 2640] = 1.0 [3929, 2640] = 1.0 [4314, 2640] = 1.0 [4862, 2640] = 1.0 [5702, 2640] = 1.0 [6007, 2640] = 1.0

add parallel workers

In [12]:
j$call("addprocs",nprocs)
  1. 2
  2. 3
  3. 4
  4. 5
  5. 6
  6. 7
  7. 8
  8. 9
  9. 10
  10. 11
  11. 12
  12. 13
  13. 14
  14. 15
  15. 16
  16. 17
  17. 18
  18. 19

make our package available to all workers

In [13]:
j$command("import HurdleDMR; @everywhere using HurdleDMR")

Fit Distributed mutlinomial regression (DMR) in Julia

In [14]:
system.time(j$command("m = fit(DMR,@model(c ~ 1 + Overall),covars,counts);"))
   user  system elapsed 
 14.580   0.136  34.884 

The above returns a lightweight wrapper with basically just the coefficients. To get the entire regularization paths, try the following

In [15]:
system.time(j$command("m = fit(DMRPaths,@model(c ~ 1 + Overall),covars,counts);"))
   user  system elapsed 
  3.232   1.300   6.607 

Julia compiles each function on its first call, which may be slower for one-off applications, but faster when the function is called many times. So to get a sense of runtime without that fixed cost, you may wish to run it again.

In [16]:
system.time(j$command("m = fit(DMR,@model(c ~ 1 + Overall),covars,counts);"))
   user  system elapsed 
  0.020   0.024   2.959 

On our machine, julia fits dmr in about half the time as R (see 'elapsed' entries above). The speed improvment is mostly due to sharing of memory across parallel workers.

Get AICc optimal coefficients

In [17]:
Bjulia <- j$eval("coef(m)")
dim(Bjulia)
  1. 2
  2. 2640

Get SR projection onto factors

In [18]:
zjulia <- j$eval("srproj(m,counts)")
dim(zjulia)
  1. 6166
  2. 2

Comparing zR to zjulia we see that the estimates are about the same.

In [19]:
all.equal(zR, zjulia, check.attributes = FALSE)
'Mean relative difference: 0.0250051'

The differences are mostly due to default regularization paths differences.

Multinomial inverse regression (MNIR)

The HurdleDMR package provides a general method to fit Counts Inverse Regressions (CIR), fit(CIR...) that can fit both backward and forward parts of the MNIR. For example:

In [20]:
j$command("using GLM")
system.time(j$command("mnir = fit(CIR{DMR,LinearModel},@model(c ~ 1 + Overall),covars,counts,:Overall);"))
   user  system elapsed 
  1.804   0.212   4.861 
In [21]:
j$eval("mnir.model.fwdm")
Julia Object of type GLM.LinearModel{GLM.LmResp{Array{Float64,1}},GLM.DensePredChol{Float64,Base.LinAlg.Cholesky{Float64,Array{Float64,2}}}}.
GLM.LinearModel{GLM.LmResp{Array{Float64,1}},GLM.DensePredChol{Float64,Base.LinAlg.Cholesky{Float64,Array{Float64,2}}}}:

Coefficients:
       Estimate  Std.Error t value Pr(>|t|)
x1      3.42207  0.0194634 175.821   <1e-99
x2      3.12546  0.0420257 74.3701   <1e-99
x3   0.00619096 0.00110878 5.58357    <1e-7

The fitted model can be used for prediction as follows:

In [22]:
yhatdmrJ <- j$eval("predict(mnir,covars[1:10,:],counts[1:10,:])")
yhatdmrJ
  1. 4.82560813985314
  2. 2.29188447231859
  3. 5.6407987244744
  4. 4.5472136166049
  5. 4.64900142586833
  6. 5.0105077560742
  7. 3.67264187569108
  8. 4.44109783266512
  9. 3.9790411085058
  10. 7.29252606765385

Comparing the R and julia versions of the predicted values, they appear to be quite similar:

In [23]:
all.equal(yhatdmrR, yhatdmrJ, check.attributes = FALSE)
'Mean relative difference: 0.006899308'

Hurdle Distributed Multiple Regression (HDMR)

Another advantage of the julia package is allowing for text selection via HDMR. Here we specify the two parts of the model via two formulas:

In [24]:
system.time(j$command("m = fit(HDMR,@model(c ~ 1 + Overall, h ~ 1 + Overall),covars,counts);"))
   user  system elapsed 
  1.456   0.088  13.446 

Fitted HDMR involves two coefficient arrays, one for the model for positives c ~ ... and one for the model for hurdle crossing or zeros h ~ ...

In [25]:
Cjulia <- j$eval("coef(m)")

The projection onto factors now gives [zpos, zzero, m] instead of [z, m] as before

In [26]:
Zjulia <- j$eval("srproj(m,counts)")

If we wish to run a CIR with HDMR instead of DMR, all we need to do is specify it in a very similar call to fit(CIR...):

In [27]:
system.time(j$command("cir = fit(CIR{HDMR,LinearModel},@model(c ~ 1 + Overall, h ~ 1 + Overall),covars,counts,:Overall);"))
   user  system elapsed 
  0.764   0.080   8.537 
In [28]:
j$eval("cir.model.fwdm")
Julia Object of type GLM.LinearModel{GLM.LmResp{Array{Float64,1}},GLM.DensePredChol{Float64,Base.LinAlg.Cholesky{Float64,Array{Float64,2}}}}.
GLM.LinearModel{GLM.LmResp{Array{Float64,1}},GLM.DensePredChol{Float64,Base.LinAlg.Cholesky{Float64,Array{Float64,2}}}}:

Coefficients:
       Estimate  Std.Error t value Pr(>|t|)
x1      3.42148  0.0195684 174.847   <1e-99
x2      0.51192   0.128349  3.9885    <1e-4
x3      3.10925  0.0422365 73.6151   <1e-99
x4   0.00643716 0.00110663 5.81691    <1e-8

The HDMR-based CIR model can be used to predict with new data

In [29]:
yhathdmr <- j$eval("predict(cir,covars[1:10,:],counts[1:10,:])")
yhathdmr
  1. 4.82171723027122
  2. 2.13237953618111
  3. 5.63154454431334
  4. 4.53672805555296
  5. 4.63218562649017
  6. 5.00968332886402
  7. 3.63877245260486
  8. 4.54135039036171
  9. 3.98486915243136
  10. 7.27183150579398

Comparing the three predicted values shows only minor differences in this toy example.

In [30]:
cbind(yhatdmrR,yhatdmrJ,yhathdmr)
yhatdmrRyhatdmrJyhathdmr
14.8642004.8256084.821717
22.2448462.2918842.132380
55.6530755.6407995.631545
114.5684464.5472144.536728
124.6684864.6490014.632186
135.0629015.0105085.009683
143.6627883.6726423.638772
154.4746804.4410984.541350
174.0036943.9790413.984869
187.3546377.2925267.271832

Kelly, Manela, and Moreira (2018) show, however, that the differences between DMR and HDMR can be substantial in some cases, especially when the counts data is highly sparse.

Please reference the paper for additional details and example applications.