We use the following packages:

library(mice)
library(dplyr)
library(magrittr)
library(DAAG)

The following table shows numbers of occasions when inhibition (i.e., no flow of current across a membrane) occurred within 120 s, for different concentrations of the protein peptide-C. The outcome yes implies that inhibition has occurred.

conc 0.1 0.5  1 10 20 30 50 70 80 100 150 
no     7   1 10  9  2  9 13  1  1   4   3 
yes    0   0  3  4  0  6  7  0  0   1   7

  1. Create this data in R
data <- data.frame(conc = c(.1, .5, 1, 10, 20, 30, 50, 70, 80, 100, 150),
                   no = c(7, 1, 10, 9, 2, 9, 13, 1, 1, 4, 3),
                   yes = c(0, 0, 3, 4, 0, 6, 7, 0, 0, 1 ,7)) 
data
##     conc no yes
## 1    0.1  7   0
## 2    0.5  1   0
## 3    1.0 10   3
## 4   10.0  9   4
## 5   20.0  2   0
## 6   30.0  9   6
## 7   50.0 13   7
## 8   70.0  1   0
## 9   80.0  1   0
## 10 100.0  4   1
## 11 150.0  3   7

  1. Add two new variables (columns) to the data
data <- 
  data %>% 
  mutate(margin = no+yes,
         prop = yes / margin)

  1. Fit the logistic model with margin as the weights
fit <- 
  data %$%
  glm(prop ~ conc, family=binomial, weights=margin)

  1. Look at the summary of the fitted object
summary(fit)
## 
## Call:
## glm(formula = prop ~ conc, family = binomial, weights = margin)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.8159  -1.0552  -0.6878   0.3667   1.0315  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -1.32701    0.33837  -3.922 8.79e-05 ***
## conc         0.01215    0.00496   2.450   0.0143 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 16.683  on 10  degrees of freedom
## Residual deviance: 10.389  on  9  degrees of freedom
## AIC: 30.988
## 
## Number of Fisher Scoring iterations: 4

  1. Inspect the plots number 1 and 5 for fit
plot(fit, which = c(1, 5))

The data set is small, but case 11 stands out in the Residuals vs. Leverage plot


  1. conc is somewhat skewed. Run the model again with a log-transformation for conc.

To apply this in the model directly:

fit.log <- 
  data %$%
  glm(prop ~ I(log(conc)), family=binomial, weights=margin)

  1. Look at the summary of the fitted object again
summary(fit.log)
## 
## Call:
## glm(formula = prop ~ I(log(conc)), family = binomial, weights = margin)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.2510  -1.0599  -0.5029   0.3152   1.3513  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept)   -1.7659     0.5209  -3.390 0.000699 ***
## I(log(conc))   0.3437     0.1440   2.387 0.016975 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 16.6834  on 10  degrees of freedom
## Residual deviance:  9.3947  on  9  degrees of freedom
## AIC: 29.994
## 
## Number of Fisher Scoring iterations: 4

The logodds now depict the unit increase in log(conc), instead of conc.


  1. Inspects the plots number 1 and 5 of the fitted object based on log(conc).
plot(fit.log, which = c(1, 5))

Outliers are now less of an issue.


  1. The moths data frame in the DAAG package contains data from a study of the effect of habitat on the densities of two species of moth (A and P). The quasipoisson distribution is a reasonable model for these data. Try modeling the variable A using this distribution. Is there a better link function than the standard?
A.glm <- glm(formula = A ~ log(meters) + factor(habitat), family =
quasipoisson, data = moths)
summary(A.glm)
## 
## Call:
## glm(formula = A ~ log(meters) + factor(habitat), family = quasipoisson, 
##     data = moths)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -3.5562  -1.4635  -0.0007   0.9248   3.0822  
## 
## Coefficients:
##                           Estimate Std. Error t value Pr(>|t|)
## (Intercept)               -15.6958  2096.5884  -0.007    0.994
## log(meters)                 0.1292     0.1528   0.845    0.404
## factor(habitat)Disturbed   15.6224  2096.5885   0.007    0.994
## factor(habitat)Lowerside   16.9057  2096.5884   0.008    0.994
## factor(habitat)NEsoak      16.0840  2096.5884   0.008    0.994
## factor(habitat)NWsoak      18.4681  2096.5884   0.009    0.993
## factor(habitat)SEsoak      16.9677  2096.5884   0.008    0.994
## factor(habitat)SWsoak      17.1371  2096.5884   0.008    0.994
## factor(habitat)Upperside   16.7433  2096.5884   0.008    0.994
## 
## (Dispersion parameter for quasipoisson family taken to be 2.700801)
## 
##     Null deviance: 257.108  on 40  degrees of freedom
## Residual deviance:  93.991  on 32  degrees of freedom
## AIC: NA
## 
## Number of Fisher Scoring iterations: 13
# Note the huge standard errors

## Consider the square root link as another possibility
A2.glm <- glm(formula = A ~ sqrt(meters) + factor(habitat), family = quasipoisson(link=sqrt), data = moths)
summary(A2.glm)
## 
## Call:
## glm(formula = A ~ sqrt(meters) + factor(habitat), family = quasipoisson(link = sqrt), 
##     data = moths)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -3.4926  -1.4380  -0.0179   0.8694   3.0841  
## 
## Coefficients:
##                          Estimate Std. Error t value Pr(>|t|)    
## (Intercept)              -0.26974    0.88037  -0.306   0.7613    
## sqrt(meters)              0.05893    0.06886   0.856   0.3985    
## factor(habitat)Disturbed  1.18451    0.89161   1.329   0.1934    
## factor(habitat)Lowerside  2.18034    0.86816   2.511   0.0173 *  
## factor(habitat)NEsoak     1.41339    0.89452   1.580   0.1239    
## factor(habitat)NWsoak     4.86203    0.94902   5.123 1.39e-05 ***
## factor(habitat)SEsoak     2.27990    0.87891   2.594   0.0142 *  
## factor(habitat)SWsoak     2.49185    0.95407   2.612   0.0136 *  
## factor(habitat)Upperside  1.83905    1.09824   1.675   0.1038    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for quasipoisson family taken to be 2.701906)
## 
##     Null deviance: 257.108  on 40  degrees of freedom
## Residual deviance:  93.741  on 32  degrees of freedom
## AIC: NA
## 
## Number of Fisher Scoring iterations: 10

  1. Let’s try to specify our own link function: \(f(x) = \log(\exp(x)-1)\). Generate data like this:
n <- 1000                       
x <- runif(n)
sh <- 2                        
y <- rgamma(n,scale=log(2+3*x)/sh,shape=sh)

Specify the link function and run a glm regressing y on x. Hint: Look at the help file for family. You need to make an object of type “link-glm”, containing the items needed for glm to run: a link function, an inverse link, the derivative of the inverse link, and a function that describes the domain of the inverse link.

 ## link
 linkfun <- function(y) log(exp(y)-1)
 ## inverse link
 linkinv <- function(eta)  log(exp(eta)+1)
 ## derivative of invlink wrt eta
 mu.eta <- function(eta) { 1/(exp(-eta) + 1) }
 valideta <- function(eta) TRUE
 link <- "log(exp(y)-1)"
 vv <- structure(list(linkfun = linkfun, linkinv = linkinv,
                 mu.eta = mu.eta, valideta = valideta, 
                 name = link),
                 class = "link-glm")

glm(y~x,family=Gamma(link=vv)) 
## 
## Call:  glm(formula = y ~ x, family = Gamma(link = vv))
## 
## Coefficients:
## (Intercept)            x  
##      0.2316       1.2625  
## 
## Degrees of Freedom: 999 Total (i.e. Null);  998 Residual
## Null Deviance:       573.3 
## Residual Deviance: 533.4     AIC: 2145

End of Practical