---
title: "Generalized Integration Model"
author: "Han Zhang"
date: "`r Sys.Date()`"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{Generalized Integration Model}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r setup, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)
```

The `gim` package is used to integrate summary data reported to literature (**external studies**) with your own raw data (**internal study**). In this short tutorial, I will use examples to show various applications with `gim`. To use `gim`, you will have to find out what models are fitted for those external summary data, while it is up to you to chose the model to be fitted on your own internal data. Keep in mind that the parameters in the model fitted on internal data are of interest in `gim` analysis.  

## Flexible Input

The `gim` package supports flexible ways to specify models fitted for the internal data and those of external studies. We use `formula` to specify a linear or logistic regression model for the internal raw data, and **consider this model as the true underlying model**. This is important as `gim` will calculate the standard error of its estimations under this model. Also, only the estimations of the coefficients in this model are of your interest. Currently, `formula` accept arithmetic expressions, e.g., `log(x)`, or other operators like `*` for interactions. For example, `y~as.factor(x1) * I(x2 >0)-1` is a valid formula for `gim`. 

Three options are supported for `family`: `gaussian` for linear regression; `binomial` for logistic regression fitted on random sample; `case-control` for logistic regression fitted on case-control data. Note that the case/control ratio can vary among case-control datasets. It is always benefitial to use `case-control` over `binomial` even if your data for binary outcome is indeed a random sample from the population. 

Your individual-level internal data should be passed through `data`. Note that every variable in `formula` should be present in `data`. `gim` discards incomplete samples from analysis. 

`gim` uses `model`, a list of list, to organize summary information from external studies. Please use `?gim` for more details of `model`. A rule of thumb is that if a set of coefficients are estimated from a fitted model, then one should pack them as an element into `model`. Two fitted external models could use the same or partially overlapping datasets, however, estimations from these two models should be stored as two separate elements in `model` as you do have fitted two models. Please refer to examples below. 

External studies may or may not share data when models being fitted. If summary data from `k` external models are stored in `model`, then `nsample` should be a `k` by `k` matrix, with its `(i,j)` element representing the numbers of shared samples being used when fitting the two models. If models are fitted on independent datasets, then the corresponding elements in `nsample` should be zeros. The diagonal elements are numbers of samples used in fitting each of the external models. When analyzing case-control data, we have to specify `ncase` and `nctrl` instead, and leave `nsample` to be `NULL`, the default value. 

`gim` is able to handle the senario when internal and external studies are conducted on different populations. In that case, an extra reference set is required for making a valid inference on model parameters, otherwise, as shown in our paper, the estimation could be biased and the standard error of estimation may be underestimated. The reference set `ref` is a data frame consisting variables used in `formula` and `model[[i]]$formula`. Unused variables in `ref` would be discarded automatically. `gim` does not assume an outcome variable is present in `ref`, that is what a *reference* means. Such a reference set is used to estimate the distribution of covariates in the external population. 

It is tricky to determine if `data` or `ref` contain **ALL** necessary variables when invoking `gim`. If you encounter error message when calling `gim`, try 
```{r, eval = FALSE}
glm(formula, family, data)
```
and 
```{r, eval = FALSE}
for(m in model){
  glm(m$formula, family, data)
}
```
to find out a full list of those variables. You may also add an artificial outcome into `ref` to replace `data` in codes above. If you still encounter any error message when running those codes, then very likely some variables are missing in `data` or `ref`. 

`?gim` provides a good example showing how to use `gim`. Here we give some more specified examples that could be of interest in practice. 

## Example 1: Integrating Population Characteristics

Many research articles in epidemiology usually provide a table in which two-way contingency tables are listed for outcome versus several population characteristics. For example, assume that in a paper about cancer, it may gives the following table

```{r}
N <- 800
set.seed(0)
sex <- sample(c('F', 'M'), N, TRUE, c(.4, .6))
snp <- rbinom(N, 2, c(.4, .3, .3))
age <- runif(N, 20, 60)
pr <- plogis(.1 + .5 * I(sex == 'F') + .5 * I(age > 40) - .2 * snp)
cancer <- rbinom(N, 1, pr)
ext <- data.frame(cancer, sex, snp, age, stringsAsFactors = FALSE)
```

|      |        | Case    | Control |
|-----:|:-------|---------|:-------:|
| Sex  |  M     | `r sum(cancer & sex == "M")` | `r sum(!cancer & sex == "M")` |
|      |  F     | `r sum(cancer & sex == "F")` | `r sum(!cancer & sex == "F")` |
| Age  | < 30   | `r sum(cancer & age < 30)`   | `r sum(!cancer & age < 30)`   |
|      | [30,50]| `r sum(cancer & age >= 30 & age <=50)` | `r sum(!cancer & age>=30 & age<=50)` |
|      | > 50  | `r sum(cancer & age > 50)`  | `r sum(!cancer & age > 50)`  |
| SNP  |  0     | `r sum(cancer & snp == 0)`   | `r sum(!cancer & snp == 0)`   |
|      |  1 or 2   | `r sum(cancer & snp != 0)`   | `r sum(!cancer & snp != 0)`   |

Note that the raw dataset `ext` could not be obtained from literature. Instead, one can only calculate univariate odds ratio of sex based on numbers in the table. For age, the odds ratios are subject to indicators `I(age < 30)` and `I(age > 50)`. For SNP, the best we can have is the odds ratio in a dominant model. Statistically, these are equivalent to fitting univariate models on `ext` as follows

```{r}
m1 <- glm(cancer ~ sex, data = ext, family = "binomial")
m2 <- glm(cancer ~ I(age < 30) + I(age > 50), data = ext, family = "binomial")
m3 <- glm(cancer ~ I(snp == 0), data = ext, family = "binomial")
```
Sometimes even the contingency tables are also inaccessible, but only the odds ratios are mentioned in text of articles. We consider a more complicated senario in this example. 
```{r}
summary(m2)$coef
```
We can see that only `I(age > 50)` shows significant association with cancer, researchers may only report its log odds ratio `r round(coef(m2)[3], 2)` in their articles. We thus illustrate `gim` by ignoring estimation of intercept in all the three models above. We also assume that only the log odds ratio of `I(age > 50)` is given to `gim`, while that of `I(age < 30)` is not. We create `model` as
```{r}
model1 <- list(formula = 'cancer ~sex', 
               info = data.frame(var = names(coef(m1))[-1], 
                                 bet = coef(m1)[-1], stringsAsFactors = FALSE))
model2 <- list(formula = 'cancer~I(age< 30) + I(age > 50)', 
               info = data.frame(var = names(coef(m2))[3], 
                                 bet = coef(m2)[3], stringsAsFactors = FALSE))
model3 <- list(formula = 'cancer ~ I(snp == 0)', 
               info = data.frame(var = names(coef(m3))[-1], 
                                 bet = coef(m3)[-1], stringsAsFactors = FALSE))
model <- list(model1, model2, model3)
model2
```
For each external model, `gim` needs to know the model used to fit external data (`model[[i]]$formula`), as well as the estimation of coefficients (`model[[i]]$info`). In this example, some coefficients (e.g. intercept and `I(age < 30)`) are estimated but not given to `gim`. This is a sweet spot of `gim` that can work with partial summary information quite well. Note that it is Since all three external models are fitted on the same dataset, we specify `nsample` for the three fitted models as 
```{r}
nsample <- matrix(N, 3, 3)
```
If we would consider external studies as case-control studies, which is what we recommend, we could instead specify
```{r}
ncase <- matrix(sum(cancer), 3, 3)
nctrl <- matrix(sum(!cancer), 3, 3)
```
and leave `nsample` to be `NULL`.

Now we collect samples from an internal study. Note that we modify the model by using a differnt intercept 0.3 while keeping all other coefficients remain the same as before, because we want a case-control data of a different case/control ratio. 
```{r}
n <- 300
set.seed(1)
sex <- sample(c('F', 'M'), n, TRUE, c(.4, .6))
snp <- rbinom(n, 2, c(.4, .3, .3))
age <- runif(n, 20, 60)
pr <- plogis(.3 + .5 * I(sex == 'F') + .5 * I(age > 40) - .2 * snp)
cancer <- rbinom(n, 1, pr)
smoking <- sample(c(TRUE, FALSE), n, TRUE, c(.3, .7))
int <- data.frame(cancer, sex, snp, age, smoking, stringsAsFactors = FALSE)
```
In this internal study, we also collect a binary `smoking`, although it does not affect the cancer risk in our setting. Now we can integrate internal data `int` with external `model`. Let's assume that, based on some knowledges, we believe that a single cutoff for age at 40 should be used in the underlying model instead 30 and 50 used in external studies, and an additive model for SNP would be more of interest. We also want to investigate potential effect of smoking status. So we assume the underlying true model would be `cancer ~ I(sex == "F") + I(age > 40) + snp + smoking`, and invoke `gim` as follows
```{r}
library(gim)
fit1 <- gim(cancer ~ I(sex == "F") + I(age > 40) + snp + smoking, 
            "case-control", int, model, 
            ncase = ncase, nctrl = nctrl)
summary(fit1)
```
One can compare the result with the one from `glm` fitted on `int`, which has higher standard errors on its estimations. 
```{r}
fit0 <- glm(cancer ~ I(sex =="F") + I(age>40) +snp + smoking, 
            "binomial", int)
summary(fit0)$coef
```

The following table compares models fitted on internal and external datasets: 

|             |                  |                  |    External   |               |
|-------------|------------------|------------------|---------------|---------------|
|             |   **Internal**   |    **model 1**   | **model 2**   |  **model 3**  |
| **Sex**     |  `I(sex == 'F')` | `I(sex == 'M')`  |               |               |
| **Age**     |  `I(age > 40)`   |                  | `I(age < 30)` |               |
|             |                  |                  | `I(age > 50)` |               |
| **SNP**     |  `snp`           |                  |               | `I(snp == 0)` |
| **Smoking** |  `smoking`       |                  |               |               |

In this example, we used the feature in `gim` that supports flexible formula. One can, however, create variable instead using such a feature. For example, we can add columns to `int` as follows. Exactly the same result would be returned. 
```{r}
int$sex_F <- ifelse(int$sex == "F", 1, 0)
int$age_gt_40 <- ifelse(int$age > 40, 1, 0)
fit2 <- gim(cancer ~ sex_F + age_gt_40 + snp + smoking, 
            "case-control", int, model, 
            ncase = ncase, nctrl = nctrl)
summary(fit2)
```

The list `model` may looks complicated. For example, 
```{r}
model[[1]]
```
So where does the `sexM` come from? It is because that `gim` parses `model[[1]]$formula` and automatically creates a dummy variable for `I(sex == "M")`. `R` names such a dummy variable to be `sexM`, and in this example, we put this name to `model[[1]]$info$var`. As long as `sex` could be found in `int`, `gim` is able to define a dummy variable `sexM` itself and move everything forward smoothly. We can, however, use a more straight way to invoke `gim`. For example, we add a column `dummy_sex` to `int`
```{r}
int$dummy_sex <- ifelse(int$sex == "M", 1, 0)
model1 <- list(formula = 'cancer ~ dummy_sex', 
               info = data.frame(var = 'dummy_sex', 
                                 bet = coef(m1)[-1], stringsAsFactors = FALSE))
model1
model <- list(model1, model2, model3)
fit3 <- gim(cancer ~ I(sex == "F") + I(age > 40) + snp + smoking, 
            "case-control", int, model, 
            ncase = ncase, nctrl = nctrl)
summary(fit3)
```
The result is still the same. Note that row name in `model[[1]]$info` does not matter. 

This example illustrates several features of `gim`: 

- flexible support on formula
- capability to work with partial extenrnal information
- account for variation due to sample overlapping through `nsample` or `ncase`, `nctrl`

`gim` also provides generic accessor functions `coef`, `confint` and `vcov` for extract information from its returned object. 
```{r}
coef(fit1)
confint(fit1, level = 0.9)
all(diag(vcov(fit1))^.5 == summary(fit1)$coef[, 2])
```