The gim
package is
used to integrate summary data reported to literature (external
studies) with your own raw data (internal
study). In this short tutorial, I will use examples to show
various applications with gim
. To use gim
, you
will have to find out what models are fitted for those external summary
data, while it is up to you to chose the model to be fitted on your own
internal data. Keep in mind that the parameters in the model fitted on
internal data are of interest in gim
analysis.
The gim
package supports flexible ways to specify models
fitted for the internal data and those of external studies. We use
formula
to specify a linear or logistic regression model
for the internal raw data, and consider this model as the true
underlying model. This is important as gim
will
calculate the standard error of its estimations under this model. Also,
only the estimations of the coefficients in this model are of your
interest. Currently, formula
accept arithmetic expressions,
e.g., log(x)
, or other operators like *
for
interactions. For example, y~as.factor(x1) * I(x2 >0)-1
is a valid formula for gim
.
Three options are supported for family
:
gaussian
for linear regression; binomial
for
logistic regression fitted on random sample; case-control
for logistic regression fitted on case-control data. Note that the
case/control ratio can vary among case-control datasets. It is always
benefitial to use case-control
over binomial
even if your data for binary outcome is indeed a random sample from the
population.
Your individual-level internal data should be passed through
data
. Note that every variable in formula
should be present in data
. gim
discards
incomplete samples from analysis.
gim
uses model
, a list of list, to organize
summary information from external studies. Please use ?gim
for more details of model
. A rule of thumb is that if a set
of coefficients are estimated from a fitted model, then one should pack
them as an element into model
. Two fitted external models
could use the same or partially overlapping datasets, however,
estimations from these two models should be stored as two separate
elements in model
as you do have fitted two models. Please
refer to examples below.
External studies may or may not share data when models being fitted.
If summary data from k
external models are stored in
model
, then nsample
should be a k
by k
matrix, with its (i,j)
element
representing the numbers of shared samples being used when fitting the
two models. If models are fitted on independent datasets, then the
corresponding elements in nsample
should be zeros. The
diagonal elements are numbers of samples used in fitting each of the
external models. When analyzing case-control data, we have to specify
ncase
and nctrl
instead, and leave
nsample
to be NULL
, the default value.
gim
is able to handle the senario when internal and
external studies are conducted on different populations. In that case,
an extra reference set is required for making a valid inference on model
parameters, otherwise, as shown in our paper, the estimation could be
biased and the standard error of estimation may be underestimated. The
reference set ref
is a data frame consisting variables used
in formula
and model[[i]]$formula
. Unused
variables in ref
would be discarded automatically.
gim
does not assume an outcome variable is present in
ref
, that is what a reference means. Such a
reference set is used to estimate the distribution of covariates in the
external population.
It is tricky to determine if data
or ref
contain ALL necessary variables when invoking
gim
. If you encounter error message when calling
gim
, try
and
to find out a full list of those variables. You may also add an
artificial outcome into ref
to replace data
in
codes above. If you still encounter any error message when running those
codes, then very likely some variables are missing in data
or ref
.
?gim
provides a good example showing how to use
gim
. Here we give some more specified examples that could
be of interest in practice.
Many research articles in epidemiology usually provide a table in which two-way contingency tables are listed for outcome versus several population characteristics. For example, assume that in a paper about cancer, it may gives the following table
N <- 800
set.seed(0)
sex <- sample(c('F', 'M'), N, TRUE, c(.4, .6))
snp <- rbinom(N, 2, c(.4, .3, .3))
age <- runif(N, 20, 60)
pr <- plogis(.1 + .5 * I(sex == 'F') + .5 * I(age > 40) - .2 * snp)
cancer <- rbinom(N, 1, pr)
ext <- data.frame(cancer, sex, snp, age, stringsAsFactors = FALSE)
Case | Control | ||
---|---|---|---|
Sex | M | 261 | 222 |
F | 216 | 101 | |
Age | < 30 | 127 | 92 |
[30,50] | 214 | 159 | |
> 50 | 136 | 72 | |
SNP | 0 | 241 | 133 |
1 or 2 | 236 | 190 |
Note that the raw dataset ext
could not be obtained from
literature. Instead, one can only calculate univariate odds ratio of sex
based on numbers in the table. For age, the odds ratios are subject to
indicators I(age < 30)
and I(age > 50)
.
For SNP, the best we can have is the odds ratio in a dominant model.
Statistically, these are equivalent to fitting univariate models on
ext
as follows
m1 <- glm(cancer ~ sex, data = ext, family = "binomial")
m2 <- glm(cancer ~ I(age < 30) + I(age > 50), data = ext, family = "binomial")
m3 <- glm(cancer ~ I(snp == 0), data = ext, family = "binomial")
Sometimes even the contingency tables are also inaccessible, but only the odds ratios are mentioned in text of articles. We consider a more complicated senario in this example.
summary(m2)$coef
#> Estimate Std. Error z value Pr(>|z|)
#> (Intercept) 0.2970718 0.1047006 2.8373472 0.004549012
#> I(age < 30)TRUE 0.0253267 0.1723537 0.1469461 0.883174576
#> I(age > 50)TRUE 0.3389170 0.1794548 1.8885921 0.058946501
We can see that only I(age > 50)
shows significant
association with cancer, researchers may only report its log odds ratio
0.34 in their articles. We thus illustrate gim
by ignoring
estimation of intercept in all the three models above. We also assume
that only the log odds ratio of I(age > 50)
is given to
gim
, while that of I(age < 30)
is not. We
create model
as
model1 <- list(formula = 'cancer ~sex',
info = data.frame(var = names(coef(m1))[-1],
bet = coef(m1)[-1], stringsAsFactors = FALSE))
model2 <- list(formula = 'cancer~I(age< 30) + I(age > 50)',
info = data.frame(var = names(coef(m2))[3],
bet = coef(m2)[3], stringsAsFactors = FALSE))
model3 <- list(formula = 'cancer ~ I(snp == 0)',
info = data.frame(var = names(coef(m3))[-1],
bet = coef(m3)[-1], stringsAsFactors = FALSE))
model <- list(model1, model2, model3)
model2
#> $formula
#> [1] "cancer~I(age< 30) + I(age > 50)"
#>
#> $info
#> var bet
#> I(age > 50)TRUE I(age > 50)TRUE 0.338917
For each external model, gim
needs to know the model
used to fit external data (model[[i]]$formula
), as well as
the estimation of coefficients (model[[i]]$info
). In this
example, some coefficients (e.g. intercept and
I(age < 30)
) are estimated but not given to
gim
. This is a sweet spot of gim
that can work
with partial summary information quite well. Note that it is Since all
three external models are fitted on the same dataset, we specify
nsample
for the three fitted models as
If we would consider external studies as case-control studies, which is what we recommend, we could instead specify
and leave nsample
to be NULL
.
Now we collect samples from an internal study. Note that we modify the model by using a differnt intercept 0.3 while keeping all other coefficients remain the same as before, because we want a case-control data of a different case/control ratio.
n <- 300
set.seed(1)
sex <- sample(c('F', 'M'), n, TRUE, c(.4, .6))
snp <- rbinom(n, 2, c(.4, .3, .3))
age <- runif(n, 20, 60)
pr <- plogis(.3 + .5 * I(sex == 'F') + .5 * I(age > 40) - .2 * snp)
cancer <- rbinom(n, 1, pr)
smoking <- sample(c(TRUE, FALSE), n, TRUE, c(.3, .7))
int <- data.frame(cancer, sex, snp, age, smoking, stringsAsFactors = FALSE)
In this internal study, we also collect a binary
smoking
, although it does not affect the cancer risk in our
setting. Now we can integrate internal data int
with
external model
. Let’s assume that, based on some
knowledges, we believe that a single cutoff for age at 40 should be used
in the underlying model instead 30 and 50 used in external studies, and
an additive model for SNP would be more of interest. We also want to
investigate potential effect of smoking status. So we assume the
underlying true model would be
cancer ~ I(sex == "F") + I(age > 40) + snp + smoking
,
and invoke gim
as follows
library(gim)
fit1 <- gim(cancer ~ I(sex == "F") + I(age > 40) + snp + smoking,
"case-control", int, model,
ncase = ncase, nctrl = nctrl)
summary(fit1)
#> Call:
#> gim.default(formula = cancer ~ I(sex == "F") + I(age > 40) +
#> snp + smoking, family = "case-control", data = int, model = model,
#> ncase = ncase, nctrl = nctrl)
#>
#> Estimate Std. Error z value Pr(>|z|)
#> (Intercept) -0.23734 0.15384 -1.5428 0.1228737
#> I(sex == "F")TRUE 0.55807 0.14132 3.9489 7.85e-05 ***
#> I(age > 40)TRUE 0.67308 0.20339 3.3093 0.0009352 ***
#> snp -0.31241 0.10480 -2.9809 0.0028743 **
#> smokingTRUE -0.15390 0.27265 -0.5645 0.5724477
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
One can compare the result with the one from glm
fitted
on int
, which has higher standard errors on its
estimations.
fit0 <- glm(cancer ~ I(sex =="F") + I(age>40) +snp + smoking,
"binomial", int)
summary(fit0)$coef
#> Estimate Std. Error z value Pr(>|z|)
#> (Intercept) 0.4469066 0.2417339 1.8487546 0.064493257
#> I(sex == "F")TRUE 0.4094388 0.2636631 1.5528865 0.120450258
#> I(age > 40)TRUE 0.6681712 0.2526914 2.6442180 0.008187991
#> snp -0.2303803 0.1784850 -1.2907543 0.196788903
#> smokingTRUE -0.1550470 0.2727146 -0.5685323 0.569673562
The following table compares models fitted on internal and external datasets:
External | ||||
---|---|---|---|---|
Internal | model 1 | model 2 | model 3 | |
Sex | I(sex == 'F') |
I(sex == 'M') |
||
Age | I(age > 40) |
I(age < 30) |
||
I(age > 50) |
||||
SNP | snp |
I(snp == 0) |
||
Smoking | smoking |
In this example, we used the feature in gim
that
supports flexible formula. One can, however, create variable instead
using such a feature. For example, we can add columns to
int
as follows. Exactly the same result would be
returned.
int$sex_F <- ifelse(int$sex == "F", 1, 0)
int$age_gt_40 <- ifelse(int$age > 40, 1, 0)
fit2 <- gim(cancer ~ sex_F + age_gt_40 + snp + smoking,
"case-control", int, model,
ncase = ncase, nctrl = nctrl)
summary(fit2)
#> Call:
#> gim.default(formula = cancer ~ sex_F + age_gt_40 + snp + smoking,
#> family = "case-control", data = int, model = model, ncase = ncase,
#> nctrl = nctrl)
#>
#> Estimate Std. Error z value Pr(>|z|)
#> (Intercept) -0.23734 0.15384 -1.5428 0.1228737
#> sex_F 0.55807 0.14132 3.9489 7.85e-05 ***
#> age_gt_40 0.67308 0.20339 3.3093 0.0009352 ***
#> snp -0.31241 0.10480 -2.9809 0.0028743 **
#> smokingTRUE -0.15390 0.27265 -0.5645 0.5724477
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The list model
may looks complicated. For example,
So where does the sexM
come from? It is because that
gim
parses model[[1]]$formula
and
automatically creates a dummy variable for I(sex == "M")
.
R
names such a dummy variable to be sexM
, and
in this example, we put this name to model[[1]]$info$var
.
As long as sex
could be found in int
,
gim
is able to define a dummy variable sexM
itself and move everything forward smoothly. We can, however, use a more
straight way to invoke gim
. For example, we add a column
dummy_sex
to int
int$dummy_sex <- ifelse(int$sex == "M", 1, 0)
model1 <- list(formula = 'cancer ~ dummy_sex',
info = data.frame(var = 'dummy_sex',
bet = coef(m1)[-1], stringsAsFactors = FALSE))
model1
#> $formula
#> [1] "cancer ~ dummy_sex"
#>
#> $info
#> var bet
#> sexM dummy_sex -0.5983149
model <- list(model1, model2, model3)
fit3 <- gim(cancer ~ I(sex == "F") + I(age > 40) + snp + smoking,
"case-control", int, model,
ncase = ncase, nctrl = nctrl)
summary(fit3)
#> Call:
#> gim.default(formula = cancer ~ I(sex == "F") + I(age > 40) +
#> snp + smoking, family = "case-control", data = int, model = model,
#> ncase = ncase, nctrl = nctrl)
#>
#> Estimate Std. Error z value Pr(>|z|)
#> (Intercept) -0.23734 0.15384 -1.5428 0.1228737
#> I(sex == "F")TRUE 0.55807 0.14132 3.9489 7.85e-05 ***
#> I(age > 40)TRUE 0.67308 0.20339 3.3093 0.0009352 ***
#> snp -0.31241 0.10480 -2.9809 0.0028743 **
#> smokingTRUE -0.15390 0.27265 -0.5645 0.5724477
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The result is still the same. Note that row name in
model[[1]]$info
does not matter.
This example illustrates several features of gim
:
nsample
or ncase
, nctrl
gim
also provides generic accessor functions
coef
, confint
and vcov
for
extract information from its returned object.
coef(fit1)
#> (Intercept) I(sex == "F")TRUE I(age > 40)TRUE snp
#> -0.2373434 0.5580730 0.6730805 -0.3124079
#> smokingTRUE
#> -0.1538999
confint(fit1, level = 0.9)
#> 5 % 95 %
#> (Intercept) -0.4903828 0.01569604
#> I(sex == "F")TRUE 0.3256179 0.79052811
#> I(age > 40)TRUE 0.3385342 1.00762688
#> snp -0.4847954 -0.14002039
#> smokingTRUE -0.6023767 0.29457684
all(diag(vcov(fit1))^.5 == summary(fit1)$coef[, 2])
#> [1] TRUE