Sample Selection Models with a Common Dummy Endogenous Regressor in Simultaneous Equations

[Michael Brottrager]

2018-08-30

Introduction

The ssdeR package provides a estimation function for sample selection models where a common dummy endogenous regressor appears both in the selection equation and in the censored equation. This model is analyzed in the framework of an endogenous switching model. Following Kim (2006), a simple two-step estimator is used for this model, which is easy to implement and numerically robust compared to other methods.

For an in depth derivation of the statistical framework, readers are advised considering Kim (2006), as this vignette mainly focuses on the application of the ssdeR package to the study of causal linkages considering climate, conflict and asylum seeking flow presented in Abel et al. (2018).


Implementation

As usual in many other regression packages for R [@R], the main model fitting function ssdeR() uses a formula-based interface and returns an (S3) object of class ssdeR:

ssdeR(formula, treatment, selection, data, subset,
      na.action = FALSE, weights, cluster = NULL,
      print.level = 0, control = ssdeR.control(...),
      model = TRUE, x = FALSE, y = FALSE, ...)

A number of standard S3 methods are provided:

Method Description
print() Simple printed display with coefficients
summary() Standard regression summary; returns summary.htobit object (with print() method)
vcov() Associated covariance matrix
predict() (Different types of) predictions for new data
fitted() Fitted values for observed data
terms() Extract terms
model.matrix() Extract model matrix (or matrices)
nobs() Extract number of observations
logLik() Extract fitted log-likelihood
estfun() Extract estimating functions (= gradient contributions) for sandwich covariances

Due to these methods a number of useful utilities work automatically, e.g., AIC(), BIC(), coeftest() (lmtest), etc.


Illustration

To illustrate the package’s use in practice, the ssdeR package is applied to dyadic migration data in the context of Abel et al. (2018). As the paper is currently under revision, readers are recommended to directly contact michael.brottrager@jku.at for a current version of the paper including the detailed data description.

data(ConflictMigration, package="ssdeR")
library(ssdeR)

This data.frame contains cross-sectional information about 24336 country-pairs capturing the period 2011-2015.

Variable Name Description
iso_i ISO code of origin.
iso_j ISO code of destination.
asylum_seekers_ij log transformed number of asylum seekers from origin i in destination j.
conflict_i Conflict in origin i indicated by any reported battle related deaths in that country.
isflow_ij Non-zero flows between origin i and destination j.
stock_ij log transformed stock of origin natives in destination j before observational period. (t-1)
dist_ij Metric distance. (t-1)
comlang_ij Common Language in both origin and destination (Indicator). (t-1)
colony_ij Colonial relationship (Indicator). (t-1)
polity_i normalized (0-1) PolityIV score. (t-1)
polity_j normalized (0-1) PolityIV score. (t-1)
pop_i log transformed origin population. (t-1)
pop_j log transformed destination population. (t-1)
gdp_j log transformed GDP in destination. (t-1)
diaspora_i Origin diaspora outside. (t-1)
ethMRQ_i Ethnic Fractionalization measurement. (t-1)
outmigration_i log transformed total outmigration of of origin i. (t-1)
inmigration_j log transformed total inmigration in to destination j. (t-1)
spei_i 12 month average SPEI index. (t-1)
battledeaths_i log transformed battledeaths in i. (t-1)

Our modelling framework aims at assessing quantitatively the determinants of asylum seeking flows using a gravity equation setting similar to that proposed for bilateral migration data (Cohen et al., 2008) but addressing explicitly the statistical problems caused endogenous selection in origin-destination pairs and non-random treatments. In this sense, our statistical problem is similar to those often encountered in health care studies, where for example the enrollment in a healthcare maintenance organisation (treatment) affects a person’s decision on both whether to use healthcare at all (extensive margin) and how much to spend for healthcare (intensive margin), given a positive decision. In our setting, however, conflict (treatment) itself is not randomly ‘assigned’ across our population of origin countries, that is, we have to consider the treatment itself to be endogenous as well. As with the healthcare example given above, this treatment (conflict) potentially affects the probability that we observe non-zero flows between some origin-destination country pairs (extensive margin). In other words, we have to account for a selection of countries in sending out migrants to a certain country of destination. Furthermore, conflict potentially affects the number of migrants seeking asylum in some destination country. These figures, however, are only observed in the case of actual flows and thus have to be considered as being potentially (non-randomly) censored.

This setting leaves us with three simultaneous equations, where two of them contain our common endogenous binary regressors (i.e. conflict onset). In order to estimate this framework of simultaneous equations, we apply a simple two-step estimation technique proposed by Kim (2006). Translated to our context, we are interested in the following sample selection model,

\[ \begin{aligned} c_i^* & = Z_{c,i}^{\prime} \gamma_1 + \epsilon_{c,i}, \quad c_i = I(c_i^*>0) \\ s_{ij}^* & = Z_{s,ij}^{\prime} \gamma_2 + c_i \beta_2 + \epsilon_{s,ij} , \quad s_{ij} = I(s_{ij}^*>0) \\ a_{ij}^* & = Z_{a,ij}^{\prime} \gamma_3 + c_i \beta_3 + \epsilon_{a,ij} , \quad a_{ij} = a_{ij}^*s_{ij} \\ \end{aligned} \]

where the first equation specifies the occurrence of conflict (\(c_i = 1\)) in country \(i\), the second equation addresses whether a non-zero flow of asylum seeking applications takes place from country \(i\) to country \(j\) (\(s_{ij}=1\)) and the last equation models the size of the flow of applications in logs \(a_{ij}\)to destination country j for origin-destination pairs with non-trivial flows. \(I(x)\)is an indicator function taking the value one if x is true and zero otherwise and the exogenous controls for each one of the equations in the model are summarized in the vectors \(Z_{c,i}, Z_{s,ij}\) and \(Z_{a,ij}\) respectively. The error terms, \(\epsilon_{c,i},\epsilon_{s,ij}\) and \(\epsilon_{a,ij}\), are assumed jointly multivariate normal and potentially correlated, thus capturing the endogenous selection of origin countries that present non-zero asylum applications to destination countries. Following Kim (2006), this sample selection model with a common endogenous regressor in the selection equation and the censored outcome equation is estimated as a hybrid of the bivariate probit and the type-II Tobit model containing the common endogenous binary conflict indicator. This implies that we have to control for the endogeneity caused by \(c_i\) and the selection bias caused by the censoring indicator \(s_{ij}\) at the same time.

Instead of a simulation assisted Full Maximum Likelihood (FIML) approach, we follow Kim (2006) and employ a simple two-step estimation technique by first estimating the bivariate probit model with structural shift and further use the estimation results of this first stage as control functions for the censored outcome equation using a simple Generalized Method of Moments (GMM) estimator. This way we can interpret the model as a Type V-Tobit model with bivariate selection and parameter restrictions. This approach bears the advantage of being numerically robust and easy to implement since it relaxes the strong normality assumptions imposed when using the FIML approach.

 Results <- ssdeR(formula = asylum_seekers_ij ~ stock_ij + dist_ij + I(dist_ij^2) +
                            comlang_ij + colony_ij  +
                            polity_i + pop_i + polity_i + pop_i +
                            gdp_j,
                  treatment = conflict_i ~ battledeaths_i + spei_i +
                            polity_i + I(polity_i^2) +
                            diaspora_i + ethMRQ_i ,
                  selection = isflow_ij ~ dist_ij + I(dist_ij^2) +
                            outmigration_i + inmigration_j ,
                            cluster = c("iso_i","iso_j"),
                  data = ConflictMigration)
## conflict_i
 summary(Results)
## 
## Call:
## ssdeR(formula = asylum_seekers_ij ~ stock_ij + dist_ij + I(dist_ij^2) + 
##     comlang_ij + colony_ij + polity_i + pop_i + polity_i + pop_i + 
##     gdp_j, treatment = conflict_i ~ battledeaths_i + spei_i + polity_i + 
##     I(polity_i^2) + diaspora_i + ethMRQ_i, selection = isflow_ij ~ 
##     dist_ij + I(dist_ij^2) + outmigration_i + inmigration_j, data = ConflictMigration, 
##     cluster = c("iso_i", "iso_j"))
## 
## 
## successive function values within tolerance limit
## Standardized residuals:
##        Min         1Q     Median         3Q        Max 
## -5.9093355 -1.3562449 -0.1294025  1.2351413  8.0190794 
## 
## Coefficients (treatment model):
##                   Estimate  Std. Error  z value   Pr(>|z|)    
## (Intercept)    -1.66131253  0.75088761 -2.21246   0.026935 *  
## battledeaths_i  0.33939266  0.04993654  6.79648 1.0721e-11 ***
## spei_i         -1.00959371  0.52210474 -1.93370   0.053150 .  
## polity_i        3.68067113  3.02927898  1.21503   0.224354    
## I(polity_i^2)  -3.59877565  2.85551627 -1.26029   0.207565    
## diaspora_i     -3.14820563  3.54748134 -0.88745   0.374838    
## ethMRQ_i       -0.14824035  0.75184182 -0.19717   0.843695    
## 
## Coefficients (selection model):
##                   Estimate  Std. Error   z value   Pr(>|z|)    
## (Intercept)    -3.24573825  0.19527169 -16.62165 < 2.22e-16 ***
## dist_ij        -0.24301450  0.03285985  -7.39548 1.4089e-13 ***
## I(dist_ij^2)   -0.05142557  0.02208049  -2.32900   0.019859 *  
## outmigration_i  0.19248499  0.02081563   9.24714 < 2.22e-16 ***
## inmigration_j   0.27055934  0.02292991  11.79941 < 2.22e-16 ***
## conflict_i      0.53659403  0.12314817   4.35730 1.3167e-05 ***
## 
## Coefficients (outcome model):
##                 Estimate  Std. Error   z value   Pr(>|z|)    
## (Intercept)   3.49365517  0.45900372   7.61139 2.7117e-14 ***
## stock_ij      0.20887866  0.53123694   0.39319 0.69417692    
## dist_ij       0.38972670  0.18044433   2.15982 0.03078685 *  
## I(dist_ij^2) -0.09448760  0.35260753  -0.26797 0.78872381    
## comlang_ij    0.24992492  0.06903778   3.62012 0.00029447 ***
## colony_ij     0.35339862  0.15037328   2.35014 0.01876623 *  
## polity_i     -1.13785543  0.10389902 -10.95155 < 2.22e-16 ***
## pop_i        -0.11696954  1.73621136  -0.06737 0.94628670    
## gdp_j         0.19566249  1.02550189   0.19080 0.84868478    
## y1            1.34985209  0.13074581  10.32425 < 2.22e-16 ***
## 
## Auxiliary Parameters:
##           Estimate  Std. Error   z value   Pr(>|z|)    
## rho120 -0.38884784  0.09008341  -4.31653 1.5850e-05 ***
## rho121 -0.14721524  0.10280387  -1.43200    0.15214    
## m_11   -0.12410240  0.08514262  -1.45758    0.14496    
## m_12   -2.52134623  0.10830152 -23.28080 < 2.22e-16 ***
## m_01   -0.49479919  0.08701260  -5.68652 1.2965e-08 ***
## m_02   -1.82254938  0.09737346 -18.71711 < 2.22e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 
## 
## df.residual: 24322
## Log-likelihood: -33754.63 on 14 Df
## 
## AIC: 67537.27
## BIC: 69693.27
## --- 
## Log-likelihood (Bivariate Probit): -17229.64 on 15 Df
## Number of iterations in First Stage BHHH maximisation : 24

To compute the (marginal) of both, the treatment and selection models, the ssdeR provides the user with the marginal.effects.ssdeR() method.This method computes the marginal effects based on Mullahy, J. (2017).

If option model = "selection" is chosen, marginal.effects.ssdeR() returns the marginal effects in the bivariate probit model. In case, model = "treatment", the marginal effects computation reduces to simple probit marginal effects and in case model = "outcome", simple 3rd-step parameter estimates are returned.

As ssdeR estimates a bivariate probit model with structural shift, selection model indirect effects are just the treatment model’s direct effects.

Standard errors are computed using the delta method.

marginal.effects.ssdeR(Results, "treatment")
##                direct.effect     std.err
## battledeaths_i    0.12702036 0.006342958
## spei_i           -0.37784835 0.197276417
## polity_i          1.37752001 4.172892411
## I(polity_i^2)    -1.34686998 3.846009158
## diaspora_i       -1.17824062 4.179786627
## ethMRQ_i         -0.05548011 0.041712269
marginal.effects.ssdeR(Results, "selection")
##                direct.effect      std.err
## dist_ij          -0.05667176 0.0018622256
## I(dist_ij^2)     -0.01199261 0.0002648027
## outmigration_i    0.04488811 0.0009343745
## inmigration_j     0.06309530 0.0014467696
## conflict_i        0.01898079 0.0023374493
marginal.effects.ssdeR(Results, "outcome")
##              direct.effect    std.err
## stock_ij         0.2088787 0.53123694
## dist_ij          0.3897267 0.18044433
## I(dist_ij^2)    -0.0944876 0.35260753
## comlang_ij       0.2499249 0.06903778
## colony_ij        0.3533986 0.15037328
## polity_i        -1.1378554 0.10389902
## pop_i           -0.1169695 1.73621136
## gdp_j            0.1956625 1.02550189
## y1               1.3498521 0.13074581

References

Cameron, A. C. and Trivedi, P. K. (2005) , Cambridge University Press.

Greene, W. H. (2003) , Prentice Hall.

Heckman, J. (1976) The common structure of statistical models of truncation, sample selection and limited dependent variables and a simple estimator for such models, , 5(4), p. 475-492.

Johnston, J. and J. DiNardo (1997) , McGraw-Hill.

Lee, L., G. Maddala and R. Trost (1980) Asymetric covariance matrices of two-stage probit and two-stage tobit methods for simultaneous equations models with selectivity. , 48, p. 491-503.

Mullahy, J. (2017) Marginal effects in multivariate probit models. , 52: 447.

il Kim, K. (2006). Sample selection models with a common dummy endogenous regressor in simultaneous equations: A simple two-step estimation. , 91(2), 280-286.

Petersen, S., G. Henningsen and A. Henningsen (2017) . Unpublished Manuscript. Department of Management Engineering, Technical University of Denmark.

Toomet, O. and A. Henningsen, (2008) Sample Selection Models in R: Package sampleSelection. 27(7),

Wooldridge, J. M. (2003) , Thomson South-Western.}