Running a multistate population projection

Marcus Wurzer

2018-03-09

Introduction

This vignette is about running a multistate population projection using the msproj()-function and reading the input data files required to do so using function fread.msproj(). Details about the creation of state spaces can be found in the “Creating and filling the state space” vignette.

Reading the data

We assume that a state space has already been created and the corresponding .csv-Files have been filled. Function msproj() that runs the simulations requires that the data are passed as a list, with each list element corresponding to one of the .csv’s that have been saved when creating the state space using the state.space() function. Users of the package may read all the required components (state.space, variable.definitions and possibly migration) separately and create the required list manually, but MSDem also provides a convenience function called fread.msproj() that automates this process (based on the fread() function that comes with the data.table-package, hence its name). As already mentioned in the “Creating and filling the state space” vignette, this only works as expected if the user does not change names and structure of the files that have been created using the state.space() function.

fread.msproj() has two arguments:

args(fread.msproj)
## function (data.dir = "input_data/", model.patt = NULL) 
## NULL

Consider that we created a state space for India using the following code:

reg <- c("IN.AN", "IN.AP", "IN.AR", "IN.AS", "IN.BR", "IN.CH", "IN.CT", "IN.DD", "IN.DL", 
         "IN.DN", "IN.GA", "IN.GJ", "IN.HP", "IN.HR", "IN.JH", "IN.JK", "IN.KA", "IN.KL", 
         "IN.LD", "IN.MH", "IN.ML", "IN.MN", "IN.MP", "IN.MZ", "IN.NL", "IN.OR", "IN.PB", 
         "IN.PY", "IN.RJ", "IN.SK", "IN.TN", "IN.TR", "IN.UP", "IN.UT", "IN.WB")
st.sp3 <- state.space(region = reg, edu = 6, country = "India", scen = "AGESR_Const", migration = "bilateral")

This is similar to the second state space created in the “Creating and filling the state space” vignette, apart from the migration that now is bilateral, which means that the exact origins and destinations of the migrants are known. We would read the (filled) data files as follows (assuming that the data is saved in the default directory as given above):

datasets = fread.msproj(model.patt = "India_AGESR_Const")

datasets then is a list having three components:

names(datasets)
## [1] "India_AGESR_Const_mig"         "India_AGESR_Const_state_space"
## [3] "India_AGESR_Const_var_def"

The list elements have been named after the csv’s that were read.

Subsequently, a dataset for a fantasy India consisting of just two states will be used to demonstrate how the data looks like and how a projection is run. The dataset comes with the package and can be loaded using data(fantasy_india). For further information, see the help file (?fantasy_india). The dataset has the following 3 components:

data(fantasy_india)
names(fantasy_india)
## [1] "India_AGESR_Const_mig"         "India_AGESR_Const_state_space"
## [3] "India_AGESR_Const_var_def"

Let’s have a look at the two components that contain the data in turn.

The migration component

The first few lines of the filled migration file look like the following (note that only the first 8 of the 76 columns are presented):

head(fantasy_india$India_AGESR_Const_mig)
##    period  sex age edu      origin   var IN.AN_urban IN.AP_urban
## 1:   2010 male   0  e1 IN.AN_urban mrate  9.27508402   7.2903537
## 2:   2010 male   0  e1 IN.AP_urban mrate  0.05483115  32.2050180
## 3:   2010 male   0  e1 IN.AN_rural mrate 10.83174894   0.6661628
## 4:   2010 male   0  e1 IN.AP_rural mrate  0.02752167  12.7962720
## 5:   2010 male   0  e1       World mrate  0.00000000   0.0000000
## 6:   2010 male   5  e1 IN.AN_urban mrate  7.34492364   5.7863424
##    IN.AN_rural IN.AP_rural World
## 1: 49.02572547   2.1861285     0
## 2:  0.01377913  30.1385492     0
## 3: 81.70289641   0.6099003     0
## 4:  0.01517805  46.5792811     0
## 5:  0.00000000   0.0000000     0
## 6: 33.71858924   1.8075536     0

The first thing we notice is that there is an origin column, but no destination column. The reason for this was already explained “Creating and filling the state space” vignette: Choosing the wide format saves rows, which helps us to avoid problems if users want to use Microsoft Excel to fill the file. Thus, each region/residence combination (IN.AN_urban etc.) can be viewed as a separate destination column.

The state space component

Also for this component we want to have a look at the first 6 lines and the first 8 columns:

head(fantasy_india$India_AGESR_Const_state_space)
##    period  sex age edu var IN.AN_urban IN.AP_urban IN.AN_rural IN.AP_rural
## 1:   2010 male   0  e1 pop    5178.348     1084307    9476.573     2181065
## 2:   2010 male   0  e2 pop       0.000           0       0.000           0
## 3:   2010 male   0  e3 pop       0.000           0       0.000           0
## 4:   2010 male   0  e4 pop       0.000           0       0.000           0
## 5:   2010 male   0  e5 pop       0.000           0       0.000           0
## 6:   2010 male   0  e6 pop       0.000           0       0.000           0

There are 5178.348167 males aged zero and belonging to the lowest educational group in the rural parts of the Andaman and Nicobar Islands region, 1084306.682 show the same pattern in Andhra Pradesh etc. Since there are no persons with an educational level above e1 in this age group, all the other entries are zeroes.

The last 6 lines give us the following information:

tail(fantasy_india$India_AGESR_Const_state_space)
##    period    sex age edu  var IN.AN_urban IN.AP_urban IN.AN_rural
## 1:   2105 female  20 e46 eapr   0.0000000   0.0000000   0.0000000
## 2:   2105 female  20 e56 eapr   0.0000000   0.0000000   0.0000000
## 3:   2105 female  25 e45 eapr   0.9751637   0.9947443   0.9861845
## 4:   2105 female  25 e46 eapr   0.0000000   0.0000000   0.0000000
## 5:   2105 female  25 e56 eapr   0.0000000   0.0000000   0.0000000
## 6:   2105 female  30 e56 eapr   0.8971341   0.8971341   0.8366134
##    IN.AP_rural
## 1:   0.0000000
## 2:   0.0000000
## 3:   0.9968825
## 4:   0.0000000
## 5:   0.0000000
## 6:   0.8366134

Here we see the educational attainment progression ratios for females in 2105, with zeroes for those transitions that are not possible.

Running a projection

Simulation parameters

Once the data have been read correctly, msproj() is used to run the population projection. This function has the following arguments (as you will note, some of them are marked as “not used at the moment”, i.e., it is planned to include them in future versions of the package):

args(msproj)
## function (data = NULL, patt = NULL, country = NULL, SSP = "SSP2", 
##     fert = NULL, mort = NULL, int.mig = NULL, recl = NULL, edu = NULL, 
##     detail.out = FALSE, nSx = "axmx", iPr = 18, reclass.model = NULL, 
##     reclass.period = 5, maxeapr5 = c(urban = 0.7, rural = 0.5, 
##         total = 0.6)) 
## NULL

A projection for India

With the fantasy_india-dataset loaded above, we now run a simulation. We only specify three of the function’s arguments: data is straightforward, reclass.period is set to 10 years because that is the number of years it takes to observe the levels of reclassification given in the input data, and patt is used to create meaningful path and file names:

res <- msproj(data = fantasy_india, reclass.period = 10, patt = "FantasyIndia_AGESR_Const")
## [1] "Period is 1"
## [1] "Period is 2"
## [1] "Period is 3"
## [1] "Period is 4"
## [1] "Period is 5"
## [1] "Period is 6"
## [1] "Period is 7"
## [1] "Period is 8"
## [1] "Period is 9"
## [1] "Period is 10"
## [1] "Period is 11"
## [1] "Period is 12"
## [1] "Period is 13"
## [1] "Period is 14"
## [1] "Period is 15"
## [1] "Period is 16"
## [1] "Period is 17"
## [1] "Period is 18"

While the projection is running, the user gets constantly informed about the current projection period.

Results

Results are available in R and in the form of tables tables and graphs saved to the output_data folder.

Results in R

Let’s have a look at the components of the R object res:

names(res)
##  [1] "pop"          "births"       "deaths"       "net.mig"     
##  [5] "prop.pop"     "prop.reclass" "fert"         "mort"        
##  [9] "int.mig"      "recl"         "edu"

We get a list containing the following 11 components:

  • pop A contingency table in array representation of S3 class c("xtabs", "table") containing the total population numbers. The dimensions are period, sex and (if included in the model) region and/or residence.
  • births A contingency table in array representation of S3 class c("xtabs", "table") containing the number of births. The dimensions are period, sex and (if included in the model) region and/or residence.
  • deaths A contingency table in array representation of S3 class c("xtabs", "table") containing the number of deaths. The dimensions are period, sex and (if included in the model) region and/or residence.
  • net.mig A contingency table in array representation of S3 class c("xtabs", "table") containing the net migration numbers, i.e., the difference between in-migration and out-migration. Thus, the numbers are negative if the out-migration is higher. The dimensions are period, sex and (if included in the model) region and/or residence.
  • prop.pop A table containing the proportion of urban population for each year of the simulation period and (if included in the model) subdivided by region.
  • prop.reclass A table containing the proportion of reclassified population (rural to urban) for each year of the simulation period and (if included in the model) subdivided by region.
  • fert Relative changes of fertility in comparison to the SSP2 assumptions throughout the simulation horizon (see above for details).
  • mort Absolute changes of mortality in comparison to the SSP2 assumptions (see above) throughout the simulation horizon (see above for details).
  • int.mig Relative changes of internal migration in comparison to the SSP2 assumptions throughout the simulation horizon (see above for details).
  • recl Relative changes of reclassification in comparison to the SSP2 assumptions throughout the simulation horizon (see above for details).
  • edu Relative changes of educational transition in comparison to the SSP2 assumptions throughout the simulation horizon (see above for details).

We now have a look on some of the results in turn (Note: Since the projection starts in 2010, the results for Andhra Pradesh do not take the effects of the Andhra Pradesh Reorganisation Act from 2014 into account). As outlined above, the total population numbers are saved in the pop component, subdivided according to the dimensions included in the model. Function dimnames() can be utilized to find out about the dimensions:

dimnames(res$pop)
## $period
##  [1] "2010" "2015" "2020" "2025" "2030" "2035" "2040" "2045" "2050" "2055"
## [11] "2060" "2065" "2070" "2075" "2080" "2085" "2090" "2095" "2100"
## 
## $sex
## [1] "female" "male"  
## 
## $region
## [1] "IN.AN" "IN.AP"
## 
## $residence
## [1] "rural" "urban"

Thus, period is dimension no. 1, sex dimension no. 2 etc. If we are interested in the original table entries, we use the following code:

res$pop["2055", "male", "IN.AN", "rural"]
## [1] 96776

We specified the year we are interested in, the sex, the region, and the residence, and got the corresponding number: In 2055, the projected number of males living in the rural areas of the Andaman and Nicobar Islands region is 96776.

If we are not only interested in these projected population numbers for region IN.AN, but all the Indian regions, we just do not provide a value for the corresponding dimension:

res$pop["2055", "male", , "rural"]
##    IN.AN    IN.AP 
##    96776 22916113

We can aggregate the data using the margin.table()-function to calculate the numbers we are interested in. E.g., if we wanted to know the total population by period, not taking into account sex, region, and residence, we would use the following code:

margin.table(res$pop, margin = 1)
## period
##     2010     2015     2020     2025     2030     2035     2040     2045 
## 84953634 87530604 90098762 92493982 94433814 95973608 97198402 98078271 
##     2050     2055     2060     2065     2070     2075     2080     2085 
## 98536345 98479002 97960151 97070900 95893683 94511341 92930255 91211303 
##     2090     2095     2100 
## 89490574 87858038 86402123

The margin argument specifies the dimension we are interested in (1 stands for the first dimension, which is period, as described above). If we wanted to know the population for each period, but separately for each region, we would just specify the vector of the dimensions of interest:

margin.table(res$pop, margin = c(1, 3))
##       region
## period    IN.AN    IN.AP
##   2010   380572 84573062
##   2015   396459 87134145
##   2020   410679 89688083
##   2025   423333 92070649
##   2030   434200 93999614
##   2035   442786 95530822
##   2040   448603 96749799
##   2045   451134 97627137
##   2050   450504 98085841
##   2055   446867 98032135
##   2060   440574 97519577
##   2065   432026 96638874
##   2070   421722 95471961
##   2075   410413 94100928
##   2080   398654 92531601
##   2085   387474 90823829
##   2090   377534 89113040
##   2095   368198 87489840
##   2100   359579 86042544

Births, deaths, and net migration can be analyzed the same way (see component description of the output above).

When looking at the prop.pop component, we observe a steadily increasing urban population in both states, with Andhra Pradesh getting ahead of the Andaman and Nicobar Islands region in the 2030s:

res$prop.pop
##       region
## period  IN.AN  IN.AP
##   2010 0.3795 0.3416
##   2015 0.4059 0.3693
##   2020 0.4302 0.3941
##   2025 0.4530 0.4171
##   2030 0.4747 0.4387
##   2035 0.4954 0.4590
##   2040 0.5149 0.4779
##   2045 0.5335 0.4953
##   2050 0.5508 0.5115
##   2055 0.5668 0.5265
##   2060 0.5816 0.5405
##   2065 0.5948 0.5536
##   2070 0.6066 0.5657
##   2075 0.6170 0.5769
##   2080 0.6264 0.5872
##   2085 0.6355 0.5968
##   2090 0.6439 0.6057
##   2095 0.6517 0.6141

Since we didn’t specify a reclassification model, the observed proportion of population reclassified from rural to urban areas stays the same for the whole projection period (the small differences for IN.AN result from rounding).

res$prop.reclass
##       region
## period  IN.AN  IN.AP
##   2010 0.0405 0.0334
##   2015 0.0405 0.0334
##   2020 0.0405 0.0334
##   2025 0.0405 0.0334
##   2030 0.0405 0.0334
##   2035 0.0405 0.0334
##   2040 0.0405 0.0334
##   2045 0.0405 0.0334
##   2050 0.0405 0.0334
##   2055 0.0405 0.0334
##   2060 0.0404 0.0334
##   2065 0.0405 0.0334
##   2070 0.0404 0.0334
##   2075 0.0405 0.0334
##   2080 0.0405 0.0334
##   2085 0.0405 0.0334
##   2090 0.0405 0.0334
##   2095 0.0405 0.0334

Results saved in the “output data” folder

Apart from the R object that contains the results, the “output_data” folder in the working directory now contains a new subfolder:

Output subfolder containing the results

Output subfolder containing the results

Since we didn’t set detail.out = TRUE, only two files are included in this folder:

Output files

Output files

The file with the “.rds” extension contains the list of results as specified above. R’s readRDS() function can be used to restore the original R object. The second file is a .csv that holds the simulation results, too, but much more detailed. For every possible population pattern, all the relevant numbers are given for the whole simulation horizon:

.csv containing the results in detail

.csv containing the results in detail

Not all the columns can be presented in this screenshot, but the following list provides a short description of them:

  • pattern A character string summarizing some information about the subgroup of the population in the corresponding line. Used internally, can be ignored by the users.
  • sex Sex of the subgroup.
  • age Age of the subgroup.
  • edu Level of education of the subgroup.
  • period Beginning year of the period.
  • region Region of the subgroup.
  • residence Residence of the subgroup.
  • pop Total population at the beginning of the period: How many people belong to the subgroup?
  • nSx Survival ratio.
  • pop1 Total population after applying the mortality component = Number of people surviving (pop1 = pop * nSx).
  • deaths Number of deaths in the total population (deaths = pop - pop1).
  • areasex Used internally, can be ignored by the users.
  • pop2 Total population after applying the education component = People are redistributed among the education categories.
  • outmig Internal out-migration: Number of people migrating to other regions and/or residences within the country.
  • inmig Internal in-migration: Number of people migrating from other regions and/or residences within the country.
  • pop3 Total population after applying the migration component (pop3 = pop2 - outmig + inmig).
  • asfr Age-specific fertility rate of the subgroup (number of births per 1000 women in one year).
  • pop3.shift Total population aged by the length of one simulation period (e.g., for five-yearly simulation steps, peopled age 20 become 25, people aged 25 become 30 etc.).
  • births Number of children born for a certain group of women in the childbearing years. Applied to the populations before and after aging and then halved, i.e., averaged (births = asfr * (pop3 + pop3.shift) / 2)
  • area Used internally, can be ignored by the users.
  • deaths.nb Number of deaths among the newborns.
  • pop3.shift.total Sum of rural and urban population for the current pattern (i.e., pop3 is aggregated for patterns that only differ in their residence values).
  • perural Percentage of rural or urban population, relating to the value of residence in the corresponding line.
  • reclasstr Percentage of population reclassified.
  • pop4 Total population after applying the reclassification component.
    for rural areas: pop4 = (pop3.shift.total * perural) - (pop3.shift.total * perural * reclasstr)
    for urban areas: pop4 = (pop3.shift.total * perural) + (pop3.shift.total * (1 - perural) * reclasstr)

For instance, in line no. 3, we see that there are 2061096 females aged 0-4 that have no education and live in the rural areas of Andhra Pradesh at the beginning of the 2010-2014 period. The nSx-value is about 0.984, thus 32905 of them die and 2028191 survive etc.

Let’s now run the simulation again, but this time we set detail.out = TRUE:

res <- msproj(data = fantasy_india, reclass.period = 10, patt = "FantasyIndia_AGESR_Const", detail.out = TRUE)
## [1] "Period is 1"
## [1] "Period is 2"
## [1] "Period is 3"
## [1] "Period is 4"
## [1] "Period is 5"
## [1] "Period is 6"
## [1] "Period is 7"
## [1] "Period is 8"
## [1] "Period is 9"
## [1] "Period is 10"
## [1] "Period is 11"
## [1] "Period is 12"
## [1] "Period is 13"
## [1] "Period is 14"
## [1] "Period is 15"
## [1] "Period is 16"
## [1] "Period is 17"
## [1] "Period is 18"

This doesn’t change the results given in R, but additional files are saved into the subfolder containing the results:

Output files if the detail.out-argument is set to TRUE

Output files if the detail.out-argument is set to TRUE

 

These additional files hold aggregated information about births, deaths etc. and some projection statistics in tabular and graphical form (saved as .csv’s and pdf’s, respectively). For example, the .csv-file having “e4plus” in its name contains the proportion of people having completed educational level 4 or above, separately for each period, region, and sex:

Percentage of people having completed e4 or above

Percentage of people having completed e4 or above

 

Two figures (one for each region) are auomatically produced using this information, the one for Andhra Pradesh is presented here:

Percentage of people having completed e4 or above in Andhra Pradesh

Percentage of people having completed e4 or above in Andhra Pradesh

 

The upward trend for both sexes can be seen, with the females surpassing the males in the 2040s.