This vignette is about running a multistate population projection using the msproj()
-function and reading the input data files required to do so using function fread.msproj()
. Details about the creation of state spaces can be found in the “Creating and filling the state space” vignette.
We assume that a state space has already been created and the corresponding .csv-Files have been filled. Function msproj()
that runs the simulations requires that the data are passed as a list, with each list element corresponding to one of the .csv’s that have been saved when creating the state space using the state.space()
function. Users of the package may read all the required components (state.space
, variable.definitions
and possibly migration
) separately and create the required list manually, but MSDem
also provides a convenience function called fread.msproj()
that automates this process (based on the fread()
function that comes with the data.table
-package, hence its name). As already mentioned in the “Creating and filling the state space” vignette, this only works as expected if the user does not change names and structure of the files that have been created using the state.space()
function.
fread.msproj()
has two arguments:
args(fread.msproj)
## function (data.dir = "input_data/", model.patt = NULL)
## NULL
data.dir
the path to the directory that contains the data sets. Defaults to the subdirectory “input_data/” of the current working directory. If users want to use a different directory, they have to specify the whole path instead.model.patt
the name of the scenario as contained in the names of all relevant data sets. This pattern is automatically generated when setting up the state space, made up of the country
and/or scen
arguments of the state.space()
-function.Consider that we created a state space for India using the following code:
reg <- c("IN.AN", "IN.AP", "IN.AR", "IN.AS", "IN.BR", "IN.CH", "IN.CT", "IN.DD", "IN.DL",
"IN.DN", "IN.GA", "IN.GJ", "IN.HP", "IN.HR", "IN.JH", "IN.JK", "IN.KA", "IN.KL",
"IN.LD", "IN.MH", "IN.ML", "IN.MN", "IN.MP", "IN.MZ", "IN.NL", "IN.OR", "IN.PB",
"IN.PY", "IN.RJ", "IN.SK", "IN.TN", "IN.TR", "IN.UP", "IN.UT", "IN.WB")
st.sp3 <- state.space(region = reg, edu = 6, country = "India", scen = "AGESR_Const", migration = "bilateral")
This is similar to the second state space created in the “Creating and filling the state space” vignette, apart from the migration that now is bilateral, which means that the exact origins and destinations of the migrants are known. We would read the (filled) data files as follows (assuming that the data is saved in the default directory as given above):
datasets = fread.msproj(model.patt = "India_AGESR_Const")
datasets
then is a list having three components:
names(datasets)
## [1] "India_AGESR_Const_mig" "India_AGESR_Const_state_space"
## [3] "India_AGESR_Const_var_def"
The list elements have been named after the csv’s that were read.
Subsequently, a dataset for a fantasy India consisting of just two states will be used to demonstrate how the data looks like and how a projection is run. The dataset comes with the package and can be loaded using data(fantasy_india)
. For further information, see the help file (?fantasy_india
). The dataset has the following 3 components:
data(fantasy_india)
names(fantasy_india)
## [1] "India_AGESR_Const_mig" "India_AGESR_Const_state_space"
## [3] "India_AGESR_Const_var_def"
Let’s have a look at the two components that contain the data in turn.
The first few lines of the filled migration file look like the following (note that only the first 8 of the 76 columns are presented):
head(fantasy_india$India_AGESR_Const_mig)
## period sex age edu origin var IN.AN_urban IN.AP_urban
## 1: 2010 male 0 e1 IN.AN_urban mrate 9.27508402 7.2903537
## 2: 2010 male 0 e1 IN.AP_urban mrate 0.05483115 32.2050180
## 3: 2010 male 0 e1 IN.AN_rural mrate 10.83174894 0.6661628
## 4: 2010 male 0 e1 IN.AP_rural mrate 0.02752167 12.7962720
## 5: 2010 male 0 e1 World mrate 0.00000000 0.0000000
## 6: 2010 male 5 e1 IN.AN_urban mrate 7.34492364 5.7863424
## IN.AN_rural IN.AP_rural World
## 1: 49.02572547 2.1861285 0
## 2: 0.01377913 30.1385492 0
## 3: 81.70289641 0.6099003 0
## 4: 0.01517805 46.5792811 0
## 5: 0.00000000 0.0000000 0
## 6: 33.71858924 1.8075536 0
The first thing we notice is that there is an origin
column, but no destination
column. The reason for this was already explained “Creating and filling the state space” vignette: Choosing the wide format saves rows, which helps us to avoid problems if users want to use Microsoft Excel to fill the file. Thus, each region/residence combination (IN.AN_urban
etc.) can be viewed as a separate destination
column.
Also for this component we want to have a look at the first 6 lines and the first 8 columns:
head(fantasy_india$India_AGESR_Const_state_space)
## period sex age edu var IN.AN_urban IN.AP_urban IN.AN_rural IN.AP_rural
## 1: 2010 male 0 e1 pop 5178.348 1084307 9476.573 2181065
## 2: 2010 male 0 e2 pop 0.000 0 0.000 0
## 3: 2010 male 0 e3 pop 0.000 0 0.000 0
## 4: 2010 male 0 e4 pop 0.000 0 0.000 0
## 5: 2010 male 0 e5 pop 0.000 0 0.000 0
## 6: 2010 male 0 e6 pop 0.000 0 0.000 0
There are 5178.348167 males aged zero and belonging to the lowest educational group in the rural parts of the Andaman and Nicobar Islands region, 1084306.682 show the same pattern in Andhra Pradesh etc. Since there are no persons with an educational level above e1
in this age group, all the other entries are zeroes.
The last 6 lines give us the following information:
tail(fantasy_india$India_AGESR_Const_state_space)
## period sex age edu var IN.AN_urban IN.AP_urban IN.AN_rural
## 1: 2105 female 20 e46 eapr 0.0000000 0.0000000 0.0000000
## 2: 2105 female 20 e56 eapr 0.0000000 0.0000000 0.0000000
## 3: 2105 female 25 e45 eapr 0.9751637 0.9947443 0.9861845
## 4: 2105 female 25 e46 eapr 0.0000000 0.0000000 0.0000000
## 5: 2105 female 25 e56 eapr 0.0000000 0.0000000 0.0000000
## 6: 2105 female 30 e56 eapr 0.8971341 0.8971341 0.8366134
## IN.AP_rural
## 1: 0.0000000
## 2: 0.0000000
## 3: 0.9968825
## 4: 0.0000000
## 5: 0.0000000
## 6: 0.8366134
Here we see the educational attainment progression ratios for females in 2105, with zeroes for those transitions that are not possible.
Once the data have been read correctly, msproj()
is used to run the population projection. This function has the following arguments (as you will note, some of them are marked as “not used at the moment”, i.e., it is planned to include them in future versions of the package):
args(msproj)
## function (data = NULL, patt = NULL, country = NULL, SSP = "SSP2",
## fert = NULL, mort = NULL, int.mig = NULL, recl = NULL, edu = NULL,
## detail.out = FALSE, nSx = "axmx", iPr = 18, reclass.model = NULL,
## reclass.period = 5, maxeapr5 = c(urban = 0.7, rural = 0.5,
## total = 0.6))
## NULL
data
a list containing the data needed to run the simulation. It has to include sublists containing the state space, the variable definitions, and (optionally) the flows of migration as defined by the state.space
function. Defaults to NULL.patt
the pattern of the model to be run. Used to create meaningful file names (especially when detailed output is generated). Defaults to NULL.country
the name of the country (or more general, the highest level in the hierarchy of geographic units) the projections are done for. Used to annotate plots and tables that are created automatically when argument detail.out
is set to TRUE
. Defaults to NULL. If no country is specified, the variable definition file is checked for an entry. If no country is given there, too, the generic name country
will be used.SSP
the Shared Socioeconomic Pathway that forms the basis for the simulation assumptions concerning fertility, mortality, internal migration, reclassification, and education. Defaults to SSP2 (middle of the road scenario). Not used at the moment.fert
an optional vector specifying changes in the fertility assumptions for the whole simulation horizon. Should be NULL
(the default) if the middle of the road fertility assumptions of SSP2 are used, a vector of length iPr
, i.e., corresponding to the number of simulation periods (e.g., years or 5-year-periods) or a single number that is than replicated iPr
times. The values of these vectors represent the relative changes in comparison to the SSP2 assumptions (see examples). Not used at the moment.mort
an optional vector specifying absolute changes in comparison to the base mortality assumptions. In contrast to fert
, the vector values specify absolute changes in comparison to SSP2 for the life expectancy at age 0. Not used at the moment.int.mig
an optional vector specifying relative changes in comparison to the base internal migration assumptions, comparable to fert
. Not used at the moment.recl
an optional vector specifying relative changes in comparison to the base reclassification assumptions, comparable to fert
. Not used at the moment.edu
an optional vector specifying relative changes in comparison to the base educational transition assumptions, comparable to fert
. Not used at the moment.detail.out
logical. Should detailed output be written to the “output_data” directory, which includes frequency tables and pdf’s? Defaults to FALSE
.nSx
the information used to compute the nSx’s. Defaults to "axmx"
, i.e., the nSx’s are computed using ax and mx. Alternatively, the computation can be based on "le0"
.iPr
the number of simulation periods. Must be smaller or equal to the number of periods in the simulation horizon. Defaults to 18 in compliance with the default simulation horizon (90 years = 18 five-year periods).reclass.model
an R object of class “glm” containing the reclassification model. Needed to estimate the proportion of the rural population that is reclassified depending on the current population distribution between urban and rural areas. Defaults to NULL.reclass.period
an integer number specifying the space of time for the reclassification data, i.e., the number of years it takes to observe the specified proportion of reclassification. Defaults to 5.mig.adj
logical. Should the internal migration rates be adjusted to account for the changing proportions of rural and urban population that occur when the simulation is run? Defaults to FALSE
.maxeapr5
a vector specifying the upper limits for the education transition rate to post-secondary attainment among upper-secondary attained. The total value is used if there is no differentiation between urban and rural areas in the model. In contrast, if residence is considered, separate values may be used for rural and urban areas. The possible values are bounded between 0 and 1.With the fantasy_india
-dataset loaded above, we now run a simulation. We only specify three of the function’s arguments: data
is straightforward, reclass.period
is set to 10 years because that is the number of years it takes to observe the levels of reclassification given in the input data, and patt
is used to create meaningful path and file names:
res <- msproj(data = fantasy_india, reclass.period = 10, patt = "FantasyIndia_AGESR_Const")
## [1] "Period is 1"
## [1] "Period is 2"
## [1] "Period is 3"
## [1] "Period is 4"
## [1] "Period is 5"
## [1] "Period is 6"
## [1] "Period is 7"
## [1] "Period is 8"
## [1] "Period is 9"
## [1] "Period is 10"
## [1] "Period is 11"
## [1] "Period is 12"
## [1] "Period is 13"
## [1] "Period is 14"
## [1] "Period is 15"
## [1] "Period is 16"
## [1] "Period is 17"
## [1] "Period is 18"
While the projection is running, the user gets constantly informed about the current projection period.
Results are available in R and in the form of tables tables and graphs saved to the output_data folder.
Let’s have a look at the components of the R object res
:
names(res)
## [1] "pop" "births" "deaths" "net.mig"
## [5] "prop.pop" "prop.reclass" "fert" "mort"
## [9] "int.mig" "recl" "edu"
We get a list containing the following 11 components:
pop
A contingency table in array representation of S3 class c("xtabs", "table")
containing the total population numbers. The dimensions are period, sex and (if included in the model) region and/or residence.births
A contingency table in array representation of S3 class c("xtabs", "table")
containing the number of births. The dimensions are period, sex and (if included in the model) region and/or residence.deaths
A contingency table in array representation of S3 class c("xtabs", "table")
containing the number of deaths. The dimensions are period, sex and (if included in the model) region and/or residence.net.mig
A contingency table in array representation of S3 class c("xtabs", "table")
containing the net migration numbers, i.e., the difference between in-migration and out-migration. Thus, the numbers are negative if the out-migration is higher. The dimensions are period, sex and (if included in the model) region and/or residence.prop.pop
A table containing the proportion of urban population for each year of the simulation period and (if included in the model) subdivided by region.prop.reclass
A table containing the proportion of reclassified population (rural to urban) for each year of the simulation period and (if included in the model) subdivided by region.fert
Relative changes of fertility in comparison to the SSP2 assumptions throughout the simulation horizon (see above for details).mort
Absolute changes of mortality in comparison to the SSP2 assumptions (see above) throughout the simulation horizon (see above for details).int.mig
Relative changes of internal migration in comparison to the SSP2 assumptions throughout the simulation horizon (see above for details).recl
Relative changes of reclassification in comparison to the SSP2 assumptions throughout the simulation horizon (see above for details).edu
Relative changes of educational transition in comparison to the SSP2 assumptions throughout the simulation horizon (see above for details).We now have a look on some of the results in turn (Note: Since the projection starts in 2010, the results for Andhra Pradesh do not take the effects of the Andhra Pradesh Reorganisation Act from 2014 into account). As outlined above, the total population numbers are saved in the pop
component, subdivided according to the dimensions included in the model. Function dimnames()
can be utilized to find out about the dimensions:
dimnames(res$pop)
## $period
## [1] "2010" "2015" "2020" "2025" "2030" "2035" "2040" "2045" "2050" "2055"
## [11] "2060" "2065" "2070" "2075" "2080" "2085" "2090" "2095" "2100"
##
## $sex
## [1] "female" "male"
##
## $region
## [1] "IN.AN" "IN.AP"
##
## $residence
## [1] "rural" "urban"
Thus, period
is dimension no. 1, sex
dimension no. 2 etc. If we are interested in the original table entries, we use the following code:
res$pop["2055", "male", "IN.AN", "rural"]
## [1] 96776
We specified the year we are interested in, the sex, the region, and the residence, and got the corresponding number: In 2055, the projected number of males living in the rural areas of the Andaman and Nicobar Islands region is 96776.
If we are not only interested in these projected population numbers for region IN.AN
, but all the Indian regions, we just do not provide a value for the corresponding dimension:
res$pop["2055", "male", , "rural"]
## IN.AN IN.AP
## 96776 22916113
We can aggregate the data using the margin.table()
-function to calculate the numbers we are interested in. E.g., if we wanted to know the total population by period, not taking into account sex, region, and residence, we would use the following code:
margin.table(res$pop, margin = 1)
## period
## 2010 2015 2020 2025 2030 2035 2040 2045
## 84953634 87530604 90098762 92493982 94433814 95973608 97198402 98078271
## 2050 2055 2060 2065 2070 2075 2080 2085
## 98536345 98479002 97960151 97070900 95893683 94511341 92930255 91211303
## 2090 2095 2100
## 89490574 87858038 86402123
The margin
argument specifies the dimension we are interested in (1
stands for the first dimension, which is period
, as described above). If we wanted to know the population for each period, but separately for each region, we would just specify the vector of the dimensions of interest:
margin.table(res$pop, margin = c(1, 3))
## region
## period IN.AN IN.AP
## 2010 380572 84573062
## 2015 396459 87134145
## 2020 410679 89688083
## 2025 423333 92070649
## 2030 434200 93999614
## 2035 442786 95530822
## 2040 448603 96749799
## 2045 451134 97627137
## 2050 450504 98085841
## 2055 446867 98032135
## 2060 440574 97519577
## 2065 432026 96638874
## 2070 421722 95471961
## 2075 410413 94100928
## 2080 398654 92531601
## 2085 387474 90823829
## 2090 377534 89113040
## 2095 368198 87489840
## 2100 359579 86042544
Births, deaths, and net migration can be analyzed the same way (see component description of the output above).
When looking at the prop.pop
component, we observe a steadily increasing urban population in both states, with Andhra Pradesh getting ahead of the Andaman and Nicobar Islands region in the 2030s:
res$prop.pop
## region
## period IN.AN IN.AP
## 2010 0.3795 0.3416
## 2015 0.4059 0.3693
## 2020 0.4302 0.3941
## 2025 0.4530 0.4171
## 2030 0.4747 0.4387
## 2035 0.4954 0.4590
## 2040 0.5149 0.4779
## 2045 0.5335 0.4953
## 2050 0.5508 0.5115
## 2055 0.5668 0.5265
## 2060 0.5816 0.5405
## 2065 0.5948 0.5536
## 2070 0.6066 0.5657
## 2075 0.6170 0.5769
## 2080 0.6264 0.5872
## 2085 0.6355 0.5968
## 2090 0.6439 0.6057
## 2095 0.6517 0.6141
Since we didn’t specify a reclassification model, the observed proportion of population reclassified from rural to urban areas stays the same for the whole projection period (the small differences for IN.AN
result from rounding).
res$prop.reclass
## region
## period IN.AN IN.AP
## 2010 0.0405 0.0334
## 2015 0.0405 0.0334
## 2020 0.0405 0.0334
## 2025 0.0405 0.0334
## 2030 0.0405 0.0334
## 2035 0.0405 0.0334
## 2040 0.0405 0.0334
## 2045 0.0405 0.0334
## 2050 0.0405 0.0334
## 2055 0.0405 0.0334
## 2060 0.0404 0.0334
## 2065 0.0405 0.0334
## 2070 0.0404 0.0334
## 2075 0.0405 0.0334
## 2080 0.0405 0.0334
## 2085 0.0405 0.0334
## 2090 0.0405 0.0334
## 2095 0.0405 0.0334
Apart from the R object that contains the results, the “output_data” folder in the working directory now contains a new subfolder:
Output subfolder containing the results
Since we didn’t set detail.out = TRUE
, only two files are included in this folder:
Output files
The file with the “.rds” extension contains the list of results as specified above. R’s readRDS()
function can be used to restore the original R object. The second file is a .csv that holds the simulation results, too, but much more detailed. For every possible population pattern, all the relevant numbers are given for the whole simulation horizon:
.csv containing the results in detail
Not all the columns can be presented in this screenshot, but the following list provides a short description of them:
pattern
A character string summarizing some information about the subgroup of the population in the corresponding line. Used internally, can be ignored by the users.sex
Sex of the subgroup.age
Age of the subgroup.edu
Level of education of the subgroup.period
Beginning year of the period.region
Region of the subgroup.residence
Residence of the subgroup.pop
Total population at the beginning of the period: How many people belong to the subgroup?nSx
Survival ratio.pop1
Total population after applying the mortality component = Number of people surviving (pop1 = pop * nSx
).deaths
Number of deaths in the total population (deaths = pop - pop1
).areasex
Used internally, can be ignored by the users.pop2
Total population after applying the education component = People are redistributed among the education categories.outmig
Internal out-migration: Number of people migrating to other regions and/or residences within the country.inmig
Internal in-migration: Number of people migrating from other regions and/or residences within the country.pop3
Total population after applying the migration component (pop3 = pop2 - outmig + inmig
).asfr
Age-specific fertility rate of the subgroup (number of births per 1000 women in one year).pop3.shift
Total population aged by the length of one simulation period (e.g., for five-yearly simulation steps, peopled age 20 become 25, people aged 25 become 30 etc.).births
Number of children born for a certain group of women in the childbearing years. Applied to the populations before and after aging and then halved, i.e., averaged (births = asfr * (pop3 + pop3.shift) / 2
)area
Used internally, can be ignored by the users.deaths.nb
Number of deaths among the newborns.pop3.shift.total
Sum of rural and urban population for the current pattern (i.e., pop3
is aggregated for patterns that only differ in their residence values).perural
Percentage of rural or urban population, relating to the value of residence in the corresponding line.reclasstr
Percentage of population reclassified.pop4
Total population after applying the reclassification component.pop4 = (pop3.shift.total * perural) - (pop3.shift.total * perural * reclasstr)
pop4 = (pop3.shift.total * perural) + (pop3.shift.total * (1 - perural) * reclasstr)
For instance, in line no. 3, we see that there are 2061096 females aged 0-4 that have no education and live in the rural areas of Andhra Pradesh at the beginning of the 2010-2014 period. The nSx-value is about 0.984, thus 32905 of them die and 2028191 survive etc.
Let’s now run the simulation again, but this time we set detail.out = TRUE
:
res <- msproj(data = fantasy_india, reclass.period = 10, patt = "FantasyIndia_AGESR_Const", detail.out = TRUE)
## [1] "Period is 1"
## [1] "Period is 2"
## [1] "Period is 3"
## [1] "Period is 4"
## [1] "Period is 5"
## [1] "Period is 6"
## [1] "Period is 7"
## [1] "Period is 8"
## [1] "Period is 9"
## [1] "Period is 10"
## [1] "Period is 11"
## [1] "Period is 12"
## [1] "Period is 13"
## [1] "Period is 14"
## [1] "Period is 15"
## [1] "Period is 16"
## [1] "Period is 17"
## [1] "Period is 18"
This doesn’t change the results given in R, but additional files are saved into the subfolder containing the results:
Output files if the detail.out-argument is set to TRUE
These additional files hold aggregated information about births, deaths etc. and some projection statistics in tabular and graphical form (saved as .csv’s and pdf’s, respectively). For example, the .csv-file having “e4plus” in its name contains the proportion of people having completed educational level 4 or above, separately for each period, region, and sex:
Percentage of people having completed e4 or above
Two figures (one for each region) are auomatically produced using this information, the one for Andhra Pradesh is presented here:
Percentage of people having completed e4 or above in Andhra Pradesh
The upward trend for both sexes can be seen, with the females surpassing the males in the 2040s.