-
Notifications
You must be signed in to change notification settings - Fork 76
/
05-smsim-in-R.Rmd
2141 lines (1787 loc) · 86.9 KB
/
05-smsim-in-R.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
# Population synthesis {#smsimr}
```{r, include=FALSE}
# packages to allow 'knitting' the chapter to html
library(png)
library(grid)
source("code/functions.R")
source("code/SimpleWorld.R")
```
In this chapter we progress from loading and preparing the input data to
running a spatial microsimulation model. The focus is on
Iterative Proportional Fitting
procedure (IPF), a simple, fast and widely-used method to allocate
individuals to zones. Other methods can achieve the same result,
as discussed in the first section of this chapter and as
demonstrated in
\@ref(#alternative-reweighting).
This chapter moves from theory to practice. It describes the
process underlying the method of population synthesis (Section \@ref(ipftheory)) before implementing
the IPF algorithm in base R (Section\@ref(ipfinr)). In the following sections two packages that automate
IPF, **ipfp** and **mipfp**, are introduced
(\@ref(#ipfp) and \@ref(#mipfp) respectively).
The chapter also introduces the important concepts of
*integerisation* and *expansion*. These methods are introduced
towards the end of each package-specific section. The final
section
(\ref{compareipf})
compares the relative merits of **ipfp** and **mipfp**.
The SimpleWorld data, loaded in the previous chapter,
is used as the basis of this chapter.
Being small, simple and hopefully by now familiar,
SimpleWorld facilitates understanding of
the process on a 'human scale'. As always, we encourage
experimentation with small datasets to avoid
the worry of overloading your computer, before
applying the same methods to larger and more complex data.
Learning the basic principles and methods of spatial microsimulation
in R is the focus of this chapter.
Time spent mastering these basics will make subsequent steps
much easier.
```{r, echo=FALSE}
# How representative each individual is of each zone is determined by their
# *weight* for that zone. If we have `nrow(cons)` zones and `nrow(ind)`
# individuals (3 and 5, respectively, in SimpleWorld) we will create
# 15 weights. Real world datasets (e.g. that presented in chapter xxx)
# could contain 10,000 individuals
# to be allocated to 500 zones, resulting in an unwieldy 5 million element
# weight matrix but we'll stick with the SimpleWorld dataset here for simplicity.
load("cache-data-prep.RData")
```
How representative each individual is of each zone is represented by their
*weight* for that zone. Each weight links an individual to a zone.
The number
of weights is therefore equal to the number of zones multiplied
by the number of individuals
in the microdata.
In terms of the SimpleWorld data loaded in the previous chapter we have, in
R syntax, `nrow(cons)` zones and `nrow(ind)` individuals
(typing those commands with the data loaded should confirm that there are
3 zones and 5 individuals in the input data for the SimpleWorld example).
This means that `nrow(cons) * nrow(ind)` weights will be estimated
(that is $3 * 5 = 15$ in SimpleWorld). The weights must begin
with an initial value. We will create a matrix of
ones. Thus, every
individual is seen as equally representative of every
zone:^[Any
initial weight values could be used here: initial weights do not affect
the final weights after many iterations. The value of one is
used as a sensible
convention.]
```{r}
# Create the weight matrix.
# Note: relies on data from previous chapter.
weights <- matrix(data = 1, nrow = nrow(ind), ncol = nrow(cons))
dim(weights) # dimension of weight matrix: 5 rows by 3 columns
```
The weight matrix links individual level data to aggregate level data.
A weight matrix value of 0 in cell `[i,j]`, for example, suggests that
the individual `i` is not representative of
zone `j`.
During the IPF procedure these weights are iteratively updated until
they *converge* towards a single result: the final weights which create
a representative population for each zone.^[Fienberg (1970) provides
a geometrical proof that IPF converges to a single result, in the absence
of *empty cells*.]
## Weighting algorithms {#weighting}
\index{reweighting algorithms}
A wide range of methods can be used to allocate individuals
to zones in spatial microsimulation. As with the majority of procedures for statistical
analysis, there are *deterministic* and *stochastic* methods.
The results of *deterministic* methods, such as IPF, never vary: no random
or probabilistic numbers are used so the resulting weights will be the same
every time. *Stochastic* methods such as *simulated annealing*,
on the other hand, use some random numbers.
In the literature, the divide between stochastic and deterministic approaches
is usually mapped onto a wider distinction between *reweighting*
and *combinatorial optimisation* methods. Reweighting methods generally calculate
non-integer weights for every individual-zone combination.
Combinatorial optimisation
methods generally work by randomly allocating individuals from the microdata
survey into each zone, one-by-one, and re-calculating the goodness-of-fit after each change.
If the fit between observed and simulated results improves after an individual
has been 'imported' into the zone in question, the individual will stay.
If the fit deteriorates, the individual will be removed from the zone and
the impact of switching a different individual into the zone is tested.
This distinction between reweighting of fractional weights
and combinatorial optimisation algorithms is important: combinatorial
optimisation methods result in whole individuals being allocated to each zone
whereas reweighting strategies result in fractions of individuals being allocated
to each zone. The latter method means that individual $i$ could have a weight of
0.223 (or any other positive real number) for zone $j$. Of course, this sounds
irrational: a quarter of a person cannot possibly exist: either she is in the zone
or she is not!
However, the distinction between combinatorial optimisation and reweighting
approaches to creating spatial microdata is not as clear cut as it may at first
seem. As illustrated in Figure 5.1, fractional weights generated by reweighting
algorithms such as IPF can be converted into integer weights (via *integerisation*).
Through the process of *expansion*, the integer weight matrix
produced by integerisation can be converted into the final spatial microdata
output stored in 'long' format represented in the right-hand box of Figure~5.1.
Thus the combined processes of integerisation
and expansion allow weight matrices to be translated into the same output
format that combinatorial optimisation algorithms produce directly.
In other words fractional weighting is interchangeable with combinatorial
optimisation approaches for population synthesis.
The reverse process is also possible:
synthetic spatial microdata generated by combinatorial
optimisation algorithms can be converted back in to a more compact weight
matrix in a step that we call *compression*. This integer weight has the same
dimensions as the integer matrix generated through integerisation described above.
```{r, echo=FALSE}
# is the function tidyr:@expand the same as expand here? will test.
```
Integerisation, expansion and compression procedures allow fractional
weighting and combinatorial optimisation approaches for population synthesis
to be seen as essentially the same thing.
This equivalence between different methods of population synthesis is
the reason we have labelled this section
*weighting algorithms*: combinatorial optimisation approaches to population synthesis can
be seen as a special case of fractional weighting and vice versa.
Thus all deterministic and stochastic
(or weighting and combinatorial optimisation) approaches to the generation of
spatial microdata can be seen as different methods, algorithms, for allocating
weights to individuals. Individuals representative of a zone will be given a
high weight (which is equivalent to being replicated many times in
combinatorial optimisation). Individuals who are rare in a zone will
be given a low weight (or not appear at all, equivalent to a weight of zero).
Later in this chapter we demonstrate functions to translate between the
'weight matrix' and 'long spatial microdata' formats generated by each approach.
\index{spatial microsimulation!schematic}
```{r, fig.cap="Schematic of different approaches for the creation of spatial microdata encapsulating stochastic combinatorial optimisation and deterministic reweighting algorithms such as IPF. Note that integerisation and 'compression' steps make the results of the two approaches interchangeable, hence our use of the term 'reweighting algorithm' to cover all methods for generating spatial microdata.", fig.scap="Schematic of microsimulation processes (IPF vs CO)", echo=FALSE}
img <- readPNG("figures/co-vs-ipf-schema.png")
grid.raster(img)
```
The concept of weights is critical to understanding how
population synthesis generates spatial microdata.
To illustrate the point imagine a parallel SimpleWorld, in which we have
no information about the characteristics of its inhabitants, only the
total population of each zone. In this case we could only assume that
the distribution of characteristics found in the sample is representative
of the distribution of the whole population. Under this scenario, individuals
would be chosen at random from the sample and allocated to zones at random
and the distribution of characteristics of individuals in each zone would be
asymptotically the same as (tending towards) the microdata.
In R, this case can
be achieved using the `sample()` command. We could use this method to
randomly allocate the 5 individuals of the microdata to zone 1 (which has
a population of 12) with the following code:
\index{sampling}
```{r}
# set the seed for reproducibility
set.seed(1)
# create selection
sel <- sample(x = 5, size = 12, replace = T)
ind_z1 <- ind_orig[sel, ]
head(ind_z1, 3)
```
Note the use of `set.seed()` in the above code to ensure the results are reproducible.
It is important to emphasise that without 'setting the seed' (determining the starting
point of the random number sequence) the results will change each time the code is
run. This is because `sample()` (along with other stochastic functions such as `runif()`
and `rnorm()`) is probabilistic and its output therefore depends on a random number
generator
(RNG).^[Without
setting the seed, the results will change each time. In addition, changing
the value inside the brackets of `set.seed()`
will result in a different combination of individuals being selected for each new number --- test this
out in your code. This happens because the method relies on *pseudo random numbers* to select values probabilistically and `set.seed()` specifies where the
random number sequence should begin, ensuring reproducibility. We must really trust the random function used. Note that it is impossible for a computer to
choose *entirely* random numbers, so algorithms to generate *pseudo-random numbers* have been developed. See the documentation provided by `?set.seed`
for more information.
]
\index{random number generation}
The *reweighting* methods consists in adding a weight to each individual for the zone. This method is good if we have
a representative sample of the zone and the minority of the population is included in it. In contrary, if we have in
the individual level only the majority of the population and for example, we have not an old man still working, this
kind of individual will not appear in the final data. A proposal to avoid this is to either modify the sample to include those outliers or use a different approach such as a *genetic algorithm*
to allow mixing some individual level variables to create a new individuals (mutation).
We are not aware of the latter technique being used at present,
although it could be used in future microsimulation work. An other useful alternative to solve this issue is given by the *tabu-search algorithm*, an heuristic for integer programming method. (Glover 1986, 1989, 1990 and Glover and Laguna, 1997). This heuristic can be used to improve an initial solution produced by the IPF.
```{r, echo=FALSE}
# TODO (Refs tabu):
# Glover F (1986) Future paths for integer programming and links to artificial intelligence. Comput. Oper. Res. 13:533–549.
# Glover F (1989) Tabu search—Part I. ORSA J. Comput. 1:190–206.
# Glover F (1990) Tabu search—Part II. ORSA J. Comput. 2:4–32.
# Glover F, Laguna M (1997) Tabu Search (Kluwer Academic Publishers, Boston).
```
In the rest of this chapter, we develop an often used
reweighting algorithm : the Iterative Proportional Fitting (IPF).
When reweighting, three steps are
always necessary. First, you generate a
weight matrix containing fractional numbers, then you integerise it.
When knowing the number of times
each individual needs to be replicated,
the expansion step calculates the final dataset,
which contains a row per individual in the whole
population and a zone for each person. The process
is explained in theory and then demonstrated with
example code in R.
## Iterative Proportional Fitting
\index{iterative proportional fitting (IPF)}
This section begins with the theory (and intuition) under the IPF
algorithm. Then, the R code to run the method is explained. However,
in R, it is often unnecessary to write the whole code. Indeed,
packages including functions to perform an IPF exist and are optimised.
The **ipfp** and **mipfp** packages are the subjects of the two last
sub-sections of this section.
### IPF in theory {#ipftheory}
The most widely used and mature *deterministic* method to allocate individuals
to zones is iterative proportional fitting (IPF).
IPF is mature, fast and has a long history: it was demonstrated by
Deming and Stephan (1940)
for estimating internal cells based on known marginals.
IPF involves calculating a series of
non-integer weights that represent how representative each individual is of
each zone. This is *reweighting*.
\index{maximum likelihood}
\index{entropy maximisation}
Regardless of implementation method,
IPF can be used to allocate individuals to zones through the calculation
of *maximum likelihood* values for each zone-individual combination represented
in the weight matrix. This is considered as the most probable configuration of
individuals in these zones. IPF is a method of *entropy maximisation*
(Cleave et al. 1995).
The *entropy* is the number of configurations of the underlying spatial microdata
that could result in the same marginal counts.
For in-depth treatment of the mathematics and theory underlying IPF,
interested readers are directed towards Fienberg (1979), Cleave et al. (1995) and an excellent recent review
of methods for maximum likelihood estimation (Fienberg and Rinaldo 2007). For the
purposes of this book, we provide a basic overview of the method for the purposes
of spatial microsimulation.
```{r, echo=FALSE}
# TODO (RL): add Lovelace et al (2015) example when published above
```
In spatial microsimulation, IPF is used to allocate individuals to zones.
The subsequent section implements IPF to create *spatial microdata* for
SimpleWorld, using the data loaded in the previous chapter as a basis.
An overview of the paper from the
perspective of transport modelling is provided by @Pritchard2012.
```{r, echo=FALSE}
# TODO (MD): add reference above
#(Dumont,2014) Jojo thesis and article given by Eric
```
Such as with the example of SimpleWorld, each application has a matrix
`ind` containing the categorical value of each individual. `ind` is a two
dimensional array (a matrix) in which each row represents an individual and
each column a variable. The value of the cell `ind(i,j)` is therefore the
category of the individual `i` for the variable `j`.
A second array containing
the constraining count data `cons` can, for the purpose of explaining the
theory, be expressed in three dimensions, which we will label `cons_t`:
`cons_t(i,j,k)` is the number of individuals corresponding to the marginal
for the zone 'i', in the variable 'j' for the category 'k'. For example,
'i' could be a municipality, 'j' the gender and 'k' the female.
Element '(i,j,k)' is the total number of women in this municipality according
to the constraint data.
The IPF algorithm will proceed zone per zone. For each zone, each individual
will have a weight of 'representatively' of the zone. The weight matrix
will then have the dimension 'number of individual x number of zone'.
'w(i,j,t)' corresponds to the weight of the individual 'i' in the
zone 'j' (during the step 't'). For the zone 'z', we will adapt the weight matrix to
each constraint 'c'. This matrix is initialized with a full matrix of 1
and then, for each step 't', the formula can be expressed as:
$$ w(i,z,t+1) = w(i,z,t) * \displaystyle\frac{cons \_ t(z,c,ind(i,c))}{ \displaystyle\sum_{j=1}^{n_{ind}} w(j,z,t) * I(ind(j,c)=ind(i,c))} $$
where the 'I(x)' function is the indicator function which value is 1
if x is true and 0 otherwise. We can see that 'ind(i,c)' is the category
of the individual 'i' for the variable 'c'. The denominator corresponds
to the sum of the actual weights of all individuals having the same category
in this variable as 'i'. We simply redistribute the weights so that the
data follows the constraint concerning this variable.
### IPF in R {#ipfinr}
\index{IPF!in R}
In the subsequent examples, we use IPF to allocate individuals to zones
in SimpleWorld, using the data loaded in
the previous chapter as a basis. IPF is mature,
fast and has a long history.
The IPF algorithm can be written in
R from scratch, as illustrated in Lovelace (2014), and as taught in the
smsim-course on-line
[tutorial](https://github.com/Robinlovelace/spatial-microsim-book).
We will refer to this implementation of IPF as 'IPFinR'.
The code described in this tutorial 'hard-codes' the IPF algorithm
in R and must be adapted to each new application (unlike the generalized 'ipfp' approach,
which works unmodified on any reweighting problem).
IPFinR works by saving the weight
matrix after every constraint for each iteration. We here develop first IPFinR to
give you an idea of the algorithm. Without reading the *source code* in the
**ipfp** and **mipfp** packages, available from CRAN,
they can seem like "black boxes" which
magically generate answers without explanation of *how* answer was created.
By the end of this section you will have generated a weight matrix representing how representative each individual
is of each zone.
The algorithm operates zone per zone and the weight matrix will be filled in
column per column. For convenience, we begin by assigning some of the basic parameters
of the input data to intuitive names, for future reference.
```{r}
# Create intuitive names for the totals
n_zone <- nrow(cons) # number of zones
n_ind <- nrow(ind) # number of individuals
n_age <-ncol(con_age) # number of categories of "age"
n_sex <-ncol(con_sex) # number of categories of "sex"
```
The next code chunk creates the R objects needed to run IPFinR.
The code creates the weight matrix (`weights`) and the marginal distribution of individuals in each
zone (`ind_agg0`). Thus first we create an object (`ind_agg0`), in which rows are zones and
columns are the different categories of the variables. Then, we duplicate the
weight matrix to keep in memory each step.
```{r}
# Create initial matrix of categorical counts from ind
weights <- matrix(data = 1, nrow = nrow(ind), ncol = nrow(cons))
# create additional weight objects
weights1 <- weights2 <- weights
ind_agg0 <- t(apply(cons, 1, function(x) 1 * ind_agg))
colnames(ind_agg0) <- names(cons)
```
IPFinR begins with a
couple of nested 'for' loops, one to iterate through each zone
(hence `1:n_zone`, which means "from 1 to the number of zones in the
constraint data") and one to iterate through each
category within the constraints (0--49 and 50+ for the first constraint),
using the `cons` and `ind` objects loaded in the
previous chapter.
```{r, echo=FALSE}
# The long way of doing it
# # Create initial matrix of categorical counts from ind
# ind_agg0 <- matrix(rep(ind_agg, nrow(cons)), nrow = nrow(cons), byrow = T)
# weights1 <- weights2 <- weights # create addition weight objects for IPFinR
# # Assign values to the previously created weight matrix
# for(j in 1:nrow(cons)){
# for(i in 1:ncol(con_age)){
# weights1[ind_cat[ , i ] == 1, j] <- con_age[j , i] / ind_agg0[j , i]
# }
# }
```
```{r}
# Assign values to the previously created weight matrix
# to adapt to age constraint
for(j in 1:n_zone){
for(i in 1:n_age){
index <- ind_cat[, i] == 1
weights1[index, j] <- weights[index, j] * con_age[j, i]
weights1[index, j] <- weights1[index, j] / ind_agg0[j, i]
}
print(weights1)
}
```
The above code updates the weight matrix by multiplying each weight
by a coefficient to respect the desired proportions (age constraints).
These coefficients are the divisions of each cell in
the census constraint (`con_age`) by the equivalent cell
aggregated version of the individual level data. In the first step,
all weights were initialised to 1 and the new weights equal
the coefficients. Note that the two lines changing `weights1` could
be joined to form only one line (we split it into two lines
to respect the margins of the book).
The weight matrix is critical to the spatial microsimulation procedure because
it describes how representative each individual is of each zone. To see the
weights that have been allocated to individuals to populate, for example zone 2,
you would query the second column of the weights: `weights1[, 2]`. Conversely,
to see the weight allocated for individual 3 for each for the 3 zones, you
need to look at the 3rd column of the weight matrix: `weights1[3, ]`.
Note that we asked R to write the resulting matrix after the completion of each zone.
As said before, the algorithm proceeds zone per zone and each column of the matrix
corresponds to a zone. This explains why the matrix is filled in per column. We can now
verify that the weights correspond to the application of the theory seen before.
For the first zone, the age constraint was to have 8 people under 50 years old and
4 over this age. The first individual is a man of 59 years old, so over 50.
To determine the weight of this person inside the zone 1, we multiply the
actual weight, 1, by a ratio with a numerator corresponding to the
number of persons in this category of age for the constraint, here 4,
and a denominator equal to the sum of the weights of the individual
having the same age category. Here, there are 3 individuals of more
than 50 years old and all weights are 1 for the moment. The new weight
is
$$ 1 * \frac{4}{1+1+1}=1.33333$$
We can verify the other weights with a similar reasoning. Now that
we explained the whole process under IPF, we can understand the origin
of the name 'Iterative Proportional Fitting'.
Thanks to the weight matrix, we can see that individual 3
(whose attributes can be viewed by entering `ind[3, ]`)
is young and has a comparatively low weight of 1 for zone two. Intuitively
this makes sense because zone 2 has only 2 young adult inhabitants (see
the result of `cons[2,]`) but 8 older inhabitants. The reweighting stage is
making sense. Note also that the weights generated are fractional;
see \@ref(sintegerisation) for methods of converting fractional weights into
integer weights to generate a synthetic small-area population ready for
agent-based modelling applications.
The next step in IPF, however, is to
re-aggregate the results from individual level data after they have been
reweighted. For the first zone, the weights of each individual are in the
first column of the weight matrix. Moreover, the characteristics of each
individual are inside the matrix `ind_cat`.
When multiplying `ind_cat` by
the first column of `weights1` we obtain a vector, the values of which
correspond to the number of people in each category for zone 1.
To aggregate all individuals for the
first zone, we just sum the values in each category. The following `for` loop
re-aggregates the individual level data, with the new weights for each zone:
```{r}
# Create additional ind_agg objects
ind_agg2 <- ind_agg1 <- ind_agg0 * NA
# Assign values to the aggregated data after con 1
for(i in 1:n_zone){
ind_agg1[i, ] <- colSums(ind_cat * weights1[, i])
}
```
If you are new to IPF, congratulations: you
have just reweighted your first individual level dataset to a geographical
constraint (the age) and have aggregated the results.
At this early stage it is important to do some preliminary checks
to ensure that the code is working correctly. First, are the resulting
populations for each zone correct? We check this for the first constraint variable
(age) using the following code (to test your understanding,
try to check the populations of the unweighted
and weighted data for the second constraint --- sex):
```{r}
rowSums(ind_agg1[, 1:2]) # the simulated populations in each zone
rowSums(cons[, 1:2]) # the observed populations in each zone
```
The results of these tests show that the new populations are correct, verifying
the technique. But what about the fit between the observed and simulated
results after constraining by age? We will cover goodness of fit in more
detail in subsequent sections. For now, suffice to know that the simplest
way to test the fit is by using the `cor` function on a 1d representation of the
aggregate level data:
```{r}
vec <- function(x) as.numeric(as.matrix(x))
cor(vec(ind_agg0), vec(cons))
cor(vec(ind_agg1), vec(cons))
```
The point here is to calculate the correlation between the aggregate actual data
and the constraints. This value is between -1 and 1 and in our case,
the best fit will be 1, meaning that there is a perfect correlation between our data
and the constraints. Note that as well as creating the correct total population for each zone, the
new weights also lead to much better fit. To see how this has worked, let's
look at the weights generated for zone 1:
```{r}
weights1[, 1]
```
The results mean that individuals 3 and 5 have been allocated a weight of 4
whilst the rest have been given a weight of $4/3$. Note that the total of these
weights is 12, the population of zone 1. Note also that individuals 3 and 5 are
in the younger age group (verify this with `ind$age[c(3,5)]`) which are more
commonly observed in zone 1 than the older age group:
```{r}
cons[1, ]
```
Note there are 8 individuals under 50 years old in zone 1, but only 2 individuals
with this age in the individual level survey dataset. This explains why the
weight allocated to these individuals was 4: 8 divided by 2 = 4.
So far we have only constrained by age. This results in aggregate level results
that fit the age constraint but not the sex constraint (Figure 5.2). The reason
for this should be obvious: weights are selected such that
the aggregated individual level data fits the age constraint perfectly, but
no account is taken of the sex constraint. This is why IPF must constrain for
multiple constraint variables.
To constrain by sex, we simply repeat the
nested `for` loop demonstrated above for the sex constraint. This is implemented
in the code block below.
```{r}
for(j in 1:n_zone){
for(i in 1:n_sex + n_age){
index <- ind_cat[, i] == 1
weights2[index, j] <- weights1[index, j] * cons[j , i] /
ind_agg1[j, i]
}
}
```
Again, the aggregate values need to be calculated in a `for` loop over every
zone. After the first constraint fitting, the weights for zone 1 were:
$$(\frac{4}{3},\frac{4}{3},4,\frac{4}{3},4) $$
\pagebreak
We can explain theoretically explain the weights for zone 1 after the
second fitting. For the first individual, its actual weight is $\frac{4}{3}$
and he is a male. In zone 1, the constraint is to have 6 men. The
three first individuals are men, so the new weight for this person in this
zone is $$weights2[1,1]=\frac{4}{3}*\frac{6}{\frac{4}{3}+\frac{4}{3}+4}=\frac{6}{5}=1.2 $$
With an analogous reasoning, we can find all weights in *weights2*:
```{r}
weights2
```
Note that the final value is calculated by multiplying by
`ind_cat` *and* `weights2`.
```{r}
for(i in 1:n_zone){
ind_agg2[i, ] <- colSums(ind_cat * weights2[, i])
}
ind_agg2
```
Note that even after constraining by age and sex, there
is still not a perfect fit between observed and simulated cell values
(Figure 5.2). After constraining only by age, the simulated cell counts
for the sex categories are far from
the observed, whilst the cell counts for the age categories fit perfectly.
On the other hand, after constraining by age *and* sex, the fit is still not
perfect. This time the age categories do not fit perfectly, whilst the sex
categories fit perfectly. Inspect `ind_agg1` and `ind_agg2` and try to explain
why this is. The concept of IPF is to repeat this procedure several times.
Thus, each iteration contains a re-constraining for each variable.
In our case, the name iteration 1.1 and 1.2 describe the first iteration constrained
only by age and then by sex, respectively.
The *overall fit*, combining age and sex categories,
has improved greatly from iteration 1.1 to 1.2 (from $r = 0.63$ to $r = 0.99$).
In iteration 2.1 (in which we constrain by age again) the fit improves again.
So, each time we reweight to a specific constraint, the fit of this constraint
is perfect, because, as seen in theory, it is a proportional reallocation.
Then, we repeat for another constraint and the first one can diverge from this
perfect fit. However, when operating this process several times, we always
refit to the next constraint and we can converge to a unique weight matrix.
```{r valit_plot1, echo=FALSE, fig.cap="Fit between observed and simulated values for age and sex categories (column facets) after constraining a first time by age and sex constraints (iterations 1.1 and 1.2, plot rows). The dotted line in each plot represents perfect fit between the simulated and observed cell values. The overall fit in each case would be found by combining the left and right-hand plots. Each symbol corresponds to a category and each category has a couple (observed, simulated) for each zone.", fig.scap="Fit between observed and simulated values after one iteration of IPF"}
# Make data long it with tidyr
library(tidyr)
library(dplyr) # a package for manipulating data
x <- gather(cons, Category, Observed, a0_49:f)
y <- gather(as.data.frame(ind_agg1), cat, Simulated, a0_49:f)
z <- gather(as.data.frame(ind_agg2), cat, Simulated, a0_49:f)
# y$cat <- x$cat # to make categories identical (not needed)
df1 <- cbind(x,y)
df1$Constraint <- "1.1: Age"
df2 <- cbind(x, z)
df2$Constraint <- "1.2: Sex"
df <- rbind(df1, df2)
df$Variable <- "Age"
df$Variable[grep("[mf]", df$Category)] <- "Sex"
library(ggplot2)
qplot(Observed, Simulated, data = df, shape = Category, alpha = 0.5) +
facet_grid(Constraint ~ Variable) +
geom_abline(slope = 1, linetype = 3) +
scale_alpha(guide = F) +
scale_shape_manual(values = c(1,11,2,12)) +
theme_bw()
# ggsave("figures/fit-obs-sim-simple-5.png")
# Attempt 2: do it with reshape2
# library(reshape2)
# melt(cons)
```
These results show that IPF requires multiple *iterations* before converging on
a single weight matrix. It is relatively simple to iterate the procedure
illustrated in this section multiple times, as described in the smsim-course
tutorial. However, for the purposes of this book, we will move on now to consider
implementations of IPF that automate the fitting procedure and iteration process.
The aim is to make spatial microsimulation as easy and accessible as possible.
\pagebreak
The advantage of hard-coding the IPF process, as illustrated above, is that
it helps understand how IPF works and aids the diagnosis of issues with
the reweighting process as the weight matrix is re-saved after every
constraint and iteration. However, there are more computationally efficient
approaches to IPF. To save computer and researcher time, we
use in the next sections
R packages which implement IPF without the user needing to *hard code*
each iteration in R:
**ipfp** and **mipfp**. We will use each of these methods to generate
fractional weight matrices allocating each individual
to zones. After following this reweighting process with **ipfp**, we will progress to
integerisation: the process of converting the fractional weights into integers.
This process creates a final contingency table, which is used to generate the
final population. This last step is called the expansion.
### IPF with **ipfp** {#ipfp}
\index{ipfp}
\index{IPF!with ipfp}
IPF runs much faster and with less code using the
**ipfp** package than in pure R. The `ipfp` function runs the IPF algorithm
in the C language, taking aggregate constraints, individual level
data and an initial weight vector (`x0`) as inputs:
```{r}
library(ipfp) # load ipfp library after install.packages("ipfp")
cons <- apply(cons, 2, as.numeric) # to 1d numeric data type
ipfp(cons[1,], t(ind_cat), x0 = rep(1, n_ind)) # run IPF
```
It is impressive that the entire IPF process, which takes dozens of lines of
code in base R (the IPFinR method), can been condensed into two lines: one to
convert the input constraint dataset to `numeric`^[The integer data type fails
because C requires `numeric` data to be converted into its *floating point*
data class.]
and one to perform the IPF operation itself. The whole procedure is hiding
behind the function that is created in C and optimised. So, it is like a
magic box where you put your data and that returns the results. This is a good
way to execute the algorithm, but one must pay attention to
the format of the input argument to use the function correctly. To be sure,
type on R `?ipfp`.
Note that although
we did not specify how many iterations to run, the above command
ran the default of `maxit = 1000` iterations, despite convergence happening after
10 iterations. Note that 'convergence' in this context means that the norm of the
matrix containing the difference between two consecutive iterations reaches the
value of `tol`, which can be set manually in the function
(e.g. via `tol = 0.0000001`). The default value of `tol`
is `.Machine$double.eps`, which is a very small number indeed:
0.000000000000000222, to be precise.
```{r, echo=FALSE}
# Attention, the TAE is the current matrix vs the target matrix
# convergence is the current matrix vs the previous one
# if the constraints are not possible to respect, the convergence will be reach
# without fitting to the target (a little tol, but a big TAE)
```
The number of iterations can also
be set by specifying the `maxit` argument (e.g. `maxit = 5`).
The result after each iteration will be printed if we enable the verbose argument
with `verbose = TRUE`. Note also that these arguments can
be referred to lazily using only their first letter: `verbose` can
be referred to as `v`, for example,
as illustrated below (not all lines of R output are shown):
```{r, eval=FALSE}
ipfp(cons[1,], t(ind_cat), rep(1, n_ind), maxit = 20, v = T)
```
```
## iteration 0: 0.141421
## iteration 1: 0.00367328
## iteration 2: 9.54727e-05
## ...
## iteration 9: 4.96507e-16
## iteration 10: 4.96507e-16
```
To be clear, the `maxit` argument in the above code
specifies the maximum number of iterations and setting
`verbose` to `TRUE` (the default is `FALSE`) instructed R to print the result.
Being able to control R in this way is critical to mastering spatial
microsimulation with R.
The numbers that are printed in the output from R
correspond to the 'distance' between
the previous and actual weight matrices. When the two matrices are equal, the algorithm has
converged and the distance will approach 0. If the distance falls below
the value of `tol`, the algorithm stops.
Note that when calculating, the computer makes numerical approximations
of real numbers. For example, when calculating the result of $\frac{4}{3}$, the computer cannot save the infinite number of decimals and truncates the number.
Due to this, we rarely reach a perfect 0 but we assume that a result that is
very close to zero is sufficient.
Usually, we use the precision of the computer that is of order $10^{-16}$
(on R, you can display the precision of the
machine by typing `.Machine$double.eps`).
Notice also that for the function to work
a *transposed* (via the `t()` function) version of the individual level
data (`ind_cat`) was used. This differs from the
`ind_agg` object used in the pure R version. To prevent having to transpose
`ind_cat` every time `ipfp` is called, we save the transposed version:
```{r}
ind_catt <- t(ind_cat) # save transposed version of ind_cat
```
Another object that can be saved prior to running `ipfp` on all zones
(the rows of `cons`) is `rep(1, nrow(ind))`, simply a series of ones - one for each individual.
We will call this object `x0` as its argument name representing
the starting point of the weight estimates in `ipfp`:
```{r}
x0 <- rep(1, n_ind) # save the initial vector
```
To extend this process to all three zones we can wrap the line beginning
`ipfp(...)` inside a `for` loop, saving the results each time into a
weight variable we created earlier:
```{r}
weights_maxit_2 <- weights # create a copy of the weights object
for(i in 1:ncol(weights)){
weights_maxit_2[,i] <- ipfp(cons[i,], ind_catt, x0, maxit = 2)
}
```
The above code uses `i` to iterate through the constraints, one row (zone) at
a time, saving the output vector of weights for the individuals into columns
of the weight matrix. To make this process even more concise (albeit
less clear to R beginners), we can use R's internal
`for` loop, `apply`:
```{r}
weights <- apply(cons, MARGIN = 1, FUN =
function(x) ipfp(x, ind_catt, x0, maxit = 20))
```
In the above code R iterates through each row
(hence the second argument `MARGIN` being `1`, `MARGIN = 2`
would signify column-wise iteration).
Thus `ipfp` is applied to each zone in turn, as with the `for` loop implementation.
The speed savings of writing the function
with different configurations are benchmarked in
'parallel-ipfp.R' in the 'R' folder of the book project directory.
This shows that reducing the maximum iterations of `ipfp` from
the default 1000 to 20 has the greatest performance benefit.^[These
tests also show that any speed gains from using `apply` instead of `for` are negligible, so
whether to use `for` or `apply` can be decided by personal preference.]
To make the code run faster on large datasets, a parallel version of
`apply` called `parApply` can be used. This is also
tested in 'parallel-ipfp.R'.
Take care when using `maxit` as the only stopping criteria as it is
impossible to predict the number of iterations necessary
to achieve the desired level of fit for every application.
So, if your argument is too big you will needlessly lose time, but if it is
too small, the result will not converge.
Using `tol` alone is also hazardous. Indeed, if you have iteratively the same
matrices, but the approximations on the distance is a number bigger than
your argument, the algorithm will continue. That's why we use `tol`
and `maxit` in tandem. Default values for
`maxit` and `tol` are 1000 a very small number defined in R as
`.Machine$double.eps` respectively.
```{r, echo=FALSE, eval=FALSE}
# Also discuss what happens when you get a huge dataset, from Stephen's dataset
```
It is important to check that the weights obtained from IPF make sense.
To do this, we multiply the weights of each individual by rows of
the `ind_cat` matrix, for each zone. Again, this can be done using
a for loop, but the apply method is more concise:
```{r}
ind_agg <- t(apply(weights, 2, function(x) colSums(x * ind_cat)))
colnames(ind_agg) <- colnames(cons) # make the column names equal
```
As a preliminary test of fit,
it makes sense to check a sample of the aggregated weighted data
(`ind_agg`) against the same sample of the constraints.
Let's look at the results (one would use a subset of the results,
e.g. `ind_agg[1:3, 1:5]` for the first five values of the first 3
zones for larger constraint tables found in the real world):
```{r}
ind_agg
cons
```
This is a good result: the constraints perfectly match the results
generated using IPF, at least for the sample. To check that this
is due to the `ipfp` algorithm improving the weights with each iteration,
let us analyse the aggregate results generated from the alternative
set of weights, generated with only 2 iterations of IPF:
```{r}
# Update ind_agg values, keeping col names (note '[]')
ind_agg[] <- t(apply(weights_maxit_2, MARGIN = 2,
FUN = function(x) colSums(x * ind_cat)))
ind_agg[1:2, 1:4]
```
Clearly the final weights after 2 iterations of IPF represent the constraint
variables well, but do not match perfectly except in the second constraint. This shows the importance of
considering number of iterations in the reweighting stage --- too many iterations
can be wasteful, too few may result in poor results. To reiterate,
20 iterations of IPF are sufficient in most cases for the results
to converge towards their final level of fit.
More sophisticated ways of evaluating model fit are
presented in Section \ref{svalidation}.
As mentioned at the beginning of this chapter there are alternative methods
for allocating individuals to zones. These are discussed in a subsequent section in
\@ref(alternative-reweighting).
Note that the weights generated in this section are fractional.
For agent-based modelling
applications integer weights are
required, which can be generated through the process of integerisation.
Two methods of integerisation are explored:
The first randomly chooses the individuals, with a probability proportional
to the weights.
The second is to define a rounding method. A good method is
'Truncate Replicate Sample' [@Lovelace2013-trs], described in (\@ref(sintegerisation)).
### IPF with **mipfp** {#mipfp}
\index{mipfp}
\index{IPF!with mipfp}
The R package **mipfp** is a more generalized implementation of the IPF algorithm
than **ipfp**, and was designed for population synthesis [@barthelemy_mipfp:_2016].
**ipfp** generates a two-dimensional weight matrix based on mutually
exclusive (non cross-tabulated) constraint tables.
This is useful for applications using constraints, which are only marginals and
which are not cross-tabulated.
**mipfp** is more flexible, allowing multiple cross-tabulations
in the constraint variables, such as age/sex and age/class combinations.
```{r, echo=FALSE}
# we have a new column with, at most, as many categories as the product of the number of categories for the variable age
# and 2 (Male or Female). However, in this case, we need to have the cross table of the age and the sex available to
# proceed this way.
```
**mipfp** is therefore a multidimensional implementation of
IPF, which can
update an initial $N$-dimensional array with respect to given
target marginal distributions. These, in turn, can also be multidimensional.
In this sense **mipfp** is more advanced than
**ipfp** which solves only the 2-dimensional case.
The main function of **mipfp** is `Ipfp()`, which
fills a $N$-dimensional weight
matrix based on a range of aggregate level constraint table options.
Let's test the package on some example data.
The first step is to load the package into the workspace:
```{r, message=F}
library(mipfp) # after install.packages("mipfp")
```
To illustrate the use of `Ipfp`, we will create a fictive example.
The case of SimpleWorld is here too simple to really illustrate the power
of this package. The solving of SimpleWorld with `Ipfp` is included in
the next section containing the comparison between the two packages.
The example case is as follows: to determine the contingency table
of a population characterized by categorical variables for age (0-17, Workage, 50+),
gender (male, female) and educational level (level 1 to level 4).
We consider a zone with 50 inhabitants. The classic spatial microsimulation
problem consists in having all marginal distributions and the cross-tabulated
result (age/gender/education in this case) only for a non-geographical sample.
We consider the variables in the following
order: sex (1), age (2) and diploma (3).
Our constraints could for example be:
```{r}
sex <- c(Male = 23, Female = 27) # n. in each sex category
age <- c(Less18 = 16, Workage = 20, Senior = 14) # age bands
diploma <- c(Level1 = 20, Level2 = 18, Level3 = 6, Level4 = 6)
```
Note the population is equal in each constraint (50 people). Note also
the order in which each category is encoded --- a common source of
error in population synthesis.
For this reason we have labelled each category of each constraint.
The constraints are the target margins and need to be stored as a list.
To tell the algorithm
which elements of the list correspond to which constraint,
a second list with the description
of the target must be created. We print the *target* before running the algorithm
to check the inputs make sense.
```{r}
target <- list (sex, age, diploma)
descript <- list (1, 2, 3)
target
```
Now that all constraint variables have been encoded, let us define
the initial array to be updated, also referred to as the seed or the weight matrix. The dimension
of this matrix must be identical to that of the constraint tables:
$(2 \times3 \times 4)$. Each cell of the array represents
a combination of the attributes' values, and thus defines a particular
category of individuals. In our case, we will consider that the weight
matrix contains 0 when the category is not possible and 1 otherwise. In this example we
will assume that it is impossible for an individual being less than 18
years old to hold a diploma level higher than 2.
The corresponding cells are then set to 0, while the cells of the
feasible categories are set to 1.
```{r}
names <- list (names(sex), names(age), names(diploma))
weight_init <- array (1, c(2,3,4), dimnames = names)
weight_init[, c("Less18"), c("Level3","Level4")] <- 0
```
Everything being well defined, we can execute
*Ipfp*. As with the package **ipfp**, we can choose a stopping criterion defined by *tol* and/or a maximum
number of iterations. Finally, setting the argument `print` to `TRUE`
ensures we will see the evolution of the procedure.
After reaching either the maximum number of iterations or convergence
(whichever comes first), the function will return a list containing the updated
array, as well as other information about the convergence of the algorithm.
Note that if the target margins are not consistent, the input data is then
normalised by considering probabilities instead of frequencies.
```{r}
result <- Ipfp(weight_init, descript, target, iter = 50,
print = TRUE, tol = 1e-5)
```
Note that the fit improves rapidly and attains the *tol* after 8
iterations. The `result` contains the final weight matrix
and some information about the convergence.