Skip to content

Input Validation / Testing: Multiprocessing #212

@nick-fournier-rsg

Description

@nick-fournier-rsg

I was toying with the mp configuration in the example_test trying to better understand the multi-processing. Notably, when I change the number of processes in the settings.yaml from 2 to something else (4, 8, 12) the test_steps_mp.py fails.

  • Investigation reveals the culprit is, again, the expanded_hh_ids.equals(expected_hh_ids) assertion. Inspection of these two dataframes during debugging reveals the values are identical across all columns, regardless of number of processors used. However, the row index does change and the df1.equals(df2) function checks values of all columns as well as the index for equivalence. I didn’t identify the exact mechanism within the multi-processing code that causes this difference in the expanded_hh_ids index. The fact that the test fails because of this isn’t that important, but if there are other processes that rely on the index of the expanded_hh_ids to be informative or follow a certain specification/sequence it may be introducing bugs elsewhere.

  • In reference to the final statement in last bullet, I checked the script that writes the synthetic population and it explicitly resets the expanded_hh_ids table index and uses that reset index (+1) to create the synthetic hh id’s. So, the synthetic hh id will be the same regardless of number of processors used.

  • Okay, what I think is happening is that, due to multi-processing, the expanded weights may be generated in different orders. To account for this, the final manipulation in the expand_households() function is to sort the expanded_weights by the geographic columns and household ID. This is what results in the different indices between multi-processing runs. There is a comment stating this sorting is done specifically to create consistency in results regardless of single processing, multi-processing, or a handful of other settings.

-Conclusion: I believe this is handled correctly but if someone were to mess around with MP settings for the test_steps_mp.py then it would cause a failed test even though the number of processors doesn’t actually change the results.

(Note from @nick-fournier-rsg: Agree we should expand mp test to test an array of num processes)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    Status

    No status

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions