- 
                Notifications
    You must be signed in to change notification settings 
- Fork 82
BUG: Queen and Rook from_dataframe do not match docs #184
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice. I was wondering about the Rook.from_dataframe and Queen.from_dataframe not matching.
As a side note, would it be possible to source the internals for the Rook and Queen classes from one set of methods? As is stands, aren't the Rook and Queen class methods identical?
| 
 We need to hear from @ljwolf before doing much more as he was the original author and I'm not sure the changes reflect his original intent. | 
| Doesn't this fail if idVariable is none but ids is passed? | 
| In general, this works better. The behavior I was going for was (0) use the index if nothing else is provided, (1) use idvariable to build IDs if it was provided, (2) use IDs if provided and, (3) keep the order of the now-resolved IDs if id_order was not provided This was awkward because of the id_order behavior and pysal's silent sorting of the input weights if you don't force their order. Id_order basically always needs to be set to either range(0,n) (to avoid silent lexicographic sorting of the IDs) or the user-specified order that's different from their input dataframe. Splitting id_order from the order in which the inputs are provided adds so much complexity... Sorry I missed that twice... | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
| Here's a bit of a test matrix. I still think the main issue is that PySAL separates the ids from how the ids are ordered. We talked about this back in GSOC 2017. It should be the responsibility of stuff outside the class to make sure the input data is ordered correctly, because the worst surprise is that your weights are silently incorrect w.r.t. the ordering of your input data. By default, PySAL does this because (at the lowest level) weight keys are sorted lexicographically unless forced to retain the original ordering (docstring, location in code) Sorting weight keys by default will make the weights out of alignment with the input dataframe unless the user knows this and presorts their data or happens to have it sorted by default. This will be a silent error, since the weights will still be of the right shape, just not representing the right linkages. I raised this back in my GSOC, but we decided to keep it around, rather than remove it. I understand how this arose, but I still think the "right" fix is to just stop sorting things on input. Things become dramatically simpler, then: 
 Quite a few PRs have attempted to address this inconsistency (#137, pysal#988, pysal#951 (refresh of pysal#922), pysal#946), and this solution, while a lot better than current, is still inconsistent, just within the Rook class: import geopandas, numpy
from libpysal import weights, examples
full = geopandas.read_file(examples.get_path('south.shp'))
numpy.random.seed(11121)
part = full.sample(frac=.1)
nothing = weights.Rook.from_dataframe(part)
nothing.id_order == part.index # makes sense
ids_from_seq = weights.Rook.from_iterable(part.geometry, 
                                          ids=part.index.tolist())
print((ids_from_seq.full()[0] == nothing.full()[0]).all())
# False, because of sorting in seq, surprising.
fips_ids_from_df = weights.Rook.from_dataframe(part, 
                                               ids=part.FIPS.tolist())
fips_ids_from_seq = weights.Rook.from_iterable(part.geometry, 
                                               ids=part.FIPS.tolist())
fips_ids_from_seq.id_order # is a list of strings, makes sense
fips_ids_from_df.id_order # is part.index, surprising if you assume these will be *exactly* the input ids, rather than a sanitized id
(fips_ids_from_seq.full()[0] == fips_ids_from_df.full()[0]).all() 
# False, because of sorting in seq, surprising.
fips_ido_from_df = weights.Rook.from_dataframe(part, 
                                               id_order=part.FIPS.tolist())
fips_from_reindex = weights.Rook.from_dataframe(part.set_index('FIPS'))
fips_idvar_from_df = weights.Rook.from_dataframe(part,
                                                 idVariable='FIPS')
# fails!
#fips_ido_from_seq = weights.Rook.from_iterable(part.geometry, 
#                                               id_order=part.FIPS.tolist())
# fails!
#fips_ido_from_seq = weights.Rook.from_iterable(part.geometry, 
#                                              ids=part.FIPS.tolist(), 
#                                              id_order = list(range(part.shape[0])))
# fails!
#fips_ido_from_seq = weights.Rook.from_iterable(part.geometry, 
#                                               ids=part.FIPS.tolist(), 
#                                               id_order = part.index.tolist())
# fails! 
#fips_idvar_from_seq = weights.Rook.from_iterable(part.geometry, 
#                                                 idVariable='FIPS')
fips_idvar_from_seq = weights.Rook.from_iterable(part.set_index('FIPS').geometry)
fips_idvar_from_seq.id_order # is a list of integers, makes sense
fips_ido_from_df.id_order # is a list of integers, surprising
fips_idvar_from_df.id_order # integers in the order of the sorted idVariable!
## We never set id_order, so all string-keyed weights would be lexsorted according to docs
(fips_idvar_from_seq.full()[0] == fips_ido_from_df.full()[0]).all() 
# True, because now the order is respected, surprising iff you assume keys are always lexsorted!
(fips_idvar_from_seq.full()[0] == fips_idvar_from_df.full()[0]).all() 
# True, because now the order is respected, surprising iff you assume keys are always lexsorted!
(fips_from_reindex.full()[0] == fips_ido_from_df.full()[0]).all()
# True, because now the order is respected, surprising iff you assume keys are always lexsorted!
(fips_ido_from_df.full()[0] == nothing.full()[0]).all()
# True, because now the order is respected, surprising iff you assume keys are always lexsorted!The issue is the surprise caused by sorting that forces the complexity up. now, you've always gotta set  | 
| I still think the "correct" solution here is to stop sorting string indices by default. When provided, take the index in the input order. | 
| Stale pull request message | 
This address #183