Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bin edges must be unique #23

Open
GriffinRidgeback opened this issue Apr 11, 2019 · 7 comments
Open

bin edges must be unique #23

GriffinRidgeback opened this issue Apr 11, 2019 · 7 comments

Comments

@GriffinRidgeback
Copy link

Hello - I am trying to use this package to provide predictions for my Data Science Capstone project. When I run against my training data, I get the following exception/error:

raceback (most recent call last): | 0/20 [00:00<?, ?epoch/s]
File "model.py", line 63, in
model_train(df, encoders, args, model)
File "C:\Users\deliak\Documents\Jupyter Notebooks\edX\DAT102x -Microsoft Professional Capstone Data Science\automl_train\pipeline.py", line 903, in model_train
X, y = process_data(df, encoders)
File "C:\Users\deliak\Documents\Jupyter Notebooks\edX\DAT102x -Microsoft Professional Capstone Data Science\automl_train\pipeline.py", line 758, in process_data
df['msa_md'].values, encoders['msa_md_bins'], labels=False, include_lowest=True)
File "C:\Users\deliak\AppData\Local\Continuum\anaconda3\lib\site-packages\pandas\core\reshape\tile.py", line 234, in cut
duplicates=duplicates)
File "C:\Users\deliak\AppData\Local\Continuum\anaconda3\lib\site-packages\pandas\core\reshape\tile.py", line 332, in _bins_to_cuts
"the 'duplicates' kwarg".format(bins=bins))
ValueError: Bin edges must be unique: array([ -1., -1., 18., 63., 118., 192., 247., 305., 329., 371., 408.]).
You can drop duplicate edges by setting the 'duplicates' kwarg
Traceback (most recent call last): | 0/20 [00:00<?, ?epoch/s]
File "c:\users\deliak\appdata\local\continuum\anaconda3\lib\runpy.py", line 193, in _run_module_as_main
"main", mod_spec)
File "c:\users\deliak\appdata\local\continuum\anaconda3\lib\runpy.py", line 85, in run_code
exec(code, run_globals)
File "C:\Users\deliak\AppData\Local\Continuum\anaconda3\Scripts\automl_gs.exe_main
.py", line 9, in
File "c:\users\deliak\appdata\local\continuum\anaconda3\lib\site-packages\automl_gs\automl_gs.py", line 175, in cmd
tpu_address=args.tpu_address)
File "c:\users\deliak\appdata\local\continuum\anaconda3\lib\site-packages\automl_gs\automl_gs.py", line 87, in automl_grid_search
"metadata", "results.csv"))
File "c:\users\deliak\appdata\local\continuum\anaconda3\lib\site-packages\pandas\io\parsers.py", line 678, in parser_f
return _read(filepath_or_buffer, kwds)
File "c:\users\deliak\appdata\local\continuum\anaconda3\lib\site-packages\pandas\io\parsers.py", line 440, in _read
parser = TextFileReader(filepath_or_buffer, **kwds)
File "c:\users\deliak\appdata\local\continuum\anaconda3\lib\site-packages\pandas\io\parsers.py", line 787, in init
self._make_engine(self.engine)
File "c:\users\deliak\appdata\local\continuum\anaconda3\lib\site-packages\pandas\io\parsers.py", line 1014, in _make_engine
self._engine = CParserWrapper(self.f, **self.options)
File "c:\users\deliak\appdata\local\continuum\anaconda3\lib\site-packages\pandas\io\parsers.py", line 1708, in init
self._reader = parsers.TextReader(src, **kwds)
File "pandas_libs\parsers.pyx", line 384, in pandas._libs.parsers.TextReader.cinit
File "pandas_libs\parsers.pyx", line 695, in pandas._libs.parsers.TextReader._setup_parser_source
FileNotFoundError: File b'automl_train\metadata\results.csv' does not exist

@tresoldi
Copy link

I am running into the same issue. The edges problem can be solved by instructing pandas to drop duplicates (add argument duplicates="drop" to the pd.cut call in templates/processors/numeric), but of course it probably means that the problem is in the data itself.

Not sure what the developers could to automatize in this case -- maybe call sklearn Inputer or (in my case) just fill the NAs?

@GriffinRidgeback
Copy link
Author

GriffinRidgeback commented Apr 11, 2019 via email

@GriffinRidgeback
Copy link
Author

Well that fixed it but now I get this error:

ValueError: Error when checking input: expected input_loan_type to have shape (1,) but got array with shape (2,)

when I check this attribute, I get this:

train_data.loan_type.unique()

array([3, 1, 2, 4], dtype=int64)

Should I open a separate ticket for this?

And thank you for getting me a little bit further

@avinregmi
Copy link

I'm having the same issue. Did you solve it?

@GriffinRidgeback
Copy link
Author

I did not. I used the xgboost algorithm instead. That ran to completion but I didn't get the output I expected. I thought I would get 1's and 0's but got probabilities instead which wasn't acceptable to what I had to submit for my course project.

Good luck!

@germanjoey
Copy link

@avinregmi Sounds similar to my problem here: #25.

@gagandeep44489
Copy link

Possible Causes and Solutions
Duplicate Values in Data:

Cause: If the data you're binning contains duplicate values, and these duplicates coincide with the bin edges, it can cause this error.
Solution: Clean your data to remove or handle duplicates before binning. You can use pandas to drop duplicates or adjust your bin edges slightly to avoid coinciding with duplicate values.
Bin Edges Overlap or Too Close:

Cause: If your bin edges are very close to each other, floating-point precision errors might cause them to be treated as non-unique.
Solution: Increase the distance between bin edges or use a smaller number of bins.
Incorrect Bin Edge Calculation:

Cause: If you're manually calculating bin edges and there's a mistake in the logic, it can result in duplicate edges.
Solution: Double-check the logic used to generate bin edges. Use functions like numpy.linspace() to ensure evenly spaced bin edges without duplicates.
Floating-Point Precision Issues:

Cause: When bin edges are calculated using floating-point arithmetic, very small differences might not be distinguishable, leading to apparent duplicates.
Solution: Round your bin edges to a certain decimal place or use integer-based binning if applicable.
Example Soluti

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants