Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BTrDB duplicate datapoints with same index/MDAL issue ? #56

Closed
marcopritoni opened this issue Apr 13, 2018 · 7 comments
Closed

BTrDB duplicate datapoints with same index/MDAL issue ? #56

marcopritoni opened this issue Apr 13, 2018 · 7 comments

Comments

@marcopritoni
Copy link

Just noticed that greenbutton data from XBOS downloaded as mdal.RAW data, has multiple (all?) points with the same timestamp (and value). Not sure if the issue is in BTrDB or MDAL.
I would imagine we do not want to have the same timestamp for multiple points.

Example:
{'4d95d5ce-de62-3449-bd58-4dcad75b526d':
2017-01-01 00:00:00-08:00 1.6395
2017-01-01 00:00:00-08:00 1.6395
2017-01-01 00:15:00-08:00 0.9959
2017-01-01 00:15:00-08:00 0.9959
2017-01-01 00:30:00-08:00 1.6222
2017-01-01 00:30:00-08:00 1.6222
2017-01-01 00:45:00-08:00 1.6374
2017-01-01 00:45:00-08:00 1.6374
... }
I need to download this as raw, because it's energy (kWh) and not power and each reading should be summed and the existing stats aggregation functions (mean, max, min, count) do not support it.

@gtfierro
Copy link
Member

I'm pretty sure this is because some of the data points were written twice while we were developing the Green Button data ingester. You should be able to get rid of it when you insert into a DataFrame (and in which case #55 should fix it)

@immesys
Copy link
Member

immesys commented Apr 13, 2018

Sum is mean times count

@marcopritoni
Copy link
Author

"Sum is mean times count"
Not if you have duplicated data points (same index, same data).
E.g.
2017-01-01 00:00:00-08:00, 1120.0
2017-01-01 00:00:00-08:00 1120.0
... (100 times)

MDAL mean (15min): 1120.0 kWh
MDAL count (15min): 100
pandas sum(15min): 112,000 kWh - Not correct
No way of knowing from pandas which points are duplicated

Downloading the raw data with MDAL and doing all this in pandas has produced another issue that we are looking into.

@gtfierro
Copy link
Member

You can definitely drop duplicated rows with the same index in pandas. In our scenario, we aren't going to have different values for the same index (timestamp), so this strategy should work.

@marcopritoni you should also make a note of which streams have duplicate points so we can clean them up later.

@marcopritoni
Copy link
Author

Sure I can make a list. Do you want me to add it here or keep it offline?

@gtfierro
Copy link
Member

Offline would be better; thanks! It's pretty easy to fix the streams to remove duplicates (I already have 90% of the script done). Maybe you could make a spreadsheet and add the streams to there so I can mark them off when they're done? Shoot me an email

@gtfierro
Copy link
Member

#58 should help

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants