Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Duplicate Fires Crawled #23

Open
Yicong-Huang opened this issue Nov 1, 2019 · 13 comments
Open

Duplicate Fires Crawled #23

Yicong-Huang opened this issue Nov 1, 2019 · 13 comments
Assignees
Labels
bug Something isn't working

Comments

@Yicong-Huang
Copy link
Contributor

There are multiple entries with the same fire name in the database. related to Fire data runnable.

@ScarlettZ98 can you check please?

@Yicong-Huang Yicong-Huang added the bug Something isn't working label Nov 1, 2019
@ScarlettZ98
Copy link
Contributor

What is the name of the fire please?

@Yicong-Huang
Copy link
Contributor Author

from fire table:

Kincade
Burris
Getty
Kincade
Kincade
Kincade
Getty
Kincade
Kincade
Kincade
Kincade
Kincade
Contempo
Kincade
Palisades
Kincade
Tick
Kincade
Saddle ridge
Palisades
Palisades
Paradise west
Palisades
Saddle ridge
Caples
Caples
Franklintrail

from fire_history table:

Contempo
Burris
Getty
Getty
Tick
Real
Getty
Tick
Real
Paradise_West
Palisades
Kincade
Franklintrail
West
Wendy
Walker
W-1_Mcdonald
Ukonom
Taboose
Star
South
Schaeffer
Saddle_Ridge
Rosasco
Red_Bank
Red

from fire_merged table:

Getty
Contempo
Palisades
Kincade
Tick
Tick
Paradise west
Paradise west
Paradise west
Palisades
Franklintrail
Real
Real
Wendy
Saddle ridge
Dehesa
Caples
Briceburg
West
Lopez
Schaeffer
Mcmurray
Bautista
Rosasco
Jakes
Kidder 2

@ScarlettZ98
Copy link
Contributor

Fire table is supposed to have records with the same names since the id is the primary key.
Fire history table only uses the name and year of the fire so it doesn't matter.
Fire merged table may have records with the same name also.
If ids in fire table and fire merged doesn't match, then it is an issue. Names can be duplicated

@Yicong-Huang
Copy link
Contributor Author

thanks.
what about the fire id in fire_merged table then? shouldn't them be unique?

@ScarlettZ98
Copy link
Contributor

I just checked the table, and there is an issue that Paradise west is created multiple times. I will look into it this weekend.

@Yicong-Huang
Copy link
Contributor Author

Yicong-Huang commented Nov 1, 2019

Thanks.

Is it hard to clean the data that is corrupted (duplicated)?

I assume we can just delete the corresponding records and then rerun the fixed crawler?

@ScarlettZ98
Copy link
Contributor

No. I will drop them and recrawl after fix it.
But it is hard for me to test the daily use of the crawler. Some issues don't appear before because when I test it, the time separation is not so long.

@Yicong-Huang
Copy link
Contributor Author

Maybe we can discuss more about the details of the strategy to merge fires? seems right now it is a static separate days threshold?

@ScarlettZ98
Copy link
Contributor

Right now it is not. Every page in the gov website is a merged fire, it crawls the website and gives it an id, then fire with the id is the merged fire.

@Yicong-Huang
Copy link
Contributor Author

maybe it's better to do a F2F discussion?

@ScarlettZ98
Copy link
Contributor

Yes, but I don't have time today. I can do it tomorrow

@Yicong-Huang
Copy link
Contributor Author

No urgent. Let's move the discussion to slack, and please schedule a meeting with me if possible.

@Yicong-Huang
Copy link
Contributor Author

any updates?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants