-
Notifications
You must be signed in to change notification settings - Fork 661
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Initial family name classifier to detect nepotism #119
base: master
Are you sure you want to change the base?
Initial family name classifier to detect nepotism #119
Conversation
Make it without mess with the repo! I uploaded my trained model temporarily to my Dropbox with a list of ~370k different names, scrapped from UERJ's "Exame de qualificação" candidates from many years, using some Linux bash commands. This list is biased towards young person names but I think it still okay to use. It is interesting because I notice that the algorithm consider "Lourdes" as a relative rare surname, but in reality it is a common "second name in composite names", maybe because of this age bias (and I need to think how to deal with it). One important modification was an improvement in the evaluation method: it does not use the deputies names to collect statistics anymore. This is important because I was "leaking" test data to the training doing this, and my reported precision was not that reliable. Using this whole name set, I could still get around 80% precision but I noticed that it variates a lot sometimes (75%-90%), maybe because of the relatively little "positive data". The only thing that I still dont like is the _get_relatives() function, but I think most of the code is good. https://dl.dropboxusercontent.com/u/48174591/family_classifier.zip |
I just checked and I see some room from improvement. But I have some doubts — are you up to pairing and we work on that function together? |
I should be able to work on it just on friday or on weekend. Do you have Em 11 de nov de 2016 6:22 PM, "Eduardo Cuducos" [email protected]
|
I mean, I believe I can help improving the If you can help me in clarifying what is the expected output I can help you achieve that in a neat way. This kind of question usually is addressed easier when working together, that's why I suggested to pair. |
Forgot to comment earlier but I think it is done. Any thoughts? |
Hi @gabriel-almeida! Many thanks for your comment. I'd say it's done too. Right now we're focused on some infrastructure to do the machine learning job and connect it to our web platform. When it's done and working our focus will be focused on the Nepositm roadmap/hypothesis — so in a few days we'll get back to this PR, is that ok? Once more, many many thanks — I hope our priorities don't let you down, it's a valuable contribution indeed ; ) |
Do not mind, take your time. Maybe this code make more sense being on this
other repo.
2016-11-24 17:30 GMT-02:00 Eduardo Cuducos <[email protected]>:
… Hi @gabriel-almeida <https://github.com/gabriel-almeida>! Many thanks for
your comment. I'd say it's done too. Right now we're focused on some
infrastructure to do the machine learning job
<https://github.com/datasciencebr/rosie> and connect it to our web
platform <https://github.com/datasciencebr/jarbas>. When it's done and
working our focus will be focused on the *Nepositm* roadmap/hypothesis
<https://github.com/datasciencebr/serenata-de-amor/milestones> — so in a
few days we'll get back to this PR, is that ok? Once more, many many thanks
— I hope our priorities don't let you down, it's a valuable contribution
indeed ; )
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#119 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ACWD0Ti062K4ykonTY5b3PQSjsWDFVAfks5rBeXsgaJpZM4KtGOf>
.
--
Abraço,
Gabriel Almeida
|
Hello @gabriel-almeida, do you have any updates on this PR? |
No, have been some time that I didn't mess with it. |
As far as I can remember, @gabriel-almeida made a good PR. We were supposed to work on a nepotism classifier, but we didn't – back then we had no info on companies (suppliers) partners, so the classifier was impossible. In spite of that in these 2 years the Federal Revenue have opened the dataset of companies partners and we can, once more, try to develop a classifier using @gabriel-almeida's (great) algorithm. |
I'm very sorry with the disaster that I made with PR #107 but I think this is a better way to clean things up.
I made most of the refactors suggested, but did not finished some of the features that I wanted.