Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Initial family name classifier to detect nepotism #119

Open
wants to merge 3 commits into
base: master
Choose a base branch
from

Conversation

gabriel-almeida
Copy link

I'm very sorry with the disaster that I made with PR #107 but I think this is a better way to clean things up.
I made most of the refactors suggested, but did not finished some of the features that I wanted.

@gabriel-almeida
Copy link
Author

Make it without mess with the repo!

I uploaded my trained model temporarily to my Dropbox with a list of ~370k different names, scrapped from UERJ's "Exame de qualificação" candidates from many years, using some Linux bash commands. This list is biased towards young person names but I think it still okay to use. It is interesting because I notice that the algorithm consider "Lourdes" as a relative rare surname, but in reality it is a common "second name in composite names", maybe because of this age bias (and I need to think how to deal with it).

One important modification was an improvement in the evaluation method: it does not use the deputies names to collect statistics anymore. This is important because I was "leaking" test data to the training doing this, and my reported precision was not that reliable. Using this whole name set, I could still get around 80% precision but I noticed that it variates a lot sometimes (75%-90%), maybe because of the relatively little "positive data".

The only thing that I still dont like is the _get_relatives() function, but I think most of the code is good.

https://dl.dropboxusercontent.com/u/48174591/family_classifier.zip
www.vestibular.uerj.br/portal_vestibular_uerj/2013/exame_de_qualificacao/eq_resultados.php

@cuducos
Copy link
Collaborator

cuducos commented Nov 11, 2016

The only thing that I still dont like is the _get_relatives() function, but I think most of the code is good.

I just checked and I see some room from improvement. But I have some doubts — are you up to pairing and we work on that function together?

@gabriel-almeida
Copy link
Author

I should be able to work on it just on friday or on weekend. Do you have
any questions that I could answer?

Em 11 de nov de 2016 6:22 PM, "Eduardo Cuducos" [email protected]
escreveu:

The only thing that I still dont like is the _get_relatives() function,
but I think most of the code is good.

I just checked and I see some room from improvement. But I have some
doubts — are you up to pairing and we work on that function together?


You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
#119 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/ACWD0Yxbml1f6GFFgcPv3sLbEtJ4Y9iRks5q9M6ggaJpZM4KtGOf
.

@cuducos
Copy link
Collaborator

cuducos commented Nov 15, 2016

I should be able to work on it just on friday or on weekend. Do you have
any questions that I could answer?

I mean, I believe I can help improving the _get_relatives() function, but as it's not documented I'm not that sure about what's the expected output. I see you load the CEAP dataset, the names dataset and try to build a dictionary with congressperson_id as key, and the congressperson_name as value. And so on…

If you can help me in clarifying what is the expected output I can help you achieve that in a neat way. This kind of question usually is addressed easier when working together, that's why I suggested to pair.

@gabriel-almeida
Copy link
Author

Forgot to comment earlier but I think it is done. Any thoughts?

@cuducos
Copy link
Collaborator

cuducos commented Nov 24, 2016

Hi @gabriel-almeida! Many thanks for your comment. I'd say it's done too. Right now we're focused on some infrastructure to do the machine learning job and connect it to our web platform. When it's done and working our focus will be focused on the Nepositm roadmap/hypothesis — so in a few days we'll get back to this PR, is that ok? Once more, many many thanks — I hope our priorities don't let you down, it's a valuable contribution indeed ; )

@gabriel-almeida
Copy link
Author

gabriel-almeida commented Nov 26, 2016 via email

@willianpaixao
Copy link
Contributor

Hello @gabriel-almeida, do you have any updates on this PR?

@gabriel-almeida
Copy link
Author

No, have been some time that I didn't mess with it.
Do you guys are planning to use it?

@cuducos
Copy link
Collaborator

cuducos commented Oct 3, 2018

As far as I can remember, @gabriel-almeida made a good PR. We were supposed to work on a nepotism classifier, but we didn't – back then we had no info on companies (suppliers) partners, so the classifier was impossible.

In spite of that in these 2 years the Federal Revenue have opened the dataset of companies partners and we can, once more, try to develop a classifier using @gabriel-almeida's (great) algorithm.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants