Initial family name classifier to detect nepotism #119

gabriel-almeida · 2016-11-09T01:19:50Z

I'm very sorry with the disaster that I made with PR #107 but I think this is a better way to clean things up.
I made most of the refactors suggested, but did not finished some of the features that I wanted.

gabriel-almeida · 2016-11-11T19:50:27Z

Make it without mess with the repo!

I uploaded my trained model temporarily to my Dropbox with a list of ~370k different names, scrapped from UERJ's "Exame de qualificação" candidates from many years, using some Linux bash commands. This list is biased towards young person names but I think it still okay to use. It is interesting because I notice that the algorithm consider "Lourdes" as a relative rare surname, but in reality it is a common "second name in composite names", maybe because of this age bias (and I need to think how to deal with it).

One important modification was an improvement in the evaluation method: it does not use the deputies names to collect statistics anymore. This is important because I was "leaking" test data to the training doing this, and my reported precision was not that reliable. Using this whole name set, I could still get around 80% precision but I noticed that it variates a lot sometimes (75%-90%), maybe because of the relatively little "positive data".

The only thing that I still dont like is the _get_relatives() function, but I think most of the code is good.

https://dl.dropboxusercontent.com/u/48174591/family_classifier.zip
www.vestibular.uerj.br/portal_vestibular_uerj/2013/exame_de_qualificacao/eq_resultados.php

cuducos · 2016-11-11T20:22:54Z

The only thing that I still dont like is the _get_relatives() function, but I think most of the code is good.

I just checked and I see some room from improvement. But I have some doubts — are you up to pairing and we work on that function together?

gabriel-almeida · 2016-11-14T15:19:34Z

I should be able to work on it just on friday or on weekend. Do you have
any questions that I could answer?

Em 11 de nov de 2016 6:22 PM, "Eduardo Cuducos" [email protected]
escreveu:

The only thing that I still dont like is the _get_relatives() function,
but I think most of the code is good.

I just checked and I see some room from improvement. But I have some
doubts — are you up to pairing and we work on that function together?

—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
#119 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/ACWD0Yxbml1f6GFFgcPv3sLbEtJ4Y9iRks5q9M6ggaJpZM4KtGOf
.

cuducos · 2016-11-15T20:18:09Z

I should be able to work on it just on friday or on weekend. Do you have
any questions that I could answer?

I mean, I believe I can help improving the _get_relatives() function, but as it's not documented I'm not that sure about what's the expected output. I see you load the CEAP dataset, the names dataset and try to build a dictionary with congressperson_id as key, and the congressperson_name as value. And so on…

If you can help me in clarifying what is the expected output I can help you achieve that in a neat way. This kind of question usually is addressed easier when working together, that's why I suggested to pair.

gabriel-almeida · 2016-11-21T13:34:40Z

Forgot to comment earlier but I think it is done. Any thoughts?

cuducos · 2016-11-24T19:30:50Z

Hi @gabriel-almeida! Many thanks for your comment. I'd say it's done too. Right now we're focused on some infrastructure to do the machine learning job and connect it to our web platform. When it's done and working our focus will be focused on the Nepositm roadmap/hypothesis — so in a few days we'll get back to this PR, is that ok? Once more, many many thanks — I hope our priorities don't let you down, it's a valuable contribution indeed ; )

gabriel-almeida · 2016-11-26T15:43:08Z

Do not mind, take your time. Maybe this code make more sense being on this other repo. 2016-11-24 17:30 GMT-02:00 Eduardo Cuducos <[email protected]>:

…

Hi @gabriel-almeida <https://github.com/gabriel-almeida>! Many thanks for your comment. I'd say it's done too. Right now we're focused on some infrastructure to do the machine learning job <https://github.com/datasciencebr/rosie> and connect it to our web platform <https://github.com/datasciencebr/jarbas>. When it's done and working our focus will be focused on the *Nepositm* roadmap/hypothesis <https://github.com/datasciencebr/serenata-de-amor/milestones> — so in a few days we'll get back to this PR, is that ok? Once more, many many thanks — I hope our priorities don't let you down, it's a valuable contribution indeed ; ) — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#119 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ACWD0Ti062K4ykonTY5b3PQSjsWDFVAfks5rBeXsgaJpZM4KtGOf> .

-- Abraço, Gabriel Almeida

willianpaixao · 2018-10-03T13:10:28Z

Hello @gabriel-almeida, do you have any updates on this PR?

gabriel-almeida · 2018-10-03T15:34:27Z

No, have been some time that I didn't mess with it.
Do you guys are planning to use it?

cuducos · 2018-10-03T16:42:27Z

As far as I can remember, @gabriel-almeida made a good PR. We were supposed to work on a nepotism classifier, but we didn't – back then we had no info on companies (suppliers) partners, so the classifier was impossible.

In spite of that in these 2 years the Federal Revenue have opened the dataset of companies partners and we can, once more, try to develop a classifier using @gabriel-almeida's (great) algorithm.

initial family name classifier with scikit-learn

d47b30c

cuducos added work in progress hacktoberfest labels Nov 9, 2016

major refactor and improved evaluation

07d35ed

Improvements in documentation, pre processing and data load methods.

ef37871

cuducos mentioned this pull request Dec 8, 2016

Find clusters of politicians spending with companies owned by each others relatives #18

Open

cuducos mentioned this pull request Mar 24, 2017

Develop an algorithm to give the probability of a certain partner be a relative of the congressperson #98

Closed

cuducos removed the hacktoberfest label Apr 24, 2017

cuducos force-pushed the master branch from d4a5126 to e23bf1e Compare September 13, 2018 15:51

This was referenced Oct 4, 2018

Using camara.leg.br API #224

Open

Get names of immediate relatives of each deputy and senator #15

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Initial family name classifier to detect nepotism #119

Initial family name classifier to detect nepotism #119

gabriel-almeida commented Nov 9, 2016

gabriel-almeida commented Nov 11, 2016

cuducos commented Nov 11, 2016

gabriel-almeida commented Nov 14, 2016

cuducos commented Nov 15, 2016

gabriel-almeida commented Nov 21, 2016

cuducos commented Nov 24, 2016

gabriel-almeida commented Nov 26, 2016 via email

willianpaixao commented Oct 3, 2018

gabriel-almeida commented Oct 3, 2018

cuducos commented Oct 3, 2018

Initial family name classifier to detect nepotism #119

Are you sure you want to change the base?

Initial family name classifier to detect nepotism #119

Conversation

gabriel-almeida commented Nov 9, 2016

gabriel-almeida commented Nov 11, 2016

cuducos commented Nov 11, 2016

gabriel-almeida commented Nov 14, 2016

cuducos commented Nov 15, 2016

gabriel-almeida commented Nov 21, 2016

cuducos commented Nov 24, 2016

gabriel-almeida commented Nov 26, 2016 via email

willianpaixao commented Oct 3, 2018

gabriel-almeida commented Oct 3, 2018

cuducos commented Oct 3, 2018