Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New ind_11 Inconsistent contract objects across the crisis #8

Open
NiccoloSalvini opened this issue Jan 11, 2023 · 3 comments
Open
Assignees
Labels
enhancement New feature or request

Comments

@NiccoloSalvini
Copy link
Collaborator

NiccoloSalvini commented Jan 11, 2023

text mining brainstorm

there are a number of techniques that may do the job, but each of them one of more of the following:

  • computationally expensive
  • bring large dependencies
  • not really designed for Italian lang 🇮🇹

As a consequence we (@giuliogcantone and me) tried to figure out what we can do, these are some of the proposals:

  • sort of topic mining on contract objects compared to a manually annotated dictionary from domain experts ❌
  • 0-shot classification (italian using roberta) on very specific critical terms that may let us think there's something suspicious with it ❌
  • from cpv exact description and compare against the objects over a number of similarity measures (Levenshtein, Jaccard etc) ✅

We are currently investigating the third solution, but we are really open to discuss the other two.

@NiccoloSalvini NiccoloSalvini added the enhancement New feature or request label Jan 11, 2023
@giuliogcantone
Copy link

I am working to build a pipeline that build a dictionary on a traning set.

Preliminary work is here:

18d7111

Some main issues must be solved in a certain order:

  1. Explicit categories of contracts are not a good classification method; they are too much and too thin. To solve this, probably we need a different level of aggregation through cpv.
  2. Tokens must be weighted (tf/idf).
  3. I am working on a mock training set. A proper training set should be identified.

On .3, consider this:

  • If all data-at-disposal are employed to build the dictionary, any use of ind_11 would just reflect 'divergence from standard patterns' of text, which is still a-ok propriety for a semantic indicator, but strictly speaking a textual divergence is not sufficient to be characterised as a semantic difference. Example: a contracting agency bought some bathrobes, but the contract was characterised as bath products. This is a weird bath product, but a reasonable semantic, e.g. not suspicious.
  • I do not think CORE needs it but a dedicated approach would process the semantic of the tokens with a pre-trained model.
  • If all data-at-disposal are employed to build the dictionary, any bias (i.e. corruption attempts) that is clustered (i.e. it happens more than 1, it's a scheme) will be processed as a standard pattern. However, by how I intend to build the dictionary, standard patterns must be really very frequent, hence the risk of false negative is minimal at cost of high risk of false positive in of the test ('low power'). It means that this ind_11 will be frequently set on 1 (ON). This is fine, given that this is a system of multiple 12 indicators. It could mess with weighting schemes of a composite indicator, tho.
  • If only safe-data (e.g. when other indicators are set on 0) are employed as training set, then ind_11 will be stochastically collinear by-design with the previous 10 indicators.

The last 2 points IMO are solved adopted a FA/PCA approach for the composite indicator, which polishes the multivariate structure from multicollinearity.


I also want to know more about roberta and what exactly it (she?) could do; is it pre-trained right? Because if we need to train it, it would incur the same issue as above.

@giuliogcantone
Copy link

macroclass n of contracts gini atkinson
3 430 0.267 0.1
9 9188 0.757 0.517
14 455 0.297 0.113
15 2923 0.539 0.261
16 226 0.347 0.14
18 2091 0.491 0.232
19 575 0.385 0.162
22 2622 0.535 0.264
24 2427 0.503 0.253
30 10799 0.69 0.438
31 5429 0.551 0.295
32 4780 0.598 0.328
33 187380 0.829 0.628
34 13941 0.652 0.384
35 2301 0.471 0.225
37 658 0.467 0.206
38 4494 0.513 0.264
39 5018 0.568 0.311
41 470 0.293 0.1
42 5902 0.626 0.383
43 516 0.345 0.126
44 10078 0.644 0.386
45 557834 0.916 0.776
48 4359 0.585 0.325
50 228193 0.959 0.886
51 980 0.366 0.137
55 9027 0.727 0.479
60 6170 0.65 0.393
63 3170 0.531 0.258
64 3744 0.621 0.342
65 6453 0.684 0.438
66 6627 0.732 0.486
70 317 0.279 0.083
71 61628 0.823 0.615
72 18364 0.735 0.491
73 1331 0.369 0.142
75 1520 0.446 0.192
76 300 0.245 0.078
77 7378 0.708 0.46
79 23322 0.741 0.494
80 4414 0.612 .0343
85 26699 0.762 0.521
90 32129 0.773 0.538
92 6215 0.597 0.323
98 32351 0.885 0.718

@NiccoloSalvini
Copy link
Collaborator Author

code and business logic in 18d7111 and 9c80688. Then some other chunks here and there!

@NiccoloSalvini NiccoloSalvini linked a pull request Feb 14, 2023 that will close this issue
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants