New `ind_11` Inconsistent contract objects across the crisis #8

NiccoloSalvini · 2023-01-11T14:56:40Z

text mining brainstorm

there are a number of techniques that may do the job, but each of them one of more of the following:

computationally expensive
bring large dependencies
not really designed for Italian lang 🇮🇹

As a consequence we (@giuliogcantone and me) tried to figure out what we can do, these are some of the proposals:

sort of topic mining on contract objects compared to a manually annotated dictionary from domain experts ❌
0-shot classification (italian using roberta) on very specific critical terms that may let us think there's something suspicious with it ❌
from cpv exact description and compare against the objects over a number of similarity measures (Levenshtein, Jaccard etc) ✅

We are currently investigating the third solution, but we are really open to discuss the other two.

The text was updated successfully, but these errors were encountered:

giuliogcantone · 2023-01-17T12:26:30Z

I am working to build a pipeline that build a dictionary on a traning set.

Preliminary work is here:

18d7111

Some main issues must be solved in a certain order:

Explicit categories of contracts are not a good classification method; they are too much and too thin. To solve this, probably we need a different level of aggregation through cpv.
Tokens must be weighted (tf/idf).
I am working on a mock training set. A proper training set should be identified.

On .3, consider this:

If all data-at-disposal are employed to build the dictionary, any use of ind_11 would just reflect 'divergence from standard patterns' of text, which is still a-ok propriety for a semantic indicator, but strictly speaking a textual divergence is not sufficient to be characterised as a semantic difference. Example: a contracting agency bought some bathrobes, but the contract was characterised as bath products. This is a weird bath product, but a reasonable semantic, e.g. not suspicious.
I do not think CORE needs it but a dedicated approach would process the semantic of the tokens with a pre-trained model.
If all data-at-disposal are employed to build the dictionary, any bias (i.e. corruption attempts) that is clustered (i.e. it happens more than 1, it's a scheme) will be processed as a standard pattern. However, by how I intend to build the dictionary, standard patterns must be really very frequent, hence the risk of false negative is minimal at cost of high risk of false positive in of the test ('low power'). It means that this ind_11 will be frequently set on 1 (ON). This is fine, given that this is a system of multiple 12 indicators. It could mess with weighting schemes of a composite indicator, tho.
If only safe-data (e.g. when other indicators are set on 0) are employed as training set, then ind_11 will be stochastically collinear by-design with the previous 10 indicators.

The last 2 points IMO are solved adopted a FA/PCA approach for the composite indicator, which polishes the multivariate structure from multicollinearity.

I also want to know more about roberta and what exactly it (she?) could do; is it pre-trained right? Because if we need to train it, it would incur the same issue as above.

giuliogcantone · 2023-01-23T21:29:40Z

macroclass	n of contracts	gini	atkinson
3	430	0.267	0.1
9	9188	0.757	0.517
14	455	0.297	0.113
15	2923	0.539	0.261
16	226	0.347	0.14
18	2091	0.491	0.232
19	575	0.385	0.162
22	2622	0.535	0.264
24	2427	0.503	0.253
30	10799	0.69	0.438
31	5429	0.551	0.295
32	4780	0.598	0.328
33	187380	0.829	0.628
34	13941	0.652	0.384
35	2301	0.471	0.225
37	658	0.467	0.206
38	4494	0.513	0.264
39	5018	0.568	0.311
41	470	0.293	0.1
42	5902	0.626	0.383
43	516	0.345	0.126
44	10078	0.644	0.386
45	557834	0.916	0.776
48	4359	0.585	0.325
50	228193	0.959	0.886
51	980	0.366	0.137
55	9027	0.727	0.479
60	6170	0.65	0.393
63	3170	0.531	0.258
64	3744	0.621	0.342
65	6453	0.684	0.438
66	6627	0.732	0.486
70	317	0.279	0.083
71	61628	0.823	0.615
72	18364	0.735	0.491
73	1331	0.369	0.142
75	1520	0.446	0.192
76	300	0.245	0.078
77	7378	0.708	0.46
79	23322	0.741	0.494
80	4414	0.612	.0343
85	26699	0.762	0.521
90	32129	0.773	0.538
92	6215	0.597	0.323
98	32351	0.885	0.718

NiccoloSalvini · 2023-01-30T15:57:10Z

code and business logic in 18d7111 and 9c80688. Then some other chunks here and there!

NiccoloSalvini added the enhancement New feature or request label Jan 11, 2023

NiccoloSalvini assigned giuliogcantone Feb 13, 2023

NiccoloSalvini linked a pull request Feb 14, 2023 that will close this issue

Cantone #11

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

New `ind_11` Inconsistent contract objects across the crisis #8

New `ind_11` Inconsistent contract objects across the crisis #8

NiccoloSalvini commented Jan 11, 2023 •

edited

Loading

giuliogcantone commented Jan 17, 2023

giuliogcantone commented Jan 23, 2023

NiccoloSalvini commented Jan 30, 2023

New ind_11 Inconsistent contract objects across the crisis #8

New ind_11 Inconsistent contract objects across the crisis #8

Comments

NiccoloSalvini commented Jan 11, 2023 • edited Loading

text mining brainstorm

giuliogcantone commented Jan 17, 2023

giuliogcantone commented Jan 23, 2023

NiccoloSalvini commented Jan 30, 2023

New `ind_11` Inconsistent contract objects across the crisis #8

New `ind_11` Inconsistent contract objects across the crisis #8

NiccoloSalvini commented Jan 11, 2023 •

edited

Loading