validate-parlamint speedup #846

matyaskopp · 2024-02-28T16:18:39Z

I have been exploring why the validation is so slow.

jing

jing allows to validation of multiple files with the same schema in parallel. These are the speeds for 64 thread CPU, in seconds:

# of files	loading schema	validating	total time
1	.335	.751	1.086
112	.314	25.774	26.088
297	.338	59.574	59.912
1149	.335	24.8152	248.487

We can speed up jing 5 times, but the order of output will be different - not file by file.

matyaskopp · 2024-02-28T16:19:36Z

We can speed up jing 5 times, but the order of output will be different - not file by file.

@TomazErjavec do we insist on this order?

TomazErjavec · 2024-02-29T08:11:03Z

I actually don't think jing is the bottleneck, rather, it is the XSLT validation that is slow. Also, validate-parlamint.pl takes file one by one, so it would be difficult to just do jing in parallel. In short, I don't think its worth trying to give jing multiple files.

matyaskopp · 2024-02-29T19:25:25Z

I actually don't think jing is the bottleneck, rather, it is the XSLT validation that is slow. Also, validate-parlamint.pl takes file one by one, so it would be difficult to just do jing in parallel. In short, I don't think its worth trying to give jing multiple files.

I have tried it, and validate-parlamint is about 25% faster (tested on LV) with Jing passing multiple files to jing.

TomazErjavec · 2024-02-29T21:21:07Z

about 25% faster (tested on LV)

ok, but I still think it is not worth it given the other problems with this approach. This might save 10% processing time, if that.

with Jing passing multiple files to jing.

Huh?

matyaskopp · 2024-03-01T12:47:43Z

Ok, I have staged my changes.

Another space for speeding up is the link-checker: Transform teiCorpus/teiHeader to a smaller temporary XML file, which contains just a list of elements with IDs - the parsing of this file can be faster, but the impact will be small too...
So no speedup, and moving to the future...

TomazErjavec · 2024-03-01T16:42:10Z

Another space for speeding up is the link-checker: Transform teiCorpus/teiHeader to a smaller temporary XML file, which contains just a list of elements with IDs - the parsing of this file can be faster, but the impact will be small too...

Yes, I think very small - the complete teiHeader (with everything XIncluded) fits into memory of any computer strong enough to process the corpus.

matyaskopp added the enhancement New feature or request label Feb 28, 2024

matyaskopp self-assigned this Feb 28, 2024

matyaskopp added this to the Future milestone Mar 1, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

validate-parlamint speedup #846

validate-parlamint speedup #846

matyaskopp commented Feb 28, 2024

matyaskopp commented Feb 28, 2024

TomazErjavec commented Feb 29, 2024

matyaskopp commented Feb 29, 2024

TomazErjavec commented Feb 29, 2024

matyaskopp commented Mar 1, 2024

TomazErjavec commented Mar 1, 2024

validate-parlamint speedup #846

validate-parlamint speedup #846

Comments

matyaskopp commented Feb 28, 2024

jing

matyaskopp commented Feb 28, 2024

TomazErjavec commented Feb 29, 2024

matyaskopp commented Feb 29, 2024

TomazErjavec commented Feb 29, 2024

matyaskopp commented Mar 1, 2024

TomazErjavec commented Mar 1, 2024