You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I am parsing the Spanish Wikipedia XML dumps using WikiTextParser and getting the following error. At the ends, there are 4000+ IndexOutofBounds errors.
Another weird thing about Spanish Wikipedia parsing, which might or might not be related to this, is that there is no Namespace.CATEGORY parsed and the category_members table ends up to be very small. Is it possible that the category pages fired the IndexOutofBounds exceptions?
22:14:08.314 [pool-4-thread-8] INFO org.wikibrain.parser.wiki.LocalLinkVisitor - Visited link #4000000
22:14:09.392 [pool-4-thread-6] INFO org.wikibrain.utils.ParallelForEach - processing iterable 420000
22:14:10.670 [pool-4-thread-7] WARN org.wikibrain.parser.wiki.WikiTextDumpParser - exception while parsing unknown
java.lang.IndexOutOfBoundsException: Index: 2938, Size: 2938
at java.util.ArrayList.rangeCheck(ArrayList.java:653) ~[?:1.8.0_66]
at java.util.ArrayList.get(ArrayList.java:429) ~[?:1.8.0_66]
at de.tudarmstadt.ukp.wikipedia.parser.mediawiki.SpanManager.getSrcPos(SpanManager.java:63) ~[de.tudarmstadt.ukp.wikipedia.parser-0.9.2.jar:?]
at de.tudarmstadt.ukp.wikipedia.parser.mediawiki.ModularParser.buildNestedList(ModularParser.java:1234) ~[de.tudarmstadt.ukp.wikipedia.parser-0.9.2.jar:?]
at de.tudarmstadt.ukp.wikipedia.parser.mediawiki.ModularParser.parseSections(ModularParser.java:592) ~[de.tudarmstadt.ukp.wikipedia.parser-0.9.2.jar:?]
at de.tudarmstadt.ukp.wikipedia.parser.mediawiki.ModularParser.parse(ModularParser.java:401) ~[de.tudarmstadt.ukp.wikipedia.parser-0.9.2.jar:?]
at org.wikibrain.parser.wiki.WikiTextParser.parse(WikiTextParser.java:64) ~[classes/:?]
at org.wikibrain.parser.wiki.WikiTextDumpParser$ParserProcedure.call(WikiTextDumpParser.java:97) [classes/:?]
at org.wikibrain.parser.wiki.WikiTextDumpParser$ParserProcedure.call(WikiTextDumpParser.java:76) [classes/:?]
at org.wikibrain.utils.ParallelForEach$4.run(ParallelForEach.java:177) [classes/:?]
at org.wikibrain.utils.ParallelForEach$BoundedExecutor$1.run(ParallelForEach.java:257) [classes/:?]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) [?:1.8.0_66]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) [?:1.8.0_66]
at java.lang.Thread.run(Thread.java:745) [?:1.8.0_66]
22:14:14.156 [pool-4-thread-8] INFO org.wikibrain.utils.ParallelForEach - processing iterable 430000
22:14:20.485 [pool-4-thread-6] INFO org.wikibrain.utils.ParallelForEach - processing iterable 440000
22:14:29.073 [pool-4-thread-8] INFO org.wikibrain.utils.ParallelForEach - processing iterable 450000
The text was updated successfully, but these errors were encountered:
I am parsing the Spanish Wikipedia XML dumps using WikiTextParser and getting the following error. At the ends, there are 4000+ IndexOutofBounds errors.
Another weird thing about Spanish Wikipedia parsing, which might or might not be related to this, is that there is no Namespace.CATEGORY parsed and the category_members table ends up to be very small. Is it possible that the category pages fired the IndexOutofBounds exceptions?
The text was updated successfully, but these errors were encountered: