Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

java.lang.IndexOutOfBoundsException in WikiTextParser #257

Open
cheetah90 opened this issue Dec 8, 2015 · 1 comment
Open

java.lang.IndexOutOfBoundsException in WikiTextParser #257

cheetah90 opened this issue Dec 8, 2015 · 1 comment

Comments

@cheetah90
Copy link

I am parsing the Spanish Wikipedia XML dumps using WikiTextParser and getting the following error. At the ends, there are 4000+ IndexOutofBounds errors.

Another weird thing about Spanish Wikipedia parsing, which might or might not be related to this, is that there is no Namespace.CATEGORY parsed and the category_members table ends up to be very small. Is it possible that the category pages fired the IndexOutofBounds exceptions?

22:14:08.314 [pool-4-thread-8] INFO  org.wikibrain.parser.wiki.LocalLinkVisitor - Visited link #4000000
22:14:09.392 [pool-4-thread-6] INFO  org.wikibrain.utils.ParallelForEach - processing iterable 420000
22:14:10.670 [pool-4-thread-7] WARN  org.wikibrain.parser.wiki.WikiTextDumpParser - exception while parsing unknown
java.lang.IndexOutOfBoundsException: Index: 2938, Size: 2938
    at java.util.ArrayList.rangeCheck(ArrayList.java:653) ~[?:1.8.0_66]
    at java.util.ArrayList.get(ArrayList.java:429) ~[?:1.8.0_66]
    at de.tudarmstadt.ukp.wikipedia.parser.mediawiki.SpanManager.getSrcPos(SpanManager.java:63) ~[de.tudarmstadt.ukp.wikipedia.parser-0.9.2.jar:?]
    at de.tudarmstadt.ukp.wikipedia.parser.mediawiki.ModularParser.buildNestedList(ModularParser.java:1234) ~[de.tudarmstadt.ukp.wikipedia.parser-0.9.2.jar:?]
    at de.tudarmstadt.ukp.wikipedia.parser.mediawiki.ModularParser.parseSections(ModularParser.java:592) ~[de.tudarmstadt.ukp.wikipedia.parser-0.9.2.jar:?]
    at de.tudarmstadt.ukp.wikipedia.parser.mediawiki.ModularParser.parse(ModularParser.java:401) ~[de.tudarmstadt.ukp.wikipedia.parser-0.9.2.jar:?]
    at org.wikibrain.parser.wiki.WikiTextParser.parse(WikiTextParser.java:64) ~[classes/:?]
    at org.wikibrain.parser.wiki.WikiTextDumpParser$ParserProcedure.call(WikiTextDumpParser.java:97) [classes/:?]
    at org.wikibrain.parser.wiki.WikiTextDumpParser$ParserProcedure.call(WikiTextDumpParser.java:76) [classes/:?]
    at org.wikibrain.utils.ParallelForEach$4.run(ParallelForEach.java:177) [classes/:?]
    at org.wikibrain.utils.ParallelForEach$BoundedExecutor$1.run(ParallelForEach.java:257) [classes/:?]
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) [?:1.8.0_66]
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) [?:1.8.0_66]
    at java.lang.Thread.run(Thread.java:745) [?:1.8.0_66]
22:14:14.156 [pool-4-thread-8] INFO  org.wikibrain.utils.ParallelForEach - processing iterable 430000
22:14:20.485 [pool-4-thread-6] INFO  org.wikibrain.utils.ParallelForEach - processing iterable 440000
22:14:29.073 [pool-4-thread-8] INFO  org.wikibrain.utils.ParallelForEach - processing iterable 450000
@shilad
Copy link
Owner

shilad commented Dec 9, 2015

These happen regularly due to the underlying parsing library (de.tudarmstadt) and appear mostly benign.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants