Imagine that you are writing a simple web crawler that locates a user-selected element on a web site with frequently changing information. You regularly face an issue that the crawler fails to find the element after minor page updates. After some analysis you decided to make your analyzer tolerant to minor website changes so that you don’t have to update the code every time.
Write a program that analyzes HTML and finds a specific element, even after changes, using a set of extracted attributes.
- Selector can be specified by using Typesafe's config notation:
-Dselector=#make-everything-ok-button
-
Logs (Logback) are at DEBUG level by default, so the algorithm shows some steps' data
-
Log level can be set from the command line:
-Dlog.level=INFO
-
Binary version of the algorithm is under build/libs folder (i.e. dist-1.0-SNAPSHOT-all.jar)
-
Binary can be built by using command
./gradlew shadowJar
-
Output for sample pages:
$ java -jar build/libs/dist-1.0-SNAPSHOT-all.jar sample-files/origin.html sample-files/other-1.html
22:10:49.680 [main] DEBUG Application - Selector is: #make-everything-ok-button
22:10:49.683 [main] DEBUG Application - Original path is: sample-files/origin.html
22:10:49.683 [main] DEBUG Application - Other path is: sample-files/other-1.html
22:10:49.943 [main] DEBUG Processor - Data gathered from initial document: HtmlOriginData(tagName=a, text=[Make, everything, OK], cssPath=#make-everything-ok-button, attributes={id=[make-everything-ok-button], class=[btn, btn-success], href=[#ok], title=[Make-Button], rel=[next], onclick=[javascript:window.okDone(), return false]})
22:10:50.432 [main] DEBUG HtmlInspectionService - Possible element data: WeightAndData(weight=21310.0, data=HtmlOriginData(tagName=a, text=[Make, everything, OK], cssPath=#page-wrapper > div.row:nth-child(3) > div.col-lg-8 > div.panel.panel-default > div.panel-body > a.btn.btn-success, attributes={class=[btn, btn-success], href=[#check-and-ok], title=[Make-Button], rel=[done], onclick=[javascript:window.okDone(), return false]}))
#page-wrapper > div.row:nth-child(3) > div.col-lg-8 > div.panel.panel-default > div.panel-body > a.btn.btn-success
$ java -jar build/libs/dist-1.0-SNAPSHOT-all.jar sample-files/origin.html sample-files/other-2.html
22:11:47.308 [main] DEBUG Application - Selector is: #make-everything-ok-button
22:11:47.310 [main] DEBUG Application - Original path is: sample-files/origin.html
22:11:47.311 [main] DEBUG Application - Other path is: sample-files/other-2.html
22:11:47.540 [main] DEBUG Processor - Data gathered from initial document: HtmlOriginData(tagName=a, text=[Make, everything, OK], cssPath=#make-everything-ok-button, attributes={id=[make-everything-ok-button], class=[btn, btn-success], href=[#ok], title=[Make-Button], rel=[next], onclick=[javascript:window.okDone(), return false]})
22:11:48.025 [main] DEBUG HtmlInspectionService - Possible element data: WeightAndData(weight=11410.0, data=HtmlOriginData(tagName=a, text=[Make, everything, OK], cssPath=#page-wrapper > div.row:nth-child(3) > div.col-lg-8 > div.panel.panel-default > div.panel-body > div.some-container > a.btn.test-link-ok, attributes={class=[btn, test-link-ok], href=[#ok], title=[Make-Button], rel=[next], onclick=[javascript:window.okComplete(), return false]}))
#page-wrapper > div.row:nth-child(3) > div.col-lg-8 > div.panel.panel-default > div.panel-body > div.some-container > a.btn.test-link-ok
$ java -jar build/libs/dist-1.0-SNAPSHOT-all.jar sample-files/origin.html sample-files/other-3.html
22:12:12.450 [main] DEBUG Application - Selector is: #make-everything-ok-button
22:12:12.454 [main] DEBUG Application - Original path is: sample-files/origin.html
22:12:12.455 [main] DEBUG Application - Other path is: sample-files/other-3.html
22:12:12.685 [main] DEBUG Processor - Data gathered from initial document: HtmlOriginData(tagName=a, text=[Make, everything, OK], cssPath=#make-everything-ok-button, attributes={id=[make-everything-ok-button], class=[btn, btn-success], href=[#ok], title=[Make-Button], rel=[next], onclick=[javascript:window.okDone(), return false]})
22:12:13.176 [main] DEBUG HtmlInspectionService - Possible element data: WeightAndData(weight=21400.0, data=HtmlOriginData(tagName=a, text=[Do, anything, perfect], cssPath=#page-wrapper > div.row:nth-child(3) > div.col-lg-8 > div.panel.panel-default > div.panel-footer > a.btn.btn-success, attributes={class=[btn, btn-success], href=[#ok], title=[Do-Link], rel=[next], onclick=[javascript:window.okDone(), return false]}))
#page-wrapper > div.row:nth-child(3) > div.col-lg-8 > div.panel.panel-default > div.panel-footer > a.btn.btn-success
$ java -jar build/libs/dist-1.0-SNAPSHOT-all.jar sample-files/origin.html sample-files/other-4.html
22:12:36.723 [main] DEBUG Application - Selector is: #make-everything-ok-button
22:12:36.727 [main] DEBUG Application - Original path is: sample-files/origin.html
22:12:36.727 [main] DEBUG Application - Other path is: sample-files/other-4.html
22:12:37.087 [main] DEBUG Processor - Data gathered from initial document: HtmlOriginData(tagName=a, text=[Make, everything, OK], cssPath=#make-everything-ok-button, attributes={id=[make-everything-ok-button], class=[btn, btn-success], href=[#ok], title=[Make-Button], rel=[next], onclick=[javascript:window.okDone(), return false]})
22:12:37.569 [main] DEBUG HtmlInspectionService - Possible element data: WeightAndData(weight=21400.0, data=HtmlOriginData(tagName=a, text=[Do, all, GREAT], cssPath=#page-wrapper > div.row:nth-child(3) > div.col-lg-8 > div.panel.panel-default > div.panel-footer > a.btn.btn-success, attributes={class=[btn, btn-success], href=[#ok], title=[Make-Button], rel=[next], onclick=[javascript:window.okFinalize(), return false]}))
#page-wrapper > div.row:nth-child(3) > div.col-lg-8 > div.panel.panel-default > div.panel-footer > a.btn.btn-success
-
Configurable weights of different attributes, ability to overrride them in config/runtime
-
Somehow layered code, ready for DI framework
-
Similar text parts of the element partially contribute into the final weight
-
It was fun to make this task
-
Planned to add https://github.com/tdebatty/java-string-similarity for strings' similarity parsing
-
No use of parents in the detection algorithm, at least information about them is collected
-
No case-insensitive strings comparison