SiteCheck is a Haskell library for web developers and web site administrators which allows for the creation of crawlers which can find and report problems within a single domain. It is inspired by QuickCheck and meant to fill a space somewhere between QuickCheck and tools like Canoo WebTest and Selenium.
The core feature of SiteCheck is that it will simply crawl every page of a web site. In order to successfully crawl a site, all links must actually work. If they do not work (they return a non 200 status code) they will be reported. Crawling a web site is not as easy as it sounds. Every site is different. Perhaps a site has a secure area which requires a login or maybe even a legal disclaimer that must be agreed to in order to continue. A new SiteCheck crawler can be created from a set of decisions. Each decision specifies some action to take when a specific URL is encountered. Decisions allow custom logic and actions to be added to the crawler.
Another aspect of crawling is how new links are added to the list of links to check. SiteCheck will read all href attributes from a page into a list of strings. You have complete control over how these strings are converted to URLs. Functions may be provided to map and filter these links. This gives you complete control over what is crawled and how it is crawled.
SiteCheck will automatically handle cookies and secure connections.
SiteCheck will automatically follow redirects until either a non redirecting status code is returned or a cycle is detected. Redirect status codes are not reported unless a cycle is detected. Otherwise the final status code is reported. Output can be configured to show all redirects.
Once SiteCheck can successfully crawl a web site, it provides you with only a minimal amount of useful information. It gets more interesting when you begin to add extensions. In general, an extension is something which may be added to the crawler to allow it to report any information that can be obtained from analyzing the contents of a web page.
The first extension, which is currently under development, is a CSS Cleaner which will report all CSS rules which are not being used (this is the original problem which I set out to solve). It is not hard to imagine other useful extensions.
Once the CSS Cleaner is implemented, it should be easy for other developers to provide their own extensions. The next major goal will be to create an external DSL which will allow for test scripts to be written in a concise syntax outside of Haskell (because, seriously, how many web developers and web site administrators want to deal with Haskell?).
Finally, this is a new project so there is no documentation other than this file and the source code. To get started, look at the example script in Main.hs.
To work with this project, use some combination of the following make commands:
make clean
make sitecheck
make tests
make cov
To use this library, you will need to have the Haskell Platform installed. Next, clone this repository, navigate into the top level directory and run the following commands:
cabal configure
cabal build
cabal install
When we get to version 0.1.0, sitecheck will be added to hackage.
Other than bringing this library into a stable condition, I do have some ideas for improvements and added usefulness.
- Read and Check all hrefs, not just anchors.
- Check that redirect cycles are being reported. If not then fix it.
- Create the CSS Cleaner Extension.
- Create an external DSL so that users will not get to write scripts in Haskell.
- Read parent pages from an output and then recheck them to ensure that problems have been resolved.
Copyright © 2011 Brenton Ashworth
Distributed under the BSD3 license. See the file LICENSE.